Learning Objectives
- Understand what SAM 2 is and how zero-shot segmentation works
- Identify practical applications of image and video segmentation in AI workflows
- Evaluate SAM 2's capabilities and limitations for production use cases
What Is SAM 2?
SAM 2 (Segment Anything Model 2) is Meta's open-source foundation model for image and video segmentation — the task of identifying and outlining specific objects or regions within visual content. Released in July 2024 as the successor to the original SAM (April 2023), SAM 2 extends the "segment anything" capability from static images to real-time video.
The breakthrough: SAM 2 performs zero-shot segmentation — it can identify and segment any object in any image or video without requiring task-specific training data. Point at something, click, and SAM 2 outlines it. No labels, no fine-tuning, no category-specific models.
This capability was previously only possible with models trained specifically for each object category (a separate model for people, cars, medical images, etc.). SAM makes segmentation a general-purpose capability.
✅Tip
Try SAM 2: segment-anything-2.com — interactive demo. Source code and weights at github.com/facebookresearch/sam2. Apache 2.0 license.
How SAM 2 Works
Prompt-Based Segmentation
SAM 2 accepts multiple types of prompts to identify what to segment:
- Point prompts — click a point on the object you want to segment
- Box prompts — draw a bounding box around the area of interest
- Text prompts — describe what you want to segment in natural language
- Mask prompts — provide a rough mask and SAM refines it
Image Segmentation
For static images, SAM 2 generates pixel-accurate masks:
- Segments any object regardless of category (people, animals, products, medical structures, industrial parts)
- Handles complex scenes with overlapping objects
- Produces multiple plausible segmentation masks when the prompt is ambiguous
Video Segmentation
SAM 2's major advancement over SAM 1 — real-time video segmentation:
- Object tracking — segment an object in one frame, and SAM 2 tracks it across the entire video
- Temporal consistency — masks remain stable as objects move, rotate, and change scale
- Occlusion handling — maintains tracking when objects are temporarily hidden behind other objects
- Real-time capable — fast enough for interactive video editing workflows
Applications
Creative and Media
- Video editing — isolate subjects for background replacement, color grading, or effects
- Photo editing — precise object selection without manual masking
- Content creation — extract objects from images for compositing
Medical Imaging
- Radiology — segment tumors, organs, or anomalies in CT/MRI scans
- Pathology — identify cellular structures in microscopy images
- Surgical planning — 3D reconstruction from segmented medical images
Industrial and Robotics
- Quality inspection — identify defects on manufacturing lines
- Autonomous navigation — segment road surfaces, obstacles, and lanes
- Robotic manipulation — identify graspable objects and their boundaries
Data Annotation
- Training data creation — dramatically accelerates the labeling of segmentation datasets
- Active learning — use SAM predictions as starting points for human annotators to refine
Access
| Detail | Info |
|---|---|
| Price | Free (open source) |
| License | Apache 2.0 |
| Source Code | github.com/facebookresearch/sam2 |
| Model Weights | Downloadable (multiple sizes) |
| Framework | PyTorch |
| Hardware | GPU recommended for real-time performance; CPU possible for batch processing |
Strengths
- Zero-shot segmentation — works on any object without task-specific training; truly general-purpose
- Image and video — SAM 2 extends to temporal segmentation with object tracking and occlusion handling
- Open source (Apache 2.0) — fully permissive license; no commercial restrictions
- Multiple prompt types — points, boxes, text, and masks provide flexible interaction
- Production-proven — widely adopted in creative tools, medical imaging, robotics, and data annotation
- Foundation for fine-tuning — can be fine-tuned on domain-specific data for even better performance in specialized applications
Limitations & Considerations
- GPU recommended — real-time performance requires a capable GPU; CPU inference is possible but slow
- Segmentation only — SAM identifies object boundaries but doesn't classify what the object is (no semantic labels)
- Complex scenes — performance degrades with very cluttered scenes or tiny objects
- Video memory — processing long videos requires significant GPU memory for temporal tracking
- Not a complete pipeline — segmentation is one step; applications typically need additional models for classification, measurement, or action
Key Takeaways
- SAM 2 is Meta's open-source foundation model for image and video segmentation — capable of segmenting any object with zero-shot prompting (click, box, or text)
- The extension to video (object tracking, temporal consistency, occlusion handling) makes SAM 2 practical for video editing, medical imaging, robotics, and data annotation
- Fully open source (Apache 2.0) and built on PyTorch; widely adopted across creative, medical, industrial, and research applications
- Represents the emergence of foundation models beyond language — general-purpose vision capabilities that previously required task-specific models