Image Segmentation
Pixel-level region labeling — semantic, instance, or panoptic segmentation. Specialized models (SAM family, Mask2Former) dominate. Critical for medical imaging, robotics, content creation.
Setup walkthrough
pip install segment-anything(Meta's SAM 2 — SOTA open-weight segmentation).- Download the model (auto on first use, ~150 MB for SAM 2.1 tiny, ~2.4 GB for SAM 2.1 large).
For automatic segmentation (SAM 2 auto):
from sam2 import SAM2ImagePredictor
predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")
predictor.set_image("photo.jpg")
masks = predictor.generate() # segments every object in the image
for i, mask in enumerate(masks):
mask_np = mask.cpu().numpy() # binary mask array
cv2.imwrite(f"mask_{i}.png", (mask_np * 255).astype('uint8'))
- First segmentation in 2-5 seconds on GPU, 10-30 seconds on CPU.
- For interactive: paint a point or box → SAM segments the object at that location.
- For video segmentation: SAM 2 video model propagates masks across frames with minimal re-prompting.
The cheap setup
SAM 2.1 Tiny (150 MB) runs on CPU at 5-15 seconds per image — practical for batch processing of 100s of images overnight. A used GTX 1060 6 GB ($60) runs SAM 2.1 Large at 2-5 seconds per image. For video segmentation (SAM 2 video): GTX 1660 Super 6 GB (~$100) handles 720p video at 5-10 fps. Pair with Ryzen 5 5600 + 16 GB DDR4 + 512 GB NVMe. Total: ~$320-370. Segmentation is moderate compute — the models are small but the mask operations are spatially expensive.
The serious setup
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs SAM 2.1 Large at 1-2 seconds per image, SAM 2 video at 15-30 fps for 1080p. Can segment 10K+ images/hour in batch. For medical imaging (3D segmentation): 12 GB handles typical CT/MRI volumes (512×512×200 voxels) with sliding window inference. Total build: ~$700-900. For very large 3D volumes (whole-body CT, 1000+ slices): 24 GB GPU recommended. Segmentation is VRAM-light for 2D, VRAM-hungry for 3D.
Common beginner mistake
The mistake: Running SAM in "automatic everything" mode on a complex scene with 50+ objects, then spending hours manually sorting through 50 masks to find the one you want. Why it fails: SAM's automatic mode segments everything — background clutter, shadows, reflections. You get 50 masks when you only needed 3. The fix: Use prompt-based segmentation. Give SAM a single point or bounding box on the object you want: predictor.predict(point_coords=[[x, y]], point_labels=[1]). SAM segments exactly that object. For batch processing similar images, use SAM-assisted labeling: segment 10 images with prompts, fine-tune a smaller model (YOLO-seg, Mask R-CNN) on those masks, then run the fine-tuned model on 10K images at 100× speed. SAM is a labeling tool, not a classifier.
Recommended setup for image segmentation
Browse all tools for runtimes that fit this workload.
Reality check
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
Common mistakes
- Buying for spec-sheet VRAM without modeling KV cache + activation overhead
- Underestimating quantization quality loss below Q4
- Skipping flash-attention support (real perf gap on long context)
- Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)
What breaks first
The errors most operators hit when running image segmentation locally. Each links to a diagnose+fix walkthrough.
Before you buy
Verify your specific hardware can handle image segmentation before committing money.