Image Segmentation

Pixel-level region labeling — semantic, instance, or panoptic segmentation. Specialized models (SAM family, Mask2Former) dominate. Critical for medical imaging, robotics, content creation.

Setup walkthrough

pip install segment-anything (Meta's SAM 2 — SOTA open-weight segmentation).
Download the model (auto on first use, ~150 MB for SAM 2.1 tiny, ~2.4 GB for SAM 2.1 large).

For automatic segmentation (SAM 2 auto):

from sam2 import SAM2ImagePredictor
predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")
predictor.set_image("photo.jpg")
masks = predictor.generate()  # segments every object in the image
for i, mask in enumerate(masks):
    mask_np = mask.cpu().numpy()  # binary mask array
    cv2.imwrite(f"mask_{i}.png", (mask_np * 255).astype('uint8'))

First segmentation in 2-5 seconds on GPU, 10-30 seconds on CPU.
For interactive: paint a point or box → SAM segments the object at that location.
For video segmentation: SAM 2 video model propagates masks across frames with minimal re-prompting.

The cheap setup

SAM 2.1 Tiny (150 MB) runs on CPU at 5-15 seconds per image — practical for batch processing of 100s of images overnight. A used GTX 1060 6 GB ($60) runs SAM 2.1 Large at 2-5 seconds per image. For video segmentation (SAM 2 video): GTX 1660 Super 6 GB (~$100) handles 720p video at 5-10 fps. Pair with Ryzen 5 5600 + 16 GB DDR4 + 512 GB NVMe. Total: ~$320-370. Segmentation is moderate compute — the models are small but the mask operations are spatially expensive.

The serious setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs SAM 2.1 Large at 1-2 seconds per image, SAM 2 video at 15-30 fps for 1080p. Can segment 10K+ images/hour in batch. For medical imaging (3D segmentation): 12 GB handles typical CT/MRI volumes (512×512×200 voxels) with sliding window inference. Total build: ~$700-900. For very large 3D volumes (whole-body CT, 1000+ slices): 24 GB GPU recommended. Segmentation is VRAM-light for 2D, VRAM-hungry for 3D.

Common beginner mistake

The mistake: Running SAM in "automatic everything" mode on a complex scene with 50+ objects, then spending hours manually sorting through 50 masks to find the one you want. Why it fails: SAM's automatic mode segments everything — background clutter, shadows, reflections. You get 50 masks when you only needed 3. The fix: Use prompt-based segmentation. Give SAM a single point or bounding box on the object you want: predictor.predict(point_coords=[[x, y]], point_labels=[1]). SAM segments exactly that object. For batch processing similar images, use SAM-assisted labeling: segment 10 images with prompts, fine-tune a smaller model (YOLO-seg, Mask R-CNN) on those masks, then run the fine-tuned model on 10K images at 100× speed. SAM is a labeling tool, not a classifier.

Recommended setup for image segmentation

Recommended hardware

Best GPU for local AI →

All workloads ranked across VRAM tiers.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running image segmentation locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle image segmentation before committing money.

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →