Vision
bbox detection
yolo
detection

Object Detection

Locating and labeling specific objects in images with bounding boxes. Specialized detection models (YOLO family, DETR) dominate, though VLMs increasingly handle simple detection via prompting.

Setup walkthrough

  1. pip install ultralytics (YOLO — the standard open-weight detection library).
  2. For zero-shot detection (no training): pip install transformers → use OWLv2:
from transformers import pipeline
detector = pipeline("zero-shot-object-detection", model="google/owlv2-base-patch16")
results = detector("living_room.jpg", candidate_labels=["sofa", "lamp", "tv", "plant"])
for r in results:
    print(f"{r['label']}: {r['score']:.2f} at {r['box']}")
  1. First detection in <1 second on GPU, 2-5 seconds on CPU.
  2. For custom classes with training: yolo detect train data=custom_dataset.yaml model=yolo11n.pt epochs=50 — trains in 10-30 minutes on 100 labeled images.
  3. For real-time detection: YOLO11n processes 200+ fps on RTX 3060, 30+ fps on CPU.
  4. For video: YOLO processes frame-by-frame with tracking (BoT-SORT/DeepSort built-in).

The cheap setup

Used GTX 1060 6 GB (~$60). Runs YOLO11n at 100-150 fps — real-time detection for multiple camera feeds. OWLv2 at 5-10 fps for zero-shot detection. Train a custom YOLO11s on 500 images in ~20 minutes. Pair with Ryzen 5 5600 + 16 GB DDR4 + 512 GB NVMe. Total: ~$320-370. For CPU-only: YOLO11n runs at 25-35 fps on a modern laptop CPU — viable for single-camera real-time. Object detection is incredibly GPU-efficient — even integrated graphics handle real-time.

The serious setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) is the sweet spot — runs YOLO11x (largest variant) at 80-120 fps. Handles 8+ 1080p camera streams simultaneously with tracking. Trains custom YOLO11x on 10K images in 1-2 hours. For production surveillance systems (16+ cameras): add a second RTX 3060. Total build: ~$700-900. For Jetson edge deployment: see /hardware/jetson-ai. Object detection is one of the most mature and GPU-efficient AI tasks — prioritize camera quality and storage, not GPU.

Common beginner mistake

The mistake: Using a multimodal LLM (GPT-4V, Qwen2-VL) for object detection on a video stream, wondering why it processes 1 frame every 5 seconds. Why it fails: VLMs generate text — they describe what they see in natural language. Object detection requires numerical output (bbox coordinates, class IDs) at 30+ fps. A VLM doing detection is like using a novelist to read a spreadsheet. The fix: Use a dedicated detection model (YOLO, DETR, OWLv2 for zero-shot). These output bounding boxes + class scores directly as tensors. YOLO11n processes 200+ fps vs. 0.2 fps for a VLM. Use VLMs only for the 0.1% of detections where you need reasoning ("Is this person holding a weapon or a phone?"). Detection is a throughput problem — use the right tool for throughput.

Recommended setup for object detection

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running object detection locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle object detection before committing money.

Specialized buyer guides
Updated 2026 roundup