Image Classification
Assigning labels to images — single-label or multi-label. Foundational vision task; modern multimodal LLMs handle this competently in addition to specialized classifiers.
Setup walkthrough
pip install transformers torch torchvision.- Python script for zero-shot classification (no training needed):
from transformers import pipeline
classifier = pipeline("zero-shot-image-classification", model="openai/clip-vit-large-patch14")
result = classifier("photo.jpg", candidate_labels=["cat", "dog", "bird", "car"])
print(result) # [{"label": "cat", "score": 0.95}, ...]
- First classification in <1 second on GPU, ~2-5 seconds on CPU.
- For custom classes with training:
pip install timm→ load a pre-trained ViT or ConvNeXt → fine-tune on your labeled images (50-200 images per class, ~5-30 minutes training on GPU). - For production: use ONNX Runtime — export the model to ONNX format for 2-3× faster inference.
- For video classification: process frames with the same pipeline, aggregate predictions across frames.
The cheap setup
Image classification is extremely hardware-efficient. CLIP ViT-B/16 (300 MB) classifies images at 50-100 images/second on any laptop CPU. A $300 refurbished laptop handles 100K+ images/day classification. For fine-tuning: a used GTX 1060 6 GB ($60) trains a ResNet-50 on 1,000 images per class in 15-30 minutes. For real-time video classification (30 fps), add a used GTX 1660 Super 6 GB ($100). Classification is the cheapest AI task — even Raspberry Pi 5 does 5-10 classifications/second.
The serious setup
Any RTX GPU is overkill for classification. A used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) runs ViT-Large at 500-1,000 images/second — 3M+ images/hour. Fine-tunes ViT on 10K images in ~5 minutes. For massive-scale classification (100M+ images): the bottleneck is storage I/O and data loading, not GPU compute. Pair with fast NVMe storage and a high-core-count CPU. Total: ~$800-1,000. Classification is almost never the compute bottleneck in a pipeline — spend your budget on storage and data preprocessing instead.
Common beginner mistake
The mistake: Using a multimodal LLM (GPT-4V, Qwen2-VL) for batch image classification when you have 100K images to process. Why it fails: VLMs generate text autoregressively — they read the image, then generate "cat" token-by-token. Each classification takes 2-5 seconds and costs ~1,000 tokens (vs. 0.01 seconds for a dedicated classifier). At 100K images, that's 50-250 hours and millions of tokens vs. 16 minutes with a classifier. The fix: Use VLMs for one-off tasks or ambiguous cases. For batch classification, use a dedicated classifier (CLIP, ViT, ConvNeXt). They produce a single forward pass → probability distribution in milliseconds. Use the VLM only for the 1% of images the classifier is uncertain about.
Recommended setup for image classification
Browse all tools for runtimes that fit this workload.
Reality check
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
Common mistakes
- Buying for spec-sheet VRAM without modeling KV cache + activation overhead
- Underestimating quantization quality loss below Q4
- Skipping flash-attention support (real perf gap on long context)
- Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)
What breaks first
The errors most operators hit when running image classification locally. Each links to a diagnose+fix walkthrough.
Before you buy
Verify your specific hardware can handle image classification before committing money.