Image Classification

Assigning labels to images — single-label or multi-label. Foundational vision task; modern multimodal LLMs handle this competently in addition to specialized classifiers.

Setup walkthrough

pip install transformers torch torchvision.
Python script for zero-shot classification (no training needed):

from transformers import pipeline
classifier = pipeline("zero-shot-image-classification", model="openai/clip-vit-large-patch14")
result = classifier("photo.jpg", candidate_labels=["cat", "dog", "bird", "car"])
print(result)  # [{"label": "cat", "score": 0.95}, ...]

First classification in <1 second on GPU, ~2-5 seconds on CPU.
For custom classes with training: pip install timm → load a pre-trained ViT or ConvNeXt → fine-tune on your labeled images (50-200 images per class, ~5-30 minutes training on GPU).
For production: use ONNX Runtime — export the model to ONNX format for 2-3× faster inference.
For video classification: process frames with the same pipeline, aggregate predictions across frames.

The cheap setup

Image classification is extremely hardware-efficient. CLIP ViT-B/16 (300 MB) classifies images at 50-100 images/second on any laptop CPU. A $300 refurbished laptop handles 100K+ images/day classification. For fine-tuning: a used GTX 1060 6 GB ($60) trains a ResNet-50 on 1,000 images per class in 15-30 minutes. For real-time video classification (30 fps), add a used GTX 1660 Super 6 GB ($100). Classification is the cheapest AI task — even Raspberry Pi 5 does 5-10 classifications/second.

The serious setup

Any RTX GPU is overkill for classification. A used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) runs ViT-Large at 500-1,000 images/second — 3M+ images/hour. Fine-tunes ViT on 10K images in ~5 minutes. For massive-scale classification (100M+ images): the bottleneck is storage I/O and data loading, not GPU compute. Pair with fast NVMe storage and a high-core-count CPU. Total: ~$800-1,000. Classification is almost never the compute bottleneck in a pipeline — spend your budget on storage and data preprocessing instead.

Common beginner mistake

The mistake: Using a multimodal LLM (GPT-4V, Qwen2-VL) for batch image classification when you have 100K images to process. Why it fails: VLMs generate text autoregressively — they read the image, then generate "cat" token-by-token. Each classification takes 2-5 seconds and costs ~1,000 tokens (vs. 0.01 seconds for a dedicated classifier). At 100K images, that's 50-250 hours and millions of tokens vs. 16 minutes with a classifier. The fix: Use VLMs for one-off tasks or ambiguous cases. For batch classification, use a dedicated classifier (CLIP, ViT, ConvNeXt). They produce a single forward pass → probability distribution in milliseconds. Use the VLM only for the 1% of images the classifier is uncertain about.

Recommended setup for image classification

Recommended hardware

Best GPU for local AI →

All workloads ranked across VRAM tiers.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running image classification locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle image classification before committing money.

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →