Visual Question Answering
Answering natural-language questions about image contents. Modern VLMs make this accessible — Qwen2.5-VL, InternVL, LLaVA all credible.
Setup walkthrough
- Install Ollama →
ollama pull minicpm-v(8 GB) orollama pull llava:13b(8 GB). - Python script for visual question answering:
import ollama
with open("image.jpg", "rb") as f:
img = f.read()
resp = ollama.chat(model="minicpm-v", messages=[{
"role": "user",
"content": "How many people are in this image? What are they doing? Is anyone wearing a hat?",
"images": [img]
}])
print(resp["message"]["content"])
- First answer in 5-10 seconds. Works for counting, attribute detection, spatial reasoning, and activity recognition.
- For stronger VQA:
ollama pull qwen2.5-vl:7b— higher accuracy on detail-oriented questions, better OCR, 128K context for multi-image Q&A. - For batch VQA (100s of images): use a local VL model with batched inference via vLLM.
- For specialized VQA benchmarks (VQAv2, GQA, TextVQA): use lm-evaluation-harness with the local VL model.
The cheap setup
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs MiniCPM-V at 5-10 seconds per question — practical for exploring photo collections interactively. Qwen2-VL 7B at similar speed with better detail recognition. Pair with Ryzen 5 5600 + 16 GB DDR4 + 512 GB NVMe. Total: ~$360-405. For CPU-only: LLaVA 7B via llama.cpp at 30-60 seconds per question — functional for occasional use. VQA is one of the most practically useful local AI tasks — "what's in this photo?" comes up constantly.
The serious setup
Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Qwen2-VL 72B at 10-20 seconds per question — the strongest local VQA available. Near GPT-4V quality on detail questions, spatial reasoning, and OCR-based VQA. For production VQA applications (customer support analyzing product photos, insurance claim photo analysis), the 72B model's accuracy justifies the slower speed. Total: ~$1,800-2,200. For speed-focused VQA: Qwen2-VL 7B on RTX 4090 at 1-3 seconds per question.
Common beginner mistake
The mistake: Asking a VLM "What's in this image?" and being satisfied with the generic answer "a living room with furniture." Why it fails: Open-ended questions produce open-ended answers. The model gives the most salient answer, not the answer you need. If you're looking for a specific detail (is the window open?), you won't get it from a generic prompt. The fix: Ask specific questions: "Is the window on the left open or closed? What color is the couch? Is there a coffee table? What's on it?" Multiple specific questions get better answers than one vague one. For batch analysis, define a questionnaire template and run it on every image. VQA quality depends more on prompt specificity than model size — a good prompt on a 7B model beats a bad prompt on a 72B model.
Recommended setup for visual question answering
Browse all tools for runtimes that fit this workload.
Reality check
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
Common mistakes
- Buying for spec-sheet VRAM without modeling KV cache + activation overhead
- Underestimating quantization quality loss below Q4
- Skipping flash-attention support (real perf gap on long context)
- Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)
What breaks first
The errors most operators hit when running visual question answering locally. Each links to a diagnose+fix walkthrough.
Before you buy
Verify your specific hardware can handle visual question answering before committing money.