Video Understanding
Comprehending video content — captioning, Q&A, action recognition. Multimodal video LLMs like Qwen2.5-VL handle this.
Setup walkthrough
- Install Ollama →
ollama pull qwen2.5-vl:7b(~5 GB — strong video understanding model). - Extract keyframes from your video:
pip install opencv-python→ffmpeg -i input.mp4 -vf "fps=1" frames/frame_%04d.jpg(one frame per second). - Python script to analyze video frames:
import ollama, glob
frames = sorted(glob.glob("frames/*.jpg"))
for i, frame in enumerate(frames):
with open(frame, "rb") as f:
img = f.read()
resp = ollama.chat(model="qwen2.5-vl:7b", messages=[{
"role": "user",
"content": f"Frame {i+1}/{len(frames)}: Describe what's happening. Focus on actions, people, and scene changes.",
"images": [img]
}])
print(f"Frame {i+1}: {resp['message']['content']}")
- For a 1-minute video: 60 frames, ~5-10 minutes total processing on 12 GB GPU.
- For temporal understanding (action sequences, events): Qwen2.5-VL supports video input natively — feed the video file directly (it auto-samples frames). Use
ollamawith video file path. - Use cases: security footage review, sports analysis, content moderation, video search.
The cheap setup
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Qwen2-VL 7B at 5-10 seconds per frame — a 1-minute video (60 frames at 1 fps) takes 5-10 minutes to analyze. For longer videos (10-60 minutes), process overnight. Pair with Ryzen 5 5600 + 32 GB DDR4 + 2TB NVMe (video storage). Total: ~$420-490. For CPU-only: MiniCPM-V via llama.cpp at 30-60 seconds per frame — 30-60 minutes for a 1-minute video, functional for occasional use.
The serious setup
Used RTX 3090 24 GB ($700-900, see /hardware/rtx-3090). Runs Qwen2-VL 72B at 10-20 seconds per frame — highest-quality local video understanding. For production video analysis (surveillance, content moderation), use Qwen2-VL 7B on 2× RTX 3060 in parallel — analyze 2 videos simultaneously. Total: ~$1,500-2,200. For the best quality-to-speed ratio: Qwen2-VL 7B on RTX 4090 ($2,000) at 1-3 seconds per frame. Video understanding scales linearly with frame count — faster GPU = more frames processed per hour.
Common beginner mistake
The mistake: Feeding a 10-minute video directly to a VLM and asking "Summarize this video." Why it fails: VLMs have context limits — Qwen2-VL supports video input but typically samples 16-64 frames total. A 10-minute video at 1 fps has 600 frames. The model either samples sparsely (missing key moments) or exceeds its context window. You get a summary of whatever random frames were sampled. The fix: Always extract keyframes yourself with ffmpeg at a frame rate that captures your use case. For sports: 10 fps (action is fast). For surveillance: 1 fps (changes are slow). For lectures: scene-change detection (only process frames when slides change). Feed the model individual frames with timestamps. Control your sampling rate — don't trust the model's default video loader. The model sees what you show it; if you miss a frame, the model misses the event.
Recommended setup for video understanding
Browse all tools for runtimes that fit this workload.
Reality check
Local video gen is genuinely possible in 2026 (LTX-Video, Mochi) but VRAM-hungry. 24 GB is the working minimum; 32 GB is the comfort zone for long-form workflows. Below 24 GB, video gen isn't realistic with current models.
Common mistakes
- Trying video gen on 16 GB cards (model + KV cache doesn't fit)
- Underestimating runtime VRAM (peak draw 1.5x model size on long sequences)
- Mixing video gen with concurrent LLM serving on same GPU
- Using Mac Silicon for video gen — viable but 30-50% slower than CUDA
What breaks first
The errors most operators hit when running video understanding locally. Each links to a diagnose+fix walkthrough.
Before you buy
Verify your specific hardware can handle video understanding before committing money.