Video Understanding

Comprehending video content — captioning, Q&A, action recognition. Multimodal video LLMs like Qwen2.5-VL handle this.

Setup walkthrough

Install Ollama → ollama pull qwen2.5-vl:7b (~5 GB — strong video understanding model).
Extract keyframes from your video: pip install opencv-python → ffmpeg -i input.mp4 -vf "fps=1" frames/frame_%04d.jpg (one frame per second).
Python script to analyze video frames:

import ollama, glob
frames = sorted(glob.glob("frames/*.jpg"))
for i, frame in enumerate(frames):
    with open(frame, "rb") as f:
        img = f.read()
    resp = ollama.chat(model="qwen2.5-vl:7b", messages=[{
        "role": "user",
        "content": f"Frame {i+1}/{len(frames)}: Describe what's happening. Focus on actions, people, and scene changes.",
        "images": [img]
    }])
    print(f"Frame {i+1}: {resp['message']['content']}")

For a 1-minute video: 60 frames, ~5-10 minutes total processing on 12 GB GPU.
For temporal understanding (action sequences, events): Qwen2.5-VL supports video input natively — feed the video file directly (it auto-samples frames). Use ollama with video file path.
Use cases: security footage review, sports analysis, content moderation, video search.

The cheap setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Qwen2-VL 7B at 5-10 seconds per frame — a 1-minute video (60 frames at 1 fps) takes 5-10 minutes to analyze. For longer videos (10-60 minutes), process overnight. Pair with Ryzen 5 5600 + 32 GB DDR4 + 2TB NVMe (video storage). Total: ~$420-490. For CPU-only: MiniCPM-V via llama.cpp at 30-60 seconds per frame — 30-60 minutes for a 1-minute video, functional for occasional use.

The serious setup

Used RTX 3090 24 GB ($700-900, see /hardware/rtx-3090). Runs Qwen2-VL 72B at 10-20 seconds per frame — highest-quality local video understanding. For production video analysis (surveillance, content moderation), use Qwen2-VL 7B on 2× RTX 3060 in parallel — analyze 2 videos simultaneously. Total: ~$1,500-2,200. For the best quality-to-speed ratio: Qwen2-VL 7B on RTX 4090 ($2,000) at 1-3 seconds per frame. Video understanding scales linearly with frame count — faster GPU = more frames processed per hour.

Common beginner mistake

The mistake: Feeding a 10-minute video directly to a VLM and asking "Summarize this video." Why it fails: VLMs have context limits — Qwen2-VL supports video input but typically samples 16-64 frames total. A 10-minute video at 1 fps has 600 frames. The model either samples sparsely (missing key moments) or exceeds its context window. You get a summary of whatever random frames were sampled. The fix: Always extract keyframes yourself with ffmpeg at a frame rate that captures your use case. For sports: 10 fps (action is fast). For surveillance: 1 fps (changes are slow). For lectures: scene-change detection (only process frames when slides change). Feed the model individual frames with timestamps. Control your sampling rate — don't trust the model's default video loader. The model sees what you show it; if you miss a frame, the model misses the event.

Reality check

Local video gen is genuinely possible in 2026 (LTX-Video, Mochi) but VRAM-hungry. 24 GB is the working minimum; 32 GB is the comfort zone for long-form workflows. Below 24 GB, video gen isn't realistic with current models.

Common mistakes

Trying video gen on 16 GB cards (model + KV cache doesn't fit)
Underestimating runtime VRAM (peak draw 1.5x model size on long sequences)
Mixing video gen with concurrent LLM serving on same GPU
Using Mac Silicon for video gen — viable but 30-50% slower than CUDA

What breaks first

The errors most operators hit when running video understanding locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle video understanding before committing money.

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →