RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Glossary / Neural network architectures / Vision-Language Model (VLM)
Neural network architectures

Vision-Language Model (VLM)

Also known as: vision-language model

A Vision-Language Model (VLM) processes both images and text, enabling tasks like image captioning, visual question answering, and document understanding. In local AI, VLMs typically consist of a vision encoder (e.g., CLIP) and a language model (e.g., Llama) fused via a projection layer. Operators encounter VLMs when running multimodal models like LLaVA, CogVLM, or Qwen-VL. VRAM usage is higher than text-only models because both the vision encoder and language model must fit in memory—a 7B VLM at Q4 may require ~6-8 GB VRAM, plus additional memory for image embeddings.

Deeper dive

VLMs combine a vision encoder (often a ViT or CLIP variant) that converts images into embeddings, and a language model (decoder-only transformer) that generates text conditioned on those embeddings. The projection layer aligns vision and text spaces. Common architectures include LLaVA (simple projection), Qwen-VL (cross-attention), and CogVLM (deep fusion). Operators running VLMs locally must consider: (1) VRAM—vision encoders add 1-3 GB at FP16; (2) context length—image tokens (e.g., 576 for CLIP) consume context; (3) quantization—both encoder and LLM can be quantized, but encoder quantization is less common. Inference speed is typically slower than text-only due to the extra encoder pass. Tools like llama.cpp support VLMs via multimodal patches, while Ollama and LM Studio offer built-in VLM support for models like LLaVA.

Practical example

Running LLaVA 1.6 7B (Q4_K_M) on an RTX 3060 12 GB: the model uses ~6 GB for the LLM, ~1 GB for the vision encoder, and ~1 GB for context. With a 4K context, the rig stays within VRAM and achieves ~15 tok/s. On an 8 GB card, the same model would exceed VRAM, forcing system-RAM offload and dropping to ~3 tok/s. Operators should check VRAM requirements before pulling a VLM.

Workflow example

In LM Studio, operators download a VLM like 'llava-v1.6-mistral-7b-Q4_K_M.gguf', load it, and select an image via the UI. The runtime processes the image through the vision encoder, then the LLM generates a caption or answers questions. In llama.cpp, the command ./llama-cli -m llava-v1.6-7b-Q4_K_M.gguf --mmproj llava-v1.6-7b-mmproj-f16.gguf --image photo.jpg -p "Describe this image" runs the same workflow. Ollama supports VLMs with ollama run llava:7b and automatically handles the multimodal projection.

Related terms

Large Language Model (LLM)Diffusion ModelVision Transformer (ViT)Multimodal AI

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →