Neural network architectures

Vision-Language Model (VLM)

Also known as: vision-language model

A Vision-Language Model (VLM) processes both images and text, enabling tasks like image captioning, visual question answering, and document understanding. In local AI, VLMs typically consist of a vision encoder (e.g., CLIP) and a language model (e.g., Llama) fused via a projection layer. Operators encounter VLMs when running multimodal models like LLaVA, CogVLM, or Qwen-VL. VRAM usage is higher than text-only models because both the vision encoder and language model must fit in memory—a 7B VLM at Q4 may require ~6-8 GB VRAM, plus additional memory for image embeddings.

Deeper dive

VLMs combine a vision encoder (often a ViT or CLIP variant) that converts images into embeddings, and a language model (decoder-only transformer) that generates text conditioned on those embeddings. The projection layer aligns vision and text spaces. Common architectures include LLaVA (simple projection), Qwen-VL (cross-attention), and CogVLM (deep fusion). Operators running VLMs locally must consider: (1) VRAM—vision encoders add 1-3 GB at FP16; (2) context length—image tokens (e.g., 576 for CLIP) consume context; (3) quantization—both encoder and LLM can be quantized, but encoder quantization is less common. Inference speed is typically slower than text-only due to the extra encoder pass. Tools like llama.cpp support VLMs via multimodal patches, while Ollama and LM Studio offer built-in VLM support for models like LLaVA.

Practical example

Running LLaVA 1.6 7B (Q4_K_M) on an RTX 3060 12 GB: the model uses ~6 GB for the LLM, ~1 GB for the vision encoder, and ~1 GB for context. With a 4K context, the rig stays within VRAM and achieves ~15 tok/s. On an 8 GB card, the same model would exceed VRAM, forcing system-RAM offload and dropping to ~3 tok/s. Operators should check VRAM requirements before pulling a VLM.

Workflow example

In LM Studio, operators download a VLM like 'llava-v1.6-mistral-7b-Q4_K_M.gguf', load it, and select an image via the UI. The runtime processes the image through the vision encoder, then the LLM generates a caption or answers questions. In llama.cpp, the command ./llama-cli -m llava-v1.6-7b-Q4_K_M.gguf --mmproj llava-v1.6-7b-mmproj-f16.gguf --image photo.jpg -p "Describe this image" runs the same workflow. Ollama supports VLMs with ollama run llava:7b and automatically handles the multimodal projection.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work

Deeper dive

Practical example

Workflow example

Related terms