by OpenGVLab (Shanghai AI Lab)
OpenGVLab's open VLM family. InternVL 2.5 series spans 1B to 78B. Strong multilingual vision capability; dense alternative to Qwen-VL.
Start with InternVL2 8B at FP16 via vLLM on RTX 4090 24 GB — InternVL2 is the strongest open-weight vision-language model family at each size class, with the 8B variant delivering OCR, document understanding, chart reading, and general VQA at quality matching GPT-4V on several benchmarks. The 8B runs entirely in GPU memory at FP16 (16 GB VRAM for model + vision encoder). For higher quality, InternVL2 76B Q4 (48 GB) fits on 2× RTX 4090 or single Mac Studio M3 Ultra. Skip InternVL 1.5 — the v2 generation is a complete architecture rebuild with dynamic high-resolution input (up to 4K images via tiled processing) that the v1.x family lacks. InternVL2 uses MIT license — no commercial restrictions. For OCR-first workloads, InternVL2 outperforms LLaVA by 15-20 points.
For single-user local: vLLM 0.6.2+ with the InternVL2 multimodal backend on RTX 4090 24 GB. Ollama supports InternVL2 via the llava backend (model file must specify the vision tower — FROM ./internvl2-8b with proper projector config). For multi-user serving: vLLM on 2× H100 SXM for InternVL2 76B — image preprocessing (tiled dynamic resolution) is CPU-bound; allocate 4+ CPU cores per concurrent request. For document/OCR pipelines: deploy InternVL2 behind SGLang v0.2.5+ with constrained JSON output for structured data extraction from documents. InternVL2 uses a SigLIP-SO400M vision encoder + InternLM2 text backbone — the vision encoder is ~1.6 GB FP16, the projector is ~30 MB, and the LLM backbone dominates VRAM. The dynamic tiled resolution means input image preprocessing latency is 200-500ms per request — cache processed vision embeddings for repeated document types.
Verify InternVL runs on your specific hardware before committing money.