other

26B parameters

Commercial OK

Multimodal

Reviewed May 2026

InternVL 2.5 26B

InternVL 2.5 mid-tier — Shanghai AI Lab vision-language model with strong document and chart understanding.

License: MIT·Released Dec 5, 2024·Context: 32,768 tokens

Overview

InternVL 2.5 mid-tier — Shanghai AI Lab vision-language model with strong document and chart understanding.

How to run it

InternVL 2.5 26B is OpenGVLab's 26B vision-language model — the smaller sibling of InternVL 2.5 78B. 26B text backbone + InternViT vision encoder, designed for document understanding, OCR, and visual QA. Run at Q4_K_M via llama.cpp with llava-server for vision. Q4_K_M file size 15 GB (text) + ~3-5 GB (vision). Minimum VRAM: 16 GB — RTX 4080 (16GB) at Q4_K_M text-only, or Q3_K_M + vision. Recommended: RTX 4090 24GB at Q4_K_M + vision. Throughput: ~30-50 tok/s on RTX 4090 at Q4_K_M text-only; vision encoding adds ~1-3s per image. InternVL architecture — InternViT encoder is large (6B), making vision VRAM proportionally higher than Llama/Qwen vision models at the same text backbone size. Check llama.cpp InternVL 26B support — may differ from 78B support. Use for: document OCR, chart understanding, visual QA, UI screenshot analysis. Not for: text-only general chat (use standard 26B text model). Context: 32K advertised; practical with vision at Q4 on 24 GB is 4-8K. For larger vision models: InternVL 2.5 78B.

Hardware guidance

Minimum: RTX 3060 12GB at Q3_K_M + vision (tight). Recommended: RTX 4090 24GB at Q4_K_M + vision (8K context). VRAM math: 26B text at Q4 ≈ 15 GB. InternViT encoder: ~4-6 GB. KV cache at 8K: ~5 GB. Total with vision: ~24-26 GB. RTX 4090 24GB: Q4 + vision + 4K context — tight. Offload vision encoder activations for headroom. RTX 4080 16GB: Q3_K_M + vision at 4K. MacBook Pro M4 Max 36GB+: Q4 + vision at 5-10 tok/s. Cloud: A10 24GB at Q4_K_M + vision. InternViT is the bottleneck — budget 4-6 GB specifically for the vision encoder. AWQ-INT4 drops text to ~13 GB, helping VRAM fit.

What breaks first

InternViT VRAM domination. The vision encoder is proportionally larger than the text backbone. At 26B text, the 6B vision encoder takes 25-30% of total VRAM — much higher ratio than Llama/Qwen vision models. 2. Multimodal GGUF scarcity. Pre-converted InternVL 26B GGUFs with vision are rare. You may need to convert from hf or use text-only. 3. Resolution sensitivity. InternViT's quality degrades sharply with low-resolution inputs. But high-res inputs spike vision encoder VRAM by 3-5 GB. Find the resolution sweet spot for your use case. 4. Tokenizer format. InternVL uses a custom vision+text tokenizer format. Standard llama.cpp llava may not handle InternVL's specific multimodal token embedding correctly. Validate vision outputs against reference.

Runtime recommendation

llama.cpp with InternVL-compatible llava-server. Verify InternVL support in your build. OpenGVLab's reference code as fallback. vLLM if InternVL is registered. Avoid Ollama unless InternVL vision tag exists.

Common beginner mistakes

Mistake: Expecting InternVL 26B to have the same vision-to-text VRAM ratio as Llama 3.2 Vision. Fix: InternViT is ~6B — 20× larger than CLIP. Budget 4-6 GB for vision encoder alone. Your 16 GB GPU may not fit vision+text at Q4. Mistake: Using InternVL 26B vision projector with 78B GGUF. Fix: Different model sizes, different projectors. Match models exactly. Mistake: Assuming 26B = half the quality of 78B. Fix: The 26B is significantly weaker at complex visual reasoning. 78B is the recommendation for document understanding and OCR. 26B is the budget option. Mistake: Sending images without preprocessing. Fix: InternVL expects specific image preprocessing. Use the model's image processor or resize to the encoder's expected input size.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (internvl-2.5)

InternVL 2.5 26B26B

You are here

InternVL 2.5 78B78B

Datacenter

Distilled / fine-tuned from this

InternVL 2.5 78B78B

Datacenter

Strengths

MIT license
Strong on charts and documents

Weaknesses

Smaller community than Qwen-VL

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	16.0 GB	20 GB

Get the model

HuggingFace

Original weights

huggingface.co/OpenGVLab/InternVL2_5-26B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of InternVL 2.5 26B.

Frequently asked

What's the minimum VRAM to run InternVL 2.5 26B?

20GB of VRAM is enough to run InternVL 2.5 26B at the Q4_K_M quantization (file size 16.0 GB). Higher-quality quantizations need more.

Can I use InternVL 2.5 26B commercially?

Yes — InternVL 2.5 26B ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of InternVL 2.5 26B?

InternVL 2.5 26B supports a context window of 32,768 tokens (about 33K).

Does InternVL 2.5 26B support images?

Yes — InternVL 2.5 26B is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/OpenGVLab/InternVL2_5-26B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

InternVL 2.5 78B

Before you buy

Verify InternVL 2.5 26B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →

other

26B parameters

Commercial OK

Multimodal

Reviewed May 2026

InternVL 2.5 26B

InternVL 2.5 mid-tier — Shanghai AI Lab vision-language model with strong document and chart understanding.

License: MIT·Released Dec 5, 2024·Context: 32,768 tokens

Overview

InternVL 2.5 mid-tier — Shanghai AI Lab vision-language model with strong document and chart understanding.

How to run it

Hardware guidance

What breaks first

InternViT VRAM domination. The vision encoder is proportionally larger than the text backbone. At 26B text, the 6B vision encoder takes 25-30% of total VRAM — much higher ratio than Llama/Qwen vision models. 2. Multimodal GGUF scarcity. Pre-converted InternVL 26B GGUFs with vision are rare. You may need to convert from hf or use text-only. 3. Resolution sensitivity. InternViT's quality degrades sharply with low-resolution inputs. But high-res inputs spike vision encoder VRAM by 3-5 GB. Find the resolution sweet spot for your use case. 4. Tokenizer format. InternVL uses a custom vision+text tokenizer format. Standard llama.cpp llava may not handle InternVL's specific multimodal token embedding correctly. Validate vision outputs against reference.

Runtime recommendation

Common beginner mistakes

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (internvl-2.5)

InternVL 2.5 26B26B

You are here

InternVL 2.5 78B78B

Datacenter

Distilled / fine-tuned from this

InternVL 2.5 78B78B

Datacenter

Strengths

MIT license
Strong on charts and documents

Weaknesses

Smaller community than Qwen-VL

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	16.0 GB	20 GB

Get the model

HuggingFace

Original weights

huggingface.co/OpenGVLab/InternVL2_5-26B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of InternVL 2.5 26B.

Frequently asked

What's the minimum VRAM to run InternVL 2.5 26B?

20GB of VRAM is enough to run InternVL 2.5 26B at the Q4_K_M quantization (file size 16.0 GB). Higher-quality quantizations need more.

Can I use InternVL 2.5 26B commercially?

Yes — InternVL 2.5 26B ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of InternVL 2.5 26B?

InternVL 2.5 26B supports a context window of 32,768 tokens (about 33K).

Does InternVL 2.5 26B support images?

Yes — InternVL 2.5 26B is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/OpenGVLab/InternVL2_5-26B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

InternVL 2.5 78B

Before you buy

Verify InternVL 2.5 26B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →