llama

34B parameters

Commercial OK

Reviewed May 2026

Phind CodeLlama 34B v2

Phind's CodeLlama-derived coder at 34B. Older release; retained for historical / continuity value. Newer Qwen Coder lineage has surpassed it.

License: Llama 2 Community License·Released Sep 1, 2023·Context: 16,384 tokens

Overview

Phind's CodeLlama-derived coder at 34B. Older release; retained for historical / continuity value. Newer Qwen Coder lineage has surpassed it.

How to run it

Phind-CodeLlama-34B-v2 is Phind's code-specialized fine-tune of CodeLlama 34B. Run at Q4_K_M via Ollama (ollama pull phind-codellama:34b-v2) or llama.cpp with -ngl 999 -fa -c 4096. Q4_K_M file size ~19 GB on disk. Minimum VRAM: 16 GB — RTX 4080 (16GB) at Q4_K_M with KV offload. RTX 4090 24GB: Q4_K_M comfortably at 8-16K context. Recommended: RTX 4090 24GB at Q4_K_M. Throughput: ~35-55 tok/s on RTX 4090 at Q4_K_M. CodeLlama architecture — well-supported. Phind's fine-tune focuses on code generation with search-augmented context: the model is trained to use retrieved code snippets effectively. Strong on: code generation, debugging, code explanation, technical Q&A. Less strong on: general chat, creative writing, non-technical tasks. Phind-v2 improves on v1 with better instruction-following and multi-language code support. Context: 16K advertised (CodeLlama base). Practical at Q4 on 24 GB is 8-16K. For larger code models, consider DeepSeek Coder V2 236B or Qwen 3 Coder 32B.

Hardware guidance

Minimum: RTX 3060 12GB at Q3_K_M with KV offload. Recommended: RTX 4090 24GB at Q4_K_M (16K context). Optimal: RTX 4090 24GB at Q4_K_M. VRAM math: 34B dense, Q4_K_M ≈ 19 GB. KV cache at 16K: ~8 GB. Total: ~27 GB at 16K. RTX 4090 24GB: Q4 + 8K = ~23 GB — fits on-GPU. 16K context: ~27 GB — offload KV. RTX 3090 24GB: same. RTX 4080 16GB: Q4 + 2K on-GPU. RTX 5090 32GB: Q4 at 32K — comfortable. MacBook Pro M4 Pro 24GB+: Q4 at 8-15 tok/s. Cloud: A10 24GB at Q4_K_M. Code generation typically doesn't need 16K+ context — 4-8K is sufficient for most coding tasks. AWQ-INT4 drops weights to ~17 GB.

What breaks first

Code quality at Q3. Code generation is precision-sensitive. Q3 quantization introduces subtle bugs — variable name errors, syntax mistakes, incorrect API calls. Use Q4_K_M minimum for code. 2. Fill-in-the-middle (FIM) support. Phind-CodeLlama supports FIM for code completion. If your inference stack doesn't support FIM formatting, completion quality degrades. 3. CodeLlama chat template. CodeLlama uses a specific infill + chat template. Using standard Llama chat template breaks code generation formatting. 4. Language-specific quality variance. Phind-CodeLlama's code quality varies by language. Python and TypeScript are strongest; less common languages may have more errors. Test your target language.

Runtime recommendation

Ollama for quick-start. llama.cpp with FIM support for code completion. vLLM for serving. For IDE integration: Continue.dev or TabbyAPI with FIM formatting. CodeLlama architecture is well-supported.

Common beginner mistakes

Mistake: Using Phind-CodeLlama for general chat. Fix: It's code-specialized. General knowledge and conversational ability are degraded vs same-sized general models. Use for code tasks only. Mistake: Ignoring FIM formatting. Fix: CodeLlama uses fill-in-the-middle format for completions. Standard chat format produces worse code completions. Use an FIM-aware frontend. Mistake: Using Q3 for production code generation. Fix: Q3 introduces subtle bugs. Test your code outputs at Q3 vs Q4 — you'll likely find more syntax errors and hallucinated APIs. Use Q4_K_M minimum. Mistake: Expecting Phind-v2 to know APIs released after its training cutoff. Fix: CodeLlama's knowledge is frozen. Use RAG with current documentation for recent API/language features.

Strengths

Historical baseline for open coding models

Weaknesses

Older — Qwen 2.5 Coder 32B is sharper

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	20.0 GB	24 GB

Get the model

HuggingFace

Original weights

huggingface.co/Phind/Phind-CodeLlama-34B-v2

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Phind CodeLlama 34B v2.

Frequently asked

What's the minimum VRAM to run Phind CodeLlama 34B v2?

24GB of VRAM is enough to run Phind CodeLlama 34B v2 at the Q4_K_M quantization (file size 20.0 GB). Higher-quality quantizations need more.

Can I use Phind CodeLlama 34B v2 commercially?

Yes — Phind CodeLlama 34B v2 ships under the Llama 2 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Phind CodeLlama 34B v2?

Phind CodeLlama 34B v2 supports a context window of 16,384 tokens (about 16K).

Source: huggingface.co/Phind/Phind-CodeLlama-34B-v2

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Before you buy

Verify Phind CodeLlama 34B v2 runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →

llama

34B parameters

Commercial OK

Reviewed May 2026

Phind CodeLlama 34B v2

Phind's CodeLlama-derived coder at 34B. Older release; retained for historical / continuity value. Newer Qwen Coder lineage has surpassed it.

License: Llama 2 Community License·Released Sep 1, 2023·Context: 16,384 tokens

Overview

Phind's CodeLlama-derived coder at 34B. Older release; retained for historical / continuity value. Newer Qwen Coder lineage has surpassed it.

How to run it

Hardware guidance

What breaks first

Code quality at Q3. Code generation is precision-sensitive. Q3 quantization introduces subtle bugs — variable name errors, syntax mistakes, incorrect API calls. Use Q4_K_M minimum for code. 2. Fill-in-the-middle (FIM) support. Phind-CodeLlama supports FIM for code completion. If your inference stack doesn't support FIM formatting, completion quality degrades. 3. CodeLlama chat template. CodeLlama uses a specific infill + chat template. Using standard Llama chat template breaks code generation formatting. 4. Language-specific quality variance. Phind-CodeLlama's code quality varies by language. Python and TypeScript are strongest; less common languages may have more errors. Test your target language.

Runtime recommendation

Ollama for quick-start. llama.cpp with FIM support for code completion. vLLM for serving. For IDE integration: Continue.dev or TabbyAPI with FIM formatting. CodeLlama architecture is well-supported.