Best for Coding

Best open-source AI models for coding

Code is where open-source AI models have closed the gap with closed proprietary frontiers fastest. The current crop of code-specialised open weights rivals GPT-4-class quality on HumanEval and BigCodeBench at parameter counts that fit on a single high-end GPU.

What we optimise for

We're optimising for HumanEval and MATH scores, broad language coverage (not just Python), licence permissiveness for commercial deployments, and tractable hardware footprints.

Why it matters

Code assistants run in tight feedback loops with the developer. Latency matters as much as raw quality, which is why a fast 32B coder often beats a slower 70B generalist in practice.

Our picks

#1Qwen2.5 Coder 32B
GPT-4o-class HumanEval at 32B, Apache 2.0, fits on a single H100.
#2DeepSeek Coder V2
Code-focused MoE with 21B active params; supports 338 languages.
#3DeepSeek V3
Generalist V3 outperforms most code specialists on HumanEval at frontier scale.
#4Qwen2.5 32B Instruct
Strongest non-coder-specialised model at this size; runs on a 4090 at Q4.
#5Phi-4 14B
14B reasoning model that punches above its weight on code+math benchmarks.

Things to watch out for

Most coding assistants chain multiple LLM calls. Picking a model with fast streaming output (vLLM, TGI) often matters more than the absolute HumanEval score.
Code workloads are sensitive to context length on large monorepos. Prefer 128K+ context models if you plan to feed in surrounding files.
Apache 2.0 simplifies downstream redistribution if you're building a commercial product on top.

All picks at a glance

Qwen2.5 Coder 32B

32B

Coding-specialised Qwen2.5 32B fine-tune. GPT-4o-class on HumanEval and BigCodeBench at the time of release. Trained on additional code-heavy data with extended pre-training. Apache 2.0. Natural pick for self-hosted coding assistants, code-review automation, and any agent loop that primarily writes code.

Context: 128K
License: apache-2-0
VRAM Q4: 19.2 GB

DeepSeek Coder V2

236B

Coding-focused MoE model with 21B active parameters out of 236B total. Supports 338 programming languages with strong performance across mainstream stacks (Python, TypeScript, Go, Rust, Java, C++) and competent results on niche languages where most open models falter. The DeepSeek licence applies — commercial use permitted with some application restrictions.

Context: 128K
License: deepseek
VRAM Q4: 141.6 GB

DeepSeek V3

671B

671B-parameter MoE model with 37B active per token. Trained for roughly $5.6M of compute — a landmark in cost-efficient frontier training. Frontier-class quality at a fraction of the cost of the closed proprietary frontier. The DeepSeek licence permits commercial use with limited restrictions on military and unlawful applications. Running V3 yourself requires serious hardware (8× H100 at fp8); most teams will use it via the DeepSeek API or providers like Together.

Context: 128K
License: deepseek
VRAM Q4: 402.6 GB

Qwen2.5 32B Instruct

32B

32B sweet-spot model: strong reasoning, fits on one H100 in fp16, on a 4090 at Q4. The 32B size in particular hits a quality/cost knee — quality scales with parameters faster than cost up to ~32B, and slower afterwards. Favoured for production chat where 7B isn't sharp enough and where 70B+ would over-spec the hardware budget. Apache 2.0 licence.

Context: 128K
License: apache-2-0
VRAM Q4: 19.2 GB

Phi-4 14B

14B

14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.

Context: 16K
License: mit
VRAM Q4: 8.4 GB

Last reviewed 2026-07-05. We refresh these picks as new models ship. See the full directory at /models.