Best for Coding
Best open-source AI models for coding
Code is where open-source AI models have closed the gap with closed proprietary frontiers fastest. The current crop of code-specialised open weights rivals GPT-4-class quality on HumanEval and BigCodeBench at parameter counts that fit on a single high-end GPU.
We're optimising for HumanEval and MATH scores, broad language coverage (not just Python), licence permissiveness for commercial deployments, and tractable hardware footprints.
Code assistants run in tight feedback loops with the developer. Latency matters as much as raw quality, which is why a fast 32B coder often beats a slower 70B generalist in practice.
Our picks
GPT-4o-class HumanEval at 32B, Apache 2.0, fits on a single H100.
Code-focused MoE with 21B active params; supports 338 languages.
Generalist V3 outperforms most code specialists on HumanEval at frontier scale.
Strongest non-coder-specialised model at this size; runs on a 4090 at Q4.
14B reasoning model that punches above its weight on code+math benchmarks.
Things to watch out for
- Most coding assistants chain multiple LLM calls. Picking a model with fast streaming output (vLLM, TGI) often matters more than the absolute HumanEval score.
- Code workloads are sensitive to context length on large monorepos. Prefer 128K+ context models if you plan to feed in surrounding files.
- Apache 2.0 simplifies downstream redistribution if you're building a commercial product on top.
All picks at a glance
Coding-specialised Qwen2.5 32B fine-tune. GPT-4o-class on HumanEval and BigCodeBench at the time of release. Trained on additional code-heavy data with extended pre-training. Apache 2.0. Natural pick for self-hosted coding assistants, code-review automation, and any agent loop that primarily writes code.
- Context
- 128K
- License
- apache-2-0
- VRAM Q4
- 19.2 GB
Coding-focused MoE model with 21B active parameters out of 236B total. Supports 338 programming languages with strong performance across mainstream stacks (Python, TypeScript, Go, Rust, Java, C++) and competent results on niche languages where most open models falter. The DeepSeek licence applies — commercial use permitted with some application restrictions.
- Context
- 128K
- License
- deepseek
- VRAM Q4
- 141.6 GB
671B-parameter MoE model with 37B active per token. Trained for roughly $5.6M of compute — a landmark in cost-efficient frontier training. Frontier-class quality at a fraction of the cost of the closed proprietary frontier. The DeepSeek licence permits commercial use with limited restrictions on military and unlawful applications. Running V3 yourself requires serious hardware (8× H100 at fp8); most teams will use it via the DeepSeek API or providers like Together.
- Context
- 128K
- License
- deepseek
- VRAM Q4
- 402.6 GB
32B sweet-spot model: strong reasoning, fits on one H100 in fp16, on a 4090 at Q4. The 32B size in particular hits a quality/cost knee — quality scales with parameters faster than cost up to ~32B, and slower afterwards. Favoured for production chat where 7B isn't sharp enough and where 70B+ would over-spec the hardware budget. Apache 2.0 licence.
- Context
- 128K
- License
- apache-2-0
- VRAM Q4
- 19.2 GB
14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.
- Context
- 16K
- License
- mit
- VRAM Q4
- 8.4 GB