Best for RAG
Best open-source AI models for RAG
RAG workloads have specific demands: large context windows (so retrieved chunks fit), strong instruction following (so the model actually grounds in the retrieval), and predictable formatting (for downstream citation rendering).
We're optimising for context length, instruction-following quality, and explicit RAG-mode features where the model author has published them.
A RAG model that doesn't reliably ground in its context is worse than no RAG at all — it produces confident-sounding hallucinations attached to retrieved sources, which users trust by default.
Our picks
Cohere built it specifically for RAG; native citation-friendly output format.
Same family, 35B size. Sweet spot for cost-sensitive RAG.
256K native context — the longest open-weights window available.
128K context with strong instruction following.
Apache-3 alternative to Llama 3.3 with similar quality.
Things to watch out for
- Effective context length is often much shorter than advertised. Test your model with realistic chunk-heavy prompts before committing.
- Many RAG failures are retrieval failures, not generation failures. Spend equal effort on your embedding model and chunking strategy.
- For commercial deployments, note that Command R / R+ open weights are CC-BY-NC — production use requires Cohere's hosted API.
All picks at a glance
Cohere's flagship 104B model. RAG-focused with native multilingual support across ~10 high-resource languages. CC-BY-NC weights; commercial use via Cohere's hosted API.
- Context
- 128K
- License
- mrl
- VRAM Q4
- 62.4 GB
Cohere's 35B model tuned for RAG and tool use. The open weights are released under CC-BY-NC (commercial use requires the Cohere API). Strong multilingual coverage and a fine-grained RAG-mode output format that makes downstream citation easier.
- Context
- 128K
- License
- mrl
- VRAM Q4
- 21 GB
Hybrid Mamba-Transformer-MoE model with native 256K context (effective beyond 140K). 94B active parameters out of 398B total. The state-space-model layers give it linear-time scaling with sequence length, making it interesting for very long contexts. Licensed under AI21's open model licence, which permits most commercial use.
- Context
- 256K
- License
- jamba-open
- VRAM Q4
- 238.8 GB
Meta's December 2024 refresh of Llama 3 70B that closes most of the gap with Llama 3.1 405B for chat workloads while remaining tractable on a single H100. Strong instruction following, robust tool-use behaviour, and a 128K context window make it the default choice for production chat at 70B scale. The 3.3 release was trained on a refreshed instruction-tuning data mix and benefits from Meta's most recent alignment work. It outperforms the much larger 3.1 405B on several reasoning benchmarks at a fraction of inference cost. The licence is the Llama 3 Community License, which permits commercial use unless your service exceeds 700M monthly active users. Good pick for: production chat at scale, RAG over long documents, agentic workflows where tool use matters, and any 70B-tier replacement for closed proprietary models.
- Context
- 128K
- License
- llama-3
- VRAM Q4
- 42 GB
The flagship Qwen 2.5 release. Competes with Llama 3.1 405B on many benchmarks at one-fifth the parameter count. Note the 72B specifically uses the Qwen License (commercial use up to 100M MAU) — the smaller Qwen2.5 sizes are Apache 2.0.
- Context
- 128K
- License
- qwen
- VRAM Q4
- 43.2 GB