Best for RAG

Best open-source AI models for RAG

RAG workloads have specific demands: large context windows (so retrieved chunks fit), strong instruction following (so the model actually grounds in the retrieval), and predictable formatting (for downstream citation rendering).

What we optimise for

We're optimising for context length, instruction-following quality, and explicit RAG-mode features where the model author has published them.

Why it matters

A RAG model that doesn't reliably ground in its context is worse than no RAG at all — it produces confident-sounding hallucinations attached to retrieved sources, which users trust by default.

Our picks

#1Command R+
Cohere built it specifically for RAG; native citation-friendly output format.
#2Command R
Same family, 35B size. Sweet spot for cost-sensitive RAG.
#3Jamba 1.5 Large
256K native context — the longest open-weights window available.
#4Llama 3.3 70B Instruct
128K context with strong instruction following.
#5Qwen2.5 72B Instruct
Apache-3 alternative to Llama 3.3 with similar quality.

Things to watch out for

Effective context length is often much shorter than advertised. Test your model with realistic chunk-heavy prompts before committing.
Many RAG failures are retrieval failures, not generation failures. Spend equal effort on your embedding model and chunking strategy.
For commercial deployments, note that Command R / R+ open weights are CC-BY-NC — production use requires Cohere's hosted API.

All picks at a glance

Command R+

104B

Cohere's flagship 104B model. RAG-focused with native multilingual support across ~10 high-resource languages. CC-BY-NC weights; commercial use via Cohere's hosted API.

Context: 128K
License: mrl
VRAM Q4: 62.4 GB

Command R

35B

Cohere's 35B model tuned for RAG and tool use. The open weights are released under CC-BY-NC (commercial use requires the Cohere API). Strong multilingual coverage and a fine-grained RAG-mode output format that makes downstream citation easier.

Context: 128K
License: mrl
VRAM Q4: 21 GB

Jamba 1.5 Large

398B

Hybrid Mamba-Transformer-MoE model with native 256K context (effective beyond 140K). 94B active parameters out of 398B total. The state-space-model layers give it linear-time scaling with sequence length, making it interesting for very long contexts. Licensed under AI21's open model licence, which permits most commercial use.

Context: 256K
License: jamba-open
VRAM Q4: 238.8 GB

Llama 3.3 70B Instruct

70B

Meta's December 2024 refresh of Llama 3 70B that closes most of the gap with Llama 3.1 405B for chat workloads while remaining tractable on a single H100. Strong instruction following, robust tool-use behaviour, and a 128K context window make it the default choice for production chat at 70B scale. The 3.3 release was trained on a refreshed instruction-tuning data mix and benefits from Meta's most recent alignment work. It outperforms the much larger 3.1 405B on several reasoning benchmarks at a fraction of inference cost. The licence is the Llama 3 Community License, which permits commercial use unless your service exceeds 700M monthly active users. Good pick for: production chat at scale, RAG over long documents, agentic workflows where tool use matters, and any 70B-tier replacement for closed proprietary models.

Context: 128K
License: llama-3
VRAM Q4: 42 GB

Qwen2.5 72B Instruct

72B

The flagship Qwen 2.5 release. Competes with Llama 3.1 405B on many benchmarks at one-fifth the parameter count. Note the 72B specifically uses the Qwen License (commercial use up to 100M MAU) — the smaller Qwen2.5 sizes are Apache 2.0.

Context: 128K
License: qwen
VRAM Q4: 43.2 GB

Last reviewed 2026-07-05. We refresh these picks as new models ship. See the full directory at /models.