OSAIM
Open Source AI Models

Best for RAG

Best open-source AI models for RAG

RAG workloads have specific demands: large context windows (so retrieved chunks fit), strong instruction following (so the model actually grounds in the retrieval), and predictable formatting (for downstream citation rendering).

What we optimise for

We're optimising for context length, instruction-following quality, and explicit RAG-mode features where the model author has published them.

Why it matters

A RAG model that doesn't reliably ground in its context is worse than no RAG at all — it produces confident-sounding hallucinations attached to retrieved sources, which users trust by default.

Our picks

  1. Cohere built it specifically for RAG; native citation-friendly output format.

  2. Same family, 35B size. Sweet spot for cost-sensitive RAG.

  3. 256K native context — the longest open-weights window available.

  4. 128K context with strong instruction following.

  5. Apache-3 alternative to Llama 3.3 with similar quality.

Things to watch out for

  • Effective context length is often much shorter than advertised. Test your model with realistic chunk-heavy prompts before committing.
  • Many RAG failures are retrieval failures, not generation failures. Spend equal effort on your embedding model and chunking strategy.
  • For commercial deployments, note that Command R / R+ open weights are CC-BY-NC — production use requires Cohere's hosted API.

All picks at a glance

Command R+
104B

Cohere's flagship 104B model. RAG-focused with native multilingual support across ~10 high-resource languages. CC-BY-NC weights; commercial use via Cohere's hosted API.

Context
128K
License
mrl
VRAM Q4
62.4 GB
Command R
35B

Cohere's 35B model tuned for RAG and tool use. The open weights are released under CC-BY-NC (commercial use requires the Cohere API). Strong multilingual coverage and a fine-grained RAG-mode output format that makes downstream citation easier.

Context
128K
License
mrl
VRAM Q4
21 GB
Jamba 1.5 Large
398B

Hybrid Mamba-Transformer-MoE model with native 256K context (effective beyond 140K). 94B active parameters out of 398B total. The state-space-model layers give it linear-time scaling with sequence length, making it interesting for very long contexts. Licensed under AI21's open model licence, which permits most commercial use.

Context
256K
License
jamba-open
VRAM Q4
238.8 GB
Llama 3.3 70B Instruct
70B

Meta's December 2024 refresh of Llama 3 70B that closes most of the gap with Llama 3.1 405B for chat workloads while remaining tractable on a single H100. Strong instruction following, robust tool-use behaviour, and a 128K context window make it the default choice for production chat at 70B scale. The 3.3 release was trained on a refreshed instruction-tuning data mix and benefits from Meta's most recent alignment work. It outperforms the much larger 3.1 405B on several reasoning benchmarks at a fraction of inference cost. The licence is the Llama 3 Community License, which permits commercial use unless your service exceeds 700M monthly active users. Good pick for: production chat at scale, RAG over long documents, agentic workflows where tool use matters, and any 70B-tier replacement for closed proprietary models.

Context
128K
License
llama-3
VRAM Q4
42 GB
Qwen2.5 72B Instruct
72B

The flagship Qwen 2.5 release. Competes with Llama 3.1 405B on many benchmarks at one-fifth the parameter count. Note the 72B specifically uses the Qwen License (commercial use up to 100M MAU) — the smaller Qwen2.5 sizes are Apache 2.0.

Context
128K
License
qwen
VRAM Q4
43.2 GB
Last reviewed 2026-06-08. We refresh these picks as new models ship. See the full directory at /models.