Cohere's 35B model tuned for RAG and tool use. The open weights are released under CC-BY-NC (commercial use requires the Cohere API). Strong multilingual coverage and a fine-grained RAG-mode output format that makes downstream citation easier.
- Parameters
- 35B
- Context length
- 128K
- Modality
- text
- Released
- 2024-03-11
Memory & hardware
- VRAM (fp16)
- 70 GB
- VRAM (Q4)
- 21 GB
- Recommended
- A100 80GB or 2× RTX 4090
- Quantizations
- fp16, q8_0, q4_k_m
License: Mistral Research License
- SPDX
- —
- Commercial use
- No
- Modification
- Yes
- Redistribution
- Yes
Benchmarks
Hosted inference pricing
USD per million tokens.
| Provider | Input | Output | |
|---|---|---|---|
| togetherCheapest | $0.50 | $1.50 |
Run it yourself
Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.
ollama run command-r
vllm serve CohereForAI/c4ai-command-r-v01
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
model = AutoModelForCausalLM.from_pretrained(
"CohereForAI/c4ai-command-r-v01", device_map="auto", torch_dtype="auto"
)CohereForAI/c4ai-command-r-v01 Related models
Same family or similar size — useful when shopping around.
Bilingual EN/中 34B chat model. Apache 2.0 licensed with strong Chinese-language performance and competitive English chat quality. Good default for bilingual production workloads.
- Context
- 33K
- License
- apache-2-0
- VRAM Q4
- 20.4 GB
Vision-language variant of Yi 34B. Image-text reasoning via an MLP adapter on a CLIP encoder. Useful for bilingual EN/中 multimodal workloads where the major Western vision-language models underperform on Chinese text in images.
- Context
- 4K
- License
- apache-2-0
- VRAM Q4
- 20.4 GB
32B sweet-spot model: strong reasoning, fits on one H100 in fp16, on a 4090 at Q4. The 32B size in particular hits a quality/cost knee — quality scales with parameters faster than cost up to ~32B, and slower afterwards. Favoured for production chat where 7B isn't sharp enough and where 70B+ would over-spec the hardware budget. Apache 2.0 licence.
- Context
- 128K
- License
- apache-2-0
- VRAM Q4
- 19.2 GB
Coding-specialised Qwen2.5 32B fine-tune. GPT-4o-class on HumanEval and BigCodeBench at the time of release. Trained on additional code-heavy data with extended pre-training. Apache 2.0. Natural pick for self-hosted coding assistants, code-review automation, and any agent loop that primarily writes code.
- Context
- 128K
- License
- apache-2-0
- VRAM Q4
- 19.2 GB
Qwen's reasoning-focused 'thinking' model. Generates long chains-of-thought before answering, similar to OpenAI's o1 and DeepSeek R1 lineage. Optimised for math and competition-style problem solving. The Preview tag means Qwen is iterating quickly; later versions may obsolete this one. Useful today for math-heavy workloads where a slow, careful answer is preferred to a fast wrong one.
- Context
- 33K
- License
- apache-2-0
- VRAM Q4
- 19.2 GB
The mixture-of-experts release that introduced 8 experts of 7B each, 2 active per token. ~13B active parameters with 47B total, which makes per-token inference roughly as fast as a 13B dense model while approaching 70B dense quality. Apache 2.0 weights mean it's still a popular self-hosting choice. Memory footprint is the main constraint — the full 47B parameters must be loaded even though only a quarter are active per token.
- Context
- 33K
- License
- apache-2-0
- VRAM Q4
- 28 GB