Mistral Small 3
24B dense model from Mistral's January 2025 release that competes with Llama 3.3 70B on many tasks at a third of the parameter count. Apache 2.0 licensed and small enough to run on a single 4090 at Q4.
Good pick when you want Llama-3.3-70B-class chat quality but at a friendlier hardware budget, or when the licence matters and Llama's community terms don't fit.
- Parameters
- 24B
- Context length
- 33K
- Modality
- text
- Released
- 2025-01-30
Memory & hardware
- VRAM (fp16)
- 48 GB
- VRAM (Q4)
- 14.4 GB
- Recommended
- A100 40GB or RTX 4090 24GB (Q4)
- Quantizations
- fp16, q8_0, q4_k_m
Benchmarks
Hosted inference pricing
USD per million tokens.
| Provider | Input | Output | |
|---|---|---|---|
| togetherCheapest | $0.80 | $0.80 |
Run it yourself
Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.
ollama run mistral
vllm serve mistralai/Mistral-Small-24B-Instruct-2501
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-24B-Instruct-2501")
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-Small-24B-Instruct-2501", device_map="auto", torch_dtype="auto"
)mistralai/Mistral-Small-24B-Instruct-2501 Related models
Same family or similar size — useful when shopping around.
Flagship Gemma 2 release. Uses logit-distillation from a larger teacher model, which is how Google delivers near-70B quality from a 27B student. A solid choice when the Llama community licence doesn't fit and you need quality at the 27B–40B size range.
- Context
- 8K
- License
- gemma
- VRAM Q4
- 16.2 GB
Phi-3's mid-tier model with extended 128K context. MIT licence. Strong reasoning relative to its parameter count thanks to Microsoft's heavy investment in synthetic training data.
- Context
- 128K
- License
- mit
- VRAM Q4
- 8.4 GB
14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.
- Context
- 16K
- License
- mit
- VRAM Q4
- 8.4 GB
Mid-size Qwen2.5 with broad task coverage. The sweet spot for users who want noticeably better quality than 7B but can't justify the hardware footprint of 32B or 72B.
- Context
- 128K
- License
- apache-2-0
- VRAM Q4
- 8.4 GB
Larger OLMo 2 release. Same fully-open philosophy as the 7B variant. The 13B size makes it more competitive with mainstream production-grade chat models.
- Context
- 4K
- License
- apache-2-0
- VRAM Q4
- 7.8 GB
Joint Mistral × NVIDIA model with 128K context, designed as a drop-in upgrade to Mistral 7B. Trained with NVIDIA's Megatron stack and released under Apache 2.0. Strong multilingual coverage thanks to the Tekken tokenizer.
- Context
- 128K
- License
- apache-2-0
- VRAM Q4
- 7.2 GB