Mistral Small 3

Mistral AI · Mistral24B13B – 30B Apache 2.0 chat code

24B dense model from Mistral's January 2025 release that competes with Llama 3.3 70B on many tasks at a third of the parameter count. Apache 2.0 licensed and small enough to run on a single 4090 at Q4. Good pick when you want Llama-3.3-70B-class chat quality but at a friendlier hardware budget, or when the licence matters and Llama's community terms don't fit.

Parameters: 24B
Context length: 33K
Modality: text
Released: 2025-01-30

Memory & hardware

VRAM (fp16): 48 GB
VRAM (Q4): 14.4 GB
Recommended: A100 40GB or RTX 4090 24GB (Q4)
Quantizations: fp16, q8_0, q4_k_m

License: Apache 2.0

SPDX: Apache-2.0
Commercial use: Yes
Modification: Yes
Redistribution: Yes

License detail →

Benchmarks

HumanEval

84.8

IFEval

82.6

MMLU

81.0

ArenaHard

77.2

MATH

70.6

Benchmarks last verified 2026-07-02.

Hosted inference pricing

USD per million tokens.

Provider	Input	Output
togetherCheapest	$0.80	$0.80

Pricing last verified 2026-05-18. Providers update rates frequently; confirm before integrating.

Run it yourself

Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.

Run Mistral Small 3 locally

Ollama (easiest)

ollama run mistral

Single-line install + run; uses the official Ollama registry tag for this family.

vLLM (production)

vllm serve mistralai/Mistral-Small-24B-Instruct-2501

High-throughput hosted inference; one command to expose an OpenAI-compatible HTTP server.

Transformers (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-24B-Instruct-2501")
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-Small-24B-Instruct-2501", device_map="auto", torch_dtype="auto"
)

Direct PyTorch usage. Pin a torch / cuda version that matches your GPU.

Hugging Face ID: mistralai/Mistral-Small-24B-Instruct-2501

Related models

Same family or similar size — useful when shopping around.

Gemma 2 27B

27B

Flagship Gemma 2 release. Uses logit-distillation from a larger teacher model, which is how Google delivers near-70B quality from a 27B student. A solid choice when the Llama community licence doesn't fit and you need quality at the 27B–40B size range.

Context: 8K
License: gemma
VRAM Q4: 16.2 GB

Llama 4 Scout 17B (16E)

17B

Meta's April 2025 mixture-of-experts release. 17B active parameters across 16 experts (109B total). Natively multimodal with an unprecedented 10M-token context window — a leap far beyond Llama 3's 128K. Scout was designed to run on a single GPU at Q4 while beating Llama 3.3 70B on reasoning and multilingual benchmarks. The Llama 4 licence tightened acceptable-use provisions vs Llama 3.

Context: 10.0M
License: llama-4
VRAM Q4: 10.2 GB

Llama 4 Maverick 17B (128E)

17B

Larger Llama 4 sibling of Scout — 17B active across 128 experts (400B total). 1M-token native context. Positioned as GPT-4o-class on chat and reasoning while remaining tractable on a single high-end host at fp8. Multimodal from the ground up; instruction-tuned by Meta with a heavier synthetic-data pipeline than Llama 3.

Context: 1.0M
License: llama-4
VRAM Q4: 10.2 GB

Phi-3 Medium 14B

14B

Phi-3's mid-tier model with extended 128K context. MIT licence. Strong reasoning relative to its parameter count thanks to Microsoft's heavy investment in synthetic training data.

Context: 128K
License: mit
VRAM Q4: 8.4 GB

Phi-4 14B

14B

14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.

Context: 16K
License: mit
VRAM Q4: 8.4 GB

Qwen2.5 14B Instruct

14B

Mid-size Qwen2.5 with broad task coverage. The sweet spot for users who want noticeably better quality than 7B but can't justify the hardware footprint of 32B or 72B.

Context: 128K
License: apache-2-0
VRAM Q4: 8.4 GB