OSAIM
Open Source AI Models

Qwen2.5 14B Instruct

Mid-size Qwen2.5 with broad task coverage. The sweet spot for users who want noticeably better quality than 7B but can't justify the hardware footprint of 32B or 72B.

Parameters
14B
Context length
128K
Modality
text
Released
2024-09-18

Memory & hardware

VRAM (fp16)
28 GB
VRAM (Q4)
8.4 GB
Recommended
A100 40GB or RTX 4090 (Q4)
Quantizations
fp16, q8_0, q5_k_m, q4_k_m

License: Apache 2.0

SPDX
Apache-2.0
Commercial use
Yes
Modification
Yes
Redistribution
Yes

Benchmarks

HumanEval
83.5
MATH
80.0
MMLU
79.7
Benchmarks last verified 2026-05-18.

Hosted inference pricing

USD per million tokens.

ProviderInputOutput
togetherCheapest$0.30$0.30
Pricing last verified 2026-05-18. Providers update rates frequently; confirm before integrating.

Run it yourself

Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.

Run Qwen2.5 14B Instruct locally
Ollama (easiest)
ollama run qwen2.5:14b
Single-line install + run; uses the official Ollama registry tag for this family.
vLLM (production)
vllm serve Qwen/Qwen2.5-14B-Instruct
High-throughput hosted inference; one command to expose an OpenAI-compatible HTTP server.
Transformers (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-14B-Instruct", device_map="auto", torch_dtype="auto"
)
Direct PyTorch usage. Pin a torch / cuda version that matches your GPU.
Hugging Face ID: Qwen/Qwen2.5-14B-Instruct

Related models

Same family or similar size — useful when shopping around.

Phi-3 Medium 14B
14B

Phi-3's mid-tier model with extended 128K context. MIT licence. Strong reasoning relative to its parameter count thanks to Microsoft's heavy investment in synthetic training data.

Context
128K
License
mit
VRAM Q4
8.4 GB
Phi-4 14B
14B

14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.

Context
16K
License
mit
VRAM Q4
8.4 GB
OLMo 2 13B
13B

Larger OLMo 2 release. Same fully-open philosophy as the 7B variant. The 13B size makes it more competitive with mainstream production-grade chat models.

Context
4K
License
apache-2-0
VRAM Q4
7.8 GB
Qwen2.5 7B Instruct
7B

Apache-2.0-licensed 7B model with surprisingly strong reasoning and multilingual chops. Qwen 2.5 trains on a larger and more carefully filtered corpus than the original Qwen series, and the 7B variant punches well above its weight on coding and math benchmarks. A strong default for cost-sensitive chat workloads and for fine-tuning experiments where the Apache licence simplifies downstream redistribution.

Context
128K
License
apache-2-0
VRAM Q4
4.2 GB
Mistral Small 3
24B

24B dense model from Mistral's January 2025 release that competes with Llama 3.3 70B on many tasks at a third of the parameter count. Apache 2.0 licensed and small enough to run on a single 4090 at Q4. Good pick when you want Llama-3.3-70B-class chat quality but at a friendlier hardware budget, or when the licence matters and Llama's community terms don't fit.

Context
33K
License
apache-2-0
VRAM Q4
14.4 GB
Gemma 2 27B
27B

Flagship Gemma 2 release. Uses logit-distillation from a larger teacher model, which is how Google delivers near-70B quality from a 27B student. A solid choice when the Llama community licence doesn't fit and you need quality at the 27B–40B size range.

Context
8K
License
gemma
VRAM Q4
16.2 GB