OSAIM
Open Source AI Models

Qwen 3 8B

The April 2025 refresh of Qwen at 8B. Native mixed-mode reasoning: the model can 'think' before answering when triggered, or answer directly for simple queries — configurable per request. Apache 2.0. A strong upgrade over Qwen 2.5 7B on math and code, with much better instruction following.

Parameters
8B
Context length
33K
Modality
text
Released
2025-04-29

Memory & hardware

VRAM (fp16)
16 GB
VRAM (Q4)
4.8 GB
Recommended
RTX 3090 or Apple M-series
Quantizations
fp16, q8_0, q5_k_m, q4_k_m, gguf

License: Apache 2.0

SPDX
Apache-2.0
Commercial use
Yes
Modification
Yes
Redistribution
Yes

Benchmarks

HumanEval
84.8
IFEval
83.2
MATH
80.2
MMLU
76.9
ArenaHard
74.4
MMLU-Pro
56.7
Benchmarks last verified 2026-07-02.

Hosted inference pricing

No hosted pricing listed — this model is currently self-host-only on this site.

Run it yourself

Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.

Run Qwen 3 8B locally
No official Ollama registry tag for this model — use transformers or vLLM below.
vLLM (production)
vllm serve Qwen/Qwen3-8B
High-throughput hosted inference; one command to expose an OpenAI-compatible HTTP server.
Transformers (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B", device_map="auto", torch_dtype="auto"
)
Direct PyTorch usage. Pin a torch / cuda version that matches your GPU.
Hugging Face ID: Qwen/Qwen3-8B

Related models

Same family or similar size — useful when shopping around.

Qwen2.5 7B Instruct
7B

Apache-2.0-licensed 7B model with surprisingly strong reasoning and multilingual chops. Qwen 2.5 trains on a larger and more carefully filtered corpus than the original Qwen series, and the 7B variant punches well above its weight on coding and math benchmarks. A strong default for cost-sensitive chat workloads and for fine-tuning experiments where the Apache licence simplifies downstream redistribution.

Context
128K
License
apache-2-0
VRAM Q4
4.2 GB
Llama 3.1 8B Instruct
8B

The workhorse 8B instruction-tuned model. Excellent quality-to-cost ratio and the broadest ecosystem support of any open-weights model — every major inference engine, fine-tuning library, and quantization toolchain has a 3.1 8B preset. Fits in 24 GB of VRAM at fp16, ~6 GB at Q4. Strong default for production chat where 70B is overkill, for fine-tuning on a specialist task, and for any workload where you want a known-good baseline.

Context
128K
License
llama-3
VRAM Q4
4.8 GB
Hermes 3 Llama 3.1 8B
8B

NousResearch's community-driven fine-tune on the Llama 3.1 8B base. Tuned for strong tool use, function calling and steerable persona behaviour. Inherits Llama 3's community licence and its 128K context.

Context
128K
License
llama-3
VRAM Q4
4.8 GB
Falcon 3 7B Instruct
7B

TII's latest dense 7B from December 2024. Strong scores on commonsense reasoning benchmarks. TII's Falcon licence permits royalty-free commercial use with attribution.

Context
33K
License
falcon-2
VRAM Q4
4.2 GB
Falcon Mamba 7B
7B

The first major open-weights state-space model. Linear-time decoding, no KV cache — memory usage stays flat as context grows, which makes it interesting for very long-context workloads. Falcon licence.

Context
16K
License
falcon-2
VRAM Q4
4.2 GB
Gemma 2 9B
9B

Mid-tier Gemma. Strong general-purpose chat model at small scale. The Gemma Terms of Use permit commercial use subject to Google's prohibited-use policy.

Context
8K
License
gemma
VRAM Q4
5.4 GB