Qwen 3 32B

Alibaba · Qwen32B30B – 70B Apache 2.0 chat reasoning code

32B sweet-spot Qwen 3, Apache 2.0. Reasoning-mode toggle inherited from smaller siblings; strong on math, code and agentic tool use. Fits on a single H100 in fp16 and on a 4090 at Q4.

Parameters: 32B
Context length: 33K
Modality: text
Released: 2025-04-29

Memory & hardware

VRAM (fp16): 64 GB
VRAM (Q4): 19.2 GB
Recommended: H100 80GB (fp16) or RTX 4090 (Q4)
Quantizations: fp16, fp8, q8_0, q5_k_m, q4_k_m

License: Apache 2.0

SPDX: Apache-2.0
Commercial use: Yes
Modification: Yes
Redistribution: Yes

License detail →

Benchmarks

HumanEval

89.6

MATH

87.4

ArenaHard

87.2

IFEval

85.7

MMLU

83.4

MMLU-Pro

71.0

GPQA

47.2

SWE-bench Verified

26.7

Benchmarks last verified 2026-07-02.

Hosted inference pricing

USD per million tokens.

Provider	Input	Output
together	$0.40	$0.40
deepinfraCheapest	$0.10	$0.30

Pricing last verified 2026-07-02. Providers update rates frequently; confirm before integrating.

Run it yourself

Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.

Run Qwen 3 32B locally

No official Ollama registry tag for this model — use transformers or vLLM below.

vLLM (production)

vllm serve Qwen/Qwen3-32B

High-throughput hosted inference; one command to expose an OpenAI-compatible HTTP server.

Transformers (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-32B", device_map="auto", torch_dtype="auto"
)

Direct PyTorch usage. Pin a torch / cuda version that matches your GPU.

Hugging Face ID: Qwen/Qwen3-32B

Related models

Same family or similar size — useful when shopping around.

Qwen2.5 32B Instruct

32B

32B sweet-spot model: strong reasoning, fits on one H100 in fp16, on a 4090 at Q4. The 32B size in particular hits a quality/cost knee — quality scales with parameters faster than cost up to ~32B, and slower afterwards. Favoured for production chat where 7B isn't sharp enough and where 70B+ would over-spec the hardware budget. Apache 2.0 licence.

Context: 128K
License: apache-2-0
VRAM Q4: 19.2 GB

Qwen2.5 Coder 32B

32B

Coding-specialised Qwen2.5 32B fine-tune. GPT-4o-class on HumanEval and BigCodeBench at the time of release. Trained on additional code-heavy data with extended pre-training. Apache 2.0. Natural pick for self-hosted coding assistants, code-review automation, and any agent loop that primarily writes code.

Context: 128K
License: apache-2-0
VRAM Q4: 19.2 GB

QwQ 32B Preview

32B

Qwen's reasoning-focused 'thinking' model. Generates long chains-of-thought before answering, similar to OpenAI's o1 and DeepSeek R1 lineage. Optimised for math and competition-style problem solving. The Preview tag means Qwen is iterating quickly; later versions may obsolete this one. Useful today for math-heavy workloads where a slow, careful answer is preferred to a fast wrong one.

Context: 33K
License: apache-2-0
VRAM Q4: 19.2 GB

Yi 1.5 34B Chat

34B

Bilingual EN/中 34B chat model. Apache 2.0 licensed with strong Chinese-language performance and competitive English chat quality. Good default for bilingual production workloads.

Context: 33K
License: apache-2-0
VRAM Q4: 20.4 GB

Yi VL 34B

34B

Vision-language variant of Yi 34B. Image-text reasoning via an MLP adapter on a CLIP encoder. Useful for bilingual EN/中 multimodal workloads where the major Western vision-language models underperform on Chinese text in images.

Context: 4K
License: apache-2-0
VRAM Q4: 20.4 GB

Command R

35B

Cohere's 35B model tuned for RAG and tool use. The open weights are released under CC-BY-NC (commercial use requires the Cohere API). Strong multilingual coverage and a fine-grained RAG-mode output format that makes downstream citation easier.

Context: 128K
License: mrl
VRAM Q4: 21 GB