OSAIM
Open Source AI Models

All models

36 of 40 open-source models (filtered).

DeepSeek V3
671B

671B-parameter MoE model with 37B active per token. Trained for roughly $5.6M of compute — a landmark in cost-efficient frontier training. Frontier-class quality at a fraction of the cost of the closed proprietary frontier. The DeepSeek licence permits commercial use with limited restrictions on military and unlawful applications. Running V3 yourself requires serious hardware (8× H100 at fp8); most teams will use it via the DeepSeek API or providers like Together.

Context
128K
License
deepseek
VRAM Q4
402.6 GB
Jamba 1.5 Large
398B

Hybrid Mamba-Transformer-MoE model with native 256K context (effective beyond 140K). 94B active parameters out of 398B total. The state-space-model layers give it linear-time scaling with sequence length, making it interesting for very long contexts. Licensed under AI21's open model licence, which permits most commercial use.

Context
256K
License
jamba-open
VRAM Q4
238.8 GB
Nemotron-4 340B Instruct
340B

NVIDIA's reward-modelling research vehicle. Trained primarily to be a synthetic-data-generation specialist rather than a chat-first model. Useful for teams building instruction-tuning datasets at scale.

Context
4K
License
llama-3
VRAM Q4
204 GB
Grok 1
314B

xAI's first open-weights release: a 314B-parameter mixture-of-experts model. Apache 2.0 licensed. Largely a research artefact at this size — most users will run smaller models for production — but useful as a permissively-licensed reference for MoE research.

Context
8K
License
apache-2-0
VRAM Q4
188.4 GB
Mixtral 8×22B Instruct
141B

Scaled-up Mixtral with 22B-parameter experts. ~39B active parameters out of 141B total. Strong long-context performance and competitive coding scores. Apache 2.0 makes it attractive for self-hosting where the licence terms of Llama 3 are a non-starter.

Context
66K
License
apache-2-0
VRAM Q4
84.6 GB
Command R+
104B

Cohere's flagship 104B model. RAG-focused with native multilingual support across ~10 high-resource languages. CC-BY-NC weights; commercial use via Cohere's hosted API.

Context
128K
License
mrl
VRAM Q4
62.4 GB
Llama 3.2 90B Vision
90B

Larger vision-language Llama variant, competitive with the proprietary multimodal frontier on standard image-understanding benchmarks. Drops in as a vision upgrade where 11B isn't sharp enough. Requires substantial GPU memory in fp16; most teams will run it quantized or on multi-GPU. A natural pairing with retrieval pipelines that fetch image-rich chunks alongside text.

Context
128K
License
llama-3
VRAM Q4
54 GB
Qwen2.5 72B Instruct
72B

The flagship Qwen 2.5 release. Competes with Llama 3.1 405B on many benchmarks at one-fifth the parameter count. Note the 72B specifically uses the Qwen License (commercial use up to 100M MAU) — the smaller Qwen2.5 sizes are Apache 2.0.

Context
128K
License
qwen
VRAM Q4
43.2 GB
Llama 3.3 70B Instruct
70B

Meta's December 2024 refresh of Llama 3 70B that closes most of the gap with Llama 3.1 405B for chat workloads while remaining tractable on a single H100. Strong instruction following, robust tool-use behaviour, and a 128K context window make it the default choice for production chat at 70B scale. The 3.3 release was trained on a refreshed instruction-tuning data mix and benefits from Meta's most recent alignment work. It outperforms the much larger 3.1 405B on several reasoning benchmarks at a fraction of inference cost. The licence is the Llama 3 Community License, which permits commercial use unless your service exceeds 700M monthly active users. Good pick for: production chat at scale, RAG over long documents, agentic workflows where tool use matters, and any 70B-tier replacement for closed proprietary models.

Context
128K
License
llama-3
VRAM Q4
42 GB
DeepSeek R1 Distill Llama 70B
70B

R1 reasoning capabilities distilled into a Llama 3.3 70B base. The most accessible way to run R1-class reasoning locally — fits on a single H100 in fp16 or on a 4090 at Q4. Inherits Llama 3's community licence (commercial use under 700M MAU). Great pick for production reasoning workloads where the full R1 is too expensive to host but o1/R1-style quality is required.

Context
128K
License
llama-3
VRAM Q4
42 GB
Llama 3.1 Nemotron 70B Instruct
70B

NVIDIA's RLHF-tuned Llama 3.1 70B. Tops several Arena-style human-preference leaderboards and shipped with NVIDIA's reward-model research. Inherits the Llama 3 community licence.

Context
128K
License
llama-3
VRAM Q4
42 GB
Mixtral 8×7B Instruct
46.7B

The mixture-of-experts release that introduced 8 experts of 7B each, 2 active per token. ~13B active parameters with 47B total, which makes per-token inference roughly as fast as a 13B dense model while approaching 70B dense quality. Apache 2.0 weights mean it's still a popular self-hosting choice. Memory footprint is the main constraint — the full 47B parameters must be loaded even though only a quarter are active per token.

Context
33K
License
apache-2-0
VRAM Q4
28 GB
Command R
35B

Cohere's 35B model tuned for RAG and tool use. The open weights are released under CC-BY-NC (commercial use requires the Cohere API). Strong multilingual coverage and a fine-grained RAG-mode output format that makes downstream citation easier.

Context
128K
License
mrl
VRAM Q4
21 GB
Yi VL 34B
34B

Vision-language variant of Yi 34B. Image-text reasoning via an MLP adapter on a CLIP encoder. Useful for bilingual EN/中 multimodal workloads where the major Western vision-language models underperform on Chinese text in images.

Context
4K
License
apache-2-0
VRAM Q4
20.4 GB
Yi 1.5 34B Chat
34B

Bilingual EN/中 34B chat model. Apache 2.0 licensed with strong Chinese-language performance and competitive English chat quality. Good default for bilingual production workloads.

Context
33K
License
apache-2-0
VRAM Q4
20.4 GB
Qwen2.5 32B Instruct
32B

32B sweet-spot model: strong reasoning, fits on one H100 in fp16, on a 4090 at Q4. The 32B size in particular hits a quality/cost knee — quality scales with parameters faster than cost up to ~32B, and slower afterwards. Favoured for production chat where 7B isn't sharp enough and where 70B+ would over-spec the hardware budget. Apache 2.0 licence.

Context
128K
License
apache-2-0
VRAM Q4
19.2 GB
Gemma 2 27B
27B

Flagship Gemma 2 release. Uses logit-distillation from a larger teacher model, which is how Google delivers near-70B quality from a 27B student. A solid choice when the Llama community licence doesn't fit and you need quality at the 27B–40B size range.

Context
8K
License
gemma
VRAM Q4
16.2 GB
Mistral Small 3
24B

24B dense model from Mistral's January 2025 release that competes with Llama 3.3 70B on many tasks at a third of the parameter count. Apache 2.0 licensed and small enough to run on a single 4090 at Q4. Good pick when you want Llama-3.3-70B-class chat quality but at a friendlier hardware budget, or when the licence matters and Llama's community terms don't fit.

Context
33K
License
apache-2-0
VRAM Q4
14.4 GB
Phi-4 14B
14B

14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.

Context
16K
License
mit
VRAM Q4
8.4 GB
Phi-3 Medium 14B
14B

Phi-3's mid-tier model with extended 128K context. MIT licence. Strong reasoning relative to its parameter count thanks to Microsoft's heavy investment in synthetic training data.

Context
128K
License
mit
VRAM Q4
8.4 GB
Qwen2.5 14B Instruct
14B

Mid-size Qwen2.5 with broad task coverage. The sweet spot for users who want noticeably better quality than 7B but can't justify the hardware footprint of 32B or 72B.

Context
128K
License
apache-2-0
VRAM Q4
8.4 GB
OLMo 2 13B
13B

Larger OLMo 2 release. Same fully-open philosophy as the 7B variant. The 13B size makes it more competitive with mainstream production-grade chat models.

Context
4K
License
apache-2-0
VRAM Q4
7.8 GB
Mistral Nemo 12B
12B

Joint Mistral × NVIDIA model with 128K context, designed as a drop-in upgrade to Mistral 7B. Trained with NVIDIA's Megatron stack and released under Apache 2.0. Strong multilingual coverage thanks to the Tekken tokenizer.

Context
128K
License
apache-2-0
VRAM Q4
7.2 GB
Stable LM 2 12B
12B

Stability AI's general-purpose 12B model. Apache 2.0. Useful default when you need a permissively-licensed 12B-scale model.

Context
4K
License
apache-2-0
VRAM Q4
7.2 GB
Llama 3.2 11B Vision
11B

Llama 3's first vision-language model. Image understanding via a separately-trained ViT adapter bolted onto Llama 3 weights. Useful for OCR-adjacent workloads, document understanding, and image captioning at a permissive licence. The 11B size makes it cheap to host. Combined with the 128K text context, it handles long PDF-with-images workflows comfortably on a single 4090.

Context
128K
License
llama-3
VRAM Q4
6.6 GB
Gemma 2 9B
9B

Mid-tier Gemma. Strong general-purpose chat model at small scale. The Gemma Terms of Use permit commercial use subject to Google's prohibited-use policy.

Context
8K
License
gemma
VRAM Q4
5.4 GB
Llama 3.1 8B Instruct
8B

The workhorse 8B instruction-tuned model. Excellent quality-to-cost ratio and the broadest ecosystem support of any open-weights model — every major inference engine, fine-tuning library, and quantization toolchain has a 3.1 8B preset. Fits in 24 GB of VRAM at fp16, ~6 GB at Q4. Strong default for production chat where 70B is overkill, for fine-tuning on a specialist task, and for any workload where you want a known-good baseline.

Context
128K
License
llama-3
VRAM Q4
4.8 GB
Falcon 3 7B Instruct
7B

TII's latest dense 7B from December 2024. Strong scores on commonsense reasoning benchmarks. TII's Falcon licence permits royalty-free commercial use with attribution.

Context
33K
License
falcon-2
VRAM Q4
4.2 GB
OLMo 2 7B
7B

Fully-open 7B model: weights, training data and code all released under permissive licences. Useful as a reference for reproducibility research and for teams that need full transparency on training data provenance.

Context
4K
License
apache-2-0
VRAM Q4
4.2 GB
Falcon Mamba 7B
7B

The first major open-weights state-space model. Linear-time decoding, no KV cache — memory usage stays flat as context grows, which makes it interesting for very long-context workloads. Falcon licence.

Context
16K
License
falcon-2
VRAM Q4
4.2 GB
Qwen2.5 7B Instruct
7B

Apache-2.0-licensed 7B model with surprisingly strong reasoning and multilingual chops. Qwen 2.5 trains on a larger and more carefully filtered corpus than the original Qwen series, and the 7B variant punches well above its weight on coding and math benchmarks. A strong default for cost-sensitive chat workloads and for fine-tuning experiments where the Apache licence simplifies downstream redistribution.

Context
128K
License
apache-2-0
VRAM Q4
4.2 GB
Mistral 7B v0.3
7B

The original Mistral 7B refresh with 32K context and extended vocabulary. Permissive Apache 2.0 weights and the first widely-deployed sliding-window-attention model. Still useful in 2026 for very-low-cost inference and as a baseline for fine-tuning experiments.

Context
33K
License
apache-2-0
VRAM Q4
4.2 GB
Phi-3 Mini 4K Instruct
3.8B

Microsoft's flagship small-model demonstration: GPT-3.5-class on academic benchmarks at <4B parameters. The 4K context-window variant is the lightest; a 128K variant ships separately. MIT licensed, well-suited to on-device assistants and structured-extraction workloads where compactness matters more than absolute quality.

Context
4K
License
mit
VRAM Q4
2.3 GB
Llama 3.2 3B
3B

Pocket-sized Llama 3 variant for edge deployment. Surprising chat quality after instruction tuning makes it competitive with much larger models from a previous generation. At Q4 it fits in ~2 GB of VRAM and runs on consumer GPUs and recent Apple Silicon. A strong default for on-device chat, summarisation, and structured extraction tasks where the workload doesn't need frontier reasoning quality.

Context
128K
License
llama-3
VRAM Q4
1.8 GB
Gemma 2 2B
2.6B

Compact Gemma variant designed for on-device inference. Trained with knowledge distillation from larger Gemma 2 teachers. Runs comfortably on a phone at Q4.

Context
8K
License
gemma
VRAM Q4
1.6 GB
Llama 3.2 1B
1B

The smallest Llama 3 release, designed for on-device inference on phones and laptops. The 1B model runs comfortably in <2 GB of RAM at Q4 quantization and is fast enough for real-time chat on a modern smartphone. Useful for edge inference, on-device assistants where round-tripping to a server is undesirable, and as a draft model for speculative decoding in front of a larger Llama 3 variant.

Context
128K
License
llama-3
VRAM Q4
0.6 GB