OSAIM
Open Source AI Models

Kimi K2 Instruct

Moonshot AI's 1-trillion-parameter mixture-of-experts (32B active per token). Trained on 15.5T tokens with a heavy emphasis on tool-use and agentic behaviour. Modified-MIT licence with an attribution clause for very-large deployments. Exceptional at long-horizon agent tasks; benchmarked well against Claude Sonnet on SWE-bench Verified.

Parameters
1000B
Context length
128K
Modality
text
Released
2025-07-14

Memory & hardware

VRAM (fp16)
2000 GB
VRAM (Q4)
600 GB
Recommended
8× H100 80GB at fp8, or hosted via Together / Groq
Quantizations
fp16, fp8, q4_k_m

License: Modified MIT (Kimi K2)

SPDX
Commercial use
Yes
Modification
Yes
Redistribution
Yes

Benchmarks

ArenaHard
95.7
MATH
90.0
IFEval
89.8
MMLU
89.5
HumanEval
88.4
MMLU-Pro
82.1
BFCL
76.5
GPQA
75.1
SWE-bench Verified
65.8
Benchmarks last verified 2026-07-02.

Hosted inference pricing

USD per million tokens.

ProviderInputOutput
togetherCheapest$1.00$3.00
groq$1.00$3.00
Pricing last verified 2026-07-02. Providers update rates frequently; confirm before integrating.

Run it yourself

Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.

Run Kimi K2 Instruct locally
No official Ollama registry tag for this model — use transformers or vLLM below.
vLLM (production)
vllm serve moonshotai/Kimi-K2-Instruct
High-throughput hosted inference; one command to expose an OpenAI-compatible HTTP server.
Transformers (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Kimi-K2-Instruct", device_map="auto", torch_dtype="auto"
)
Direct PyTorch usage. Pin a torch / cuda version that matches your GPU.
Hugging Face ID: moonshotai/Kimi-K2-Instruct

Related models

Same family or similar size — useful when shopping around.

DeepSeek R1
671B

Reasoning model trained with reinforcement learning on top of DeepSeek V3-Base. MIT licence — even the weights are unrestricted, making R1 the most permissively-licensed frontier reasoning model. Generates long internal chains-of-thought before answering, trading latency for accuracy on math, code, and reasoning benchmarks. Distilled variants (e.g. R1 Distill Llama 70B) recover most of the quality at much smaller scales.

Context
128K
License
mit
VRAM Q4
402.6 GB
DeepSeek V3
671B

671B-parameter MoE model with 37B active per token. Trained for roughly $5.6M of compute — a landmark in cost-efficient frontier training. Frontier-class quality at a fraction of the cost of the closed proprietary frontier. The DeepSeek licence permits commercial use with limited restrictions on military and unlawful applications. Running V3 yourself requires serious hardware (8× H100 at fp8); most teams will use it via the DeepSeek API or providers like Together.

Context
128K
License
deepseek
VRAM Q4
402.6 GB
Llama 3.1 405B Instruct
405B

Meta's July 2024 flagship — the first open-weights model at 405B parameters. Trained on 15T tokens with 128K context. Rivals GPT-4o on many academic benchmarks and set the ceiling for open-weights quality for most of 2024. Running it self-hosted requires serious hardware (8× H100 at fp8 or multi-node at fp16); most users will run it via a hosted provider (Together, Groq, Fireworks). Llama 3.3 70B closed most of the practical gap at a fraction of the cost, so 405B is now most useful when 70B specifically hits its ceiling.

Context
128K
License
llama-3
VRAM Q4
243 GB
Jamba 1.5 Large
398B

Hybrid Mamba-Transformer-MoE model with native 256K context (effective beyond 140K). 94B active parameters out of 398B total. The state-space-model layers give it linear-time scaling with sequence length, making it interesting for very long contexts. Licensed under AI21's open model licence, which permits most commercial use.

Context
256K
License
jamba-open
VRAM Q4
238.8 GB
Nemotron-4 340B Instruct
340B

NVIDIA's reward-modelling research vehicle. Trained primarily to be a synthetic-data-generation specialist rather than a chat-first model. Useful for teams building instruction-tuning datasets at scale.

Context
4K
License
llama-3
VRAM Q4
204 GB
Grok 1
314B

xAI's first open-weights release: a 314B-parameter mixture-of-experts model. Apache 2.0 licensed. Largely a research artefact at this size — most users will run smaller models for production — but useful as a permissively-licensed reference for MoE research.

Context
8K
License
apache-2-0
VRAM Q4
188.4 GB