Methodology

Benchmark scores

Scores are taken from each model's official Hugging Face card, the accompanying paper, or the release blog post — whichever the lab publishes as authoritative. If multiple sources disagree, we prefer the most recent. When a model lacks an official score for a benchmark we track (MMLU, HumanEval, MATH), we leave the field blank rather than guess.

Why this matters: labs use slightly different evaluation harnesses, prompt formats, and few-shot counts. Cross-family comparisons can over-state real-world differences. Treat the leaderboards as guidance, not gospel. For workloads where benchmark accuracy materially affects your decision, re-run the eval on your own infrastructure.

Each benchmark row stores a last_verified_at date. We do a sweep of the headline benchmarks (MMLU, HumanEval, MATH) periodically; specialist benchmarks (GPQA, MMLU-Pro) are updated when we add them.

VRAM estimates

The headline VRAM figures on each model page are weight-storage estimates: fp16 ≈ params × 2 GB and Q4 ≈ params × 0.6 GB. These are loading-only floors — real inference also needs memory for the KV cache, activations, and CUDA / runtime overhead.

For long-context workloads, KV cache often dominates. Plan for 1.5–2× the listed VRAM at full context length. The "Recommended hardware" field on each model page accounts for typical usage and is hand-curated rather than computed.

Inference pricing

Prices are listed in USD per million tokens for each provider's hosted endpoint, at the time the row was last verified. Providers update pricing frequently — always confirm on the provider's own pricing page before integrating. Every pricing row carries a last_verified_at date.

We attempt to list at least one provider per model that supports hosted inference. Models without listed pricing are typically self-host-only on this site — either because no major provider hosts them, or because we haven't verified pricing yet.

The "Cheapest" label on the model detail page is computed from input price; ties broken on output. It's a starting point, not a recommendation — latency, region, and rate-limit availability matter as much as raw $/MT for production workloads.

Licence summaries

The licence comparison page and the per-licence detail pages distil each licence's terms into yes/no answers for commercial use, modification, redistribution, and attribution requirements. These summaries are not legal advice — they're a starting point. For any commercial deployment, read the actual licence text. We link to the canonical source on every licence detail page.

Inclusion criteria

To be listed, a model must publish its weights under a licence that allows local inference. Source-available licences with usage restrictions (Llama 3 community licence, Gemma terms, Qwen 72B licence) qualify and are flagged on their licence pages. We don't list closed-API-only models or models whose weights are restricted to research-only deployment behind explicit approval (e.g. some early academic releases).

Corrections

The site is open source. If you spot a wrong number, stale price, or missing model, open an issue or pull request at github.com/bryanflowers/opensourceaimodels. Substantial corrections will trigger a re-verification sweep across related rows.