OSAIM
Open Source AI Models

Hermes 3 Llama 3.1 8B

NousResearch's community-driven fine-tune on the Llama 3.1 8B base. Tuned for strong tool use, function calling and steerable persona behaviour. Inherits Llama 3's community licence and its 128K context.

Parameters
8B
Context length
128K
Modality
text
Released
2024-08-15

Memory & hardware

VRAM (fp16)
16 GB
VRAM (Q4)
4.8 GB
Recommended
RTX 3090 or Apple M-series
Quantizations
fp16, q8_0, q5_k_m, q4_k_m, gguf

License: Llama 3 Community License

SPDX
Commercial use
Yes
Modification
Yes
Redistribution
Yes

Benchmarks

IFEval
66.9
MMLU
65.4
HumanEval
60.4
Benchmarks last verified 2026-07-02.

Hosted inference pricing

No hosted pricing listed — this model is currently self-host-only on this site.

Run it yourself

Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.

Run Hermes 3 Llama 3.1 8B locally
Ollama (easiest)
ollama run llama3.1:8b
Single-line install + run; uses the official Ollama registry tag for this family.
vLLM (production)
vllm serve NousResearch/Hermes-3-Llama-3.1-8B
High-throughput hosted inference; one command to expose an OpenAI-compatible HTTP server.
Transformers (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("NousResearch/Hermes-3-Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Hermes-3-Llama-3.1-8B", device_map="auto", torch_dtype="auto"
)
Direct PyTorch usage. Pin a torch / cuda version that matches your GPU.
Hugging Face ID: NousResearch/Hermes-3-Llama-3.1-8B

Related models

Same family or similar size — useful when shopping around.

Llama 3.1 8B Instruct
8B

The workhorse 8B instruction-tuned model. Excellent quality-to-cost ratio and the broadest ecosystem support of any open-weights model — every major inference engine, fine-tuning library, and quantization toolchain has a 3.1 8B preset. Fits in 24 GB of VRAM at fp16, ~6 GB at Q4. Strong default for production chat where 70B is overkill, for fine-tuning on a specialist task, and for any workload where you want a known-good baseline.

Context
128K
License
llama-3
VRAM Q4
4.8 GB
Qwen 3 8B
8B

The April 2025 refresh of Qwen at 8B. Native mixed-mode reasoning: the model can 'think' before answering when triggered, or answer directly for simple queries — configurable per request. Apache 2.0. A strong upgrade over Qwen 2.5 7B on math and code, with much better instruction following.

Context
33K
License
apache-2-0
VRAM Q4
4.8 GB
Falcon 3 7B Instruct
7B

TII's latest dense 7B from December 2024. Strong scores on commonsense reasoning benchmarks. TII's Falcon licence permits royalty-free commercial use with attribution.

Context
33K
License
falcon-2
VRAM Q4
4.2 GB
Falcon Mamba 7B
7B

The first major open-weights state-space model. Linear-time decoding, no KV cache — memory usage stays flat as context grows, which makes it interesting for very long-context workloads. Falcon licence.

Context
16K
License
falcon-2
VRAM Q4
4.2 GB
Gemma 2 9B
9B

Mid-tier Gemma. Strong general-purpose chat model at small scale. The Gemma Terms of Use permit commercial use subject to Google's prohibited-use policy.

Context
8K
License
gemma
VRAM Q4
5.4 GB
Mistral 7B v0.3
7B

The original Mistral 7B refresh with 32K context and extended vocabulary. Permissive Apache 2.0 weights and the first widely-deployed sliding-window-attention model. Still useful in 2026 for very-low-cost inference and as a baseline for fine-tuning experiments.

Context
33K
License
apache-2-0
VRAM Q4
4.2 GB