Best for Edge

Best open-source AI models for edge / on-device inference

Edge inference is where the model meets the user: no network round-trip, no server bills, no data leaving the device. The current crop of sub-4B models is good enough for many real workloads.

What we optimise for

We're optimising for VRAM under 4 GB at Q4, CPU performance, and broad ecosystem support (Ollama, llama.cpp, MLX, NPU-accelerated runtimes).

Why it matters

On-device inference enables fundamentally different products: offline assistants, latency-sensitive UX, private inference with zero cloud dependency.

Our picks

#1Llama 3.2 1B
Runs on a modern smartphone. The de facto on-device default.
#2Llama 3.2 3B
Step up in quality; still fits comfortably on a laptop or recent phone.
#3Phi-3 Mini 4K Instruct
GPT-3.5-class on academic benchmarks at <4B params.
#4Gemma 2 2B
Compact Gemma. Strong default if Llama community licence doesn't fit.

Things to watch out for

Smaller models hallucinate more. Pair them with strict structured-output formats (JSON schema) when correctness matters.
On-device fine-tuning is now practical with LoRA at this scale — consider personalising a 1–3B model to your user's data on-device.
Battery and thermal budget matter as much as raw model quality on mobile.

All picks at a glance

Llama 3.2 1B

The smallest Llama 3 release, designed for on-device inference on phones and laptops. The 1B model runs comfortably in <2 GB of RAM at Q4 quantization and is fast enough for real-time chat on a modern smartphone. Useful for edge inference, on-device assistants where round-tripping to a server is undesirable, and as a draft model for speculative decoding in front of a larger Llama 3 variant.

Context: 128K
License: llama-3
VRAM Q4: 0.6 GB

Llama 3.2 3B

Pocket-sized Llama 3 variant for edge deployment. Surprising chat quality after instruction tuning makes it competitive with much larger models from a previous generation. At Q4 it fits in ~2 GB of VRAM and runs on consumer GPUs and recent Apple Silicon. A strong default for on-device chat, summarisation, and structured extraction tasks where the workload doesn't need frontier reasoning quality.

Context: 128K
License: llama-3
VRAM Q4: 1.8 GB

Phi-3 Mini 4K Instruct

3.8B

Microsoft's flagship small-model demonstration: GPT-3.5-class on academic benchmarks at <4B parameters. The 4K context-window variant is the lightest; a 128K variant ships separately. MIT licensed, well-suited to on-device assistants and structured-extraction workloads where compactness matters more than absolute quality.

Context: 4K
License: mit
VRAM Q4: 2.3 GB

Gemma 2 2B

2.6B

Compact Gemma variant designed for on-device inference. Trained with knowledge distillation from larger Gemma 2 teachers. Runs comfortably on a phone at Q4.

Context: 8K
License: gemma
VRAM Q4: 1.6 GB

Last reviewed 2026-07-05. We refresh these picks as new models ship. See the full directory at /models.