Size

Under 3B

2 open-source models in this size bucket.

Compact Gemma variant designed for on-device inference. Trained with knowledge distillation from larger Gemma 2 teachers. Runs comfortably on a phone at Q4.

Context: 8K
License: gemma
VRAM Q4: 1.6 GB

Llama 3.2 1B

The smallest Llama 3 release, designed for on-device inference on phones and laptops. The 1B model runs comfortably in <2 GB of RAM at Q4 quantization and is fast enough for real-time chat on a modern smartphone. Useful for edge inference, on-device assistants where round-tripping to a server is undesirable, and as a draft model for speculative decoding in front of a larger Llama 3 variant.

Context: 128K
License: llama-3
VRAM Q4: 0.6 GB