Google has released new quantization-aware trained (QAT) versions of its Gemma 2B and 7B models, enabling state-of-the-art performance while running efficiently on consumer-grade GPUs. These models are designed to maintain accuracy even after 4-bit quantization, thanks to techniques like QLoRA and SmoothQuant. Notably, they outperform competitors like Mistral and LLaMA 2 across multiple benchmarks, and offer open weights for local deployment via platforms like Hugging Face and NVIDIA TensorRT-LLM. This positions Gemma QAT as a major leap toward democratizing high-performance AI inference for individual developers and small teams.

Google Developer Blog