2026-06-03 18:46 UTCIn-site rewrite4 min readUpdated: 2026-06-30 13:03 UTC

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with Native Audio That Runs on a 16 GB Laptop

Google DeepMind has released Gemma 4 12B, a 12-billion-parameter dense multimodal model that eliminates traditional encoders, feeding vision and audio directly into the LLM backbone. It runs locally on consumer laptops with 16 GB RAM, under the Apache 2.0 license. The model natively handles text, images, audio, and video, making it the first mid-sized Gemma with native audio input.

SourceMarkTechPostAuthor: Asif Razzaq

Article intelligence

EngineersAdvanced

Key points

Encoder-free design: removes separate 550M vision and 300M audio encoders, using a lightweight 35M vision embedder and direct audio wave projection.
Achieves near-26B MoE performance with under half the memory footprint, running on 16 GB devices.
First mid-sized Gemma with native audio support, including ASR and diarization; adds video understanding.
Open-source under Apache 2.0, compatible with llama.cpp, MLX, vLLM, and more.

Why it matters

This matters because encoder-free design: removes separate 550M vision and 300M audio encoders, using a lightweight 35M vision embedder and direct audio wave projection.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

Google DeepMind just released Gemma 4 12B, a dense multimodal model that strips out traditional encoders entirely. Vision and audio flow straight into the LLM backbone. The result is a model that runs agentic workflows on a consumer laptop with 16 GB of RAM. It ships under the Apache 2.0 license.

Model Overview & Access

Gemma 4 12B is a 12-billion-parameter decoder-only transformer. It handles text, images, audio, and video natively. There are no separate vision or audio encoders. The decoder uses the same structure as the Gemma 4 31B Dense model. It bridges the gap between the edge-friendly E4B and the larger 26B Mixture of Experts variant.

Architecture: Unified, encoder-free decoder-only transformer.

Modalities: Text, image, video, and native audio input — the first mid-sized Gemma with audio.

Hardware requirement: 16 GB VRAM or unified memory. Runs on consumer GPU laptops and Apple Silicon Macs.

License: Apache 2.0. Weights are open and publicly downloadable.

Inference stack: Compatible with llama.cpp, MLX, vLLM, Ollama, SGLang, Unsloth, and LM Studio.

Download: Hugging Face and Kaggle. The instruct variant is google/gemma-4-12B-it.

Integration: Hugging Face Transformers, LiteRT-LM CLI, and an OpenAI-compatible local API server via litert-lm serve.

A dedicated Multi-Token Prediction (MTP) drafter model is also released. It reduces inference latency on local hardware.

Architecture: The Encoder-Free Design

Every prior mid-sized Gemma model used separate Transformer encoders for vision and audio. Those encoders added latency and parameter overhead. The medium-sized Gemma 4 models carry a 550M-parameter vision encoder. The E2B and E4B models include a 300M-parameter audio encoder. All of that is gone in the 12B.

Vision embedder (35M parameters): Raw images are split into 48×48 pixel patches. Each patch is projected to the LLM’s hidden dimension with a single matrix multiplication. There is no attention layer; each patch is processed independently. Spatial position is injected using a factorized coordinate lookup: a learned X matrix and a learned Y matrix. For a patch at (x, y), the model looks up two learned embeddings and adds them to form a position vector. This is added to the patch embedding, followed by normalization. That is the entire vision pipeline.

Audio wave projection: Raw 16 kHz audio is sliced into 40 ms frames. Each frame contains 640 values. Those values are linearly projected into the same embedding space as text tokens. There is no feature extraction and no conformer layers. The LLM’s existing Rotary Position Embedding (RoPE) handles the 1-D temporal sequence. The audio encoder in the E2B and E4B used 12 conformer layers. All of that is removed.

Importance: The unified weight space means you no longer co-tune separate frozen encoders. Downstream fine-tuning with LoRA or full tuning updates vision, audio, and text processing in a single pass. Hugging Face Transformers and Unsloth already support this.

The encoder-free design reduces multimodal latency. The LLM backbone starts processing immediately. No encoder must finish first.

Capabilities & Performance

Google DeepMind team has not published full benchmark results in the initial launch materials. The official release notes state the 12B model performs nearing the 26B MoE model on standard benchmarks, at less than half the total memory footprint.

https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/

The model’s demonstrated capabilities include:

Automatic speech recognition. Transcribes audio natively without an external ASR pipeline.

Agentic reasoning. Runs multi-step workflows locally, with performance approaching the 26B MoE model.

Diarization. Distinguishes speakers in audio input.

Video understanding. Processes video frames alongside audio. A demo analyzed a 5-minute Google I/O keynote segment using 313 frames at 1 FPS with a visual token budget of 70 per frame.

Coding. Built a Gradio image-processing app using its own code generation, served locally with llama.cpp.

Multimodal agentic workflows. The official Gemma Skills repository at github.com/google-gemma/gemma-skills provides pre-built agent capabilities.

In Google’s own Google AI Edge Eloquent app, the switch to Gemma 4 12B produced what Google reports as a 60%+ jump in overall quality, with improved instruction following and scope adherence.

Marktechpost’s Visual Explainer

Released June 3, 2026

Gemma 4 12B

Google DeepMind’s unified, encoder-free multimodal model

A 12-billion-parameter decoder-only transformer that drops separate vision and audio encoders. Vision and audio flow straight into the LLM backbone. It runs locally on a 16 GB laptop under an Apache 2.0 license.

Encoder-free — no separate vision or audio encoders

First mid-sized Gemma with native audio input; adds video

Local-ready — 16 GB VRAM or unified memory

Overview & Access

What ships

Specs, weights, and the inference stack

Architecture — decoder-only, same structure as Gemma 4 31B Dense

Modalities — text, image, video, and native audio

Hardware — 16 GB VRAM / unified memory; GPU laptops and Apple Silicon

License — Apache 2.0; weights on Hugging Face and Kaggle

Instruct variant — google/gemma-4-12B-it

Speed — a dedicated Multi-Token Prediction (MTP) drafter is also released

Architecture · Vision

A 35M vision embedder

Replacing the 550M vision encoder of the medium-sized models

Raw images split into 48×48 pixel patches

Each patch projected to the LLM hidden dimension with a single matrix multiplication

No attention layer — each patch is processed independently

Position via a factorized X/Y coordinate lookup, then normalization

That is the entire vision pipeline

Architecture · Audio

Direct audio wave projection

No conformer layers, no feature extraction

Removes the 12 conformer layers used in Gemma 4 E2B and E4B

Raw 16 kHz audio sliced into 40 ms frames (640 values each)

Frames projected into the same embedding space as text tokens

The LLM’s existing RoPE handles the temporal sequence

The first mid-sized Gemma to natively ingest audio

Capabilities & Performance

Near-26B reasoning, half the memory

Google reports performance nearing the 26B MoE at under half the memory footprint

ASR & diarization — native transcription, speaker separation

Agentic reasoning — multi-step workflows run locally

Video — demo on a 5-min I/O keynote: 313 frames at 1 FPS, token budget 70

Coding — built a Gradio app via gemma-skills, served with llama.cpp

No full benchmark tables published at launch

Run It Locally

Three paths on day one

Native macOS apps plus a drop-in local server

Google AI Edge Gallery (macOS) — sandboxed Python execution loop

Google AI Edge Eloquent (macOS) — on-device dictation and editing

LiteRT-LM CLI — litert-lm serve exposes an OpenAI-compatible endpoint

Works with Continue, Aider, OpenCode, Open WebUI

Also LM Studio, Ollama, Transformers, Unsloth, vLLM, SGLang, MLX

Deploy on Cloud Run, GKE, or Gemini Enterprise Agent Platform Model Garden

Key Takeaways

The bottom line

What the encoder-free design buys you

No separate vision (550M) or audio (300M) encoders

35M vision embedder plus direct audio wave projection

Fine-tuning updates vision, audio, and text in a single pass

Nears 26B performance at under half the memory; runs on 16 GB

Apache 2.0 with broad ecosystem support out of the gate

1 / 7

Marktechpost — AI research, model releases & developer tools for 1M+ practitioners. marktechpost.com

Key Takeaways

Google DeepMind released Gemma 4 12B, a dense encoder-free multimodal model under the Apache 2.0 license.

Vision and audio feed straight into the LLM backbone — no separate vision (550M) or audio (300M) encoders.

A 35M vision embedder uses a single matmul plus factorized X/Y position lookup; audio projects raw 16 kHz frames directly.

It is the first mid-sized Gemma with native audio, and adds video, running on a 16 GB laptop.

Benchmark performance nears the 26B MoE model at less than half the memory footprint.

Check out the Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with Native audio that runs on a 16 GB laptop appeared first on MarkTechPost.