2026-06-18站内改写2 min readUpdated: 2026-06-18

Gemma 4 on Cerebras—The Fastest Inference is Now Multimodal

Gemma 4 is now in private preview on Cerebras Inference, with general availability later this month. This multimodal model runs at over 1,500 tokens per second, enabling computer use and image-driven agentic workflows, 15x faster than Claude Haiku.

SourceCerebras Blog

Jun 18 2026

Cerebras Team

Gemma 4 is now in private preview on Cerebras Inference, with general availability later this month. This multimodal model unlocks an entirely new class of applications on Cerebras Inference, from computer use to image-driven agentic workflows, all running at over 1,500 tokens per second.

As the category leader in fast inference, Cerebras has set benchmarks across numerous open-weight models including Kimi, GLM, GPT-OSS, and Qwen. Gemma 4 is the first Google DeepMind model we have brought to the platform, and the first to let developers feed images—screenshots, documents, charts, UI states—into a model running at wafer-scale speed. The result: visual and agentic loops that once felt sluggish on GPUs become fast and responsive.

The Fastest Multimodal Model

Cerebras runs Gemma 4 at over 1,500 output tokens per second. By comparison, Claude Haiku runs at roughly 100 tokens per second. That is a 15x speedup against the most directly comparable production model, at quality that lands in the same band and at a price lower per output token.

Speed compounds in exactly the workloads Gemma 4 is built for. Multimodal and agentic loops rarely call a model once: they inspect a visual input, reason over it, produce structured output, call tools, check the result, and try again. At 100 tokens per second those loops are too slow to provide realtime input. At 1,500 TPS, the application and user can work together at the same time. Front-end iteration feels near-instant, document and screenshot workflows return in a fraction of the time, and developers can fit more verification and more retries into the same product.

The Smartest Gemma 4 Model

Gemma 4 31B is the flagship of Google DeepMind's open-weight Gemma family—a dense, multimodal model built for quality and efficiency rather than raw parameter count. Dense models achieve high model intelligence without the large memory footprint of MoE models. Gemma 4 hits a sweet spot: strong enough for serious work, efficient to serve, and open enough to build around without vendor lock-in.

On the Artificial Analysis Intelligence Index, Gemma 4 31B scores 29—effectively matching Claude Haiku at 30. The difference is that Gemma 4 is open-weight under Apache 2.0, and on Cerebras it runs an order of magnitude faster.

Gemma 4 is the first model on Cerebras to support image understanding. It enables workflows combining text with images—screenshots, charts, UI states, scanned pages, forms, diagrams. It also unlocks computer use and robotics applications

Bringing vision to wafer-scale hardware is a milestone for the platform. Multimodal support starts with Gemma 4, and we will extend it to additional models going forward. The combination of image understanding and wafer-scale speed is what unlocks new product experiences: a model that can see a dashboard, reason over it, return structured output, and act on it fast enough to keep a human or an agent in the loop.

Examples Include:

Screenshot Insight. Feed the model a dense dashboard screenshot or document page and watch it identify what matters, explain the finding, and return structured output—in real time rather than after a wait.

Long-context summarization. Hand it a research report or technical brief and get a crisp, decision-ready summary back fast enough to read, react, and re-query in a single sitting.

Screenshot to Patch. Play to medium-model strengths—take a broken UI screenshot, the source, and the console error, then returns a minimal patch and the checks to verify it.

Available Now in Private Preview

Gemma 4 enters private preview on Cerebras on June 18, with general availability at the end of the month. We recommend it as the reference medium size model on the platform: if you are looking to migrate from Llama, GPT-OSS, or Haiku, Gemma 4 provides equal or higher intelligence at Cerebras speed.

If you are building multimodal reasoning, document understanding, fast summarization, or targeted coding workflows and inference speed is the bottleneck, we would love to hear from you.