Google DeepMind's DiffusionGemma is an experimental open-weight model that uses diffusion to generate text blocks in parallel, offering faster local inference compared to traditional autoregressive models. Built on the Gemma 4 26B A4B MoE architecture, it trades some quality for speed, making it ideal for interactive and editing tasks. The article explains its architecture, how text diffusion works, benchmark results, and provides a step-by-step guide to run it locally using llama.cpp.
DiffusionGemma generates and refines blocks of tokens in parallel, reducing latency for local inference.
It uses bidirectional attention and a 256-token canvas with multiple denoising steps.
Cohere has released its first developer-facing coding model, North Mini Code, a 30B total parameter mixture-of-experts model with only 3B active parameters per token. It runs on a single H100 GPU, supports 256K context length, and is optimized for code generation, agentic software engineering, and terminal tasks. The weights are open under Apache 2.0.
North Mini Code is Cohere’s first coding model, 30B total parameters with 3B active, supporting 256K context and 64K max output.
Runs on a single H100 at FP8; weights open under Apache 2.0 via Hugging Face, Cohere API, and more.
Ollama's MLX engine has been updated to deliver its highest performance on Apple Silicon yet. By leaning more heavily on Apple's unified memory and the Metal-backed MLX framework, models output higher quality responses, respond faster, and use less memory. The update includes support for NVFP4 format, up to 20% faster output, and a snapshot system for agent workflows.
Ollama's MLX engine now supports NVFP4 format, halving quantization quality loss.
Output speed increased by up to 20% due to fused Metal kernels and optimized sampling.
Google has released DiffusionGemma, a new open-weight model under Apache 2 license, available for free via NVIDIA's NIM cloud API. It delivers impressive generation speeds exceeding 500 tokens per second.
Google releases open-source DiffusionGemma model under Apache 2 license.
Google released DiffusionGemma, a 26-billion-parameter model that generates text via diffusion, achieving 1,000 tokens per second on an H100 GPU—four times faster than autoregressive models, but with lower quality. It's currently experimental.
26-billion-parameter diffusion model for text generation
DiffusionGemma is Google DeepMind's experimental open text generation model that uses text diffusion instead of standard autoregressive decoding, achieving up to 4x faster generation on dedicated GPUs. The 26B MoE model (3.8B active parameters) is built on the Gemma 4 backbone, supports multimodal inputs (text, image, video), has a 256K context window, covers 140+ languages, and is released under Apache 2.0.
DiffusionGemma is a 26B Mixture of Experts (MoE) model with 3.8B active parameters that generates text in parallel via diffusion, not token-by-token.
It achieves 1000+ tokens/s on a single NVIDIA H100 and 700+ tokens/s on an RTX 5090, fitting in 18GB VRAM when quantized.
Google DeepMind released DiffusionGemma, an experimental open model for fast text generation using parallel token generation. NVIDIA optimized it to run faster on GeForce RTX, RTX PRO, and DGX Spark systems, achieving up to 1000 tokens/sec locally.
DiffusionGemma generates up to 256 tokens in parallel per step, unlike traditional autoregressive models. Based on Gemma 4 (26B parameters, MoE), activating only 3.8B per step. Up to 4x faster performance. Open source under Apache 2.0, runs locally with no cloud dependency.
NeuroBait is a fine-tuned AI model designed to help ADHD brains by providing dopamine sparks to overcome task initiation paralysis. Created from real observation of the author's wife, it uses warm, flowing prose to offer one tiny actionable step instead of overwhelming to-do lists. Built with LoRA on Gemma 3 12B and deployed on Hugging Face, it aims to help anyone feeling stuck, not just those with ADHD.
NeuroBait generates warm, flowing text to give a tiny actionable step, helping ADHD brains start tasks. It focuses on emotional barriers, not to-do lists.
Fine-tuned with LoRA on Gemma 3 12B using a small curated synthetic dataset derived from real ADHD friction.
A study evaluated LLaMA 3.1 for extracting structured data from Dutch brain MRI reports. The model showed high performance on categorical variables like visual rating scores but lower performance on numerical variables. Few-shot prompting improved numerical extraction accuracy significantly.
Doubleword's batch inference offering keeps costs down by keeping throughput high, something which isn't easily done given the architecture of popular Mixture-of-Expert models. While MoE's sparse expert weights make them quick to train, they also mean that at each layer of every forward each request in a batch typically requires different expert weights to be loaded. This makes inference severely memory-bandwidth bound and cuts throughput relative to dense models. However, by reordering inputs so that similar prompts batch together, we can overlap the experts needed and reduce the number of unique experts loaded per forward.
MoE inference is memory-bandwidth bound due to sparse expert loading, reducing throughput.
Reordering inputs to batch similar prompts reduces unique expert loads per forward pass.
This article details the configuration and memory calibration required to run the Qwen 3.6 35B MoE model at a 450,000 token context window on a single 32GB VRAM GPU (NVIDIA RTX 5090) using llama.cpp with TurboQuant and YaRN scaling. It covers model selection, quantization trade-offs, KV cache quantization, RoPE scaling, multimodal setup, replication guide, VRAM lifecycle management, and performance evaluation.
Run Qwen3.6-35B-A3B-Q6_K on a single RTX 5090 with 450K context using llama.cpp TurboQuant fork and YaRN scaling.
Achieve 450K context by compressing KV cache to 3-bit (turbo3) and extending RoPE beyond native 262K with YaRN, but at cost of perplexity and retrieval accuracy.
Harness-1 is a 20B retrieval subagent built on gpt-oss-20b, trained with reinforcement learning inside a stateful search harness. The harness handles bookkeeping—candidate pool, curated set, evidence graph, verification records—while the policy focuses on search, curation, and verification decisions. It achieves 0.730 average curated recall across eight benchmarks, outperforming the next open subagent by 11.4 points and trailing only Opus-4.6. Weights and harness code are public.
Harness-1 is a 20B retrieval subagent trained with RL in a stateful search harness.
The harness manages bookkeeping; the policy handles semantic decisions.
Unlike GPT-4o or Qwen3.5-Omni, Audio Interaction doesn't wait for a recording to end: it translates, transcribes, chats, and picks up everyday noises like coughing in a single stream. Code, model weights, and download instructions are available on GitHub under the Apache 2.0 open-source license, with the training data to follow.
The Audio Interaction model continuously listens to audio streams, making decisions every 0.4 seconds.
It can translate, transcribe, chat, and recognize everyday noises in a single stream.
A personal project called tinderbox allows users to export Claude.ai conversations, index them locally, and search them from any Claude session via an MCP server. Supports hybrid retrieval, Supabase storage, and Ollama embeddings.
Export Claude.ai conversation ZIPs, automatically parse and ingest them
Hybrid semantic + full-text search over messages and artifacts
Google DeepMind released Quantization-Aware Training checkpoints for Gemma 4, targeting edge devices and consumer GPUs. This comparison of BF16, Q4_0 QAT, and the new mobile QAT format focuses on memory footprint, quality preservation, and deployment suitability using published data.
Q4_0 QAT reduces E2B memory from 9.6 GB (BF16) to 3.2 GB, and E4B from 15 GB to 5 GB.
The new mobile QAT format brings E2B to ~1 GB; text-only goes under 1 GB.
Google releases new Gemma 4 checkpoints optimized with Quantization-Aware Training (QAT) to reduce memory usage and enable local deployment on edge devices and consumer GPUs. The models include a custom mobile quantization format that cuts memory footprint to 1GB for the E2B model.
QAT integration during training minimizes quality loss from compression.
Custom mobile quantization schema includes static activations, channel-wise quantization, targeted 2-bit quantization, and embedding/KV cache optimization.
On June 3, 2026, Google introduced Gemma 4 12B Unified, an open-source multimodal model that understands text, images, audio, and video within a single architecture. It combines a 256K context window with a laptop-friendly design for agentic workflows and local deployment. This article covers its architecture, features, benchmarks, and practical guidance for developers.
Gemma 4 12B Unified is a mid-sized open-source multimodal model with an encoder-free design that projects image and audio directly into the LLM embedding space.
It supports 256K context, function calling, 35+ languages, speech recognition, video understanding, and can run locally via tools like Ollama.
At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution — a difference invisible to the scalar error rate. The Errorquake-10k benchmark scores each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, revealing that severity profiles provide information beyond error rate.
Errorquake-10k benchmark scores LLM responses on a 0-4 severity scale, revealing heavy-tailed severity distributions.
Many model pairs show significantly different severity distributions at matched accuracy, indicating that error rate alone is insufficient.
Ollama 0.30 is now available with improved performance and GGUF model compatibility through llama.cpp, augmenting MLX on Apple silicon and supporting more models on wider hardware.
NVIDIA has released Nemotron 3 Ultra, a 550B total (55B active) open Mixture-of-Experts hybrid Mamba-Transformer for long-running agents. It pairs a 1M-token context with up to ~6x higher inference throughput than comparable open LLMs at on-par accuracy, and ships with open weights, training data, and recipes under OpenMDW-1.1.
NVIDIA releases Nemotron 3.5 Content Safety, a unified model combining multimodal input, multilingual coverage, custom enterprise policy enforcement, and auditable reasoning for content safety. Built on Google Gemma 3 4B IT and fine-tuned with LoRA, it supports explicit training in 12 languages with zero-shot generalization to ~140 languages. New features include custom policy enforcement via natural language specifications and a THINK mode for auditable step-by-step reasoning. The model achieves ~85% average accuracy across multiple multilingual and multimodal safety benchmarks while maintaining a compact 4B-parameter size and low latency. NVIDIA also releases a safety dataset with multimodal, multilingual safety reasoning traces.
The model, released under the Apache 2.0 license, is another example of how cloud providers are enabling enterprises to run models on local devices for agentic workflows.
Google released Gemma 4 12B under Apache 2.0 license.
The model enables enterprises to run AI on local devices for agentic workflows.
Learn how to perform text classification using locally hosted open-source LLMs like Llama 3, Mistral, and Gemma via Ollama and the Scikit-LLM Python library, all without API costs.
Install Ollama and pull open-source LLMs for local use.
Configure Scikit-LLM to route requests to local Ollama endpoint.
This paper introduces Orli (Ordered Regression of Lines), an end-to-end model that unifies text line detection and reading order prediction as a single image-to-sequence task. Trained on 196,691 pages across ten writing systems, Orli marginally exceeds state-of-the-art on cBAD line detection without dataset-specific training, achieves near-perfect coverage and ordering on multiple reading-order benchmarks zero-shot, and adapts to specialized out-of-domain layouts with limited fine-tuning. Code and weights are open-sourced.
Orli frames line detection and reading order as a single image-to-sequence problem
Uses chord-frame parameterization for baselines with iterative refinement and local visual refiner
POLARIS is a training recipe for small open-weight models that significantly improves long-form creative writing. By combining a frontier LLM judge with human-reference injection in a GRPO framework, the resulting 9B model matches the performance of much larger models and exhibits strong length generalization.
POLARIS uses an LLM-as-a-judge reward and human-reference injection to enhance small model writing
Trained on Qwen3.5-9B, POLARIS-9B competes with larger models like Qwen3.5-27B
NVIDIA Nemotron 3 Ultra is a 550 billion parameter (55B active) open model designed for long-running agentic workflows, with 1M token context and NVFP4 optimization, leading in agentic benchmarks and cost efficiency.
550B total parameters with 55B active per token, optimized for agent orchestration and coding agents.
1M token context window for entire codebases and tool histories.
Google has released the Gemma 4 12B model, a 12-billion-parameter AI model that can run on consumer laptops with 16GB of RAM, filling a gap in the Gemma 4 lineup between mobile and high-performance models.
Google's new Gemma 4 12B model requires only 16GB of RAM to run locally.
It fits between the mobile-optimized models and the high-end 26B/31B models.
Google Deepmind's Gemma 4 12B is an open-source model that processes text, images, and audio natively and runs on laptops with just 16 GB of RAM. It nearly matches the twice-as-large 26B model in benchmarks and ships under an Apache 2.0 license for commercial use.
Google DeepMind has released Gemma 4 12B, a 12-billion-parameter dense multimodal model that eliminates traditional encoders, feeding vision and audio directly into the LLM backbone. It runs locally on consumer laptops with 16 GB RAM, under the Apache 2.0 license. The model natively handles text, images, audio, and video, making it the first mid-sized Gemma with native audio input.
Encoder-free design: removes separate 550M vision and 300M audio encoders, using a lightweight 35M vision embedder and direct audio wave projection.
Achieves near-26B MoE performance with under half the memory footprint, running on 16 GB devices.
Ideogram releases version 4.0 of its text-to-image model as an open-weight model with native 2K resolution, bounding box control, and improved text rendering. On the DesignArena leaderboard, it ranks first among all open models; only closed systems from OpenAI and Google score higher. Commercial use requires a paid license.
This study evaluates Random Forest and four CNNs (ResNet-50, ResNet-101, EfficientNet-B4, ConvNeXt-Large) for transferable satellite-derived bathymetry over 0-20 m depth using Sentinel-2 imagery. Key design choices include preserving spatial continuity (contiguous reef blocks) and a Smooth Weight Function (SWF)-weighted RMSE loss. Intra-regional RMSE ranges from 1.15-1.92 m (as low as 0.26 m for shallow depths), while cross-regional RMSE is 2.46-2.98 m for deep models. On the MagicBathyNet benchmark, the proposed networks achieve 0.19-0.22 m RMSE, outperforming U-Net and a task-specific transformer with fewer parameters. Multi-temporal imagery and median aggregation reduce noise. Optimized architectures and pretrained weights are released for scalable transfer.
Preserving spatial continuity (contiguous reef blocks) during training is the single most impactful design choice.
A Smooth Weight Function (SWF)-weighted RMSE loss emphasizes near-surface depth accuracy.
This paper introduces Geometry-Aware Tabular Diffusion (GATD), which augments tabular diffusion denoisers with pairwise angles and lengths computed from column value differences as inputs and auxiliary targets. The MLP instantiation achieves state-of-the-art benchmark performance while using 3.5x fewer parameters on average (up to 25x for classification tasks): on ten datasets, it wins 8/10 Shape, 7/10 Trend, and 9/10 downstream utility (F1/RMSE), reducing Shape and Trend error by 27% and 20%. Default loss weights transfer to GNN and Transformer denoisers, improving Shape on 27/30 and Trend on 25/30 architecture-dataset cells. A matched ablation shows supervision (not extra inputs or capacity) drives the gain. This shows explicit relational supervision is a portable inductive bias for tabular diffusion.
GATD enhances tabular diffusion with explicit geometric features (angles and lengths) to model inter-column relationships.
MLP variant achieves SOTA with fewer parameters, significantly reducing Shape and Trend errors.
The CVE AI Agent is an autonomous vulnerability intelligence engine that continuously ingests, enriches, and triages CVE data, delivering findings to platforms like n8n, Jira, Slack, Splunk, or local file exports. It features a token-efficient architecture using deterministic minimization logic to filter noise, with prompts averaging 1,000 tokens. The agent follows a strict Two-Pass architecture: Pass 1 extracts all measurable data deterministically, and Pass 2 uses an LLM to fill qualitative sections. It supports multiple LLM providers, including Gemini, OpenAI, Claude, Groq, and Ollama, and offers a web dashboard.
CVE AI Agent is an autonomous vulnerability intelligence pipeline designed for SOC-grade, auditable security.
Uses a Two-Pass architecture: deterministic engine for data extraction, LLM only for qualitative enrichment, reducing hallucinations.
Researchers propose MIND (Data Manifold-aware Image Diffusion Model), which explicitly models manifold geometry by integrating discrete patch tokenization into the score function of a continuous diffusion model. On ImageNet 256×256, MIND-B with 130M parameters achieves FID 2.06 under guidance, surpassing LlamaGen-3B's performance.
MIND combines structural quantification of discrete tokens with parallel generation flexibility of continuous diffusion.
A novel soft top-k aggregation mechanism enables end-to-end differentiable training.
Proposes SENSE, which uses target model hidden states for semantic retrieval and soft-gated evaluation to improve robustness and efficiency of retrieval-based speculative decoding, achieving up to 4.09 mean acceptance length and 3.26x speedup on LLaMA and Qwen.
SENSE anchors retrieval on hidden states of target model for semantic alignment.
Soft-gated Evaluation validates semantic equivalence instead of surface forms.
Chinese AI company MiniMax has released its new model M3. It's billed as the first open-weight model to combine top-tier coding performance, a one-million-token context window, and native multimodality.
MiniMax releases M3, the first open-weight model combining top coding performance, 1M-token context, and native multimodality.
The model challenges proprietary leaders in AI performance.
NVIDIA announced Nemotron 3 Ultra, a 550B-parameter open weights model with 55B active parameters, achieving the highest intelligence among US open weights models with a score of 48 on the AI Index, and serving over 300 tokens per second on DeepInfra.
Nemotron 3 Ultra is the largest and most intelligent US open weights model to date.
It scores 48 on the AI Index, surpassing other US models but trailing Chinese Kimi K2.6.
MiniMax releases M3, the first open-weights model integrating coding & agentic frontier, sparse attention for 1M context, and native multimodality. It achieves strong benchmarks like 59% SWE-Bench Pro, with API pricing starting with a 7-day 50% discount. Weights and tech report due in ~10 days.
MiniMax M3 is the first open-weights model combining coding/agentic, sparse attention for 1M context, and native multimodality.
Benchmark results: SWE-Bench Pro 59.0%, Terminal Bench 66.0%, SWE-fficiency 34.8%, etc.
This study introduces a multi-model paradigm to study synthetic deception via LoRA fine-tuning of five transformer models. Linear probes detect deception with near-perfect AUC in early layers, and logistic regression probes outperform MLP probes, supporting the Linear Representation Hypothesis. Probes generalize across domains with minimal loss. Different models exhibit distinct representational regimes: collapse in Pythia/Llama/Qwen versus high-dimensional preservation in Gemma-2. The results show that robust, domain-invariant deception representations can be rapidly entrenched through modest supervised fine-tuning, with implications for activation-based monitoring.
Linear probes on mean-pooled hidden states detect synthetic dishonesty with near-perfect AUC (≥0.99) as early as layers 1-3 in four architectures. Logistic regression consistently matches or outperforms MLP probes.
Probes trained on TruthfulQA generalize with near-zero loss (ΔAUC≈0) to held-out MMLU subjects. Late-layer representations show strong robustness to Gaussian noise.
MAVEN (Modular Agentic Verification and Execution Network) is a lightweight symbolic reasoning scaffold designed to enhance generalization in tool-calling environments through structured decomposition, adaptive tool orchestration, and intermediate verification. On the MAVEN-Bench stress test, MAVEN improves the GPT-OSS-120b base model from 48% to 71% accuracy without additional training, using an open-weight backbone at roughly 1/10 the cost of proprietary baselines.
MAVEN is a lightweight symbolic reasoning scaffold for improving generalization in agentic tool calling.
On MAVEN-Bench, MAVEN boosts GPT-OSS-120b accuracy from 48% to 71% without extra training.
Open-weight AI models without guardrails are becoming more accessible, raising safety concerns. A new method called 'abliteration' easily removes restrictions, allowing anyone to use them for harmful activities like generating terrorist content or making weapons. Despite legitimate uses, regulation and safeguards face challenges.
Open-weight models (e.g., DeepSeek) can have safety guardrails easily removed, lowering the barrier to misuse.
The 'abliteration' method simplifies removal, leading to a surge in such models.
According to Epoch's internal capability metric (ECI), open-weight models take an average of 4 months to catch up with state-of-the-art closed models. ECI is a composite measure covering many benchmarks.
Open-weight models lag behind closed models by approximately 4 months on average
Epoch uses the ECI metric to measure model performance
At Mistral AI's summit, CEO Arthur Mensch warned that Europe has just two years to build sufficient AI infrastructure or risk becoming a 'vassal state' to American AI. The event drew a large crowd, highlighting growing European demand for data sovereignty and open-source models, despite the region still lagging behind the US in investment and scale.
Mistral CEO warns Europe has two years to build AI infrastructure or become a vassal state.
Summit attracts large turnout, underscoring Europe's desire for an independent AI ecosystem.
A project demonstrates boosting Qwen3-30B inference speed from 0.09 to 14.03 tok/s on a 2017 MacBook Air by combining a human experimenter, Codex, llama.cpp, a local database, and IBM Quantum sampling. The QPU is used for candidate selection, not for running the model directly.
Runs Qwen3-30B on 2017 MacBook Air (8GB RAM, CPU-only)
Hybrid quantum-classical optimization loop achieves 14.03 tok/s from 0.09 baseline
Personal insights from the Mistral AI Now Summit: Mistral is evolving from a model company to a full AI stack provider with its own compute, models, platforms, and consultancy. The summit emphasized partnerships (ASML, BNP Paribas, Amazon) over new model releases. Specialized small models (Document AI, Voxtral, Robostral) outperform big general ones for specific tasks. Sovereignty and on-prem deployment are key differentiators for European enterprises. An inspiring talk on using AI to decipher ancient papyrus documents showcased AI's potential in humanities.
Mistral is transforming from a model company into a full-stack AI provider with in-house compute, models, platforms, and consultancy.
Summit focused on partnerships (ASML, BNP Paribas, Amazon) rather than new model announcements.