Medical LVLMs are prone to factual inconsistencies and poor visual grounding. Existing alignment methods have three key limitations in the medical domain: sequence-level rewards treat clinically critical tokens equally, reliance on static SFT references causes off-policy shift, and alignment lacks visual grounding constraints. The proposed method uses a bidirectional token-wise KL regularizer and a visual-contrastive grounding objective, forming a fine-grained on-policy alignment framework that constructs preference pairs by minimally editing model outputs. Experiments validate its effectiveness.
Existing preference optimization methods in medicine suffer from sequence-level rewards, off-policy shift, and lack of visual grounding.
Proposed method combines bidirectional token-wise KL regularizer and visual-contrastive grounding.
OpenAI is considering cutting API token prices to win customers from Anthropic, according to the Wall Street Journal, signaling a potential price war in the AI industry.
OpenAI plans to lower token prices to attract Anthropic's customers
The move could trigger a broader price war in AI APIs
Cohere has released its first developer-facing coding model, North Mini Code, a 30B total parameter mixture-of-experts model with only 3B active parameters per token. It runs on a single H100 GPU, supports 256K context length, and is optimized for code generation, agentic software engineering, and terminal tasks. The weights are open under Apache 2.0.
North Mini Code is Cohere’s first coding model, 30B total parameters with 3B active, supporting 256K context and 64K max output.
Runs on a single H100 at FP8; weights open under Apache 2.0 via Hugging Face, Cohere API, and more.
Google has released DiffusionGemma, a new open-weight model under Apache 2 license, available for free via NVIDIA's NIM cloud API. It delivers impressive generation speeds exceeding 500 tokens per second.
Google releases open-source DiffusionGemma model under Apache 2 license.
Google released DiffusionGemma, a 26-billion-parameter model that generates text via diffusion, achieving 1,000 tokens per second on an H100 GPU—four times faster than autoregressive models, but with lower quality. It's currently experimental.
26-billion-parameter diffusion model for text generation
DiffusionGemma is Google DeepMind's experimental open text generation model that uses text diffusion instead of standard autoregressive decoding, achieving up to 4x faster generation on dedicated GPUs. The 26B MoE model (3.8B active parameters) is built on the Gemma 4 backbone, supports multimodal inputs (text, image, video), has a 256K context window, covers 140+ languages, and is released under Apache 2.0.
DiffusionGemma is a 26B Mixture of Experts (MoE) model with 3.8B active parameters that generates text in parallel via diffusion, not token-by-token.
It achieves 1000+ tokens/s on a single NVIDIA H100 and 700+ tokens/s on an RTX 5090, fitting in 18GB VRAM when quantized.
Google DeepMind released DiffusionGemma, an experimental open model for fast text generation using parallel token generation. NVIDIA optimized it to run faster on GeForce RTX, RTX PRO, and DGX Spark systems, achieving up to 1000 tokens/sec locally.
DiffusionGemma generates up to 256 tokens in parallel per step, unlike traditional autoregressive models. Based on Gemma 4 (26B parameters, MoE), activating only 3.8B per step. Up to 4x faster performance. Open source under Apache 2.0, runs locally with no cloud dependency.
Anthropic releases Claude Fable 5 and Mythos 5, with Fable 5 offering the same performance as Mythos 5 but with stricter safety guardrails. The models feature a 1 million token context window, 128k output tokens, and pricing double that of Opus 4.8. Simon Willison spent 5.5 hours testing Fable 5 and found it to be a 'beast'—knowledgeable and capable, but slow and expensive. Fable 5 successfully upgraded micropython-wasm to full Python, implemented pause-resume for tool calls in Datasette Agent and the LLM library, and consumed $110.42 in tokens in a single day.
Claude Fable 5 is Anthropic's new flagship model with same capabilities as Mythos 5 but stricter safety. It has a 1M context window, 128k output, and costs double Opus 4.8.
Fable 5 demonstrated deep knowledge by listing Simon Willison's open source projects in detail, correcting a typo in the prompt.
Anthropic announced Claude Fable 5, claiming it is the most powerful AI model it has widely released, with exceptional performance in software engineering, knowledge work, and vision. It marks the first broad release from the Mythos class, previously deemed too dangerous due to cybersecurity capabilities. New safeguards block responses in high-risk areas, falling back to Claude Opus 4.8 when necessary. Anthropic also launched Claude Mythos 5, available only in a limited trusted-access program. Pricing is $10 per million input tokens and $50 per million output tokens.
Claude Fable 5 is Anthropic's most powerful widely available AI model, excelling in long and complex tasks.
It is the first public release from the Mythos class, previously restricted due to cybersecurity risks.
Researchers identify a 'concept bottleneck' in CoCoNuT latent reasoning where intermediate hidden states are overwritten, causing performance loss. They propose AGCLR with a persistent gated memory (write, read, forget gates) that consistently improves performance on GSM8K, HotpotQA, and ProsQA using GPT-2, with the gap widening as curriculum depth increases.
CoCoNuT suffers from a concept bottleneck: intermediate states overwritten, losing early facts; performance degrades with depth
AGCLR adds a Gated Concept Stream with write, read, and forget gates for persistent memory
AI-noleak is a local reverse proxy that intercepts accidentally exposed secrets (API keys, tokens) from AI coding agents and replaces them with deterministic placeholders before they reach the upstream AI model. It operates via three layers (PTY wrapper, HTTP proxy, file watcher) without requiring TLS MITM or root CA certificates, ensuring local security isolation.
Three layers of protection: PTY input, HTTP transport, file storage. No TLS MITM needed.
Secrets are replaced with placeholders (@TOKEN_xxxxxx@); AI models only see placeholders, which are reversibly restored locally.
This paper introduces On-Policy Diffusion Language Model (OPDLM), which transforms autoregressive models into diffusion language models via on-policy distillation, addressing distribution shifts. OPDLM achieves strong performance with 15x to 7,000x fewer training tokens across various tasks, positioning DLM transformation as a form of ARLM post-training.
OPDLM eliminates train-inference mismatch and retains knowledge from autoregressive models via on-policy distillation.
It requires 15x to 7,000x fewer training tokens compared to traditional methods.
The Piggyback Hypothesis proposes that chat-template tokens can piggyback finetuned behavior onto out-of-domain queries. Validated via prefix perturbations, leading to Token-Regularized Finetuning (TReFT) that mitigates emergent misalignment while preserving in-domain learning.
Piggyback Hypothesis: chat-template tokens cause LLM overgeneralization to unrelated domains.
Prefix perturbations restore alignment, supporting the hypothesis.
Diffusion Large Language Models (dLLMs) refine tokens iteratively but commit them irreversibly, causing a "stability lag" where early decisions remain fragile. Post-Training Quantization (PTQ) error easily flips these borderline decisions at the write frontier, locking them in. FAIR-Calib, a two-stage PTQ framework, probes a full-precision teacher for a position prior and performs off-policy layer-wise calibration with a reweighted hidden-state MSE, protecting fragile frontier states without expensive end-to-end rollouts. Theoretically justified as a surrogate for output KL divergence, FAIR-Calib outperforms baselines on LLaDA and Dream (W4A4), reducing frontier flips and post-commit mismatches.
Diffusion LLMs suffer from stability lag where early token decisions are fragile to quantization error
FAIR-Calib introduces a two-stage PTQ framework with frontier-aware instability reweighting
This article details the configuration and memory calibration required to run the Qwen 3.6 35B MoE model at a 450,000 token context window on a single 32GB VRAM GPU (NVIDIA RTX 5090) using llama.cpp with TurboQuant and YaRN scaling. It covers model selection, quantization trade-offs, KV cache quantization, RoPE scaling, multimodal setup, replication guide, VRAM lifecycle management, and performance evaluation.
Run Qwen3.6-35B-A3B-Q6_K on a single RTX 5090 with 450K context using llama.cpp TurboQuant fork and YaRN scaling.
Achieve 450K context by compressing KV cache to 3-bit (turbo3) and extending RoPE beyond native 262K with YaRN, but at cost of perplexity and retrieval accuracy.
On June 3, 2026, Google introduced Gemma 4 12B Unified, an open-source multimodal model that understands text, images, audio, and video within a single architecture. It combines a 256K context window with a laptop-friendly design for agentic workflows and local deployment. This article covers its architecture, features, benchmarks, and practical guidance for developers.
Gemma 4 12B Unified is a mid-sized open-source multimodal model with an encoder-free design that projects image and audio directly into the LLM embedding space.
It supports 256K context, function calling, 35+ languages, speech recognition, video understanding, and can run locally via tools like Ollama.
NVIDIA Nemotron 3 Ultra, an open large language model with 550B total parameters and 55B active parameters, is now available on Amazon SageMaker JumpStart. It offers 5x faster inference and up to 30% lower cost for agentic AI workloads, with a hybrid Transformer-Mamba MoE architecture and million-token context window.
Nemotron 3 Ultra is now available for one-click deployment on SageMaker JumpStart
Delivers 5x faster inference and up to 30% lower cost for agentic workloads
A comprehensive guide to AI API pricing, covering six key decisions: what to meter, which pricing primitive (tokens, credits, outcomes), cost calculation, tier structure, hard vs. soft caps, and credit wallet design. Includes practical examples and a diagnostic prompt for your own pricing.
Six decisions in order: meter, primitive, per-unit price, tiers, cap type, wallet behavior.
Prefer outcome pricing if definable, then credits, then tokens as last resort.
Qwen 3.7 Plus is Alibaba's proprietary reasoning model released in June 2026, scoring 53 on the Artificial Analysis Intelligence Index, far above average. However, it is expensive, slow, and very verbose. The model supports text, image, and video input with a 1M-token context window.
Intelligence score of 53, well above the average of 23 for comparable models.
Priced at $0.40/M input tokens and $1.16/M output tokens, placing it in the expensive range.
Walmart is limiting employee use of its internal AI assistant Code Puppy by assigning fixed tokens due to high costs from the shift to pay-per-use LLM billing. The retailer aims to control expenses and encourage thoughtful AI usage.
Walmart limits Code Puppy usage and assigns fixed AI tokens to control costs
LLM providers shift to pay-per-use, causing enterprise AI costs to surge
GitHub Copilot has adopted a usage-based pricing system using credits. Costs vary by model and tokens, with advanced models being more expensive. Users report high credit consumption even for simple tasks, and caution is needed with Auto mode.
New Copilot pricing uses credits based on tokens and model chosen.
Simple queries can consume many credits unexpectedly.
MiniMax officially released MiniMax M3 on June 1, 2026, featuring MiniMax Sparse Attention (MSA) for a 1M-token context window, native image/video input, and desktop computer operation. The API is live now.
M3 introduces MSA, achieving >9× prefill and >15× decoding speedup at 1M-token context versus M2, with 1/20th per-token compute.
Scores 59.0% on SWE-Bench Pro, surpassing GPT-5.5 and Gemini 3.1 Pro.
This post explores how combining Amazon FSx for Lustre, NVIDIA GPUDirect Storage, and sharded parallel loading reduces cold-start time-to-first-token for large language models from minutes to seconds, and how TurboQuant KV cache significantly increases context window size.
CPU-based model loading is a cold-start bottleneck, taking 10–20 minutes for a 405B model.
FSx for Lustre with GPUDirect Storage enables direct GPU HBM loading via EFA, bypassing CPU.
Chinese AI company MiniMax has released its new model M3. It's billed as the first open-weight model to combine top-tier coding performance, a one-million-token context window, and native multimodality.
MiniMax releases M3, the first open-weight model combining top coding performance, 1M-token context, and native multimodality.
The model challenges proprietary leaders in AI performance.
NVIDIA announced Nemotron 3 Ultra, a 550B-parameter open weights model with 55B active parameters, achieving the highest intelligence among US open weights models with a score of 48 on the AI Index, and serving over 300 tokens per second on DeepInfra.
Nemotron 3 Ultra is the largest and most intelligent US open weights model to date.
It scores 48 on the AI Index, surpassing other US models but trailing Chinese Kimi K2.6.
Chinese AI startup MiniMax released its flagship model M3, designed for coding agents and automated workflows. It processes up to 1M tokens, reduces computational costs by 20x, and outperforms OpenAI GPT-5.5 and Google Gemini on SWE-Bench Pro. The company also prepares for a Shanghai IPO and partners with Ant Group's Alipay for AI payment infrastructure.
MiniMax unveils M3 with 1M-token context and 20x cost reduction.
M3 beats OpenAI GPT-5.5 and Google Gemini 3.1 Pro on SWE-Bench Pro.
At the 2026 China AIGC Industry Summit, Wang Xiaoye, Technical Director of AWS Product Technology, pointed out that 87% of enterprises claim to have deployed AI at scale, but only 10% have gained actual value. He emphasized the huge gap between personal and enterprise-level agent deployment, and proposed that enterprises need to focus on five layers: compute, models, data & knowledge, agentic platform, and applications. He also noted that token costs are often high because too much useless information is fed to the model.
87% of enterprises have deployed AI, but only 10% see value
Personal and enterprise agent deployment are fundamentally different
Headroom is an open-source context compression layer that reduces token consumption by 50-90% by compressing all content (tool outputs, logs, RAG chunks, files, conversation history) before it reaches the LLM. It offers multiple integration modes (library, proxy, agent wrap, MCP server), supports various AI agents (Claude Code, Codex, Cursor, etc.), and preserves answer accuracy on benchmarks. The community has saved over 60B tokens.
Compresses all AI agent context before LLM processing, cutting token costs by 50-90%.
Available as a Python/TypeScript library, proxy, agent wrapper, and MCP server; supports major coding agents.
Fluiq is an AI Ops platform covering security, optimization, observability, and evaluations. It offers a free tier for early signups and reviews, and detects threats with minimal code.
Fluiq is an AI Ops platform for security, optimization, observability, and evaluations.
Detects prompt injection, PII, and Crescendo attacks with just 2 lines of Python.
At the 2026 China AIGC Industry Summit, Wang Xiaoye, Technical Director of Amazon Web Services, pointed out that 87% of enterprises claim to have deployed AI at scale, but only 10% have gained real production value. He emphasized that enterprise-grade Agent deployment must bridge four major gaps: model selection, construction complexity, usage threshold, and talent shortage. He introduced AWS's five-layer architecture—compute, model, data, harness platform, and agent applications—and products like Quick to help enterprises move from demo to production.
87% of enterprises deploy AI, but only 10% gain production value.
Enterprise-grade agents differ vastly from personal ones, requiring solutions for security, stability, and trust.
Anthropic has released Claude Opus 4.8, an upgrade to Opus 4.7 with improvements in coding, agent work, reasoning, and knowledge work. New features include effort control, dynamic workflows, and live Messages API updates. Pricing remains unchanged at $5/$25 per million tokens for standard and $10/$50 for fast mode (2.5x speed). Early testers report cost parity with GPT-5.5 and fewer tool steps. The company also outlined its roadmap including Mythos-class models and Project Glasswing for cybersecurity.
Claude Opus 4.8 improves on Opus 4.7 in coding, agent work, reasoning, and knowledge work.
New features: effort control, dynamic workflows, and live Messages API updates.
Aryabhata 2 is a reasoning-focused language model for competitive STEM exams like JEE and NEET, fine-tuned via reinforcement learning on GPT-OSS-20B using PhysicsWallah's question banks. It achieves up to 64% fewer output tokens while outperforming the base model on multiple benchmarks.
Aryabhata 2 uses RL post-training optimized for competitive STEM exams.
Built on GPT-OSS-20B with custom training curriculum from PhysicsWallah.
This paper presents RightNow-Arabic-0.5B-Turbo, a 518M-parameter Arabic-specialized LLM built on Qwen2.5-0.5B using vocabulary injection and edge-first deployment. It achieves 35.9% mean accuracy on Arabic benchmarks, outperforming all same-class open models, and ties Falcon-H1-1.5B on COPA-ar at one-third the size. The quantized model is 398 MB and delivers 635 tokens/s on a single H100, enabling efficient edge deployment.
518M-parameter Arabic LLM built on Qwen2.5-0.5B with vocabulary injection of 27,032 Arabic tokens.
Achieves 35.9% mean accuracy on three Arabic benchmarks, surpassing all same-class open-source models.
This paper proposes COM, a strategy that integrates geometric constraints into token initialization and training to preserve the inherent continuity and ordinality of time series tokens, consistently improving the performance of token-based time series LLMs on multiple benchmarks.
Token-based time series LLMs overlook continuity and ordinality, limiting performance.
COM applies geometric constraints during initialization and training to preserve these properties.
Release of llm-anthropic 0.25.1 adds support for Claude Opus 4.8, fast mode option for eligible accounts, and changes default max_tokens to each model's maximum output.
New model: Claude Opus 4.8 (claude-opus-4.8).
New -o fast 1 option for fast mode (for organizations with feature enabled).
This article demonstrates how to implement a context pruning pipeline for long-running AI agents to manage conversational memory efficiently using semantic similarity. It covers using sentence transformer embedding models, computing similarities, and assembling a pruned context window.
Unbounded conversation history increases token costs and degrades reasoning in long-running agents.
A context pruning pipeline keeps the current prompt, most recent turn, and top-K semantically similar past turns.
Open Agent Tools (oats) is a self-hosted AI framework that enables small-to-large local models to use local source code for tool-calling, freeing up expensive large model tokens by delegating tasks to smaller models.
oats allows local AI models to use local source code for tool-calling without HTTP or MCP.
It mines over 20,000 GitHub repos to create reusable prompt indices.
ICG is a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant cover images. It extracts semantic features via meta tokens, refines them with user embeddings, and injects personalized context into diffusion models. A multi-reward learning strategy combines public rewards with a personalized preference model, eliminating the need for labeled supervision. Experiments show improvements in image quality, semantic fidelity, and personalization, boosting user appeal and recommendation accuracy.
ICG integrates MLLM prompting with personalized preference alignment for end-to-end cover image generation.
Semantic features are extracted via meta tokens and refined with user embeddings for diffusion model injection.
At Databricks, we’ve built a unique inference platform that serves every frontier model, from open source to proprietary, powering some of the largest agentic applications. Serving over 120T tokens per month, we tackle challenges of reliability and latency through abstractions like model units for capacity management, cost-aware load balancing and autoscaling that save over 80% GPU costs, and runtime reliability mechanisms including black-box health checks that detect silent failures. Profiling multimodal bottlenecks unlocked 3x throughput gains.
Databricks' inference platform serves frontier models including open source and proprietary, handling 120T tokens/month.
Model units provide a VM-like abstraction for capacity management, enabling cost-aware routing and scaling.
Mr. Guy Invests is a free, beginner-friendly stock research and portfolio tracker that leverages public SEC filings to track hedge fund and insider activity, offers an AI stock tutor, a $100K virtual trading challenge, daily market briefs, and more. Free tier has daily limits; Pro is $4.99/month for unlimited access.
Uses SEC Form 13F and Form 4 data to show what hedge funds and insiders are buying.
AI Stock Tutor answers questions in plain English, avoiding financial jargon.
One month after DeepSeek V4's release, the open-source community unveiled Reasonix, a tool specifically designed to minimize API costs by maximizing cache efficiency. It achieves a staggering 99.82% cache hit rate, reducing a $61 bill for 400M+ tokens to just $12.
Reasonix is a dedicated coding harness for DeepSeek, focusing on cost reduction.
Its cache-first loop, tool-call repair, and automatic context compression maintain over 90% cache hit rate in long sessions.
This study challenges the assumption that high benchmark scores reflect true visual understanding in vision-language models (VLMs). By removing a large fraction of image tokens with minimal performance drop, the authors reveal a mismatch between accuracy and visual grounding. Through multi-level analyses including global degradation, localized occlusion, question reformulation, answer-space expansion, decision-level analysis, and layer-wise vision-token geometry, they find that models are less sensitive to fine-grained visual evidence than expected, and that visual tokens become more similar in deeper layers. The results indicate that current benchmarks are insufficient for evaluating fine-grained visual grounding in VLMs.
Removing many image tokens only slightly degrades VLM performance, questioning benchmark reliance on vision.
Models incorporate visual input but are insensitive to loss of fine-grained visual evidence.
NVIDIA's Gated DeltaNet-2 is a linear attention layer that decouples memory erasing and writing into channel-wise gates. Trained at 1.3B parameters on 100B FineWeb-Edu tokens, it outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in language modeling, commonsense reasoning, and long-context retrieval, with the largest gains on RULER benchmarks.
Gated DeltaNet-2 decomposes the scalar gate into a channel-wise erase gate (key axis) and write gate (value axis), enabling independent control of erasing old content and writing new content.
At 1.3B parameters trained on 100B FineWeb-Edu tokens, it achieves best average performance across benchmarks compared to baselines.
Deepseek is making the 75 percent discount on its top model V4-Pro permanent. At $0.435 per million input tokens, it's at least 11.5 times cheaper than GPT-5.5 and over 34 times cheaper on output. For token-hungry agentic systems, this kind of pricing could squeeze Western providers hard.
Deepseek's 75% discount on V4-Pro is now permanent.
Input token price is $0.435 per million, 11.5x cheaper than GPT-5.5.
DeepSeek announced permanent price cuts for its V4-Pro API. Meanwhile, CATL, JD, and NetEase are in talks to invest in DeepSeek's first external funding round. Founder Liang Wenfeng emphasizes prioritizing AGI research and maintaining open-source principles.
DeepSeek V4-Pro API permanently reduced to one-quarter of original price
CATL, JD, and NetEase among companies negotiating investment in DeepSeek
NVIDIA introduces Nemotron-Labs Diffusion language models that achieve up to 6.4x faster inference than autoregressive models while maintaining high accuracy by generating tokens in parallel and refining them iteratively. The models support three modes: autoregressive, diffusion, and self-speculation. The 8B model outperforms Qwen3 8B by 1.2% accuracy.
Nemotron-Labs Diffusion models offer three generation modes: autoregressive, diffusion, and self-speculation.
The 8B model achieves 2.6x TPF in diffusion mode and up to 6.4x with self-speculation.
This paper proposes the Ablate-to-Validate diagnostic principle and its instantiation, the Token Replacement Test (TRT), to determine whether vision-language models (VLMs) genuinely use continuous latent tokens for reasoning. Experiments show that VLMs retain most performance gains even when token content is corrupted or replaced, indicating that accuracy improvements are a misleading proxy for latent-token reasoning.
Introduces the Ablate-to-Validate principle and the Token Replacement Test (TRT) to diagnose actual use of continuous thought tokens.
Experiments reveal VLMs retain performance gains even after token content corruption, suggesting gains are not due to reasoning with tokens.
Alibaba's Qwen team announced Qwen3.7-Max, their most advanced agent model, featuring a 1M-token context window, extended-thinking mode, and strong benchmark scores (56.6 on AI Index, 5th overall). The model excels at coding, debugging, and long-horizon autonomous tasks but has caveats like reduced factual recall on AA-Omniscience and no independent verification of long-context reliability.
Qwen3.7-Max offers a 1M-token context window and extended-thinking mode for complex multi-step tasks.
It scored 56.6 on the Artificial Analysis Intelligence Index, ranking fifth among all models.