AI News HubLIVE

Model Pricing updates

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

Medical LVLMs are prone to factual inconsistencies and poor visual grounding. Existing alignment methods have three key limitations in the medical domain: sequence-level rewards treat clinically critical tokens equally, reliance on static SFT references causes off-policy shift, and alignment lacks visual grounding constraints. The proposed method uses a bidirectional token-wise KL regularizer and a visual-contrastive grounding objective, forming a fine-grained on-policy alignment framework that constructs preference pairs by minimally editing model outputs. Experiments validate its effectiveness.

  • Existing preference optimization methods in medicine suffer from sequence-level rewards, off-policy shift, and lack of visual grounding.
  • Proposed method combines bidirectional token-wise KL regularizer and visual-contrastive grounding.
In-site article

OpenAI vs. Anthropic: A price war over API tokens is brewing

OpenAI is considering cutting API token prices to win customers from Anthropic, according to the Wall Street Journal, signaling a potential price war in the AI industry.

  • OpenAI plans to lower token prices to attract Anthropic's customers
  • The move could trigger a broader price war in AI APIs
In-site article

Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding

Cohere has released its first developer-facing coding model, North Mini Code, a 30B total parameter mixture-of-experts model with only 3B active parameters per token. It runs on a single H100 GPU, supports 256K context length, and is optimized for code generation, agentic software engineering, and terminal tasks. The weights are open under Apache 2.0.

  • North Mini Code is Cohere’s first coding model, 30B total parameters with 3B active, supporting 256K context and 64K max output.
  • Runs on a single H100 at FP8; weights open under Apache 2.0 via Hugging Face, Cohere API, and more.
In-site article

DiffusionGemma: Google's Open-Source High-Speed Text Generation Model

Google has released DiffusionGemma, a new open-weight model under Apache 2 license, available for free via NVIDIA's NIM cloud API. It delivers impressive generation speeds exceeding 500 tokens per second.

  • Google releases open-source DiffusionGemma model under Apache 2 license.
  • Free hosting on NVIDIA NIM cloud API.
In-site article

Google's new open model DiffusionGemma generates text from noise instead of word by word

Google released DiffusionGemma, a 26-billion-parameter model that generates text via diffusion, achieving 1,000 tokens per second on an H100 GPU—four times faster than autoregressive models, but with lower quality. It's currently experimental.

  • 26-billion-parameter diffusion model for text generation
  • Reaches 1,000 tokens/sec on a single H100 GPU
In-site article

Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion for Up to 4x Faster Generation

DiffusionGemma is Google DeepMind's experimental open text generation model that uses text diffusion instead of standard autoregressive decoding, achieving up to 4x faster generation on dedicated GPUs. The 26B MoE model (3.8B active parameters) is built on the Gemma 4 backbone, supports multimodal inputs (text, image, video), has a 256K context window, covers 140+ languages, and is released under Apache 2.0.

  • DiffusionGemma is a 26B Mixture of Experts (MoE) model with 3.8B active parameters that generates text in parallel via diffusion, not token-by-token.
  • It achieves 1000+ tokens/s on a single NVIDIA H100 and 700+ tokens/s on an RTX 5090, fitting in 18GB VRAM when quantized.
In-site article

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

Google DeepMind released DiffusionGemma, an experimental open model for fast text generation using parallel token generation. NVIDIA optimized it to run faster on GeForce RTX, RTX PRO, and DGX Spark systems, achieving up to 1000 tokens/sec locally.

  • DiffusionGemma generates up to 256 tokens in parallel per step, unlike traditional autoregressive models. Based on Gemma 4 (26B parameters, MoE), activating only 3.8B per step. Up to 4x faster performance. Open source under Apache 2.0, runs locally with no cloud dependency.
In-site article

Initial impressions of Claude Fable 5

Anthropic releases Claude Fable 5 and Mythos 5, with Fable 5 offering the same performance as Mythos 5 but with stricter safety guardrails. The models feature a 1 million token context window, 128k output tokens, and pricing double that of Opus 4.8. Simon Willison spent 5.5 hours testing Fable 5 and found it to be a 'beast'—knowledgeable and capable, but slow and expensive. Fable 5 successfully upgraded micropython-wasm to full Python, implemented pause-resume for tool calls in Datasette Agent and the LLM library, and consumed $110.42 in tokens in a single day.

  • Claude Fable 5 is Anthropic's new flagship model with same capabilities as Mythos 5 but stricter safety. It has a 1M context window, 128k output, and costs double Opus 4.8.
  • Fable 5 demonstrated deep knowledge by listing Simon Willison's open source projects in detail, correcting a typo in the prompt.
In-site article

Anthropic releases its first Mythos-class model Claude Fable 5

Anthropic announced Claude Fable 5, claiming it is the most powerful AI model it has widely released, with exceptional performance in software engineering, knowledge work, and vision. It marks the first broad release from the Mythos class, previously deemed too dangerous due to cybersecurity capabilities. New safeguards block responses in high-risk areas, falling back to Claude Opus 4.8 when necessary. Anthropic also launched Claude Mythos 5, available only in a limited trusted-access program. Pricing is $10 per million input tokens and $50 per million output tokens.

  • Claude Fable 5 is Anthropic's most powerful widely available AI model, excelling in long and complex tasks.
  • It is the first public release from the Mythos class, previously restricted due to cybersecurity risks.
In-site article

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

Researchers identify a 'concept bottleneck' in CoCoNuT latent reasoning where intermediate hidden states are overwritten, causing performance loss. They propose AGCLR with a persistent gated memory (write, read, forget gates) that consistently improves performance on GSM8K, HotpotQA, and ProsQA using GPT-2, with the gap widening as curriculum depth increases.

  • CoCoNuT suffers from a concept bottleneck: intermediate states overwritten, losing early facts; performance degrades with depth
  • AGCLR adds a Gated Concept Stream with write, read, and forget gates for persistent memory
In-site article

AI-noleak – Local secret proxy for AI CLIs

AI-noleak is a local reverse proxy that intercepts accidentally exposed secrets (API keys, tokens) from AI coding agents and replaces them with deterministic placeholders before they reach the upstream AI model. It operates via three layers (PTY wrapper, HTTP proxy, file watcher) without requiring TLS MITM or root CA certificates, ensuring local security isolation.

  • Three layers of protection: PTY input, HTTP transport, file storage. No TLS MITM needed.
  • Secrets are replaced with placeholders (@TOKEN_xxxxxx@); AI models only see placeholders, which are reversibly restored locally.
In-site article

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

This paper introduces On-Policy Diffusion Language Model (OPDLM), which transforms autoregressive models into diffusion language models via on-policy distillation, addressing distribution shifts. OPDLM achieves strong performance with 15x to 7,000x fewer training tokens across various tasks, positioning DLM transformation as a form of ARLM post-training.

  • OPDLM eliminates train-inference mismatch and retains knowledge from autoregressive models via on-policy distillation.
  • It requires 15x to 7,000x fewer training tokens compared to traditional methods.
In-site article

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

The Piggyback Hypothesis proposes that chat-template tokens can piggyback finetuned behavior onto out-of-domain queries. Validated via prefix perturbations, leading to Token-Regularized Finetuning (TReFT) that mitigates emergent misalignment while preserving in-domain learning.

  • Piggyback Hypothesis: chat-template tokens cause LLM overgeneralization to unrelated domains.
  • Prefix perturbations restore alignment, supporting the hypothesis.
In-site article

FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models

Diffusion Large Language Models (dLLMs) refine tokens iteratively but commit them irreversibly, causing a "stability lag" where early decisions remain fragile. Post-Training Quantization (PTQ) error easily flips these borderline decisions at the write frontier, locking them in. FAIR-Calib, a two-stage PTQ framework, probes a full-precision teacher for a position prior and performs off-policy layer-wise calibration with a reweighted hidden-state MSE, protecting fragile frontier states without expensive end-to-end rollouts. Theoretically justified as a surrogate for output KL divergence, FAIR-Calib outperforms baselines on LLaDA and Dream (W4A4), reducing frontier flips and post-commit mismatches.

  • Diffusion LLMs suffer from stability lag where early token decisions are fragile to quantization error
  • FAIR-Calib introduces a two-stage PTQ framework with frontier-aware instability reweighting
In-site article

Show HN: Best setup local LLM found for a 5090 (llama.cpp fork + turboquant)

This article details the configuration and memory calibration required to run the Qwen 3.6 35B MoE model at a 450,000 token context window on a single 32GB VRAM GPU (NVIDIA RTX 5090) using llama.cpp with TurboQuant and YaRN scaling. It covers model selection, quantization trade-offs, KV cache quantization, RoPE scaling, multimodal setup, replication guide, VRAM lifecycle management, and performance evaluation.

  • Run Qwen3.6-35B-A3B-Q6_K on a single RTX 5090 with 450K context using llama.cpp TurboQuant fork and YaRN scaling.
  • Achieve 450K context by compressing KV cache to 3-bit (turbo3) and extending RoPE beyond native 262K with YaRN, but at cost of perplexity and retrieval accuracy.
In-site article

Google Gemma 4 12B: Architecture, Benchmarks, Access, and Hands-on Guide for Developers

On June 3, 2026, Google introduced Gemma 4 12B Unified, an open-source multimodal model that understands text, images, audio, and video within a single architecture. It combines a 256K context window with a laptop-friendly design for agentic workflows and local deployment. This article covers its architecture, features, benchmarks, and practical guidance for developers.

  • Gemma 4 12B Unified is a mid-sized open-source multimodal model with an encoder-free design that projects image and audio directly into the LLM embedding space.
  • It supports 256K context, function calling, 35+ languages, speech recognition, video understanding, and can run locally via tools like Ollama.
In-site article

NVIDIA Nemotron 3 Ultra now available on Amazon SageMaker JumpStart

NVIDIA Nemotron 3 Ultra, an open large language model with 550B total parameters and 55B active parameters, is now available on Amazon SageMaker JumpStart. It offers 5x faster inference and up to 30% lower cost for agentic AI workloads, with a hybrid Transformer-Mamba MoE architecture and million-token context window.

  • Nemotron 3 Ultra is now available for one-click deployment on SageMaker JumpStart
  • Delivers 5x faster inference and up to 30% lower cost for agentic workloads
In-site article

How to design pricing for AI APIs and LLM-powered products

A comprehensive guide to AI API pricing, covering six key decisions: what to meter, which pricing primitive (tokens, credits, outcomes), cost calculation, tier structure, hard vs. soft caps, and credit wallet design. Includes practical examples and a diagnostic prompt for your own pricing.

  • Six decisions in order: meter, primitive, per-unit price, tiers, cap type, wallet behavior.
  • Prefer outcome pricing if definable, then credits, then tokens as last resort.
In-site article

Qwen 3.7 Plus: Alibaba's High-Intelligence but Expensive and Slow Model

Qwen 3.7 Plus is Alibaba's proprietary reasoning model released in June 2026, scoring 53 on the Artificial Analysis Intelligence Index, far above average. However, it is expensive, slow, and very verbose. The model supports text, image, and video input with a 1M-token context window.

  • Intelligence score of 53, well above the average of 23 for comparable models.
  • Priced at $0.40/M input tokens and $1.16/M output tokens, placing it in the expensive range.
In-site article

Walmart’s AI workflows meet the realities of the balance sheet

Walmart is limiting employee use of its internal AI assistant Code Puppy by assigning fixed tokens due to high costs from the shift to pay-per-use LLM billing. The retailer aims to control expenses and encourage thoughtful AI usage.

  • Walmart limits Code Puppy usage and assigns fixed AI tokens to control costs
  • LLM providers shift to pay-per-use, causing enterprise AI costs to surge
In-site article

AI costs how much? GitHub Copilot users react to new usage-based pricing system

GitHub Copilot has adopted a usage-based pricing system using credits. Costs vary by model and tokens, with advanced models being more expensive. Users report high credit consumption even for simple tasks, and caution is needed with Auto mode.

  • New Copilot pricing uses credits based on tokens and model chosen.
  • Simple queries can consume many credits unexpectedly.
In-site article

MiniMax Releases MiniMax M3 with MSA Architecture Supporting 1M-Token Context, Native Multimodality, and Agentic Coding

MiniMax officially released MiniMax M3 on June 1, 2026, featuring MiniMax Sparse Attention (MSA) for a 1M-token context window, native image/video input, and desktop computer operation. The API is live now.

  • M3 introduces MSA, achieving >9× prefill and >15× decoding speedup at 1M-token context versus M2, with 1/20th per-token compute.
  • Scores 59.0% on SWE-Bench Pro, surpassing GPT-5.5 and Gemini 3.1 Pro.
In-site article

Accelerate LLM model loading and increase context windows with GPUDirect on Amazon FSx for Lustre and TurboQuant

This post explores how combining Amazon FSx for Lustre, NVIDIA GPUDirect Storage, and sharded parallel loading reduces cold-start time-to-first-token for large language models from minutes to seconds, and how TurboQuant KV cache significantly increases context window size.

  • CPU-based model loading is a cold-start bottleneck, taking 10–20 minutes for a 405B model.
  • FSx for Lustre with GPUDirect Storage enables direct GPU HBM loading via EFA, bypassing CPU.
In-site article

MiniMax M3: Open-weight model with a million-token context challenges proprietary leaders

Chinese AI company MiniMax has released its new model M3. It's billed as the first open-weight model to combine top-tier coding performance, a one-million-token context window, and native multimodality.

  • MiniMax releases M3, the first open-weight model combining top coding performance, 1M-token context, and native multimodality.
  • The model challenges proprietary leaders in AI performance.
In-site article

Nemotron 3 Ultra: high-speed, leading US open weights intelligence

NVIDIA announced Nemotron 3 Ultra, a 550B-parameter open weights model with 55B active parameters, achieving the highest intelligence among US open weights models with a score of 48 on the AI Index, and serving over 300 tokens per second on DeepInfra.

  • Nemotron 3 Ultra is the largest and most intelligent US open weights model to date.
  • It scores 48 on the AI Index, surpassing other US models but trailing Chinese Kimi K2.6.
In-site article

MiniMax debuts AI model built for long and complex coding tasks

Chinese AI startup MiniMax released its flagship model M3, designed for coding agents and automated workflows. It processes up to 1M tokens, reduces computational costs by 20x, and outperforms OpenAI GPT-5.5 and Google Gemini on SWE-Bench Pro. The company also prepares for a Shanghai IPO and partners with Ant Group's Alipay for AI payment infrastructure.

  • MiniMax unveils M3 with 1M-token context and 20x cost reduction.
  • M3 beats OpenAI GPT-5.5 and Google Gemini 3.1 Pro on SWE-Bench Pro.
In-site article

Tokens Are Expensive Because You Feed the Model Too Much Junk | @Wang Xiaoye from AWS AIGC2026

At the 2026 China AIGC Industry Summit, Wang Xiaoye, Technical Director of AWS Product Technology, pointed out that 87% of enterprises claim to have deployed AI at scale, but only 10% have gained actual value. He emphasized the huge gap between personal and enterprise-level agent deployment, and proposed that enterprises need to focus on five layers: compute, models, data & knowledge, agentic platform, and applications. He also noted that token costs are often high because too much useless information is fed to the model.

  • 87% of enterprises have deployed AI, but only 10% see value
  • Personal and enterprise agent deployment are fundamentally different
In-site article

Headroom compresses everything your AI agent reads before it reaches the LLM

Headroom is an open-source context compression layer that reduces token consumption by 50-90% by compressing all content (tool outputs, logs, RAG chunks, files, conversation history) before it reaches the LLM. It offers multiple integration modes (library, proxy, agent wrap, MCP server), supports various AI agents (Claude Code, Codex, Cursor, etc.), and preserves answer accuracy on benchmarks. The community has saved over 60B tokens.

  • Compresses all AI agent context before LLM processing, cutting token costs by 50-90%.
  • Available as a Python/TypeScript library, proxy, agent wrapper, and MCP server; supports major coding agents.
In-site article

Show HN: Fluiq – detect prompt injection, PII, Crescendo attack with 2 lines of Python

Fluiq is an AI Ops platform covering security, optimization, observability, and evaluations. It offers a free tier for early signups and reviews, and detects threats with minimal code.

  • Fluiq is an AI Ops platform for security, optimization, observability, and evaluations.
  • Detects prompt injection, PII, and Crescendo attacks with just 2 lines of Python.
In-site article

Tokens Are Expensive Because You Feed the Model Too Much Junk | @Wang Xiaoye at AIGC2026

At the 2026 China AIGC Industry Summit, Wang Xiaoye, Technical Director of Amazon Web Services, pointed out that 87% of enterprises claim to have deployed AI at scale, but only 10% have gained real production value. He emphasized that enterprise-grade Agent deployment must bridge four major gaps: model selection, construction complexity, usage threshold, and talent shortage. He introduced AWS's five-layer architecture—compute, model, data, harness platform, and agent applications—and products like Quick to help enterprises move from demo to production.

  • 87% of enterprises deploy AI, but only 10% gain production value.
  • Enterprise-grade agents differ vastly from personal ones, requiring solutions for security, stability, and trust.
In-site article

Compare AI Model Pricing Across 9 Providers (385 Models)

A new tool allows comparing prices for 385 AI models across 9 providers, helping users find the cheapest option.

  • Compare 385 AI models across 9 platforms
  • Supports SilkDock, OpenRouter, Together AI, and more
In-site article

3000 tokens/sec LLM playground

A high-speed LLM playground achieving 3000 tokens per second, featuring an open web UI.

  • 3000 tokens per second throughput
  • Open WebUI interface
In-site article

Anthropic releases Claude Opus 4.8

Anthropic has released Claude Opus 4.8, an upgrade to Opus 4.7 with improvements in coding, agent work, reasoning, and knowledge work. New features include effort control, dynamic workflows, and live Messages API updates. Pricing remains unchanged at $5/$25 per million tokens for standard and $10/$50 for fast mode (2.5x speed). Early testers report cost parity with GPT-5.5 and fewer tool steps. The company also outlined its roadmap including Mythos-class models and Project Glasswing for cybersecurity.

  • Claude Opus 4.8 improves on Opus 4.7 in coding, agent work, reasoning, and knowledge work.
  • New features: effort control, dynamic workflows, and live Messages API updates.
In-site article

Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

Aryabhata 2 is a reasoning-focused language model for competitive STEM exams like JEE and NEET, fine-tuned via reinforcement learning on GPT-OSS-20B using PhysicsWallah's question banks. It achieves up to 64% fewer output tokens while outperforming the base model on multiple benchmarks.

  • Aryabhata 2 uses RL post-training optimized for competitive STEM exams.
  • Built on GPT-OSS-20B with custom training curriculum from PhysicsWallah.
In-site article

RightNow-Arabic-0.5B-Turbo: An Open Sub-1B Arabic Language Model via Vocabulary Injection and Edge-First Deployment

This paper presents RightNow-Arabic-0.5B-Turbo, a 518M-parameter Arabic-specialized LLM built on Qwen2.5-0.5B using vocabulary injection and edge-first deployment. It achieves 35.9% mean accuracy on Arabic benchmarks, outperforming all same-class open models, and ties Falcon-H1-1.5B on COPA-ar at one-third the size. The quantized model is 398 MB and delivers 635 tokens/s on a single H100, enabling efficient edge deployment.

  • 518M-parameter Arabic LLM built on Qwen2.5-0.5B with vocabulary injection of 27,032 Arabic tokens.
  • Achieves 35.9% mean accuracy on three Arabic benchmarks, surpassing all same-class open-source models.
In-site article

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

This paper proposes COM, a strategy that integrates geometric constraints into token initialization and training to preserve the inherent continuity and ordinality of time series tokens, consistently improving the performance of token-based time series LLMs on multiple benchmarks.

  • Token-based time series LLMs overlook continuity and ordinality, limiting performance.
  • COM applies geometric constraints during initialization and training to preserve these properties.
In-site article

llm-anthropic 0.25.1

Release of llm-anthropic 0.25.1 adds support for Claude Opus 4.8, fast mode option for eligible accounts, and changes default max_tokens to each model's maximum output.

  • New model: Claude Opus 4.8 (claude-opus-4.8).
  • New -o fast 1 option for fast mode (for organizations with feature enabled).
In-site article

Building a Context Pruning Pipeline for Long-Running Agents

This article demonstrates how to implement a context pruning pipeline for long-running AI agents to manage conversational memory efficiently using semantic similarity. It covers using sentence transformer embedding models, computing similarities, and assembling a pruned context window.

  • Unbounded conversation history increases token costs and degrades reasoning in long-running agents.
  • A context pruning pipeline keeps the current prompt, most recent turn, and top-K semantically similar past turns.
In-site article

Show HN: Local Coding Agent with LLMs to Delegate Tool Calls to Small AI Models

Open Agent Tools (oats) is a self-hosted AI framework that enables small-to-large local models to use local source code for tool-calling, freeing up expensive large model tokens by delegating tasks to smaller models.

  • oats allows local AI models to use local source code for tool-calling without HTTP or MCP.
  • It mines over 20,000 GitHub repos to create reusable prompt indices.
In-site article

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

ICG is a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant cover images. It extracts semantic features via meta tokens, refines them with user embeddings, and injects personalized context into diffusion models. A multi-reward learning strategy combines public rewards with a personalized preference model, eliminating the need for labeled supervision. Experiments show improvements in image quality, semantic fidelity, and personalization, boosting user appeal and recommendation accuracy.

  • ICG integrates MLLM prompting with personalized preference alignment for end-to-end cover image generation.
  • Semantic features are extracted via meta tokens and refined with user embeddings for diffusion model injection.
In-site article

Reliable LLM Inference at Scale

At Databricks, we’ve built a unique inference platform that serves every frontier model, from open source to proprietary, powering some of the largest agentic applications. Serving over 120T tokens per month, we tackle challenges of reliability and latency through abstractions like model units for capacity management, cost-aware load balancing and autoscaling that save over 80% GPU costs, and runtime reliability mechanisms including black-box health checks that detect silent failures. Profiling multimodal bottlenecks unlocked 3x throughput gains.

  • Databricks' inference platform serves frontier models including open source and proprietary, handling 120T tokens/month.
  • Model units provide a VM-like abstraction for capacity management, enabling cost-aware routing and scaling.
In-site article

I built a free stock research tool for beginners using SEC data and AI

Mr. Guy Invests is a free, beginner-friendly stock research and portfolio tracker that leverages public SEC filings to track hedge fund and insider activity, offers an AI stock tutor, a $100K virtual trading challenge, daily market briefs, and more. Free tier has daily limits; Pro is $4.99/month for unlimited access.

  • Uses SEC Form 13F and Form 4 data to show what hedge funds and insiders are buying.
  • AI Stock Tutor answers questions in plain English, avoiding financial jargon.
In-site article

DeepSeek V4 Gets Even Cheaper: New Tool Boasts 99.82% Cache Hit Rate, Slashes Bills to 20%

One month after DeepSeek V4's release, the open-source community unveiled Reasonix, a tool specifically designed to minimize API costs by maximizing cache efficiency. It achieves a staggering 99.82% cache hit rate, reducing a $61 bill for 400M+ tokens to just $12.

  • Reasonix is a dedicated coding harness for DeepSeek, focusing on cost reduction.
  • Its cache-first loop, tool-call repair, and automatic context compression maintain over 90% cache hit rate in long sessions.
In-site article

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

This study challenges the assumption that high benchmark scores reflect true visual understanding in vision-language models (VLMs). By removing a large fraction of image tokens with minimal performance drop, the authors reveal a mismatch between accuracy and visual grounding. Through multi-level analyses including global degradation, localized occlusion, question reformulation, answer-space expansion, decision-level analysis, and layer-wise vision-token geometry, they find that models are less sensitive to fine-grained visual evidence than expected, and that visual tokens become more similar in deeper layers. The results indicate that current benchmarks are insufficient for evaluating fine-grained visual grounding in VLMs.

  • Removing many image tokens only slightly degrades VLM performance, questioning benchmark reliance on vision.
  • Models incorporate visual input but are insensitive to loss of fine-grained visual evidence.
In-site article

NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule

NVIDIA's Gated DeltaNet-2 is a linear attention layer that decouples memory erasing and writing into channel-wise gates. Trained at 1.3B parameters on 100B FineWeb-Edu tokens, it outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in language modeling, commonsense reasoning, and long-context retrieval, with the largest gains on RULER benchmarks.

  • Gated DeltaNet-2 decomposes the scalar gate into a channel-wise erase gate (key axis) and write gate (value axis), enabling independent control of erasing old content and writing new content.
  • At 1.3B parameters trained on 100B FineWeb-Edu tokens, it achieves best average performance across benchmarks compared to baselines.
In-site article

Deepseek makes its 75 percent discount permanent, pricing output tokens at least 34x below GPT-5.5

Deepseek is making the 75 percent discount on its top model V4-Pro permanent. At $0.435 per million input tokens, it's at least 11.5 times cheaper than GPT-5.5 and over 34 times cheaper on output. For token-hungry agentic systems, this kind of pricing could squeeze Western providers hard.

  • Deepseek's 75% discount on V4-Pro is now permanent.
  • Input token price is $0.435 per million, 11.5x cheaper than GPT-5.5.
In-site article

DeepSeek V4 Slashes Prices Permanently; CATL, JD, NetEase Rush to Invest; Liang Wenfeng: Goal is AGI

DeepSeek announced permanent price cuts for its V4-Pro API. Meanwhile, CATL, JD, and NetEase are in talks to invest in DeepSeek's first external funding round. Founder Liang Wenfeng emphasizes prioritizing AGI research and maintaining open-source principles.

  • DeepSeek V4-Pro API permanently reduced to one-quarter of original price
  • CATL, JD, and NetEase among companies negotiating investment in DeepSeek
In-site article

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

NVIDIA introduces Nemotron-Labs Diffusion language models that achieve up to 6.4x faster inference than autoregressive models while maintaining high accuracy by generating tokens in parallel and refining them iteratively. The models support three modes: autoregressive, diffusion, and self-speculation. The 8B model outperforms Qwen3 8B by 1.2% accuracy.

  • Nemotron-Labs Diffusion models offer three generation modes: autoregressive, diffusion, and self-speculation.
  • The 8B model achieves 2.6x TPF in diffusion mode and up to 6.4x with self-speculation.
In-site article

Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

This paper proposes the Ablate-to-Validate diagnostic principle and its instantiation, the Token Replacement Test (TRT), to determine whether vision-language models (VLMs) genuinely use continuous latent tokens for reasoning. Experiments show that VLMs retain most performance gains even when token content is corrupted or replaced, indicating that accuracy improvements are a misleading proxy for latent-token reasoning.

  • Introduces the Ablate-to-Validate principle and the Token Replacement Test (TRT) to diagnose actual use of continuous thought tokens.
  • Experiments reveal VLMs retain performance gains even after token content corruption, suggesting gains are not due to reasoning with tokens.
In-site article

Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window

Alibaba's Qwen team announced Qwen3.7-Max, their most advanced agent model, featuring a 1M-token context window, extended-thinking mode, and strong benchmark scores (56.6 on AI Index, 5th overall). The model excels at coding, debugging, and long-horizon autonomous tasks but has caveats like reduced factual recall on AA-Omniscience and no independent verification of long-context reliability.

  • Qwen3.7-Max offers a 1M-token context window and extended-thinking mode for complex multi-step tasks.
  • It scored 56.6 on the Artificial Analysis Intelligence Index, ranking fifth among all models.
In-site article

More growth tags

Model Pricing AI News | AI News Hub