Model Pricing AI News

Model Pricing updates

Kimi K3 on vLLM: Up to 370 Tokens/sec

2026-07-27 15:44 UTC

vLLM announces day-0 support for Kimi K3, a 2.8-trillion-parameter Mixture-of-Experts model with 1M-token context. Achieves up to 370 tok/s with DSpark speculative decoding. Features hybrid prefix caching, tool calling, and optimizations for production deployment.

Kimi K3 is a 2.8T parameter MoE model with 16 of 896 experts active per token, supporting 1M token context.
vLLM serves Kimi K3 at up to 370 tok/s using DSpark speculative decoding, a 3.14x speedup over baseline.

Ordered Action Tokens for Visuomotor Policy Learning

2026-07-27 04:00 UTC

This paper introduces Ordered Action Tokenization (OAT), a learned action tokenizer that maps continuous robot action chunks to an ordered sequence of discrete tokens, achieving high compression, total decodability, and ordered token space. It demonstrates strong performance across various tasks and backbones.

OAT uses a transformer with registers, finite scalar quantization, and ordering-inducing training to discretize action chunks into an ordered token sequence. Early tokens encode coarse control, later tokens refine details.
OAT is validated on autoregressive policies and token co-training policies, offering anytime tradeoff between inference cost and action fidelity.

Kimi K3 by Moonshot now available on Modal

2026-07-27 00:00 UTC

Moonshot has released Kimi K3, a 2.8 trillion parameter multimodal model, now available on Modal at 460 tokens per second. The model features a mixture-of-experts architecture, 1M token context window, native vision, and is optimized with a custom DFlash speculator for faster inference.

Kimi K3 is a 2.8 trillion parameter multimodal model ranking 4th on Artificial Analysis Intelligence Index. It uses MoE with 16/896 experts per token and 1M context window.
Modal supports K3 on day zero with token-based Shared API and dedicated Auto Endpoint, plus a custom DFlash speculator.

An Inside Look at the Relay Market Powering Token Resellers and Fraud

2026-07-26 19:30 UTC

Matt Lenhard's investigation reveals a market where LLM tokens are resold at a discount by abusing free trials, unprotected support bots, stolen credit cards, and chargeback attacks. Primarily in China, resellers use open-source proxy software like one-api and new-api to pool API keys. Buyers seek cheap tokens, avoid geo-restrictions, or collect data for model distillation. The author calls for stricter API key caps from vendors.

Resellers offer discounted LLM API access by abusing free trials, unprotected support bots, or stolen credit cards and chargeback attacks.
Open-source proxy software one-api and its fork new-api are used to pool API keys.

Running a 28.9M parameter LLM on an $8 microcontroller

2026-07-25 18:59 UTC

A developer successfully runs a 28.9 million parameter language model on an ESP32-S3 microcontroller costing around $8, using Per-Layer Embeddings to store most parameters in flash. The text generation runs entirely on-device at about 9.5 tokens per second, a significant leap from previous 260k-parameter models.

28.9M parameter LLM runs on $8 ESP32-S3 microcontroller with 512KB SRAM and 16MB flash
Uses Google's Per-Layer Embeddings to store 25M parameters in flash, reading only ~450 bytes per token

Meet the New Claude Opus 5: Frontier-Class Agentic Coding and Computer Use at Unchanged Opus Pricing

2026-07-24 21:50 UTC

Anthropic has released Claude Opus 5, replacing Opus 4.8 as the Opus-tier flagship. Pricing remains unchanged at $5/M input and $25/M output tokens. The model approaches Claude Fable 5's intelligence at half the price. Key API changes: thinking enabled by default, breaking change for disabled thinking at high effort, and removal of verification prompts. It achieves strong results in agentic coding benchmarks like OSWorld (70.57%) and Zapier AutomationBench (26.0%), as well as reasoning tests (perfect score on IMO 2026 problems, 30.16% on ARC-AGI-3). Safety relaxations are limited to source-code vulnerability finding; exploitation remains blocked.

Pricing unchanged with major performance improvements, approaching Fable 5 level
Thinking enabled by default; cannot be disabled at high effort

Show HN: Frontier model pricing became a rip-off, so I built an open-source CLI

2026-07-24 10:33 UTC

Kolega Code is a local-first terminal coding agent with multi-agent orchestration (Gigacode) for broad tasks like large audits, migrations, and parallel checks. It supports model routing, plan/build modes, web search, MCP servers, and is open source under Apache 2.0.

Kolega Code is an open-source, local-first terminal coding agent designed for multi-agent collaboration.
Its Gigacode feature enables parallel execution of sub-agents for efficient handling of large codebases.

Show HN: Generous free tier for SERP and AI web scraping

2026-07-24 04:34 UTC

cloro.dev, the leading AI UI scraping platform, introduces a new recurring free tier offering 500 credits every month. The platform extracts structured data from ChatGPT, Perplexity, Grok, Gemini, Google Search, Google News, Copilot, and AI Overview via a single API. It has crossed 200 customers, mostly large enterprises.

New monthly free tier with 500 credits
Single API for data extraction from multiple AI platforms

Is MoE Routing a Huffman Code? Discovering the Frequency-Diversity Law in Chain-of-Thought

2026-07-24 04:00 UTC

A new study reveals that Mixture-of-Experts (MoE) routing in large language models follows a principle similar to Huffman coding, where common tokens are processed by sparse experts and rare, complex tasks engage diverse expert committees. The paper introduces the Frequency-Diversity Law, identifies a redundancy trap in certain models, and proposes Subset Difference Pruning to improve efficiency.

MoE routing behaves like Huffman coding, allocating sparse resources for frequent tokens and diverse experts for rare tasks.
The Frequency-Diversity Law explains how models like Phi-3.5-MoE and Gemma-4-27B-A4B optimize information-theoretic efficiency.

Show HN: Frontier model pricing is a rip-off, so I built an open-source CLI

2026-07-23 11:15 UTC

Kolega Code is an open-source, local-first CLI tool that orchestrates multiple AI agents for coding tasks. It supports various model providers, features parallel sub-agent workflows (Gigacode), web search, browser automation, and keeps all data on the user's machine.

Multi-agent coding with specialized sub-agents and Gigacode parallel workflows.
Local-first design: sessions, keys, and state remain on user's machine.

ChronoStitch: Training-Free Composition of Visual KV Memories for Long-Horizon Temporal Reasoning

2026-07-23 04:00 UTC

This paper introduces ChronoStitch, a training-free method for composing independently stored visual key-value (KV) memories to enable long-horizon temporal reasoning in video question answering. By re-basing stored post-rotary keys onto a global three-axis multimodal RoPE coordinate system and selectively recomputing high-deviation visual tokens, it overcomes temporal phase collisions and content gaps from naive concatenation. Experiments on Qwen2.5-VL-3B and the temporal split of TempCompass show improved event-ordering accuracy and 3.3x speedup over full joint re-prefilling.

Long-video QA requires preserving visual evidence over time; KV caching is practical but naive concatenation loses global order.
ChronoStitch re-bases keys to a global RoPE coordinate system and selectively recomputes high-deviation tokens for training-free composition.

Show HN: LiquidBrain – Unlimited Tokens. Unlimited Context. One Fixed Price

2026-07-22 20:59 UTC

LiquidBrain.ai offers unlimited tokens and unlimited context at a fixed price, with a strong emphasis on data privacy.

Unlimited tokens and context processing
Fixed pricing with no extra costs

Gemini 3.6 Flash Is Here: The Efficiency Release

2026-07-22 11:52 UTC

On July 21, 2026, Google quietly released Gemini 3.6 Flash, a mid-cycle update focused on efficiency rather than breakthrough capability. It maintains similar reasoning to 3.5 Flash but with significantly reduced token usage and cost. Improvements in coding, ML tasks, and computer use are notable, with a refreshed knowledge cutoff. The model is priced at $1.50/M input tokens and $7.50/M output, cheaper than its predecessor. The article includes stress tests for readers to evaluate the model themselves.

Gemini 3.6 Flash focuses on efficiency gains, not raw intelligence leaps
Output tokens reduced by ~17%, with up to 65% on some tasks

The Sequence AI of the Week #899: Inside Inkling: A Trillion-Parameter Model That Only Wakes Up 41 Billion at a Time

2026-07-22 11:04 UTC

Thinking Machine's new model Inkling has 975 billion parameters but activates only 41 billion per token, using a router to select specialist subsets for sparse computation.

Inkling has 975B total parameters but only 41B active per token (about 4.2%).
It uses a sparse activation architecture with a router to choose expert subsets.

The US Army Is Burning Through Its AI Tokens

2026-07-22 06:03 UTC

The US Army exhausted its entire year's supply of AI tokens for Ask Sage in just one month, forcing a reimposition of limits. Despite encouragement to use generative AI heavily, token consumption has been staggering, raising questions about the utility and reliability of AI tools.

The Army used up a year's worth of AI tokens by mid-June, requiring new limits in July.
Employees were pushed to use AI, with monthly allotments of at least 200,000 tokens.

Google Releases Gemini 3.6 Flash, 3.5 Flash-Lite, and 3.5 Flash Cyber: A Cheaper, More Token-Efficient Flash Tier Built for Agentic Workloads

2026-07-21 17:45 UTC

Google released Gemini 3.6 Flash, 3.5 Flash-Lite, and 3.5 Flash Cyber on July 21, 2026. The Flash tier gets cheaper and more token-efficient, with 3.6 Flash cutting output tokens 17% and dropping its output price to $7.50 per 1M. Flash-Lite runs at 350 tokens/sec, while gated Flash Cyber powers CodeMender for vulnerability finding. The flagship 3.5 Pro remains delayed.

Gemini 3.6 Flash reduces output tokens by 17% (up to 65% on DeepSWE) and lowers output price from $9.00 to $7.50 per 1M tokens.
Gemini 3.5 Flash-Lite delivers 350 tokens/sec at $0.30/$2.50 per 1M input/output tokens, outperforming older 3 Flash on SWE-Bench Pro and OSWorld-Verified.

Exploring self-distilled reasoning for supervised fine-tuning with Amazon Nova

2026-07-21 16:23 UTC

This post explores generating thinking tokens for datasets lacking reasoning traces in SFT customization. It examines the reasoning suppression problem, introduces Self-Distilled Reasoning (SDR), validates it across three benchmarks, and provides practical recommendations. SDR reuses the base model's chain of thought as a stand-in, mitigating catastrophic forgetting while maintaining or improving target performance.

SFT on non-reasoning datasets can suppress the model's reasoning ability, even when reasoning mode is enabled.
Self-Distilled Reasoning (SDR) generates reasoning traces from the base model itself, requiring no human annotation.

Show HN: Calyxa – Browser Native AI tutor solving the "cheating" problem

2026-07-21 05:50 UTC

Calyxa is a browser-native AI tutor that helps students learn directly on homework websites and PDFs, eliminating the need to copy-paste problems into separate chats. It provides contextual annotations and voice coaching, aiming to turn cheating into genuine understanding. Free tier offers 10 sessions/month; Pro at $10/month for unlimited use.

Calyxa is an AI tutor that works directly on homework pages, providing annotations and coaching to help students learn step-by-step.
It aims to solve the 'cheating' problem by embedding tutoring into the actual homework process, reducing busywork.

Multi-level context Modeling for consistent expert selection in Mixture-of-Experts

2026-07-21 04:00 UTC

Mixture-of-Experts (MoE) scales Transformers by routing tokens to a subset of experts, but existing routers use shallow or isolated token representations, leading to unstable and semantically inconsistent routing. This work proposes Multi-level Context Fusion MOE (MCF-MOE), which integrates cross-layer semantic aggregation and local token-level interactions for more context-aware representations. Experiments show improved routing consistency and downstream performance.

Existing MoE routers suffer from context incompleteness causing inconsistent expert selection.
MCF-MOE fuses cross-layer semantic and local token signals for better representations.

It Takes 8 Tokens: Weak-to-Strong Off-Policy RL via Auxiliary Branches

2026-07-21 04:00 UTC

A new reinforcement learning method, W2SPO, uses weak auxiliary models to inject short segments into LLM reasoning paths, improving performance and training speed on math reasoning tasks.

Standard RL for LLMs suffers from semantic redundancy and limited support
W2SPO injects short auxiliary segments (as few as 8 tokens) into trajectories

Show HN: Turn casual photos into professional headshots with AI

2026-07-21 03:13 UTC

Portraify is an AI-powered headshot generator that lets users upload 1–3 everyday photos to get a studio-quality portrait in seconds. It offers a free tier with 3 portraits, paid plans starting at $9, and emphasizes privacy by not storing uploaded photos.

Upload 1–3 casual photos; AI generates professional headshot in 1–2 minutes.
Free tier includes 3 portraits; paid plans from $9 to $50 one-time, no subscription.

Colibrì proof-of-concept gains frontier-level 1.5-TB AI model

2026-07-20 22:58 UTC

Italian engineer Vincenzo (JustVugg) created Colibrì, a proof-of-concept that runs the 744-billion-parameter GLM-5.2 model (1.5TB) on a modest CPU with only 25GB RAM and 1GB/s NVMe. Despite extremely slow speeds (0.05-0.1 tokens per second), it leverages the Mixture-of-Experts architecture to load experts per token, achieving frontier-level answer quality. The project is open-source and aims to explore running large models on consumer hardware.

Colibrì runs a 1.5TB AI model on minimal hardware at 0.05-0.1 tokens/sec.
It uses MoE architecture to load/unload experts per token, enabling operation within tight memory.

Complete Guide to Thinking Machines Inkling

2026-07-20 06:37 UTC

Thinking Machines Lab has released Inkling, its first general-purpose open-weights foundation model. It is a multimodal MoE model with 975B parameters, 41B active parameters, and a 1M-token context window. Designed for customization, Inkling excels in reasoning, coding, agentic workflows, and multimodal tasks. This guide covers its architecture, training, benchmarks, deployment, and fine-tuning workflow.

Inkling is a 975B-parameter sparse MoE model with 41B active parameters and up to 1M token context. It supports text, image, and audio input.
Its architecture includes hybrid attention, relative positional embeddings, short convolutions, and multi-token prediction. It was trained on 45 trillion tokens.

Better Starts, Better Ends: Bootstrapped Iterative Self-Reasoning Distillation for Compressed Reasoning

2026-07-20 04:00 UTC

The paper introduces BIRD, a two-stage self-reasoning distillation method that first samples concise solutions with a brevity instruction and performs prompt-switch SFT, then applies on-policy reverse-KL distillation on cleaner prefixes. On Qwen3-8B, MATH-500 accuracy improves from 86.2% to 92.0% while response length drops from 3,099 to 1,115 tokens.

Existing on-policy self-distillation has an initialization bottleneck due to training on noisy prefixes.
BIRD's first stage uses brevity instruction sampling and prompt-switch SFT to make conciseness a default behavior.

AI Trading: Evaluating Large Language Models for Technical Market Analysis

2026-07-20 04:00 UTC

A systematic evaluation of five major LLMs for technical market analysis finds GPT-4 Turbo achieves highest annualized return and Sharpe ratio, while FinGPT shows competitive risk-adjusted performance through domain fine-tuning. The study also identifies failure modes including numerical hallucination and context window limitations.

GPT-4 Turbo achieves highest annualized return and Sharpe ratio among general-purpose models
FinGPT demonstrates competitive risk-adjusted performance via domain-specific fine-tuning

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

2026-07-20 00:00 UTC

Apple researchers present the Length Value Model (LenVM), a token-level framework that predicts remaining generation length at each decoding step. By framing length modeling as a value estimation problem with a constant negative reward per token, LenVM provides annotation-free, dense, unbiased, and scalable supervision. Experiments on LLMs and VLMs show that on the LIFEBench exact length matching task, LenVM improves a 7B model's length score from 30.9 to 64.8, surpassing frontier closed-source models. LenVM also enables continuous control over the performance-efficiency trade-off and predicts total generation length from the prompt.

LenVM is a token-level framework for predicting remaining generation length.
It uses value estimation with constant negative reward for annotation-free supervision.

Free AI Harness Profiler – What tf did Fable do with your tokens

2026-07-18 19:07 UTC

Introducing a free tool to analyze token usage in Fable's AI harness.

This tool audits Fable's token operations.
Free to use, providing insight into token flows.

Controlling Reasoning Effort in LLMs

2026-07-18 11:16 UTC

This article explores how to develop reasoning models with multiple effort modes, covering the evolution from o1 and DeepSeek-R1 to GPT-5.6, and key techniques such as RLVR training, inference scaling, think tokens, and reasoning mode toggles.

Reasoning models output intermediate reasoning traces, distinguishing them from conventional LLMs.
RLVR training rewards only final answer correctness, not the reasoning trace.

NVIDIA Vera Rubin Maximizes Intelligence per Dollar for Post-Training Workloads – a Key Metric for Agentic AI

2026-07-17 15:00 UTC

Lowest cost per token from extreme codesign maximizes intelligence per dollar for post-training in the agentic era.

Post-training is a continuous process for agentic AI, essential for adapting to changing environments.
NVIDIA Vera Rubin reduces GPU requirements by 4x compared to Blackwell for large model training.

Polestar: Drift-Aware Cache Calibration and Token Commitment for Efficient Inference of Diffusion LLMs

2026-07-17 04:00 UTC

Polestar is a training-free inference framework that addresses KV-cache reuse and decoding parallelism challenges in diffusion LLMs by leveraging token representation drift. It consists of Polestar-Cache for sparse cache refreshes and Polestar-Commit for identifying commit-ready tokens, achieving up to 10.73% accuracy improvement and 3.7x higher throughput on math and coding benchmarks.

Polestar uses token representation drift to jointly optimize cache efficiency and decoding parallelism.
Polestar-Cache identifies stale KV-cache positions for sparse refreshes, enabling efficient reuse.

Token Time Continuous Diffusion for Language Modeling

2026-07-17 04:00 UTC

This paper introduces token time continuous diffusion (TTCD), a diffusion language model operating in continuous space with per-token times, where tokens proceed from noise to token at varying rates. TTCD avoids parallel sampling inaccuracies and outperforms discrete models at high speedups. A 160M parameter model trained on OpenWebText and self-distilled achieves comparable unconditional and superior conditional generation, with gains in Sudoku solving.

TTCD is a continuous-space diffusion LM with per-token times, allowing tokens to be generated at different rates.
Continuous space avoids inaccuracies from parallel sampling, improving performance at high speedups.

Firefox in WebAssembly

2026-07-16 23:34 UTC

Puter compiled Firefox to WebAssembly, enabling a full browser to run inside another browser. The project used an estimated $25,000 in Claude Opus and Fable tokens, leverages the Wisp protocol for proxying, supports end-to-end encryption, and is open source.

Puter successfully compiled Firefox's Gecko engine to WebAssembly, allowing a browser within a browser.
The project cost approximately $25,000 in AI compute resources, using a Claude Max subscription.

Introducing Grok on Amazon Bedrock

2026-07-16 19:29 UTC

xAI's Grok 4.3 is now generally available on Amazon Bedrock, offering configurable reasoning effort, strong tool use, instruction following, and a 1 million token context window for agentic and enterprise workloads. This post covers its features, access methods, and how to use key capabilities such as chat, reasoning, tool calling, structured output, image input, and multi-turn conversations.

Grok 4.3 is available on Amazon Bedrock via the Mantle inference engine with OpenAI-compatible APIs.
Supports configurable reasoning effort (none, low, medium, high) to balance depth and latency.

Inkling: Our open-weights model

2026-07-16 15:35 UTC

Mira Murati's Thinking Machines Lab released Inkling, a 975B parameter MoE model (41B active) under Apache-2.0 license, multimodal, trained on 45T tokens. It's not frontier but a strong base for fine-tuning via Tinker platform. Inkling-Small (276B, 12B active) is promised. Model card and training data documentation are unusually brief. Inkling is competitive with Chinese open-weight models, adding to the US ecosystem.

Inkling is an open-weights multimodal MoE model with 975B total parameters (41B active), Apache-2.0 licensed, trained on 45 trillion tokens.
It is not a frontier model but designed as a strong base for fine-tuning using Thinking Machines' Tinker platform.

Oracle Agent Memory as an Enterprise Memory Substrate for Long-Horizon AI Agents

2026-07-16 04:00 UTC

A technical report from arXiv introduces Oracle Agent Memory, a database-native memory system built on Oracle Database for long-horizon AI agents. It achieves 93.8% accuracy on LongMemEval while using 10.7x fewer tokens compared to flat-history baselines. The system addresses memory lifecycle, layered architecture with scope control, and evaluation methodology combining task accuracy with memory-specific metrics.

Agent memory is critical for long-horizon AI agents to retain state, user preferences, and procedural knowledge.
Oracle Agent Memory is built on Oracle Database with a lifecycle covering ingestion, extraction, consolidation, retrieval, summarization, and revision/removal.

Open Source, Free Tier Capable Whispr Using Cloudflare AI

2026-07-16 02:57 UTC

Voicebox is an open-source voice-to-text tool that captures speech, transcribes it via Whisper, and formats the output with an LLM, all powered by Cloudflare Workers AI.

Leverages Cloudflare Workers AI for real-time speech-to-text and LLM formatting.
Desktop client built with Wails (Go + React) offers global hotkey and auto-paste.

Show HN: Throttle – Local Claude Cockpit for macOS, now with remote control

2026-07-16 00:12 UTC

Throttle is a macOS menu bar meter for Claude Code usage that evolved into a full cockpit. The free version provides local, no-telemetry monitoring. Pro adds a project cockpit with embedded terminals, auto-hibernate, remote session transfer to Linux servers, and an AI optimizer that audits claude.md and settings.json to reduce output tokens by 65–75%. All data stays local or in iCloud private database. One-time fee of €29.

Free version offers Claude usage meter with no telemetry, no network.
Pro includes project cockpit, embedded terminals, remote control, and AI optimization.

Mira Murati’s Thinking Machines drops Inkling, an open-weights model anyone can access

2026-07-15 23:51 UTC

Mira Murati's Thinking Machines Lab Inc. today launched its first foundation model with the release of Inkling, making its full open weights available to developers so they can fine-tune it as they wish. Inkling is a mixture-of-experts model with 975 billion parameters (41B active) trained on 45 trillion tokens of text, image, audio and video, capable of reasoning across all four modalities but outputting only text. It features "thinking effort" controls and uncertainty flagging to reduce hallucinations. The model is fine-tunable via the Tinker API and aims to provide a Western open-source alternative to Chinese AI models. Thinking Machines plans to generate revenue through the Tinker platform rather than per-token API access, potentially disrupting current AI business models.

Thinking Machines releases Inkling, a 975B-parameter open-weights model (41B active).
Trained on 45T tokens across modalities; outputs text only.

Thinking Machines Lab Releases Inkling: A 975B-Parameter Open-Weights Multimodal MoE With 41B Active Parameters And Controllable Thinking Effort

2026-07-15 23:48 UTC

Thinking Machines Lab released Inkling on July 15, 2026, its first model trained from scratch. The full weights ship under Apache 2.0. It is a 975B-parameter Mixture-of-Experts transformer with 41B active parameters, a 1M-token context window, and native text, image, and audio input. The core differentiator is controllable thinking effort, allowing users to adjust token budgets per call to balance cost and performance.

Inkling is a 975B-parameter MoE transformer with 41B active parameters, supporting a 1M-token context and multimodal input (text, image, audio).
Controllable thinking effort, achieved via RL, enables dynamic token budget adjustment, matching Nemotron 3 Ultra on Terminal Bench with one-third the tokens.

Soofi Consortium Releases Soofi S 30B-A3B: An Open Hybrid Mamba-Transformer MoE Foundation Model For German And English

2026-07-15 21:02 UTC

A German research consortium has published the pretraining report for Soofi S 30B-A3B, an open base model for German and English. It is a Mixture-of-Experts hybrid Mamba Transformer model with 31.6B total parameters, activating 3.2B per token. It achieves the highest English and German aggregate scores among tested fully open base models.

Soofi S 30B-A3B is a hybrid Mamba-Transformer MoE model that activates 3.2B of 31.6B parameters.
It leads open base models with 70.1% English aggregate and 79.1% German aggregate.

Show HN: Limits, an on-device iOS app for tracking AI usage limits

2026-07-15 18:53 UTC

Limits is an iOS app that monitors AI tool quotas for Codex, Claude Code, and Cursor directly on your device. It sends notifications when limits reset, predicts when you'll run out, and helps you redeem expiring rate-limit resets. All data stays on your phone, with tokens stored in the iOS Keychain.

Tracks usage for Codex, Claude Code, and Cursor in one place with real-time session and weekly limit monitoring.
Sends push notifications the moment a limit resets, so you never miss a chance to resume work.

Scaling Point-in-Time Language Models

2026-07-15 04:00 UTC

This paper shows that scaling can substantially narrow the performance gap between point-in-time language models and their unconstrained counterparts. The authors trained decoder-only transformers with up to 4 billion parameters on 1 trillion chronologically filtered tokens from FineWeb, creating monthly model checkpoints from 2013 to 2024. On reasoning and understanding benchmarks, these models approach the performance of leading open-weight models of comparable size (e.g., Gemma-3-4B and LLaMA-7B) trained on temporally unrestricted data. Instruction fine-tuning via LoRA further improves downstream usability. The complete pipeline is released for reproducibility.

Point-in-time language models eliminate lookahead bias by training only on text available up to each calendar date.
Models with up to 4B parameters trained on 1 trillion temporally filtered tokens approach the performance of unconstrained models.

PrismML Releases Bonsai 27B: 1-bit and Ternary Builds of Qwen3.6-27B That Run on Laptops and Phones

2026-07-14 22:51 UTC

PrismML just released Bonsai 27B. It is a low-bit representation of Qwen3.6-27B, not a new pretrain. The architecture is unchanged. Two variants ship under Apache 2.0. Ternary Bonsai 27B uses {−1, 0, +1} weights at a true 1.71 bits per weight. Its ideal size is 5.9GB. 1-bit Bonsai 27B uses binary {−1, +1} weights at 1.125 bits per weight, for 3.9GB. Performance: ternary retains 94.6% of FP16, binary retains 89.5%. Both are multimodal, context 262K tokens. PrismML claims the 1-bit build is the first 27B-class model to fit a phone.

Bonsai 27B is a low-bit representation of Qwen3.6-27B, not a new pretrain.
Two variants: ternary (1.71 bits/weight, 5.9GB) and binary (1.125 bits/weight, 3.9GB).

What Context Does a Coding Agent Actually Need to Act?

2026-07-14 04:00 UTC

A new study reveals that coding agents need minimal context when editing code: the signal is only in the code being edited, natural-language summaries fail to answer behavioral questions, surrounding context (UML skeletons) performs no better than deleting it, and compressed context matches full files at one-third the tokens. Temperature-0 inference introduces a ~9% noise floor. The authors release their instrument including gold-validated environments, deterministic patches, and pre-registered hypotheses.

The signal for editing lives solely in the code being edited; natural-language summaries answer almost none of the behavioral questions that source code does, regardless of summarizer size.
Surrounding context rendered as UML skeletons resolves no more issues than outright deletion (N=70, p=0.75).

Anthropic Claude Sonnet 5 vs Sonnet 4.6 vs Opus 4.8: Agentic Coding Benchmarks, API Pricing, and Cost-Performance Tradeoffs Compared

2026-07-14 00:58 UTC

Anthropic has launched Claude Sonnet 5, its most agentic mid-tier model, outperforming Sonnet 4.6 across all benchmarks and narrowing the gap to Opus 4.8. It introduces effort levels to control reasoning costs, offering great value at low/medium effort but potentially exceeding Opus 4.8 cost at extra-high effort. It is now the default model for Free and Pro plans and accessible via API.

Sonnet 5 beats Sonnet 4.6 on SWE-bench Pro, OSWorld-Verified, and HLE, approaching Opus 4.8. scores.
Pricing is lower than Opus 4.8: $2/$10 per million tokens intro (until Aug 31, 2026), then $3/$15.

Show HN: QuantumReckon – find your real cloud and AI spend, including tokens

2026-07-13 16:55 UTC

QuantumReckon is a new tool that reveals the true cost of cloud and AI spending, especially the often-hidden token costs from AI APIs. It connects to multiple cloud and AI providers, performs daily automated sweeps, detects anomalies and waste, and provides auditable evidence with sealed receipts. Validated on the founder's own estate, it identified significant savings.

QuantumReckon uncovers AI token spend invisible in traditional cloud bills.
It connects to providers like Azure, AWS, GCP, Anthropic, and OpenAI for daily sweeps.

CLAP: Direct VLM-to-VLA Adaptation via Language-Action Grounding

2026-07-13 04:00 UTC

CLAP converts pretrained VLMs to VLAs by prepending language descriptions to action tokens, avoiding distribution shift. Single-epoch fine-tuning yields 90.8% on LIBERO (+14.9 over VLA-0) and improved robustness. Open-weight models at 0.8B, 2B, 4B to be released.

CLAP adapts VLMs to VLAs by prepending language to action tokens, avoiding output-distribution mismatch
Single-epoch fine-tuning achieves 90.8% on LIBERO for 2B model, +14.9 over VLA-0

Sticky Routing: Training MoE Models for Memory-Efficient Inference

2026-07-13 04:00 UTC

We propose StickyMoE, a differentiable routing consistency loss that penalizes abrupt expert switches between adjacent tokens during training, enabling memory-efficient inference on edge devices. Experiments show up to 60% reduction in expert switch rate with less than 4% perplexity degradation.

MoE models suffer from memory bottlenecks on edge devices due to frequent expert switching.
StickyMoE directly optimizes routing locality at training time via an auxiliary loss, requiring no architectural changes.

When Implausible Tokens Get Reinforced: Tail-Aware Credit Calibration for LLM Reinforcement Learning

2026-07-10 04:00 UTC

This paper identifies a failure mode called Positive-Credit Contamination in RL for LLMs, where low-probability erroneous tokens receive identical positive credit as plausible ones. The proposed TACO method computes a tail-risk score to calibrate credit assignment, outperforming GRPO baselines across three LLMs and eight benchmarks while improving training stability in long-horizon RL.

Identifies Positive-Credit Contamination: uniform credit assignment reinforces flawed reasoning by giving same positive credit to erroneous tail tokens.
Proposes TACO, which uses a tail-risk score based on local generation context to modulate positive updates for risky tokens.

Jet-Long: Efficient Long-Context Extension with Dynamic Bifocal RoPE

2026-07-10 04:00 UTC

Jet-Long introduces a tuning-free zero-shot method for extending LLM context windows by using dynamic bifocal RoPE, which adapts the rescaling factor to sequence length, achieving high efficiency and strong performance on multiple benchmarks.

Existing zero-shot context extension methods use a fixed rescaling factor, leading to trade-offs between short and long contexts.
Jet-Long employs dynamic bifocal RoPE with local and long-range windows, automatically adjusting the rescaling factor based on sequence length.

Model Pricing

Related topics

Model Pricing updates

Kimi K3 on vLLM: Up to 370 Tokens/sec

Ordered Action Tokens for Visuomotor Policy Learning

Kimi K3 by Moonshot now available on Modal

An Inside Look at the Relay Market Powering Token Resellers and Fraud

Running a 28.9M parameter LLM on an $8 microcontroller

Meet the New Claude Opus 5: Frontier-Class Agentic Coding and Computer Use at Unchanged Opus Pricing

Show HN: Frontier model pricing became a rip-off, so I built an open-source CLI

Show HN: Generous free tier for SERP and AI web scraping

Is MoE Routing a Huffman Code? Discovering the Frequency-Diversity Law in Chain-of-Thought

Show HN: Frontier model pricing is a rip-off, so I built an open-source CLI

ChronoStitch: Training-Free Composition of Visual KV Memories for Long-Horizon Temporal Reasoning

Show HN: LiquidBrain – Unlimited Tokens. Unlimited Context. One Fixed Price

Gemini 3.6 Flash Is Here: The Efficiency Release

The Sequence AI of the Week #899: Inside Inkling: A Trillion-Parameter Model That Only Wakes Up 41 Billion at a Time

The US Army Is Burning Through Its AI Tokens

Google Releases Gemini 3.6 Flash, 3.5 Flash-Lite, and 3.5 Flash Cyber: A Cheaper, More Token-Efficient Flash Tier Built for Agentic Workloads

Exploring self-distilled reasoning for supervised fine-tuning with Amazon Nova

Show HN: Calyxa – Browser Native AI tutor solving the "cheating" problem

Multi-level context Modeling for consistent expert selection in Mixture-of-Experts

It Takes 8 Tokens: Weak-to-Strong Off-Policy RL via Auxiliary Branches

Show HN: Turn casual photos into professional headshots with AI

Colibrì proof-of-concept gains frontier-level 1.5-TB AI model

Complete Guide to Thinking Machines Inkling

Better Starts, Better Ends: Bootstrapped Iterative Self-Reasoning Distillation for Compressed Reasoning

AI Trading: Evaluating Large Language Models for Technical Market Analysis

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

Free AI Harness Profiler – What tf did Fable do with your tokens

Controlling Reasoning Effort in LLMs

NVIDIA Vera Rubin Maximizes Intelligence per Dollar for Post-Training Workloads – a Key Metric for Agentic AI

Polestar: Drift-Aware Cache Calibration and Token Commitment for Efficient Inference of Diffusion LLMs

Token Time Continuous Diffusion for Language Modeling

Firefox in WebAssembly

Introducing Grok on Amazon Bedrock

Inkling: Our open-weights model

Oracle Agent Memory as an Enterprise Memory Substrate for Long-Horizon AI Agents

Open Source, Free Tier Capable Whispr Using Cloudflare AI

Show HN: Throttle – Local Claude Cockpit for macOS, now with remote control

Mira Murati’s Thinking Machines drops Inkling, an open-weights model anyone can access

Thinking Machines Lab Releases Inkling: A 975B-Parameter Open-Weights Multimodal MoE With 41B Active Parameters And Controllable Thinking Effort

Soofi Consortium Releases Soofi S 30B-A3B: An Open Hybrid Mamba-Transformer MoE Foundation Model For German And English

Show HN: Limits, an on-device iOS app for tracking AI usage limits

Scaling Point-in-Time Language Models

PrismML Releases Bonsai 27B: 1-bit and Ternary Builds of Qwen3.6-27B That Run on Laptops and Phones

What Context Does a Coding Agent Actually Need to Act?

Anthropic Claude Sonnet 5 vs Sonnet 4.6 vs Opus 4.8: Agentic Coding Benchmarks, API Pricing, and Cost-Performance Tradeoffs Compared

Show HN: QuantumReckon – find your real cloud and AI spend, including tokens

CLAP: Direct VLM-to-VLA Adaptation via Language-Action Grounding

Sticky Routing: Training MoE Models for Memory-Efficient Inference

When Implausible Tokens Get Reinforced: Tail-Aware Credit Calibration for LLM Reinforcement Learning

Jet-Long: Efficient Long-Context Extension with Dynamic Bifocal RoPE

More growth tags

AI Coding

MCP

Open Source Models

Inference Cost

Agent Frameworks

China AI

GPU Infrastructure

DeepSeek

Qwen