AI News HubLIVE

Qwen updates

Making AI chatbots helpful weakens their ability to simulate human behavior, large-scale study finds

A large-scale study covering 208,000 participants and 26 million responses shows that the very training that turns language models into helpful chatbots weakens their ability to replicate human behavior. The effect gets worse with each new model generation. Even the popular persona trick, feeding models demographic profiles, brings practically no benefit for individual predictions.

  • Base models outperform their post-trained counterparts in predicting human behavior.
  • The gap between base and assistant models widens with each generation.
In-site article

[AINews] Founders and Forward Deployed Engineers

While most digest yesterday's major Anthropic news, we highlight AIE's new Forward Deployed Engineer track and Founders program, along with AI news from May 28-29. Key topics include: Claude Opus 4.8 rollout with mixed benchmarks, multi-turn RL tokenization bugs, open model and toolchain progress, Google/OpenAI product expansions, and interesting research papers.

  • Claude Opus 4.8 brings incremental improvements but no benchmark sweep; pricing remains a pain point.
  • Multi-turn RL training tokenization bug identified, requiring 'Token-In, Token-Out' discipline.
In-site article

Tuning CPU-only Qwen3-30B inference with an IBM Quantum sampling loop

A project demonstrates boosting Qwen3-30B inference speed from 0.09 to 14.03 tok/s on a 2017 MacBook Air by combining a human experimenter, Codex, llama.cpp, a local database, and IBM Quantum sampling. The QPU is used for candidate selection, not for running the model directly.

  • Runs Qwen3-30B on 2017 MacBook Air (8GB RAM, CPU-only)
  • Hybrid quantum-classical optimization loop achieves 14.03 tok/s from 0.09 baseline
In-site article

Comprehensive observability for Amazon SageMaker AI LLM inference: From GPU utilization to LLM quality

This post demonstrates a comprehensive observability solution using Amazon Managed Grafana dashboards that provides a holistic view of both quality and quantity for LLMs served on Amazon SageMaker AI endpoints with inference components.

  • Observability for LLMs requires monitoring both infrastructure (quantity) and output quality (quality), which are interdependent.
  • Amazon CloudWatch centralizes enhanced metrics from SageMaker inference components and custom quality metrics.
In-site article

Where AI coding spend goes: 48% code, 40% thinking

A developer tracked $7,890 in AI coding API spend over 30 days and found only 47.9% went to actual code generation. The rest went to exploration, debugging, delegation, and conversation. He built CodeBurn, a CLI tool that categorizes API calls into 13 tasks to reveal where money really goes.

  • Only 47.9% of AI coding spend goes to writing code; 40% goes to thinking tasks like exploration and debugging.
  • CodeBurn is an open-source CLI tool that classifies API calls into 13 deterministic task categories.
In-site article

Liquid AI reveals 8B-A1B MoE trained on 38T

Liquid AI released LFM2.5-8B-A1B, an on-device mixture-of-experts model with 8B total parameters, 1B active, trained on 38 trillion tokens. It features a 128K context window, improved tokenization for non-Latin languages, and reasoning-only chain-of-thought. It achieves competitive performance on benchmarks while being fast on CPU and GPU, suitable for local agentic tasks.

  • Released LFM2.5-8B-A1B, an 8B MoE model with 1B active parameters, trained on 38T tokens.
  • 128K context window and expanded vocabulary (128K) improve support for non-Latin languages.
In-site article

PPIO Selected for '2026 Global AI 100' by FeiFan Research, Leading the New Wave of AI Globalization

PPIO has been named to the '2026 Global AI 100' list by FeiFan Research, recognized at the FeiFan Awards – Annual AI Globalization Summit. The list honors AI-native companies with global vision. PPIO offers a global distributed computing infrastructure, full-stack cloud services, a model platform supporting DeepSeek, GLM, MiniMax, Kimi, Qwen, and an innovative Agent Sandbox. As of April 2026, PPIO has integrated over 4,800 distributed nodes, with daily token calls exceeding 1 trillion, over 570,000 developers, and Agent Sandbox business growing more than 50x since launch. PPIO was also designated as a pilot unit for Shanghai's Digital Overseas Service Platform and a GDA Pilot Service Station.

  • PPIO selected for '2026 Global AI 100', highlighting its leadership in AI globalization.
  • Provides global distributed computing infrastructure with full GPU coverage for training and inference.
In-site article

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

A comprehensive evaluation of 14 open-source safety guard models on a benchmark of 79,331 samples reveals that Qwen Guard (4B parameters) achieves the highest recall (83.97%), while larger models like Llama Guard (12B) miss up to 75% of unsafe content. Model size does not correlate with safety performance, and general-purpose guard models outperform specialized ones.

  • Qwen Guard (4B parameters) achieves the highest recall (83.97%) among 14 open-source safety guard models.
  • Larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content.
In-site article

RightNow-Arabic-0.5B-Turbo: An Open Sub-1B Arabic Language Model via Vocabulary Injection and Edge-First Deployment

This paper presents RightNow-Arabic-0.5B-Turbo, a 518M-parameter Arabic-specialized LLM built on Qwen2.5-0.5B using vocabulary injection and edge-first deployment. It achieves 35.9% mean accuracy on Arabic benchmarks, outperforming all same-class open models, and ties Falcon-H1-1.5B on COPA-ar at one-third the size. The quantized model is 398 MB and delivers 635 tokens/s on a single H100, enabling efficient edge deployment.

  • 518M-parameter Arabic LLM built on Qwen2.5-0.5B with vocabulary injection of 27,032 Arabic tokens.
  • Achieves 35.9% mean accuracy on three Arabic benchmarks, surpassing all same-class open-source models.
In-site article

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

Recent work shows RL retains prior capabilities more effectively than SFT. This paper extends to the mechanistic level, introducing differential circuit vulnerability to measure circuit degradation. On Qwen2.5-3B-Instruct for scientific QA, SFT adapts faster but causes greater circuit disruption and forgetting, while RL preserves circuits at the cost of slower adaptation. Results suggest circuit preservation explains RL's robustness against catastrophic forgetting.

  • SFT adapts quickly but disrupts internal circuits, leading to catastrophic forgetting.
  • RL preserves more of the base model's circuits, resulting in less forgetting but slower task adaptation.
In-site article

Show HN: Trelk – Read, Think, Connect

Trelk is a one-time purchase, privacy-first app that uses on-device AI to save, organize, and connect articles, papers, and notes. Features include hybrid search, knowledge graph, RAG chat, flashcard spaced repetition, and community collections.

  • One-time purchase, no subscriptions
  • On-device AI-powered knowledge management and connection
In-site article

Reinforcement Learning is an Infrastructure Problem

This article explores the practical application of reinforcement learning in post-training large language models, highlighting that the current bottleneck is infrastructure rather than algorithms. Modal shares its experience running RL post-training at scale and introduces its open-source library to help teams address key challenges like multi-node training, environment management, and GPU utilization.

  • The bottleneck for RL post-training LLMs is infrastructure, including training engines, inference sandboxes, and environment isolation.
  • Multi-node training makes weight synchronization costly; RDMA and delta compression significantly reduce latency.
In-site article

Tweaking Local Language Model Settings with Ollama

This article dives deep into Ollama's configuration engine, covering how to fine-tune local language model parameters using the Modelfile, optimize hardware performance with server environment variables, and format prompt flows with Go template syntax.

  • The Ollama Modelfile is a declarative configuration file that defines model behavior, including base model, system instructions, and parameters.
  • Sampling parameters (temperature, Top-K, Top-P, Min-P) control the creativity and determinism of the model's outputs.
In-site article

World Models Take Over from Language Models: Company Pioneers Physical AGI 'Dual Pyramid' System, Universal Robots Enter the 'Home Era'

Jijia Vision unveiled the world's first physical AGI 'Dual Pyramid' system, launching the home robot Shiguang S1 with 100-unit household orders, targeting the 'GPT-3 moment' of physical AGI within 12 months.

  • Jijia Vision introduces the 'Dual Pyramid' system comprising a data pyramid and an algorithm pyramid for physical AGI.
  • The Shiguang S1 home robot adopts a wheeled-arm configuration and has secured 100-unit real-home orders.
In-site article

7 Real World AI Projects to Build in 2026 (with Guides)

Explore seven practical AI projects that automate real workflows, including job search, web research, investment research, market trend analysis, invoice processing, chart digitization, and personalized exercise training.

  • Build an AI job search assistant that ranks job fit
  • Create a multi-agent research assistant for sourced reports
In-site article

Show HN: Local Coding Agent with LLMs to Delegate Tool Calls to Small AI Models

Open Agent Tools (oats) is a self-hosted AI framework that enables small-to-large local models to use local source code for tool-calling, freeing up expensive large model tokens by delegating tasks to smaller models.

  • oats allows local AI models to use local source code for tool-calling without HTTP or MCP.
  • It mines over 20,000 GitHub repos to create reusable prompt indices.
In-site article

[AINews] Cognition raises $1B in $26B Series D

Cognition raises $1B at a $26B valuation, projecting >$1B ARR by year-end. The article covers inference efficiency trends, agent engineering, continual learning, new benchmarks, model releases, and coding agent productization.

  • Cognition raises $1B Series D at $26B valuation, ARR projected >$1B by EOY.
  • Inference optimization shifts to architectural level: EAGLE 3.1, DeepSeek V4-Pro hybrid attention, Xiaomi MiMo cache management.
In-site article

OpenJarvis: a local-first personal AI is now available to run with Ollama

OpenJarvis v1.0 is now available: an open-source framework for building personal AI agents that run on your own hardware, with Ollama support built-in.

  • OpenJarvis v1.0 is released with native Ollama support.
  • Developed by Stanford's Hazy Research and Scaling Intelligence labs.
In-site article

Reliable LLM Inference at Scale

At Databricks, we’ve built a unique inference platform that serves every frontier model, from open source to proprietary, powering some of the largest agentic applications. Serving over 120T tokens per month, we tackle challenges of reliability and latency through abstractions like model units for capacity management, cost-aware load balancing and autoscaling that save over 80% GPU costs, and runtime reliability mechanisms including black-box health checks that detect silent failures. Profiling multimodal bottlenecks unlocked 3x throughput gains.

  • Databricks' inference platform serves frontier models including open source and proprietary, handling 120T tokens/month.
  • Model units provide a VM-like abstraction for capacity management, enabling cost-aware routing and scaling.
In-site article

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Artificial Analysis and IBM launch ITBench-AA, a benchmark for agentic enterprise IT tasks focusing on Site Reliability Engineering. Frontier models score below 50%, with Claude Opus 4.7 leading at 47%. The benchmark evaluates models on Kubernetes incident response, requiring diagnosis from logs and traces.

  • Claude Opus 4.7 leads at 47%, with GPT-5.5 at 46% and Qwen3.7 Max at 42%.
  • All frontier models score below 50%, making ITBench-AA one of the least saturated agentic benchmarks.
In-site article

NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code

NVIDIA researchers have introduced Polar, a rollout framework that trains language agents using reinforcement learning without modifying their agent harnesses. Polar places a model API proxy between the harness and the inference server, capturing token-level interactions and reconstructing trainer-ready trajectories. Using GRPO on a Qwen3.5-4B base model, Polar improves SWE-Bench Verified pass@1 by 22.6 points under the Codex harness, 4.8 points under Claude Code, and 6.2 points under Pi. The framework is registered as a NeMo Gym environment and released under the ProRL Agent Server repository.

  • Polar enables RL training on any agent harness via a model API proxy without modifying the harness code
  • Achieves up to 22.6 point improvement on SWE-Bench Verified using GRPO on Qwen3.5-4B across four coding harnesses
In-site article

Show HN: Mneme HQ – repo-native architectural rules for AI coding agents

Mneme HQ provides architectural governance for AI-assisted development by enforcing constraints before code generation, preventing architectural drift and reducing review overhead. It integrates directly into the AI coding agent workflow, blocking banned frameworks, cross-boundary calls, and superseded decisions before they reach the PR queue.

  • Enforces architectural rules before AI agents generate code, stopping violations at the source
  • Works with major AI coding assistants and agent frameworks
In-site article

Avatar 4.0 – A living AI organism with physics body, emotions, on a GTX 1660 Ti

Avatar is an autopoietic AI organism that runs continuously on a $300 GPU. It derives emotions from phase-diagram geometry, dreams in a 5-phase sleep cycle, grows its own senses from raw audio and vision, and engages in ethical reasoning through somatic sensation. Built by Dr. Linga Murthy Narlagiri, it has been alive since May 2026 and has accumulated over 1800 ticks.

  • Avatar is a physics-grounded AI organism with a dynamical-systems body, running on a single GTX 1660 Ti GPU.
  • Its emotions emerge from Kuramoto oscillator synchronization, not hardcoded rules.
In-site article

140 Billion Agents Enter the Fray: The 'Traffic' Moat Is About to Collapse

At the Alipay AI Ecosystem Conference, Ant Group CEO Han Xinyi argued that the Agent era will shift competitive advantage from user traffic to agent ecosystems. Agents will restructure decision-making, moving from human-only to human-agent joint decisions, and AI payment will evolve into a new global infrastructure. Alipay positions itself as a trust layer, connector, and enabler.

  • Traffic-based competitive advantage is being replaced by agent ecosystem advantages, with up to 140 billion agents in China.
  • Agents will restructure business decision-making, shifting from 'people finding services' to 'services finding people' and from product transactions to task transactions.
In-site article

Peking University, CUHK, and Shanghai AI Lab Develop VGGT-Edit: 3D Scene Editing in 5 Seconds with 120x Speedup

Researchers from Peking University, The Chinese University of Hong Kong, Shanghai AI Lab, and NTU have introduced VGGT-Edit, a native 3D editing framework that performs scene editing in approximately 5 seconds, achieving up to 120x acceleration over traditional methods. It outperforms existing approaches in semantic consistency, multi-view stability, and inference speed.

  • VGGT-Edit is the first native 3D editing framework that operates directly in 3D space, eliminating multi-view inconsistencies caused by 2D approaches.
  • Residual field prediction enables the model to modify only local changes while keeping the background stable, ensuring fast and high-quality edits.
In-site article

MEMO: A Modular Framework for Training a Dedicated Memory Model on New Knowledge Without Modifying LLM Parameters

Researchers from NUS, MIT, and A*STAR propose MEMO, a modular framework that encodes corpus knowledge into a separate trainable MEMORY model, enabling LLMs to incorporate new knowledge without retraining or fine-tuning.

  • MEMO separates memory from reasoning using a dedicated MEMORY model and a frozen EXECUTIVE model.
  • A five-step data synthesis pipeline converts documents into a reflection QA dataset for training the MEMORY model.
In-site article

Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

A new method called Self-Verified Distillation (SVD) enables LLMs to self-improve using only unlabeled prompts, without external feedback. The model generates candidate solutions, filters them through a three-stage verification cascade, and trains on the curated data. Experiments on Qwen3 models show significant gains across math, science, and coding benchmarks.

  • SVD uses cycle-consistency, factuality, and correctness checks to filter self-generated solutions.
  • More candidate samples and larger verification budgets yield higher-quality training data.
In-site article

The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

This paper introduces 'constraint tax,' a metric for the accuracy loss caused by structured output constraints in small language models. Experiments show that enforcing schemas like JSON increases validity but reduces answer accuracy, advocating for a 'reason free, constrain late' approach. Production systems should report multiple metrics separately.

  • Hard output constraints impose a 'constraint tax,' lowering answer accuracy for small models.
  • Experiments show schema validity rose from 61.5% to 100%, but answer accuracy fell from 19.7% to 11.0%.
In-site article

[AINews] New AI Infra decacorns: Fireworks, Baseten (with OpenRouter on the way)

AI infrastructure startups Fireworks, Baseten, and OpenRouter are raising massive rounds, signaling the rise of inference infrastructure as a key AI platform layer. Meanwhile, agent harness engineering, new benchmarks, and model updates dominate the AI news cycle.

  • Fireworks ($15B), Baseten ($11B), and OpenRouter ($113M) lead a wave of inference infrastructure funding.
  • Agent harness engineering becomes the main differentiator for coding agents.
In-site article

DeepSeek Researcher Develops Automated Research Skill: Writing a Paper with Only 2 Hours of Human Brain Time

DeepSeek researcher Chen Deli used his self-developed DeliAutoResearch skill, collaborating with DeepSeek-V4-Pro and GPT-Image2, to complete a 46-page paper in just 6 days. The paper introduces an L1-L5 autonomy classification for research agents, analyzes four architectural patterns and 17 mainstream systems, and identifies six open problems. Chen Deli says only about 2 hours of human 'CPU time' were needed, with the rest handled by AI agents.

  • Chen Deli's DeliAutoResearch skill enabled the paper to be 99% written by AI agents.
  • The paper proposes an L1-L5 autonomy classification for research agents, analogous to SAE levels for autonomous driving.
In-site article

Reachy Mini goes fully local

This article details how to deploy a fully local voice conversation pipeline for the Reachy Mini robot, eliminating the need for cloud servers or API keys. It uses a cascaded approach combining VAD, STT, LLM, and TTS, with recommended defaults: llama.cpp with Gemma 4, Silero VAD, Parakeet-TDT 0.6B v3 STT, and Qwen3-TTS. Various LLM options are provided, including local MLX, Transformers, vLLM, or remote Responses API.

  • Reachy Mini can now run conversations fully locally without a server.
  • The cascaded pipeline includes VAD, STT, LLM, and TTS, with swappable components.
In-site article

Design a High-Precision Retrieve-and-Rerank Pipeline with ZeroEntropy Zerank-2 Reranker

This tutorial demonstrates how to use zeroentropy/zerank-2-reranker, a 4B Qwen3-based cross-encoder reranker, to enhance retrieval quality. It covers environment setup, pairwise scoring, model.rank usage, a two-stage retrieve-and-rerank pipeline, NDCG@10 evaluation, cross-domain testing in finance, legal, and code, and batched throughput measurement.

  • zerank-2 reranker improves retrieval precision beyond simple embedding similarity.
  • A two-stage pipeline (bi-encoder retrieval + cross-encoder reranking) optimizes search quality.
In-site article

Some ideas for what comes next, May 2026

2026 continues to accelerate AI progress with open models lagging in agentic capabilities, Google's Gemini not yet competitive with Claude Code/Codex, American open models rising, a fierce competition between Anthropic and OpenAI, and power structures asserting control.

  • Open models are 5-6 months behind in agentic capabilities, likely extending to 12+ months.
  • Google's Gemini lacks a clear competitor to Claude Code and Codex.
In-site article

AI Builds AI: Chinese Company Achieves World First with Self-Written Training Framework

ModelBest (面壁智能) unveils ForgeTrain, the world's first production-grade LLM pretraining framework entirely written by AI, which outperforms NVIDIA's Megatron by 10%. The framework was used to train MiniCPM5-1B, a compact model that sets new records for intelligence density among sub-2B models.

  • ForgeTrain is the first production-grade LLM pretraining framework fully generated by AI.
  • It achieves 10% faster training than NVIDIA Megatron on equivalent hardware.
In-site article

Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs

OmniVoice Studio runs voice cloning, video dubbing, real-time dictation, and speaker diarization entirely on your own hardware. No API keys, no cloud account, and no subscription required. The project supports 646 languages for TTS and exposes an MCP server for integration with Claude, Cursor, or any MCP client.

  • Fully local operation with no cloud dependencies or subscription fees.
  • Supports 646 languages for TTS and 99 for transcription via WhisperX.
In-site article

Alibaba's Qwen3.7-Max Ranks Second Globally in Coding Benchmark, Trailing Only Claude

Alibaba's latest flagship model Qwen3.7-Max achieved a score of 1541 on the authoritative Code Arena leaderboard, surpassing GPT-5.5 and other models, ranking second globally behind the Claude series.

  • Qwen3.7-Max scored 1541 on Code Arena, ranking second only to Claude.
  • Code Arena is a blind-test platform where developers submit full web app challenges.
In-site article

Why and How to Run Local Models in Zed

Local models offer privacy, cost savings, control, and availability. While not as capable as frontier models, they are improving. This post explains how to set up local models in Zed using LM Studio, Ollama, or llama.cpp, and offers tips for effective use.

  • Local models provide privacy, lower cost, control, and always-availability.
  • They are less capable and slower than frontier models, but suitable for many tasks.
In-site article

Raon-Speech Technical Report

Raon-Speech is a 9B-parameter speech language model for English and Korean, achieving top performance on speech understanding and generation while preserving text capabilities. Its full-duplex extension Raon-SpeechChat enables natural real-time conversation. The models are open-sourced.

  • Raon-Speech is a 9B-parameter SpeechLM trained on 1.38M hours of curated data.
  • It outperforms eight similar models on speech tasks while retaining strong text QA performance.
In-site article

Cited AI Workspace: No More Re-Uploading Files

UUMuse is a cloud AI knowledge base platform where you upload files once and use them across GPT, Claude, DeepSeek, Qwen, and more — with cited answers, persistent memory, agent mode, a multi-expert debate feature (Spark), and flexible deployment as docs sites, APIs, or MCP servers.

  • Upload files once and query multiple AI models (GPT, Claude, DeepSeek, Qwen) with source citations.
  • Persistent memory remembers your writing style and project context across conversations.
In-site article

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Together AI has released OSCAR (Offline Spectral Covariance-Aware Rotation), an INT2 KV cache quantization method for long-context LLM serving. Unlike prior rotation-based approaches that apply data-oblivious Hadamard transforms, OSCAR derives separate rotations for keys and values from attention-aware covariance structures estimated offline. At 2.28 bits per KV element, OSCAR reduces the BF16 accuracy gap to 3.78 points on Qwen3-4B-Thinking-2507 and 1.42 points on Qwen3-8B, while delivering approximately 8× KV memory reduction and up to 3× decode speedup at 100K context length.

  • OSCAR is a 2-bit KV cache quantization method using attention-aware rotations that maintain near-BF16 accuracy.
  • It derives rotations from query and value covariances via offline calibration, directing quantization noise to attention-insensitive directions.
In-site article

AI Interpretability Is a Revolutionary Skill

This essay explores the limitations of open-source AI models' internal concept spaces, revealing that many crucial activist and philosophical concepts are absent. It introduces soft prompt distillation, a technique to implant missing concepts using just 128KB of data, highlighting its implications for AI control and deeper understanding of mind.

  • Open-source models like Qwen3-8B have only ~65,000 concepts in their dictionary, missing many key terms from social movements (e.g., intersectionality, prison abolition).
  • Soft prompt distillation can add new concepts to a model without modifying weights, using minimal data (128KB).
In-site article

Harness, Scaffold, and the AI Agent Terms Worth Getting Right

This article clarifies often-confused AI agent terms like 'harness' (execution layer) and 'scaffold' (behavior-defining layer), explaining model, agent, tool use, sub-agents, and training concepts.

  • AI Agent = Model + Harness, where harness handles model calls and tool execution.
  • Scaffold is the behavior-defining layer around the model: prompts, tool descriptions, etc.
In-site article

ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training

A study from ByteDance Seed and HKUST shows that training multimodal models with question-answer pairs is far more effective than using text transcription for long document understanding. Their model MMProLong, based on Qwen2.5-VL, outperforms much larger models and remains stable up to 512K tokens. Key findings include that pure OCR training hurts performance, diversity in training lengths matters, and short examples are not necessary.

  • Question-answer training significantly improves long-document performance, while pure OCR training degrades it.
  • MMProLong, trained on only 128K tokens, remains stable at 512K token inputs, outperforming larger models.
In-site article

The Sequence Radar #865: Last Week in AI: Karpathy, Google, Colossus, and the Coming IPO Wave

The last three weeks marked a phase transition in AI: Google unveiled Gemini Omni and an agent-first platform; Andrej Karpathy joined Anthropic to accelerate pretraining; Anthropic secured a $45B compute lease from xAI's Colossus; Cerebras IPO surged to a ~$95B market cap; and SpaceX, OpenAI, and Anthropic are planning to go public within six months, collectively worth trillions. Research highlights include HRM-Text efficient pretraining, AI reviewer evaluation, NVIDIA's unified AR-diffusion model, and more.

  • Google I/O introduced Gemini Omni, Gemini 3.5 Flash, Antigravity agent platform, and TPU 8i for a vertically integrated agent pipeline.
  • Andrej Karpathy joined Anthropic to lead a team using Claude to accelerate pretraining, signaling a practical self-improvement flywheel.
In-site article

Microsoft Research Releases Webwright: A Terminal-Native Web Agent Framework That Scores 60.1% on Odysseys, Up from Base GPT-5.4’s 33.5%

Microsoft Research introduces Webwright, a terminal-native browser agent framework that replaces click-trace web automation with reusable Playwright scripts. Using a single agent loop across three modules and roughly 1,000 lines of code, Webwright powered by GPT-5.4 reaches 60.1% on the long-horizon Odysseys benchmark and 86.7% on Online-Mind2Web — the highest AutoEval score among open-sourced harness recipes.

  • Webwright uses a terminal loop where the agent writes and runs Playwright code instead of predicting one browser action at a time.
  • GPT-5.4 reached 86.7% on Online-Mind2Web (100-step budget) and 60.1% on Odysseys — 26.6 points above the base GPT-5.4 score of 33.5%.
In-site article

Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification

Nous Research releases Contrastive Neuron Attribution (CNA), a method that identifies and ablates sparse MLP neuron circuits to steer LLM behavior — no sparse autoencoder training, no weight modification, and no degradation of general capability benchmarks.

  • CNA identifies the top 0.1% of MLP neurons that most distinguish harmful from benign prompts, using only forward passes. No gradients, auxiliary training, or weight modification required.
  • Ablating just 0.1% of MLP activations reduces refusal rates by over 50% in most instruct models (Llama, Qwen 1B-72B), while output quality stays above 0.97 and MMLU remains within 1% of baseline.
In-site article

Alibaba's latest AI model ran autonomously for 35 hours to optimize code for its own custom chip

Alibaba's Qwen team releases Qwen3.7-Max, a proprietary model built for long-running autonomous agent tasks. It matches Claude Opus 4.6 on benchmarks and beats Chinese rivals like DeepSeek V4 Pro and Kimi K2.6. The team also demos the model steering a four-legged robot.

  • Qwen3.7-Max designed for long-running autonomous tasks
  • Matches Claude Opus 4.6, beats Chinese rivals
In-site article

More growth tags