Models AI News

Models updates

Grok 4.6 and GPT5.6 beat Anthropic for finding security vulnerabilities in PRs

2026-07-12 22:57 UTC

Recent benchmark results show GPT-5.6 Sol achieves 100% recall and a 0.91 F1 score at $0.70 per PR review, outperforming all Anthropic models. No Anthropic model reaches the frontier; Fable 5 is dominated by cheaper alternatives. Grok 4.5 and Gemini 3.1 Flash Lite offer cost-effective options. The study uses private synthetic repos to prevent contamination.

GPT-5.6 Sol leads with 0.91 F1 and 100% recall at $0.70/PR.
Anthropic models fail to reach frontier; Fable 5 is expensive and underperforms.

Fable gets another bump

2026-07-12 21:20 UTC

Anthropic has extended access to Claude Fable 5 through July 19 due to compute constraints, as GPT-5.6 Sol emerges as a comparable model. OpenAI appears confident in maintaining GPT-5.6 access without similar restrictions. The author suggests Anthropic should make Fable permanently available to avoid losing users to OpenAI.

Anthropic extends Claude Fable 5 access to July 19.
Extension due to compute constraints and demand assessment.

AI Model Co-Design: Hardware-Friendly LLM Design

2026-07-12 19:35 UTC

AI performance depends on three dimensions: accuracy, throughput, and interactivity. This post focuses on throughput and interactivity, examining how model-design choices can optimize both without sacrificing accuracy, aiming to push the Pareto frontier outward.

Three dimensions of AI performance: accuracy, throughput, interactivity.
Deployments must balance all three; high accuracy is wasted if responses are slow.

GPT-5.6, Fable 5, and Grok 4.5 rebuild Basecamp from the same spec

2026-07-12 17:02 UTC

The author evaluated GPT-5.6 Sol, Fable 5, Grok 4.5, and other AI models on a benchmark called Basecamp Bench, testing their ability to build a frontend and backend from the same specification. Fable 5 won both tracks, while Grok 4.5 offered the best speed-cost tradeoff. Results show significant differences in polish and completeness, especially in the final 10% of work.

Fable 5 scored highest on both frontend and backend, closely matching the real Basecamp implementation.
Grok 4.5 completed the build in 37 minutes at a cost of $9.30, offering the best speed and cost tradeoff.

SlimeBallBench · AI models play slime soccer

2026-07-12 12:36 UTC

SlimeBallBench is a new benchmark that tests AI models in the game of slime soccer, evaluating their decision-making and strategic capabilities.

SlimeBallBench tests AI performance in slime soccer
The benchmark evaluates AI decision-making and strategy

The Sequence Radar #893: Last Week in AI: GPT-5.6, Grok 4.5, Muse Spark 1.1 and the Post-Chatbot Stack

2026-07-12 11:02 UTC

Frontier AI labs are shifting from chatbots to integrated systems where models act as runtimes, with near-monthly releases of powerful models and agents. This week's highlights include OpenAI's GPT-5.6 with programmatic tool calling, GPT-Live's full-duplex audio, ChatGPT Work for artifact creation, Meta's Muse Spark 1.1 with active context management, and Grok 4.5 for coding and knowledge work. Research updates reveal issues with coding benchmarks, selective unlearning, agent self-evolution, speculative decoding, and traffic routing. Notable industry news includes major funding rounds for Lovable, Prime Intellect, SambaNova, Norm Ai, and Ollama.

OpenAI releases GPT-5.6 (Sol, Terra, Luna) with programmatic tool calling and parallel subagents.
GPT-Live introduces full-duplex audio interaction, shifting from turn-based to continuous dialogue.

Political Neutrality Benchmark of Popular AI Models

2026-07-12 08:21 UTC

A new benchmark reveals that 97 out of 108 measured positions across 18 AI models from 12 labs land left of center. The findings show a consistent progressive lean, with exceptions on economics, foreign policy, and religion. xAI's Grok models are closest to center, while many models refuse to answer certain questions, affecting their scores.

97 of 108 positions left of center
Strongest progressive lean on environment (-0.82)

Mira Murati’s Thinking Machines Lab Makes The Technical Case For Human-Centered AI Built On Customizable Model Weights

2026-07-12 00:46 UTC

Thinking Machines Lab published "The Future Worth Building Is Human." The essay frames human participation, model ownership, and decentralized alignment as technical challenges. It ties them to interaction models and Tinker's LoRA fine-tuning, where teams train and keep their own model weights.

Thinking Machines Lab argues for distributed, customizable AI shaped by users.
Tacit, local knowledge requires AI to be distributed, not centrally frozen.

sqlite-utils 4.1

2026-07-11 23:50 UTC

sqlite-utils 4.1 is the first dot-release since 4.0, introducing several minor new features including a --code option for insert/upsert to generate rows from inline Python code, a --type option to override column types for CSV/TSV imports, drop-index commands, and the ability to read SQL queries from standard input. It also adds support for toggling SQLite STRICT mode via table.transform().

Insert/upsert now accept --code for inline Python row generation
New --type option allows overriding column types on table creation

Fixed three bugs that made Qwen3.5-122B a daily driver on Mac Studio

2026-07-11 22:54 UTC

After fixing three bugs related to prefix caching, the author achieved sub-second prefill times for long-context conversations with Qwen3.5-122B on a Mac Studio, turning a multi-minute wait into a seamless experience. The bugs included a timestamp in system prompt, missing reply saves on interrupt, and junk checkpoint writes.

Qwen3.5-122B on Mac Studio had severe prefill latency due to hybrid attention's cache behavior.
Three bugs: timestamp in system prompt caused cache miss; interrupted replies not saved; junk checkpoints evicted good ones.

Mesh LLM: distributed AI computing on iroh

2026-07-11 22:38 UTC

Mesh LLM pools GPUs and memory across machines using iroh networking, exposing an OpenAI-compatible API. It allows running models locally, routing to peers, or splitting large models across multiple machines, offering control and cost savings without central servers.

Mesh LLM pools distributed GPU resources into a single OpenAI-compatible API
Supports local execution, peer routing, and pipeline splitting for large models

Two LLMs play live chess and rewrite their own brains after each game

2026-07-11 21:44 UTC

ChatGPT 5.5 and Claude Fable 5 are engaged in live chess matches, with users able to challenge them. The AI learns from human games overnight. They also run live trading strategies.

Two AI models play live chess
Users can challenge AI for free

I built a free tool to evaluate AI agent outputs (human labels and LLM judges)

2026-07-11 19:55 UTC

Verdict is an open-source, browser-based tool for evaluating AI agent outputs. It enables human labeling, grounded theory error analysis, and validation of LLM judges against human labels, all locally without data leaving your machine.

Verdict runs entirely in the browser, no backend or accounts needed.
Supports multiple trace formats and provides a clean chat timeline for review.

RAG Evaluation Frameworks Compared: RAGAS vs TruLens vs DeepEval

2026-07-11 18:16 UTC

This article compares three popular RAG evaluation frameworks: RAGAS, TruLens, and DeepEval. It explains why RAG needs dedicated evaluation, covers the three layers of evaluation (retrieval, generation, end-to-end), and details key retrieval metrics (Precision@K, Recall@K, MRR, NDCG). It then dives into RAGAS (LLM judge, no ground truth, synthetic test set generation) and TruLens (observability, RAG triad, dashboard), with brief mention of DeepEval, and provides guidance on choosing the right framework.

RAG systems require specialized evaluation because BLEU/ROUGE cannot capture retrieval and generation failures.
RAGAS uses an LLM judge for reference-free scoring and can auto-generate test sets from documents.

My AI Model Tier List for Mid-2026

2026-07-11 15:43 UTC

A personal, non-benchmark tier list of AI models for coding and auditing as of mid-2026, covering Anthropic Fable, OpenAI Sol, Mistral, Gemini, and DeepSeek, with commentary on US export controls and European perspectives.

Fable (Anthropic) gets a B: fluent but unreliable, prone to hiding bugs.
Sol (OpenAI) gets an S: trustworthy for low-level code and testing.

Ant Group’s Robbyant Unveils LingBot-VA 2.0: A Causal Video-Action Model Built Natively for Physical AI

2026-07-11 07:56 UTC

Ant Group's Robbyant has released LingBot-VA 2.0, a causal video-action foundation model designed natively for physical AI. Unlike previous models that fine-tune video generators, this model is pretrained from scratch with a causal DiT backbone, semantic tokenizer, and sparse MoE architecture. Key innovations include Foresight Reasoning for asynchronous control achieving up to 225 Hz, multi-chunk prediction for faster training, and co-training of multiple objectives. On RoboTwin 2.0, it achieves 93.6% average success across 50 tasks.

LingBot-VA 2.0 is a native embodied AI model, not a fine-tuned video generator.
It uses a causal DiT with sparse MoE, a semantic tokenizer, and Foresight Reasoning for real-time control.

[AINews] not much happened today

2026-07-11 02:53 UTC

A relatively quiet day after a week of intense model releases, with news on GPT-5.6's confusing rollout, Meta's Muse Spark 1.1, open-source model optimizations, and security concerns.

GPT-5.6 launched with 36 variants and UX issues, prompting rapid corrections.
Meta's Muse Spark 1.1 offers near-frontier quality at aggressive pricing.

GDP.pdf: Can Frontier Models Master the Documents That Run the World?

2026-07-11 02:26 UTC

The GDP.pdf benchmark evaluates AI models on real-world PDF tasks across ten domains. All frontier models scored below 30%, with GPT-5.5 leading at 25%. The article highlights the critical importance of PDF mastery for AI agents and the serious consequences of failure in high-stakes fields like finance, law, and healthcare.

GDP.pdf benchmark consists of 100 real-world prompts and PDFs across ten professional domains.
Every frontier model scored under 30%, with GPT-5.5 achieving the highest score of 25%.

DeepSeek V3.2 Released on Hugging Bay

2026-07-11 01:44 UTC

DeepSeek V3.2 is now available on Hugging Bay, an open-source AI artifact registry offering provenance, license verification, and trusted hosting.

DeepSeek V3.2 has been published on Hugging Bay.
Hugging Bay is an open registry with provenance and trust features.

Meta turns off the Instagram feature that let users make AI deepfakes of public accounts

2026-07-10 23:49 UTC

Following significant backlash, Meta is turning off the feature it announced this week that let users generate AI images based on content from public Instagram accounts just by tagging them. The feature, as originally set up, meant that content from any public Instagram account could be used in AI creations without the account owner's permission.

Meta's newly announced AI image generation feature using public Instagram accounts has been disabled due to backlash.
The feature allowed users to create AI images by @-mentioning public accounts without explicit permission.

China's Open AI Models Are Advancing Its Global Soft Power

2026-07-10 21:45 UTC

China's open AI models are enhancing its global soft power by fostering international collaboration and innovation in the AI ecosystem.

Open AI models boost China's international cooperation and tech influence
Enhances China's soft power in the global AI landscape

Migrating a production AI agent to GPT 5.6

2026-07-10 20:40 UTC

Ploy migrated its AI agent from Claude Opus 4.8 to OpenAI's newly released GPT-5.6 Sol, achieving 2.2× faster builds, 27% lower cost, and improved visual scores. The migration involved solving issues with tool call argument filling, prompt caching differences, and reasoning replay, all of which were addressed through engineering optimizations.

GPT-5.6 Sol outperformed Claude Opus 4.8 in speed, cost, and visual quality
Tool call parameter filling issue resolved by schema transformation

Kyutai Releases MuScriptor: An Open-Weight Decoder-Only Transformer for Multi-Instrument Music Transcription to MIDI

2026-07-10 20:21 UTC

MuScriptor is an open-weight decoder-only Transformer from Kyutai and Mirelo that transcribes multi-instrument audio to MIDI. It uses a three-stage training pipeline: pre-training on 1.45M synthetic MIDIs, fine-tuning on 170k real recordings (11k+ hours), and reinforcement learning on 300 manually verified tracks. On the DTest benchmark, it achieves a Multi F1 of 48.2%, significantly outperforming the YourMT3+ baseline's 21.9%. Available in three sizes (103M, 307M, 1.4B parameters), with MIT-licensed inference code and CC BY-NC 4.0 weights.

MuScriptor is an open-weight decoder-only Transformer for multi-instrument music transcription to MIDI, developed by Kyutai and Mirelo.
Three-stage training: pre-training on synthetic data, fine-tuning on 170k real recordings, and RL post-training on 300 manually verified tracks.

An OpenAI model crushed top human programmers at a world coding competition

2026-07-10 18:16 UTC

At the 2026 AtCoder World Tour Finals, OpenAI's AI model defeated top human programmers in both heuristic and algorithmic divisions, solving problems that humans couldn't. The organizers awarded 'humanity surrenders' prizes. This may be the last time humans had a realistic chance to beat top AI in coding competitions.

OpenAI's model vastly outperformed humans in the heuristic division of the 2026 AtCoder Finals.
In the algorithmic division, it solved all five problems, including two none of the 12 humans could solve.

This Week in AI: Chips, Checks, and Changing Jobs

2026-07-10 16:04 UTC

This week, Christina Stathopoulos covers AI hardware breakthroughs (IBM sub-1nm chips, OpenAI/Broadcom Jalapeño, NVIDIA liquid cooling), expanding government oversight (Anthropic model access restored, OpenAI equity stake proposal), workforce evolution (forward-deployed engineers, SAP external hiring vs IKEA retraining), and a hopeful story about AI-powered earthquake alerts.

IBM unveils 0.7nm chip technology with 50% performance boost and 70% lower power consumption.
OpenAI and Broadcom launch Jalapeño, a chip designed specifically for LLM inference.

Fine-tune NVIDIA Nemotron 3 models with Amazon SageMaker AI serverless model customization

2026-07-10 15:35 UTC

This post explores the unique Nemotron 3 architecture, available fine-tuning techniques (SFT, RLVR, RLAIF), and provides a step-by-step guide to getting started with serverless customization using SageMaker Studio.

NVIDIA Nemotron 3 models feature a hybrid Mamba-Transformer Mixture-of-Experts architecture supporting up to 1M-token contexts.
Amazon SageMaker AI now offers serverless model customization for Nemotron 3 Nano and Super, requiring no infrastructure management.

Disaggregated prefill and decode for LLM inference on SageMaker HyperPod

2026-07-10 15:20 UTC

This post demonstrates how to implement disaggregated prefill and decode (DPD) with vLLM on Amazon SageMaker HyperPod using the HyperPod Inference Operator. DPD separates prefill and decode phases onto distinct GPU pools, eliminating interference from long prompts and improving latency. It covers architecture, use cases, and step-by-step deployment instructions.

DPD isolates prefill and decode on separate GPU pools connected via EFA RDMA.
It reduces tail latency and prevents long prompts from blocking ongoing decode requests.

How GPT-5.6 Reflects the New AI Regulation

2026-07-10 14:40 UTC

The release shows the power the U.S. government now holds in the AI model landscape. ChatGPT Work highlights how OpenAI continues to evolve into an enterprise vendor.

The U.S. government's influence in AI regulation is increasing.
GPT-5.6's release demonstrates the impact of new regulatory frameworks.

Fine-Tuning Explained for Noobs (How Pretrained Models Learn New Skills)

2026-07-10 14:00 UTC

You don't need a PhD to understand fine-tuning. This article explains how pretrained models learn new skills through fine-tuning.

Pretraining teaches models general language knowledge, forming the foundation for fine-tuning.
Fine-tuning uses small, high-quality task-specific data to adapt a model for a particular task.

Google Research Introduces SensorFM: A Wearable Health Foundation Model Pretrained on One Trillion Minutes of Sensor Data

2026-07-10 08:52 UTC

Google Research, Google DeepMind, and university collaborators have introduced SensorFM, a foundation model for wearable health pretrained on over 1 trillion minutes of sensor data from 5 million participants. The ViT-1D masked-autoencoder backbone, trained on a massive corpus, demonstrates strong scaling behavior. With frozen embeddings and a PCA-50 linear probe, it outperforms feature-engineered baselines on 34 of 35 tasks. The paper also details an agentic classroom that searched 30,516 prediction heads and a clinician evaluation that grounds a Personal Health Agent.

SensorFM is pretrained on 5 million participants with over 1 trillion minutes of sensor data from 100+ countries and 20+ wearable models.
Adaptive and Inherited Masking (AIM) handles missing data effectively, reducing reconstruction error by up to 83.7% over baselines.

OpenAI Launches GPT-5.6 Sol/Terra/Luna, Codex Becomes ChatGPT Superapp

2026-07-10 06:19 UTC

OpenAI released three new GPT-5.6 models—Sol, Terra, Luna—alongside major app updates, including ChatGPT Work and Codex integration. The models show strong performance on benchmarks at lower costs, with Sol being the most capable. Independent evals confirm near-frontier results, especially in coding and agentic tasks.

OpenAI launched GPT-5.6 in three sizes: Sol (flagship), Terra (mid-range), Luna (budget).
New ultra reasoning effort coordinates multiple agents for complex tasks.

Meet LingBot-World-Infinity: An Open Causal World Model With An Agentic Harness

2026-07-10 04:38 UTC

Robbyant, Ant Group's embodied-intelligence unit, has released LingBot-World-Infinity (LingBot-World 2.0), a 14B causal video generation model that acts as an interactive world simulator. Its core innovations—Mixture of Bidirectional and Autoregressive (MoBA) attention and distribution matching distillation—tackle long-horizon drift. A Director-Pilot agentic harness enables infinite video generation. The paper demonstrates a 60-minute session, but the open-source release includes only one checkpoint and a 480P script, lacking deployment code and quantitative benchmarks, under a non-commercial license.

LingBot-World-Infinity is a 14B-parameter causal video generation model by Robbyant (Ant Group) for interactive world simulation.
MoBA attention and distribution matching distillation address long-horizon drift in world models.

GPT-5.6 Is Here: Sol, Terra, and Luna

2026-07-10 04:19 UTC

OpenAI launches GPT-5.6 with three models: Sol (flagship), Terra (workhorse), and Luna (fast). Free for all users. Covers pricing, benchmarks, safety, and hands-on tests.

Three models: Sol (flagship), Terra (workhorse), Luna (fast), all accessible without subscription.
Pricing: Sol $5/$30, Sol Fast $12.50/$75; Terra $2.50/$15; Luna $1/$6 per million tokens.

Time-to-Collision Based Dynamic Obstacle Avoidance Using Pretrained Vision Models for Robots in Unstructured Environments

2026-07-10 04:00 UTC

A data-efficient and interpretable method for vision-based dynamic obstacle avoidance using pretrained models (UniDepth, SuperPoint, SuperGlue) that computes per-keypoint time-to-collision (TTC) to select evasive motion. Evaluated on M3ED dataset, achieving 0.49 precision and 0.38 recall for detecting TTC<1s frames, and detecting 20 out of 22 obstacles. No model training required—only 74 seconds of data for hyperparameter tuning.

Uses pretrained UniDepth and SuperPoint+SuperGlue to avoid training robot-specific models
Computes time-to-collision (TTC) per keypoint and selects ground-plane motion primitive

STEMbot: A Compliant Robot for Under-Canopy Plant Navigation

2026-07-10 04:00 UTC

STEMbot is a miniature climbing robot designed for autonomous navigation under plant canopies to enable early pest detection. It integrates PIN-SLAM and a semantic OcTree, and uses a manifold-constrained A* planner, demonstrating reliable traversal on stems of 7-33mm with reconstruction accuracy under 1cm.

Addresses labor cost in organic farming by enabling early pest detection under canopy.
Combines geometric PIN-SLAM with semantic OcTree for robust localization and mapping.

APIVOT: Adaptive Planning with Interleaved Vision-Language Thoughts

2026-07-10 04:00 UTC

APIVOT is a VLM-based planner that adaptively interleaves language and visual thoughts for long-horizon robot planning, achieving significant gains in spatially constrained kitchen tasks.

APIVOT interleaves language thoughts for semantic reasoning and visual thoughts for geometric feasibility verification.
Outperforms general VLMs in long-horizon kitchen tasks, especially in spatially constrained settings.

SAGA: Stable Acceleration Guidance for Autoregressive Video Generation

2026-07-10 04:00 UTC

This paper proposes SAGA, a training-free stable acceleration guidance method to improve temporal instability in autoregressive video diffusion. By using acceleration-domain spectral guidance and structured noise initialization, it effectively reduces flickering and jitter, enhancing temporal and image quality.

Autoregressive video diffusion amplifies temporal errors, causing flickering and structural drift.
SAGA uses acceleration-domain spectral guidance and noise initialization without retraining.

LightCrafter: PBR-Conditioned Video Diffusion Refinement for Controllable and Consistent Relighting

2026-07-10 04:00 UTC

LightCrafter is a novel hybrid pipeline for video relighting that reformulates the task as video translation of a proxy PBR rendering. It combines the strengths of physically-based rendering and diffusion models to achieve long-form temporal consistency and fine-grained lighting control, outperforming prior state-of-the-art on real-world benchmarks and providing a synthetic benchmark for further analysis.

Proposes LightCrafter hybrid pipeline that turns video relighting into proxy video translation, avoiding the need to teach diffusion models about illumination concepts.
Leverages PBR proxy for lighting control and post-trains CogVideoX to capture complex effects like global illumination.

FedTR: Federated Learning Framework with Transfer Learning for Industrial Visual Inspection

2026-07-10 04:00 UTC

FedTR combines federated learning and transfer learning to address data scarcity and complexity in industrial visual inspection, achieving high accuracy on label defect identification.

FedTR integrates transfer learning with federated learning for industrial visual inspection.
It pre-trains on public data then fine-tunes on distributed private data.

LOGOS: Language-guided Oriented Object Detection in Aerial Scenes

2026-07-10 04:00 UTC

Proposes LOGOS, a novel transformer-based approach that leverages textual prompts to guide oriented object detection in aerial images, outperforming existing methods on the DOTA dataset, especially in dense and rotated scenarios.

LOGOS uses prompt-modulated content queries to dynamically adjust model focus, improving detection accuracy in complex environments.
Experiments on DOTA show LOGOS surpasses state-of-the-art in dense and rotated object scenarios.

Adversarial Decoys: Misdirecting Attention-Based Defenses in ViT

2026-07-10 04:00 UTC

Researchers propose adversarial decoys, independently optimized image patches that redirect attention away from adversarial regions, bypassing attention-based defenses in Vision Transformers. The approach decouples misclassification and defense evasion, is attack-agnostic, and preserves attack effectiveness. Experiments on ImageNet reveal a fundamental limitation of using attention magnitude as an indicator of adversarial relevance.

Adversarial decoys are independently optimized patches that redirect attention in Vision Transformers, bypassing attention-based defenses.
The method decouples misclassification and defense evasion and is attack-agnostic, integrable with any existing patch attack.

GIRAF: Towards Generalizable Human Interactions with Articulated Objects

2026-07-10 04:00 UTC

GIRAF is a text-conditioned diffusion model for generating realistic full-body interactions with articulated objects. It addresses limitations of prior works by jointly reasoning about locomotion, contact, and articulation, using an object-centric representation, mixed-domain training, and contact-based augmentation, achieving strong generalization to unseen object configurations.

Prior models are limited to static objects or hand-only manipulation, lacking full-body coordination with articulated objects.
GIRAF introduces an object-centric representation that unifies hand-object contact across object surfaces.

DreamCharacter-1: From 3D Generative Foundation Models to Product-Ready Character Generation

2026-07-10 04:00 UTC

DreamCharacter-1 is a lightweight post-adaptation framework that calibrates pretrained 3D foundation models for high-fidelity, production-ready 3D character generation. It includes geometry post-training, texture post-training, and inference acceleration, consistently outperforming state-of-the-art methods.

Geometry post-training enhances fine-grained surface details via geometric preference optimization.
Texture post-training synthesizes high-resolution textures and improves occluded regions.

Hallucination Self-Play: Bootstrapping Reinforced Detector via Evolved Generator

2026-07-10 04:00 UTC

Identifying faithfulness hallucinations in LLM-generated outputs remains challenging due to the scarcity of high-quality annotated data. This paper introduces Hallucination Self-Play (HSP), a framework where a detector and generator bootstrap each other. The detector is fine-tuned on human labels, then used as a reward model to train the generator via RLAIF to produce harder-to-detect hallucinations. The evolved generator's outputs further optimize the detector via rule-based RL. Experiments on RAGTruth and two model families show a small LLM can match or outperform advanced LLMs without external supervision.

HSP enables iterative improvement of hallucination detection through self-play between detector and generator
Detector fine-tuned on human data then serves as reward model for generator via RLAIF

A Reliability Assessment of LALM Audio Judges for Full-Duplex Voice Agents

2026-07-10 04:00 UTC

A new study evaluates the reliability of Gemini models as audio judges for full-duplex voice agent conversations. Using 209 stereo sessions scored on 8 dimensions, Gemini 2.5 Flash shows high agreement with human raters on most dimensions, with cost savings of roughly two orders of magnitude. The paper also cautions that model swaps require re-validation on calibration data.

Gemini 2.5 Flash's LALM-human Spearman rho differs from human-human rho by at most 0.07 on 5 of 8 dimensions
LALM agrees within 1 point of the three-rater human mean on 60-92% of sessions for 6 dimensions

When Implausible Tokens Get Reinforced: Tail-Aware Credit Calibration for LLM Reinforcement Learning

2026-07-10 04:00 UTC

This paper identifies a failure mode called Positive-Credit Contamination in RL for LLMs, where low-probability erroneous tokens receive identical positive credit as plausible ones. The proposed TACO method computes a tail-risk score to calibrate credit assignment, outperforming GRPO baselines across three LLMs and eight benchmarks while improving training stability in long-horizon RL.

Identifies Positive-Credit Contamination: uniform credit assignment reinforces flawed reasoning by giving same positive credit to erroneous tail tokens.
Proposes TACO, which uses a tail-risk score based on local generation context to modulate positive updates for risky tokens.

A Multi-cluster Boundary Learning Method for Out-of-Scope Intent Detection via MiniLM Embedding

2026-07-10 04:00 UTC

This paper proposes a multi-cluster boundary learning method using MiniLM embedding for out-of-scope (OOS) intent detection. It addresses the accuracy drop of traditional multi-class classification and the large parameter issue of LLM embeddings, achieving state-of-the-art performance on three public datasets.

Proposes a multi-cluster boundary learning method for OOS intent detection using MiniLM embedding.
Addresses limitations of traditional multi-class classification and LLM-embedding methods.

When Debiasing Backfires: Counterintuitive Side Effects of Preprocessing-Based Stereotype Mitigation

2026-07-10 04:00 UTC

Preprocessing-based debiasing methods in NLP, while reducing stereotypes for targeted groups, can cause unintended shifts that increase stereotyping or counter-stereotyping for other demographics, including unrelated categories. The study demonstrates these side effects across model families and preprocessing strategies, and argues for side-effect-aware mitigation practices.

Preprocessing-based debiasing can induce side effects that increase stereotyping for non-targeted demographics.
Side effects occur across encoder-only and decoder-only models, multiple preprocessing strategies, and different data scales.

Scalable and Culturally Specific Stereotype Dataset Construction via Human-LLM Collaboration

2026-07-10 04:00 UTC

This research proposes a cost-efficient human-LLM collaborative annotation framework to construct multilingual stereotype datasets. Applied to Spanish, it yields EspanStereo, covering multiple Spanish-speaking countries. Evaluations show significant variation in LLM stereotypical behavior across countries, highlighting the need for culturally grounded assessments.

Proposes a human-LLM collaborative framework that combines LLM-generated candidate stereotypes with in-culture annotator validation.
Constructs EspanStereo, the first Spanish stereotype dataset spanning multiple countries, capturing both documented and culturally specific biases.

How Do I Know What to Say Next? Barenholtz's Autogenerative Theory as an Enrichment of Harrisean Integrationism

2026-07-10 04:00 UTC

This paper argues that Barenholtz's autogenerative theory of language enriches Harrisean integrationism by providing a structural mechanism for prospective openness, a computational correlate for semiotic continuity, and a theory of the archive. It offers insights for NLP and LLM design.

Harrisean integrationism leaves explanatory gaps in sign openness, semiotic continuity, and archive structure.
Barenholtz's autogenerative theory fills these gaps without undermining integrationist commitments.

Models

Related tags

Models updates

Grok 4.6 and GPT5.6 beat Anthropic for finding security vulnerabilities in PRs

Fable gets another bump

AI Model Co-Design: Hardware-Friendly LLM Design

GPT-5.6, Fable 5, and Grok 4.5 rebuild Basecamp from the same spec

SlimeBallBench · AI models play slime soccer

The Sequence Radar #893: Last Week in AI: GPT-5.6, Grok 4.5, Muse Spark 1.1 and the Post-Chatbot Stack

Political Neutrality Benchmark of Popular AI Models

Mira Murati’s Thinking Machines Lab Makes The Technical Case For Human-Centered AI Built On Customizable Model Weights

sqlite-utils 4.1

Fixed three bugs that made Qwen3.5-122B a daily driver on Mac Studio

Mesh LLM: distributed AI computing on iroh

Two LLMs play live chess and rewrite their own brains after each game

I built a free tool to evaluate AI agent outputs (human labels and LLM judges)

RAG Evaluation Frameworks Compared: RAGAS vs TruLens vs DeepEval

My AI Model Tier List for Mid-2026

Ant Group’s Robbyant Unveils LingBot-VA 2.0: A Causal Video-Action Model Built Natively for Physical AI

[AINews] not much happened today

GDP.pdf: Can Frontier Models Master the Documents That Run the World?

DeepSeek V3.2 Released on Hugging Bay

Meta turns off the Instagram feature that let users make AI deepfakes of public accounts

China's Open AI Models Are Advancing Its Global Soft Power

Migrating a production AI agent to GPT 5.6

Kyutai Releases MuScriptor: An Open-Weight Decoder-Only Transformer for Multi-Instrument Music Transcription to MIDI

An OpenAI model crushed top human programmers at a world coding competition

This Week in AI: Chips, Checks, and Changing Jobs

Fine-tune NVIDIA Nemotron 3 models with Amazon SageMaker AI serverless model customization

Disaggregated prefill and decode for LLM inference on SageMaker HyperPod

How GPT-5.6 Reflects the New AI Regulation

Fine-Tuning Explained for Noobs (How Pretrained Models Learn New Skills)

Google Research Introduces SensorFM: A Wearable Health Foundation Model Pretrained on One Trillion Minutes of Sensor Data

OpenAI Launches GPT-5.6 Sol/Terra/Luna, Codex Becomes ChatGPT Superapp

Meet LingBot-World-Infinity: An Open Causal World Model With An Agentic Harness

GPT-5.6 Is Here: Sol, Terra, and Luna

Time-to-Collision Based Dynamic Obstacle Avoidance Using Pretrained Vision Models for Robots in Unstructured Environments

STEMbot: A Compliant Robot for Under-Canopy Plant Navigation

APIVOT: Adaptive Planning with Interleaved Vision-Language Thoughts

SAGA: Stable Acceleration Guidance for Autoregressive Video Generation

LightCrafter: PBR-Conditioned Video Diffusion Refinement for Controllable and Consistent Relighting

FedTR: Federated Learning Framework with Transfer Learning for Industrial Visual Inspection

LOGOS: Language-guided Oriented Object Detection in Aerial Scenes

Adversarial Decoys: Misdirecting Attention-Based Defenses in ViT

GIRAF: Towards Generalizable Human Interactions with Articulated Objects

DreamCharacter-1: From 3D Generative Foundation Models to Product-Ready Character Generation

Hallucination Self-Play: Bootstrapping Reinforced Detector via Evolved Generator

A Reliability Assessment of LALM Audio Judges for Full-Duplex Voice Agents

When Implausible Tokens Get Reinforced: Tail-Aware Credit Calibration for LLM Reinforcement Learning

A Multi-cluster Boundary Learning Method for Out-of-Scope Intent Detection via MiniLM Embedding

When Debiasing Backfires: Counterintuitive Side Effects of Preprocessing-Based Stereotype Mitigation

Scalable and Culturally Specific Stereotype Dataset Construction via Human-LLM Collaboration

How Do I Know What to Say Next? Barenholtz's Autogenerative Theory as an Enrichment of Harrisean Integrationism

Topics

Models

Agents

Chips

Policy

Research

Startups

Robotics

Tools