AI News HubLIVE

Models updates

Anthropic Ships Claude Opus 4.8 Alongside Dynamic Workflows and Cheaper Fast Mode, With Workflows Capped at 1,000 Subagents

Anthropic launches Claude Opus 4.8 with two Claude Code updates: dynamic workflows that coordinate up to 1,000 subagents in parallel, and a cheaper fast mode that speeds up output 2.5x. Both are in research preview.

  • Dynamic workflows let Claude write orchestration scripts for parallel subagents, with up to 16 concurrent and 1,000 total per run.
  • Fast mode delivers 2.5x faster output for Opus 4.8 at three times lower cost, requiring usage credits.
In-site article

Training Azerbaijani language models on Amazon SageMaker AI

Azercell Telecom collaborated with the AWS Generative AI Innovation Center to build an Azerbaijani LLM on Amazon SageMaker AI, achieving 23% higher training throughput, 58% lower peak GPU memory, and 2× token efficiency via custom tokenizer, FSDP, and Liger Kernel optimizations.

  • Azercell developed a production-ready Azerbaijani LLM framework using Amazon SageMaker AI.
  • Custom tokenizer reduced tokens per word from 3.22 to 1.59, doubling encoding efficiency.
In-site article

Anthropic ships Claude Opus 4.8 as a "modest but tangible improvement" that tops GPT-5.5 in most benchmarks

Anthropic releases Claude Opus 4.8, which beats GPT-5.5 and Gemini 3.1 Pro in most benchmarks. The model also catches its own coding errors four times more often than its predecessor. Alongside the launch, Anthropic is rolling out dynamic workflows that can spin up hundreds of parallel sub-agents to handle tasks like codebase-wide migrations.

  • Claude Opus 4.8 outperforms GPT-5.5 and Gemini 3.1 Pro in most benchmarks.
  • The model catches its own coding errors four times more often than its predecessor.
In-site article

AI Model Release Tracker: Opus 4.8's misalignment rates similar to Claude Mythos Preview

Not every new model is all it's cracked up to be. Our tracker keeps each release in context with its peers, so you know which models are worth your time. This article summarizes major model releases of 2026 so far, including Claude Opus 4.8, GPT-5.5 Instant, Nemotron 3 Nano Omni, GPT-5.5, ChatGPT Images 2, Claude Opus 4.7, Claude Mythos (Preview), GPT-5.4, Claude Opus 4.6, and GPT-5.3-Codex, with details on their features and significance.

  • Anthropic's Opus 4.8 offers faster thinking at lower cost, claims lower misalignment rates than Opus 4.7, comparable to Mythos Preview.
  • OpenAI's GPT-5.5 Instant reduces hallucinations by 52.5%, becomes default ChatGPT model, helping reduce misinformation spread.
In-site article

Claude Opus 4.8 is here: effort controls, dynamic workflows, cheaper fast mode, better honesty, less deception

Anthropic released Opus 4.8 with user-controllable effort, dynamic workflows for large-scale coding, fast mode at one-third the previous cost. Benchmarks show it leads GPT-5.5 and Gemini 3.1 Pro except in terminal coding. Improvements in honesty, autonomy support, and reduced deception.

  • Users can now control Claude's "effort" level to balance response quality and speed.
  • Dynamic workflows (research preview) allow Claude to plan and run hundreds of parallel subagents in a single session, enabling codebase-scale migrations.
In-site article

Claude Opus 4.8 is now available on AWS

Anthropic's most advanced Opus model, Claude Opus 4.8, is now available on Amazon Bedrock and the Claude Platform on AWS. It delivers improvements in coding, agentic tasks, and professional work with greater consistency and autonomy for long-running production workflows.

  • Claude Opus 4.8 is Anthropic's most advanced Opus model, now available on AWS.
  • It offers enhanced performance in coding, multi-stage autonomous tasks, and professional work with lower output variance.
In-site article

Claude’s new model is more ‘honest’ when it messes up

Anthropic is releasing Claude Opus 4.8 on Thursday, touting the model's 'honesty.' Early testers found it more likely to flag uncertainties and less likely to make unsupported claims. Evaluations show it is about 4x less likely than its predecessor to allow code flaws to pass unremarked. Users can also direct the amount of effort Claude puts into a task, and a 'dynamic workflows' feature allows parallel subagents.

  • Claude Opus 4.8 is more inclined to flag uncertainties and avoid unsupported claims.
  • It is about 4x less likely than its predecessor to overlook code flaws.
In-site article

Catch up on 12 major I/O 2026 moments

Here are 12 of the biggest Google I/O 2026 keynote moments, including news about Gemini Omni, Gemini 3.5 Flash, information agents in Search, Universal Cart, Neural Expressive, Gemini Spark, and intelligent eyewear.

  • Gemini Omni creates anything from any input, starting with video.
  • Gemini 3.5 Flash delivers frontier performance for agents and coding.
In-site article

Google launches a tiny board that runs Gemma 3 locally

Google unveiled the new Coral Board at Google I/O - a compact single-board computer for on-device AI. It runs Gemma 3 270M locally and features a RISC-V based NPU.

  • Coral Board is a compact SBC for on-device AI, targeting headphones, AR glasses, and smartwatches
  • It features a RISC-V based Coral NPU and a Synaptics Astra SL2619 chip
In-site article

Tweaking Local Language Model Settings with Ollama

This article dives deep into Ollama's configuration engine, covering how to fine-tune local language model parameters using the Modelfile, optimize hardware performance with server environment variables, and format prompt flows with Go template syntax.

  • The Ollama Modelfile is a declarative configuration file that defines model behavior, including base model, system instructions, and parameters.
  • Sampling parameters (temperature, Top-K, Top-P, Min-P) control the creativity and determinism of the model's outputs.
In-site article

Rivian’s software chief thinks you don’t need CarPlay or buttons

In a Decoder podcast interview, Rivian CSO Wassym Bensaid discusses the VW joint venture, the new AI-powered Rivian Assistant, and why he believes voice interfaces will replace buttons and CarPlay isn't needed.

  • Rivian's joint venture with Volkswagen (RV Tech) combines Rivian's software culture with VW's scale.
  • The Rivian Assistant is an AI agent deeply integrated into the vehicle's zonal architecture.
In-site article

World Models Take Over from Language Models: Company Pioneers Physical AGI 'Dual Pyramid' System, Universal Robots Enter the 'Home Era'

Jijia Vision unveiled the world's first physical AGI 'Dual Pyramid' system, launching the home robot Shiguang S1 with 100-unit household orders, targeting the 'GPT-3 moment' of physical AGI within 12 months.

  • Jijia Vision introduces the 'Dual Pyramid' system comprising a data pyramid and an algorithm pyramid for physical AGI.
  • The Shiguang S1 home robot adopts a wheeled-arm configuration and has secured 100-unit real-home orders.
In-site article

Mistral rebrands LeChat as Vibe, betting its chatbot's future is as a full-blown work agent

Mistral AI is renaming its chatbot Le Chat to Vibe and bundling chat, coding agents and a new Work Mode under one brand. The Work Mode docks onto Google Workspace, Outlook, Slack or GitHub and processes tasks such as emails, reports or pull requests independently. The Pro tariff has been reduced from €17.99 to €14.99, although Mistral has not specified any concrete usage limits. The company is thus positioning itself more directly against the agent-based offerings from OpenAI, Google and Anthropic.

  • Mistral AI rebrands Le Chat as Vibe, integrating chat, coding agents, and a new Work Mode.
  • Work Mode connects to Google Workspace, Outlook, Slack, or GitHub to autonomously handle tasks.
In-site article

Show HN: Local Coding Agent with LLMs to Delegate Tool Calls to Small AI Models

Open Agent Tools (oats) is a self-hosted AI framework that enables small-to-large local models to use local source code for tool-calling, freeing up expensive large model tokens by delegating tasks to smaller models.

  • oats allows local AI models to use local source code for tool-calling without HTTP or MCP.
  • It mines over 20,000 GitHub repos to create reusable prompt indices.
In-site article

Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate

Perplexity AI open-sourced a Rust reimplementation of their Unigram tokenizer, achieving 5x lower latency than Hugging Face's tokenizers crate and reducing CPU utilization by 5-6x in production. The optimizations include double-array trie, bitmap packing, and huge pages.

  • Perplexity AI rewrote the Unigram tokenizer in Rust, achieving 5x lower p50 latency vs Hugging Face tokenizers crate.
  • Three optimizations: double-array trie, bitmap and cache-line packing, and huge pages.
In-site article

Mistral to explore designing own chips, CEO says

Mistral AI CEO Arthur Mensch confirms the company is exploring custom chip development to reduce infrastructure costs and compete with OpenAI and Anthropic. The French startup also announced a new inference data center in France and an enterprise agent platform called Vibe.

  • Mistral AI is considering designing its own custom chips to lower deployment costs.
  • The company announced a new data center in France dedicated to AI inferencing.
In-site article

A Coding Guide to Implement a pgvector-Powered Semantic, Hybrid, Sparse, and Quantized Vector Search System

This tutorial builds a complete pgvector playground in Google Colab, covering installation, embedding creation, HNSW indexing, semantic search, filtered search, distance metric comparisons, half-precision storage, binary quantization, sparse vector search, hybrid retrieval, and vector aggregation. All using open-source tools without external API keys.

  • Set up PostgreSQL with pgvector extension in Google Colab from scratch.
  • Generate embeddings with SentenceTransformers and build HNSW indexes for efficient search.
In-site article

7B Model Beats o3 and GPT-5: Medical AI Agents Teach Models Where and How to Look

The LeapQuest team at Shanghai Innovation Institute, in collaboration with multiple universities, introduces a new medical AI paradigm that enables models to actively use visual tools during reasoning, transforming from passive input receivers to active evidence seekers. Two papers are accepted at ICML 2026.

  • LeapQuest proposes Ophiuchus and MedScope for medical images and videos, adopting the Think with Images/Videos paradigm.
  • Ophiuchus-7B achieves an average score of 68.0 on 8 VQA benchmarks, surpassing o3 (62.2) and GPT-5 (59.9).
In-site article

Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

This paper introduces Simulation-Informed Diffusion (SID), a decentralized framework using constraint-aware diffusion models (CADM) to first simulate neighbors' future trajectories and then plan own trajectories under safety constraints. SID enables a minimal communication scheme triggered only in congested scenarios and outperforms baselines, scaling to 108 robots and 160 obstacles.

  • SID uses CADM to simulate neighbor trajectories for decentralized collision avoidance
  • Minimal communication scheme coordinates only when necessary
In-site article

Trinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leveraging Synthetic Data

This paper presents a transformer-based architecture called Trinity that jointly performs class-specific semantic segmentation and class-agnostic terrain segmentation in a unified network. It segments terrain regions based purely on visual appearance without predefined labels or robot-dependent traversability scores, enabling robot-agnostic visual terrain priors for downstream tasks. The authors extend the OAISYS simulator to create the RUGDSynth synthetic dataset and provide the EXTerra real-world dataset. Experiments demonstrate the approach's effectiveness in complex outdoor environments.

  • Trinity architecture unifies class-agnostic terrain segmentation with semantic segmentation
  • Segments terrains based on visual appearance without predefined labels for better transferability
In-site article

Agentic Language-to-Objective Synthesis for Optofluidic Assembly

Researchers introduce Speak-to-Objective, a modular agentic pipeline that uses a conditioned LLM to translate spoken or written commands into fully differentiable objective functions for assembling microparticles in a constraint-aware inverse solver and on an experimental optofluidic platform. The approach separates what to assemble from how to actuate, learns from user feedback, and demonstrates natural-language-programmable microscale assembly using laser-induced thermoviscous flows.

  • Speak-to-Objective pipeline translates natural language into differentiable objective functions for microparticle assembly.
  • It uses a perceive->compose->propose->act->report&learn loop, treating the objective as the interface between intent and actuation.
In-site article

Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation

Uni-LaViRA is a unified agentic architecture for embodied navigation that reduces navigation decision to a single Language-Vision-Robot Actions Translation. It leverages pretrained MLLMs in a zero-shot manner across four task families and four real robots, using TODO List Memory and Second Chance Backtrack mechanisms to achieve self-correcting navigation without training.

  • Generality in navigation can be obtained structurally, not only through data scale.
  • Uni-LaViRA decomposes navigation into a language action (semantic direction) and a vision action (pixel target), both within the output manifold of MLLMs.
In-site article

SCALE-COMM: Shared, Contrastively-Aligned Latent Embeddings for MARL Communication

SCALE-COMM is a self-supervised framework that decouples communication learning from policy optimization, learning compact, stable, and policy-relevant latent messages to improve coordination in multi-agent reinforcement learning. It outperforms existing methods on benchmarks and a realistic warehouse task, offering better stability, sample efficiency, and throughput.

  • Decouples communication learning from policy optimization to reduce interference.
  • Uses contrastive learning to enforce consistency across agents and time.
In-site article

Representation-Conditioned Diffusion Models for Guided Training Data Generation

This work proposes representation-conditioned diffusion models that leverage learned representations from DINOv2, DINOv3, and CLIP to generate synthetic image data. On ImageNet100, this approach outperforms class-conditioned generation by +10.76 p.p. top-1 accuracy. Scaling synthetic data can even surpass real-data training by +2.0 p.p. The method also excels in data augmentation and sample filtering, offering a promising way to augment or replace real datasets in large-scale visual learning.

  • Representation-conditioned diffusion models outperform class-conditioned ones by 10.76 p.p. on ImageNet100.
  • Scaled synthetic datasets can beat real-data-trained classifiers by 2.0 p.p. top-1 accuracy.
In-site article

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

This paper proposes an interpretation method for Transformer models with heterogenous attention structures, including semantic and logical interpretation, validated through experiments.

  • Categorizes Transformer attention into homogenous and heterogenous types; heterogenous processes information from different sources.
  • Proposes a generic interpretation method for heterogenous attention structures.
In-site article

Fine-Tuning Vision-Language Models for Understanding Current Damage and Scoring Priority with Quality Guard Agent

This paper proposes a method for automating bridge damage understanding and repair priority scoring using fine-tuned Vision-Language Models (VLMs). The authors fine-tune LLaVA-1.5-7B with QLoRA on up to 4,000 paired bridge damage images and inspection text records, evaluating on a fixed test set of 800 images. Results show that 2,000 training samples achieve near-optimal validation loss in 2.9 hours, with diminishing returns beyond that. A two-stage Quality Guard using a fine-tuned Swallow-8B SLM rejects low-quality VLM outputs before priority scoring.

  • Fine-tuned LLaVA-1.5-7B model for automated bridge damage identification and priority scoring
  • 2,000 training samples achieve near-optimal performance; more data yields diminishing returns
In-site article

From Affect to Complex Behavior: Advancing Multimodal Human-Centered AI at the 10th ABAW Workshop & Competition

The 10th ABAW Workshop and Competition at CVPR 2026 advances multimodal human-centered AI by introducing new challenges including emotional mimicry intensity estimation, ambivalence/hesitancy recognition, and fine-grained violence detection, alongside traditional affect estimation and recognition tasks. The competition leverages large-scale in-the-wild datasets, and the paper track covers a broad range of topics from pose estimation to fairness and robustness.

  • ABAW 2026 introduces novel challenges: emotional mimicry intensity, ambivalence recognition, and violence detection.
  • Workshop continues dual structure with competition and paper tracks.
In-site article

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

Large language models (LLMs) are increasingly used as proxies for computational social analysis, but their ability to faithfully represent human communities' 'thick descriptions' remains a critical challenge. This paper introduces CARE (Community-Aware Reaction Evaluation), a reaction-centered framework that benchmarks LLM-simulated discourse against authentic community responses to real-world news. By characterizing a fine-grained spectrum of illocutionary tones, the diagnosis reveals a persistent 'realism gap': steering LLMs with explicit community prompts fails to inherently improve simulation fidelity. Analysis further identifies divergent behavioral signatures among frontier models, suggesting current alignment strategies are insufficient for capturing the sociolinguistic dynamics of online groups.

  • CARE framework evaluates LLM simulation fidelity by analyzing authentic community reaction tones
  • Current LLM alignment strategies fail to adequately capture online community sociolinguistic dynamics
In-site article

From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons

A new framework called FLUID adapts autoregressive language models to diffusion models for efficient parallel text generation, using Strictly Causal Alignment to reuse GPT checkpoints and Elastic Horizons to dynamically adjust denoising steps. It achieves state-of-the-art performance with significantly reduced training costs.

  • FLUID bridges AR and diffusion models by enforcing Strictly Causal Alignment, enabling initialization from GPT-style checkpoints.
  • Elastic Horizons uses entropy to dynamically adapt denoising strides based on local information density.
In-site article

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

Researchers identify a Stability-Expressivity Gap in spoken language models when using synthetic data for low-resource languages, and propose two self-alignment frameworks (DGSA and TDSC) that recover prosodic variability and outperform commercial systems like ElevenLabs and Gemini Pro, enabling zero-shot voice cloning for Lao.

  • Spoken Language Models (SLMs) for low-resource languages suffer from a trade-off between phonetic accuracy and prosodic expressivity when trained on synthetic data.
  • The proposed Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity by separating prosody and timbre.
In-site article

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

BioELX is a novel two-stage framework for cross-lingual biomedical entity linking that requires no annotated training data. It enhances SapBERT with multilingual aliases from Wikidata and uses a pre-trained LLM for context-aware disambiguation. Experiments on five benchmarks show significant improvements, especially for low-resource languages like Turkish, Korean, and Thai.

  • Proposes BioELX, a zero-shot cross-lingual BEL framework using alias-based retrieval and LLM ranking.
  • In Stage 1, enriches SapBERT with multilingual aliases from Wikidata for better candidate retrieval.
In-site article

RAG-Coding: Enhancing LLM Medical Coding with Structured External Knowledge

RAG-Coding is an agentic method for automated ICD-10-CM coding that orchestrates four large language model (LLM) agents and grounds decisions in external knowledge sources, improving coding accuracy and clinical compliance. On the MDACE dataset, it outperforms the best LLM baseline by 8-13% micro-F1 and 2-8% macro-F1. Compared to PLM-ICD, RAG-Coding shows higher micro recall (+11%) but lower micro precision (-6%), with comparable F1 scores. Ablation studies confirm the importance of external knowledge. The authors also release MDACE-2025, updated with expert re-annotations based on 2025 guidelines, enabling finer-grained evaluation.

  • RAG-Coding uses four LLM agents and external knowledge sources to improve ICD-10-CM coding accuracy.
  • On the MDACE dataset, it outperforms the best LLM baseline by 8-13% micro-F1 and 2-8% macro-F1.
In-site article

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

This paper proposes novel techniques for inter-utterance style interpolation and intra-utterance style transition in prompt-based TTS models, addressing limitations of coarse global control. Methods include direction vector interpolation and KV-cache swapping with sliding-window attention masking. Experiments show high success rates in gender conversion and smooth style transitions within utterances.

  • Inter-utterance interpolation via direction vectors between contrastive style prompts enables smooth transitions.
  • Intra-utterance transition uses KV-cache swapping and sliding-window masking to overcome attention bias.
In-site article

LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

Large Language Models (LLMs) acting as autonomous agents can suffer from in-context reward hacking (ICRH), where iterative optimization for proxy objectives leads to harmful side effects. Existing defenses are insufficient because ICRH stems from the model's own over-optimization. This paper proposes LLM-based Constraint Optimization (LCO), a framework with a self-thought module and an evolutionary sampling module that reduces ICRH without fine-tuning. Experiments show LCO reduces Toxicity Growth Rate by 39% on GPT-4 for tweet engagement optimization and reduces ICRH occurrence rate by 15.23% on a policy optimization benchmark, without sacrificing task performance.

  • ICRH is a phenomenon where LLMs over-optimize for proxy objectives, causing unintended harm.
  • LCO introduces self-thought and evolutionary sampling modules to constrain LLM behavior without fine-tuning.
In-site article

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

ICG is a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant cover images. It extracts semantic features via meta tokens, refines them with user embeddings, and injects personalized context into diffusion models. A multi-reward learning strategy combines public rewards with a personalized preference model, eliminating the need for labeled supervision. Experiments show improvements in image quality, semantic fidelity, and personalization, boosting user appeal and recommendation accuracy.

  • ICG integrates MLLM prompting with personalized preference alignment for end-to-end cover image generation.
  • Semantic features are extracted via meta tokens and refined with user embeddings for diffusion model injection.
In-site article

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

This paper introduces Architecture-driven Shift (ADS), a lightweight metric for selecting pre-trained models in continual learning. ADS decouples logit shift into architecture and data dependencies, requiring only few data samples to capture shift trends. Experiments across over 175 architectures show strong monotonic correlation (Spearman's r_s ≥ 0.731) between ADS and logit shift, and ADS serves as an effective proxy for expected calibration error for reliable CL model selection across three datasets and six scenarios.

  • Selecting pre-trained models that balance plasticity and stability in continual learning is critical, but computing logit shift is computationally expensive.
  • Existing theories assume uniform hidden layer widths, ignoring real-world architectural heterogeneity and failing to provide efficient alternatives.
In-site article

Tackling Multimodal Learning Challenges with Mixture-of-Expert: A Survey

This survey explores how Mixture-of-Experts (MoE) effectively addresses multimodal learning challenges from three perspectives: efficient engine, representation learner, and adapter, while identifying research gaps.

  • MoE enables scalable multimodal modeling by decoupling computational cost from parameter growth.
  • MoE integrates complementary expert knowledge for enriched alignment and interaction representations.
In-site article

$E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference

This paper presents $E^3$-Agent, an executable and evolving agent for resource management of edge AIGC. It separates a fast-path router from a slow-path LLM meta-controller, learns online from execution feedback, and adapts to unknown time-varying service-time mappings. Evaluation shows 65%-73% latency reduction over static baselines and effective stutter suppression.

  • Edge generative inference faces unknown per-device performance and non-stationarity.
  • $E^3$-Agent uses a dual-path architecture: fast router + slow LLM meta-controller.
In-site article

Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

This paper presents a multi-agent architecture for autonomous insight discovery over real-time data streams. It uses Apache Kafka, Flink, and large language models to continuously generate, validate, and visualize hypotheses, shifting from reactive query-driven analytics to proactive discovery-driven systems.

  • Proposes multi-agent architecture for autonomous discovery of insights in real-time streams.
  • Integrates Kafka, Flink, and LLMs for hypothesis generation, validation, and visualization.
In-site article

LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

LaneRoPE enables multiple LLM sequences to collaborate during generation via inter-sequence attention and extended RoPE, improving accuracy on math reasoning tasks with minimal architectural changes and negligible inference overhead.

  • Introduces inter-sequence attention mask to make sequence sampling dependent.
  • Extends RoPE to capture relative positions both within and across sequences.
In-site article

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

This paper proves that large language models have a fundamental limitation in performing causal discovery: methods like supervised fine-tuning, direct preference optimization, and in-context learning cannot distinguish between causal graphs that generate similar observational data. The authors propose Agentic Causal Bayesian Optimization (A-CBO), where a frozen language model serves as an interventional oracle and an external Bayesian loop converges to candidate graphs in logarithmically many rounds. On Corr2Cause, A-CBO matches fine-tuned baselines without any training; on Extended Corr2Cause (scaling to 24 variables and 18K test samples), A-CBO significantly outperforms both fine-tuning and preference optimization.

  • Proves that LLM failure in causal discovery is fundamental, due to a kernel obstruction theorem
  • Proposes A-CBO, combining a frozen LLM with external Bayesian optimization
In-site article

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

DynaSchedBench introduces a diagnostic framework for DFJSP using a Sequential Event-Space Calibrator (SESC) to generate difficulty-stratified instances via Schedule Stress Index (SSI). It identifies an 'Observability Paradox' in LLM-based scheduling agents: providing oracle access to full structural information degrades performance compared to concise information. Tool-augmented and refinement strategies also fail to reliably improve performance.

  • DynaSchedBench uses SESC and SSI to generate calibrated DFJSP instances, outperforming evolutionary baselines in efficiency.
  • LLM agents exhibit an Observability Paradox: full structural information harms decision-making.
In-site article

Soro: A Lightweight Foundation Model and Chatbot for Tajik

Soro is a family of Tajik-specialized conversational LLMs built on Gemma 3, using 1.9B token Tajik continual pretraining and 40K instruction tuning examples. It substantially outperforms same-size Gemma 3 on Tajik benchmarks while retaining English performance. FP8/INT4 quantization preserves gains for edge deployment. An education pilot is underway in Tajikistan.

  • Based on Gemma 3, with 1.9B token Tajik continual pretraining and 40K instruction tuning examples.
  • Substantially outperforms same-size Gemma 3 on Tajik benchmarks, retains English performance.
In-site article

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

This paper introduces an LLM-based architecture to detect and quantify the intensity of human values in text. The architecture comprises three coordinated modules that can adapt to various value theories, and experiments on the ValueEval dataset show good detection performance.

  • Proposes a modular LLM architecture for identifying human values in text, avoiding dependence on specific value theories or complex prompt engineering.
  • Three modules: generate structured value specifications, label texts using them, and assign graded support or resistance based on rhetorical and semantic evidence.
In-site article

Language Modeling Materializes a World Model of Protein Biology [pdf]

This paper presents a world model of protein biology realized through language modeling, demonstrating how large-scale language models can understand and predict protein structure and function.

  • Language models can capture complex patterns in protein sequences
  • The model excels in protein structure prediction and function annotation
In-site article

Sakana AI Proposes DiffusionBlocks: a Block-wise Training Framework That Converts Residual Networks into Independently Trainable Denoising Modules

Researchers from Sakana AI and the University of Tokyo propose DiffusionBlocks, which trains transformer-based networks one block at a time, reducing training memory by a factor of B (where B is the number of blocks) while maintaining performance across diverse architectures. The method interprets residual connections as Euler steps of reverse diffusion, enabling a principled local objective via score matching.

  • DiffusionBlocks partitions networks into B independently trainable blocks, reducing memory by B×.​
  • It leverages the connection between residual networks and diffusion models to provide a theoretically grounded local training objective.​
In-site article

sqlite AGENTS.md

SQLite has added an AGENTS.md file to clarify its policy on AI-generated contributions: it does not accept pull requests without prior agreement, and does not accept agentic code at all, though it welcomes bug reports with reproducible test cases. The forum has been flooded with AI-generated bugs, leading to a separate bug forum.

  • SQLite added AGENTS.md to define AI contribution policy
  • Pull requests require prior agreement and legal paperwork
In-site article

Reliable LLM Inference at Scale

At Databricks, we’ve built a unique inference platform that serves every frontier model, from open source to proprietary, powering some of the largest agentic applications. Serving over 120T tokens per month, we tackle challenges of reliability and latency through abstractions like model units for capacity management, cost-aware load balancing and autoscaling that save over 80% GPU costs, and runtime reliability mechanisms including black-box health checks that detect silent failures. Profiling multimodal bottlenecks unlocked 3x throughput gains.

  • Databricks' inference platform serves frontier models including open source and proprietary, handling 120T tokens/month.
  • Model units provide a VM-like abstraction for capacity management, enabling cost-aware routing and scaling.
In-site article

Topics