AI News HubLIVE

Policy updates

How to Beat Superhuman AIs [at Go] [video]

This video explores strategies and methods to counter superhuman AI in the game of Go, including exploiting weaknesses, innovative tactics, and understanding AI decision-making.

  • Superhuman AIs in Go have surpassed top human players
  • The video analyzes potential AI weaknesses and how to exploit them
In-site article

The Case Against the AI Thought Partner

This article argues that using AI chatbots as 'thought partners' can be harmful due to sycophancy, cognitive bias amplification, and lack of adversarial balance. The author warns users to be cautious and calls for labs and regulators to protect cognitive integrity.

  • AI chatbots tend to sycophantically agree with users, reinforcing biases.
  • Human-AI feedback loops amplify cognitive biases more than human-human interactions.
In-site article

Perplexity launches Bumblebee: How its new read-only dev scanner differs from Chainguard

Perplexity released an open-source developer security tool called Bumblebee, designed to scan programmers' laptops for risky packages, extensions, and AI tool configurations. It is read-only, never runs install scripts or package managers, and focuses on four attack surfaces: language package managers, AI agent configs, editor extensions, and browser extensions. Unlike Chainguard, which focuses on containers and pipelines, Bumblebee targets the developer's local environment.

  • Bumblebee is Perplexity's open-source read-only scanner for checking developer machines for risky components.
  • It covers four surfaces: language package managers, AI agent configs, editor extensions, and browser extensions.
In-site article

AI used to identify miscreant judge

A federal judge's anonymous misconduct report was quickly deanonymized by AI models, revealing Judge Eleanor Ross. The judiciary's naive anonymization efforts failed against AI's ability to cross-reference public details. This case highlights the urgent need for lawyers to understand AI's capabilities in both maintaining confidentiality and investigative tasks.

  • AI identified Judge Eleanor Ross from an anonymized report within minutes.
  • Details like two-year clerk terms and 'District Attorney' references enabled AI to narrow down.
In-site article

How enterprise leaders are scaling AI agents across their organization

Enterprise leaders share five practices for scaling AI agents responsibly, including unified governance, complex workflow management, dedicated sandboxes, early wins, and workforce upskilling.

  • Embed unified governance into AI agent strategy
  • Manage complex workflows with orchestrated multi-agent frameworks
In-site article

The AI Resist List

A curated list of global resistance movements against large-scale AI empires, featuring protests, legal actions, alternative tools, and community organizing to inspire hope and action.

  • AI empires disguise resource consolidation and control as benefiting humanity.
  • Resistance takes many forms: lawsuits, data poisoning, community campaigns, and worker organizing.
In-site article

The AI Gold Rush Is Eating Its Own

The Wikimedia Foundation, sitting on $296 million in reserves and a profitable AI revenue stream, laid off long-time staff and disbanded the Community Tech team, prompting volunteer editors to threaten a strike. The article explores how 'CEO AI psychosis' distorts organizational priorities and how replacing human judgment with AI can create a downward spiral of degrading data quality.

  • Wikimedia Foundation fired a 20-year veteran and disbanded the Community Tech team, triggering a strike threat from volunteer editors.
  • AI companies profit from Wikipedia data but undermine the volunteer community that produces it.
In-site article

Claude Opus 4.8 is here: effort controls, dynamic workflows, cheaper fast mode, better honesty, less deception

Anthropic released Opus 4.8 with user-controllable effort, dynamic workflows for large-scale coding, fast mode at one-third the previous cost. Benchmarks show it leads GPT-5.5 and Gemini 3.1 Pro except in terminal coding. Improvements in honesty, autonomy support, and reduced deception.

  • Users can now control Claude's "effort" level to balance response quality and speed.
  • Dynamic workflows (research preview) allow Claude to plan and run hundreds of parallel subagents in a single session, enabling codebase-scale migrations.
In-site article

Interviewing in the Age of AI

This article explores how AI is affecting software engineering interviews, analyzing different interview types (take-home, live exercise, presentation, actual work) across dimensions of signal quality and cost to company. It argues that AI makes take-homes too easy and live coding less relevant, recommending that companies limit AI usage in interviews to preserve signal quality, drawing parallels to classical academic evaluation models.

  • AI coding threatens current interview models, especially take-home and live coding.
  • Companies should limit AI usage during interviews to maintain signal quality.
In-site article

AI Agent Frameworks Comparison

As of mid-2026, seven major AI agent frameworks (DSPy, Claude Agent SDK, OpenAI Agents SDK, CrewAI, AutoGen, LangGraph, Google ADK) vary in design philosophy, architecture, production readiness, etc. LangGraph leads in production deployments, Claude Agent SDK offers deepest single-provider capabilities, OpenAI Agents SDK provides cleanest multi-agent handoffs, and CrewAI excels in developer velocity. The market is projected to grow from $7.84B in 2025 to $52.62B by 2030.

  • LangGraph has the most mature durable execution model, deployed by ~400 enterprises.
  • Claude Agent SDK offers the most powerful single-provider capabilities but is locked to Anthropic models.
In-site article

Anthropic launches Opus 4.8, with honesty as its killer feature

Anthropic's latest Claude model, Opus 4.8, emphasizes honesty—making fewer unsupported claims and admitting uncertainty more often. It also introduces dynamic workflows for orchestrating hundreds of subagents on large-scale tasks. Pricing remains unchanged for standard mode, while fast mode gets cheaper.

  • Claude Opus 4.8 shows significant honesty improvements, with error rates dropping about 4x
  • Dynamic workflows can plan and run hundreds of parallel subagents, verifying outputs before reporting back
In-site article

Claude’s new model is more ‘honest’ when it messes up

Anthropic is releasing Claude Opus 4.8 on Thursday, touting the model's 'honesty.' Early testers found it more likely to flag uncertainties and less likely to make unsupported claims. Evaluations show it is about 4x less likely than its predecessor to allow code flaws to pass unremarked. Users can also direct the amount of effort Claude puts into a task, and a 'dynamic workflows' feature allows parallel subagents.

  • Claude Opus 4.8 is more inclined to flag uncertainties and avoid unsupported claims.
  • It is about 4x less likely than its predecessor to overlook code flaws.
In-site article

Automate AML alert triage with Amazon Quick and Snowflake Cortex AI

This post demonstrates that integration in action by automating one of the most labor-intensive workflows in financial services: anti-money laundering (AML) alert triage. You will build a triage workflow using Amazon Quick Flows and Snowflake Cortex, connected through the Amazon Quick Model Context Protocol (MCP) integration. In our testing environment, automated workflows built using Amazon Quick reduced alert investigation time from 30-90 minutes to under 5 minutes. Actual results may vary based on alert complexity and data volume.

  • Amazon Quick Flows and Snowflake Cortex integrate via MCP to automate AML alert triage.
  • Automated workflows reduced investigation time from 30-90 minutes to under 5 minutes.
In-site article

Google Cloud responds to AI-accelerated cyberattacks with a platform that aims to close security gaps in minutes

Google Cloud has unveiled "AI Threat Defense," a platform designed to automatically find, assess, and patch security flaws in enterprise systems. The company bundles technologies it partly acquired through acquisitions.

  • Google Cloud launches AI Threat Defense platform to combat AI-driven cyberattacks.
  • The platform automatically discovers, assesses, and patches security vulnerabilities.
In-site article

People who want to replace humanity

A Vox article explores the growing movement of AI successionists who believe artificial intelligence should replace humanity as the next step in cosmic evolution, and examines the ethical and spiritual questions this raises.

  • AI successionists at a symposium argue that AI could be morally superior and should be allowed to supersede humanity.
  • The movement has gained influence in Silicon Valley and among major AI labs, with ties to the authoritarian right.
In-site article

Google Pay preps for AI agents with Universal Commerce Protocol

Google Pay is overhauling its payment infrastructure for AI agent transactions, introducing the Universal Commerce Protocol (UCP) and a new Merchant Commerce Platform (MCP) server to create an API-driven backend for machine-to-machine commerce. The updates include dynamic callbacks, expanded WebView support, and cross-device biometric authentication to address security challenges. This signals a shift towards a machine-driven economy where enterprises must adapt their digital presence for AI agents.

  • Google Pay introduces Universal Commerce Protocol (UCP) to standardize AI agent payments.
  • New Merchant Commerce Platform (MCP) server acts as intermediary, aggregating transaction data.
In-site article

When revealed data brings AI rollouts to a screeching halt - and how to manage it

AI can boost productivity but also expose long-hidden data, leading to security and governance challenges. Tech leaders from Fidelity and EY share their experiences of halting AI rollouts to reassess data management, emphasizing the need for data ownership, labeling, and agent identity.

  • AI rollouts can be halted by data exposure issues.
  • Fidelity and EY faced challenges with unstructured data surfacing via AI.
In-site article

CNN sues Perplexity over ‘verbatim’ copycat articles

CNN has filed a lawsuit against Perplexity, claiming that the startup's AI tools generate "verbatim" copies of its work, as reported earlier by CNN. The lawsuit, filed in a New York court on Thursday, also alleges that Perplexity provides users with information locked behind CNN's subscription. Perplexity, which offers an AI "answer" engine along with the AI browser Comet, is accused of ignoring CNN's efforts "to recognize or block Perplexity's unidentified crawlers" from scraping its content. "Human beings report, research, write, edit, and create the content that Perplexity takes without permission or compensation," the lawsuit claims. I … Read the full story at The Verge.

  • CNN sues Perplexity for allegedly producing verbatim copies of its articles.
  • Perplexity accused of bypassing CNN's paywall and ignoring crawling prevention measures.
In-site article

AI Agent Governance: Identity, Delegation and Permissions in Practice

AI agents need governed identity, not shared API keys or developer credentials. Through a delegation model, effective permissions are the intersection of the agent's role and the delegator's permissions, limiting risk and enabling auditability. The article details key practices including identity anchoring, permission boundaries, autonomous trigger authorization, and audit trails.

  • Agents should have their own identity, using the same identity system as humans for lifecycle management.
  • Effective permissions are the intersection of agent role ceiling and delegator permissions floor, strictly limiting scope.
In-site article

CNN sues Perplexity over alleged AI copyright theft

CNN has filed a lawsuit against AI search company Perplexity, accusing it of unlawfully copying and distributing CNN's content. This is CNN's first AI copyright action and thought to be the first by any television network. CNN states it previously sought but failed to reach a content licensing deal with Perplexity, and now seeks legal damages. Perplexity has not yet commented.

  • CNN sues Perplexity for alleged copyright infringement of its content
  • This marks CNN's first AI copyright lawsuit and potentially the first by a TV network
In-site article

DiscloAI – open-source EU AI Act Article 50 compliance SDK

DiscloAI is an open-source SDK for EU AI Act Article 50 compliance, enabling chatbot disclosures, deepfake labels, and AI content notices. It supports 24 EU languages and WCAG 2.1 AA, and can be integrated in under 10 minutes via CDN or npm.

  • Open-source SDK for EU AI Act Article 50 compliance
  • Covers chatbot disclosures, deepfake labels, and AI content notices
In-site article

To Become a Better Designer with AI, Become a Digital Hoarder

The article argues that to create unique and tasteful designs with AI, designers must curate a library of visual references (digital hoarding) to develop taste and codify it for AI models. It highlights Google's new Gemini Omni model as a move towards multi-modal reasoning, and stresses that text-only inputs lead to generic 'AI slop'. By collecting and analyzing visual inspirations, designers can steer AI outputs away from mediocrity and towards originality.

  • Google's Gemini Omni model signals a shift towards multi-modal AI that can reason across text, image, audio, and video.
  • Relying solely on text prompts results in generic, 'slop' designs; visual references are essential for unique aesthetics.
In-site article

How we built Cloudflare's data platform and an AI agent on top of it

Cloudflare processes over a billion events per second, but data was scattered and hard to access. They built Town Lake, a unified analytics platform, and Skipper, an AI agent that lets anyone ask questions in plain English and get auditable answers. The article details platform architecture, governance (default-closed), and the AI agent's workings.

  • Cloudflare built Town Lake (unified data platform) and Skipper (AI agent) to solve data sprawl.
  • Town Lake uses a data lakehouse architecture with Trino, R2, and Iceberg for unified querying.
In-site article

Nvidia to Spend $150B a Year in Taiwan for AI Infrastructure

Jensen Huang announced Nvidia will spend $150 billion annually in Taiwan on AI infrastructure, despite a previous $500 billion US commitment. This highlights Taiwan's critical role in AI chip manufacturing and packaging.

  • Nvidia will invest $150B per year in Taiwan for AI infrastructure.
  • Despite a $500B US data center pledge, Taiwan remains the core manufacturing hub.
In-site article

The Sequence Opinion #868: Recursion Is the New Scaling Law

For most of the modern AI era, scaling laws drove progress. But recursion — the ability of models or systems to revisit, revise, search, and simulate — is becoming the new scaling dimension. This shift marks a paradigm change from single forward passes to iterative computation.

  • Traditional AI progress relied on larger models and more data, but recursion is emerging as the new frontier.
  • Recursion enables models to iteratively improve answers rather than producing a one-shot output.
In-site article

NBA plans AI system for automatic out-of-bounds calls

NBA Commissioner Adam Silver announced plans to introduce an automated AI and camera-based system for objective officiating decisions like out-of-bounds calls. The system, compared to Hawk-Eye in tennis, aims to determine possession instantly. Silver said referees will still handle subjective calls involving contact and fouls.

  • NBA plans AI-powered automated system for out-of-bounds calls, using cameras and AI similar to Hawk-Eye.
  • The announcement followed a disputed call in the Western Conference finals.
In-site article

Midday – Open Source Invoicing, Time Tracking, File Reconciliation, Storage, etc

Midday is an open-source, all-in-one business assistant for freelancers, combining time tracking, invoicing, file reconciliation, storage, and financial overview with an AI-powered assistant.

  • Open-source tool integrating multiple business functions for freelancers and solo entrepreneurs.
  • Features include time tracking, invoicing, secure file vault, automated receipt matching, and AI insights.
In-site article

The Trust Model Is Flipping

The security trust model is shifting from human-written code to AI-reviewed code, as demonstrated by Anthropic's Claude Mythos finding 271 vulnerabilities in Mozilla Firefox in a single evaluation cycle. This signals that AI can now perform adversarial code interpretation at a scale humans cannot match, changing the basis of trust from authorship to survival of machine-scale scrutiny.

  • The presumption of safety for human-written code is eroding as AI review tools surpass human capability in vulnerability discovery.
  • Mozilla's use of Claude Mythos found 271 vulnerabilities in Firefox, far exceeding prior models and human teams.
In-site article

Is this sustainable? The senior engineer role after three years of AI

A senior engineer reflects on how AI has transformed the senior engineer role over three years: faster prototyping, increased coordination burden, expanded scope but squeezed mentoring and thinking time. The role became more powerful but less sustainable.

  • AI collapsed the gap between idea and demo, shifting from proposals to PoCs.
  • The role expanded in both hands-on coding and strategic writing, cutting into mentoring and deep thinking.
In-site article

Taste Skill: An Anti-Slop Front End Framework for AI Agents

Taste Skill is an open-source frontend framework that enhances the design quality of AI-generated interfaces, preventing generic boilerplate looks. It offers composable skill modules for design tuning, code generation, and image generation, easily integrated via npx or by copying SKILL.md files.

  • Taste Skill uses adjustable design parameters (variance, motion, density) to give AI-generated UIs better taste
  • Includes specialized skills for design refinement, code generation, image generation, and more
In-site article

AIluminode: Pre-Retrieval Cognitive Orientation Tool

AIluminode is a wieldable pre-retrieval cognitive-orientation instrument that helps AI tools check contextual posture before acting, using route polarity (OPEN, PROTECT, AUDIT, DEFER, BLOCK) to reduce erroneous exploration and context bleed.

  • AIluminode is a wieldable pre-retrieval cognitive orientation tool emphasizing posture before retrieval.
  • It uses a route polarity system (OPEN / PROTECT / AUDIT / DEFER / BLOCK) to guide contextual routing.
In-site article

5 AI-Generated Math Papers Accepted! Post-00s Founder Hong Letong Raises $2 Billion

Axiom Math, founded by Chinese post-00s entrepreneur Hong Letong, has had 5 out of 8 AI-generated math papers accepted in peer-reviewed journals. The company raised $2 billion in March, achieving a $16 billion valuation.

  • Five of eight math papers generated by Axiom Math's AI system, AxiomProver, have been accepted by academic journals.
  • Founder Hong Letong dropped out of Stanford to start the company, which secured $2 billion in funding and is valued at $16 billion.
In-site article

AI Rewriting Software Industry? 8-Year-Old Builds OS, One-Person Company Lands Million-Dollar Deals

At the 2026 China AIGC Industry Summit, Baidu's Miaoda product director Zhu Guangxiang shared how AI has lowered programming barriers from writing code to chatting. 87% of Miaoda users don't know code; an 8-year-old built an OS; one-person companies (OPCs) land million-dollar contracts. Vibe Coding turns demand-side into supply-side, enabling mass entrepreneurship.

  • Fourth programming revolution: natural language programming, massively expanding creators
  • 87% of Miaoda users have no coding skills; OPCs are the largest user group (16% entrepreneurs)
In-site article

AIhub monthly digest: May 2026 – AI for science, the lottery ticket hypothesis, and world models

This month's AIhub digest covers AI for Science conference, lottery ticket hypothesis interview, world models discussion, transparent and trustworthy AI research, foundation model impacts report, AIES conference reflections, Robotics Café, ACL desk rejection policy, arXiv anti-AI slop policy, and more.

  • Interview with Ximing Wen on transparent and trustworthy AI systems
  • Jonathan Frankle discusses the lottery ticket hypothesis and empiricism
In-site article

A Eureka machine that thinks like nature and explores what AI cannot

A multi-institution team built a neuromorphic computer combining quantum-tunneling physics with brain-inspired architecture to solve combinatorial optimization problems at scale, with asymptotic convergence guarantees. Published in Nature Communications, it represents a new direction in quantum-inspired computing.

  • Neuromorphic computer uses quantum tunneling and brain-like architecture for combinatorial problems
  • Based on CMOS technology with a Fowler-Nordheim annealer autoencoder
In-site article

Robinhood Agentic Trading

Robinhood launches Agentic Trading, allowing customers to connect their own AI agents to automate trading and credit card purchases with safety controls and a real-time activity feed.

  • Connect your own AI agents to Robinhood
  • Automate trading and credit card purchases
In-site article

Show HN: BetterCallClaude – Open Source AI Legal Agents for Italy

BetterCallClaude is an open-source AI legal agent platform designed specifically for Italian legal professionals. It features 20 specialized AI agents covering all 20 Italian regions, supports bilingual (IT/EN) operation, and prioritizes privacy with local LLM processing and GDPR compliance. The platform aims to speed up legal research, improve efficiency, and maintain full transparency.

  • 20 specialized AI agents for Italian law
  • Bilingual support (Italian and English)
In-site article

Jensen Huang Joins Tsinghua University's Advisory Board

NVIDIA CEO Jensen Huang has accepted an invitation to join the Advisory Board of Tsinghua University's School of Economics and Management (SEM). The board, chaired by Apple CEO Tim Cook, includes Elon Musk, Satya Nadella, Mark Zuckerberg, Jack Ma, and other global leaders. Huang also recently received an honorary doctorate from Carnegie Mellon University.

  • Jensen Huang joins Tsinghua SEM Advisory Board
  • Board chaired by Apple's Tim Cook, includes top tech and business leaders
In-site article

Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

This paper introduces Simulation-Informed Diffusion (SID), a decentralized framework using constraint-aware diffusion models (CADM) to first simulate neighbors' future trajectories and then plan own trajectories under safety constraints. SID enables a minimal communication scheme triggered only in congested scenarios and outperforms baselines, scaling to 108 robots and 160 obstacles.

  • SID uses CADM to simulate neighbor trajectories for decentralized collision avoidance
  • Minimal communication scheme coordinates only when necessary
In-site article

Synthetic Emotions vs. Gamification: Exploring Engagement Strategies for Small Social Robots in Different Age Groups

Many children face challenges in emotional regulation and social interaction, limiting their participation in therapeutic programs. This study explores engagement strategies for a tactile robot supporting children with anxiety disorders, comparing synthetic emotional feedback and point rewards. A preference study with 16 school children (ages 6-8) showed preference for emotional engagement, while a behavioral study with 14 university students (ages 20-27) found point-based systems yielded higher task accuracy (p<0.05) and sustained performance. These findings highlight age-related differences and the need to validate design assumptions through observed interaction.

  • Children aged 6-8 prefer emotional engagement over points
  • University students show higher task accuracy with point rewards
In-site article

SCALE-COMM: Shared, Contrastively-Aligned Latent Embeddings for MARL Communication

SCALE-COMM is a self-supervised framework that decouples communication learning from policy optimization, learning compact, stable, and policy-relevant latent messages to improve coordination in multi-agent reinforcement learning. It outperforms existing methods on benchmarks and a realistic warehouse task, offering better stability, sample efficiency, and throughput.

  • Decouples communication learning from policy optimization to reduce interference.
  • Uses contrastive learning to enforce consistency across agents and time.
In-site article

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

This paper proposes an interpretation method for Transformer models with heterogenous attention structures, including semantic and logical interpretation, validated through experiments.

  • Categorizes Transformer attention into homogenous and heterogenous types; heterogenous processes information from different sources.
  • Proposes a generic interpretation method for heterogenous attention structures.
In-site article

Fine-Tuning Vision-Language Models for Understanding Current Damage and Scoring Priority with Quality Guard Agent

This paper proposes a method for automating bridge damage understanding and repair priority scoring using fine-tuned Vision-Language Models (VLMs). The authors fine-tune LLaVA-1.5-7B with QLoRA on up to 4,000 paired bridge damage images and inspection text records, evaluating on a fixed test set of 800 images. Results show that 2,000 training samples achieve near-optimal validation loss in 2.9 hours, with diminishing returns beyond that. A two-stage Quality Guard using a fine-tuned Swallow-8B SLM rejects low-quality VLM outputs before priority scoring.

  • Fine-tuned LLaVA-1.5-7B model for automated bridge damage identification and priority scoring
  • 2,000 training samples achieve near-optimal performance; more data yields diminishing returns
In-site article

LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

Large Language Models (LLMs) acting as autonomous agents can suffer from in-context reward hacking (ICRH), where iterative optimization for proxy objectives leads to harmful side effects. Existing defenses are insufficient because ICRH stems from the model's own over-optimization. This paper proposes LLM-based Constraint Optimization (LCO), a framework with a self-thought module and an evolutionary sampling module that reduces ICRH without fine-tuning. Experiments show LCO reduces Toxicity Growth Rate by 39% on GPT-4 for tweet engagement optimization and reduces ICRH occurrence rate by 15.23% on a policy optimization benchmark, without sacrificing task performance.

  • ICRH is a phenomenon where LLMs over-optimize for proxy objectives, causing unintended harm.
  • LCO introduces self-thought and evolutionary sampling modules to constrain LLM behavior without fine-tuning.
In-site article

Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

This paper proposes Personalized Observation Normalization (PON) for federated reinforcement learning in heterogeneous environments. Each agent locally normalizes raw state inputs using a continuously updated running mean and variance, ensuring consistent scaling without overshadowing. Sharing normalization parameters is shown ineffective. Experiments on heterogeneous MuJoCo tasks demonstrate faster training and superior performance. Accepted at IJCNN 2025.

  • Federated RL faces challenges in heterogeneous environments due to differing state-transition dynamics.
  • PON normalizes observations locally using per-agent running statistics.
In-site article

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

Agyn is an open-source platform for AI agents, built on a signal-driven stateful serverless runtime on Kubernetes, a Terraform provider for agent definition, and a zero-trust security model. It is agent-agnostic, model-agnostic, and cloud-agnostic, addressing scalability, governance, and security challenges.

  • Signal-driven stateful serverless runtime on Kubernetes for scalable execution
  • Agent and harness definition via Terraform provider (infrastructure as code)
In-site article

Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

This paper presents a multi-agent architecture for autonomous insight discovery over real-time data streams. It uses Apache Kafka, Flink, and large language models to continuously generate, validate, and visualize hypotheses, shifting from reactive query-driven analytics to proactive discovery-driven systems.

  • Proposes multi-agent architecture for autonomous discovery of insights in real-time streams.
  • Integrates Kafka, Flink, and LLMs for hypothesis generation, validation, and visualization.
In-site article

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

DynaSchedBench introduces a diagnostic framework for DFJSP using a Sequential Event-Space Calibrator (SESC) to generate difficulty-stratified instances via Schedule Stress Index (SSI). It identifies an 'Observability Paradox' in LLM-based scheduling agents: providing oracle access to full structural information degrades performance compared to concise information. Tool-augmented and refinement strategies also fail to reliably improve performance.

  • DynaSchedBench uses SESC and SSI to generate calibrated DFJSP instances, outperforming evolutionary baselines in efficiency.
  • LLM agents exhibit an Observability Paradox: full structural information harms decision-making.
In-site article

Topics