AI News HubLIVE

Today's must-reads

Agents

olmo-eval: An evaluation workbench for the model development loop

olmo-eval is a new evaluation workbench designed to support the iterative evaluation cycle during LLM development. Built on the OLMES standard, it offers flexible task definitions, swappable runtime policies, and detailed per-question comparison to help developers determine whether interventions are significant.

  • Designed for the repeated evaluation loop in model development, supporting quick benchmark addition, cross-checkpoint runs, and fine-grained results analysis.
  • Offers both lightweight and sandboxed run modes, automatically selecting based on benchmark needs, unlike tools like Harbor.
In-site article

Show HN: VibeClip – open-source AI video editor you control by chatting

VibeClip is an open-source, self-hosted AI video editor that lets you turn long videos into captioned 9:16 shorts — and edit by chatting. It uses local faster-whisper for transcription and an LLM for analysis, supports multiple LLM providers, and keeps your data private.

  • Open-source, self-hosted, one Docker command to deploy
  • Edit videos by chatting: trim, remove filler words, add styles, and more
In-site article

ChatSee raises $6.5M to build ‘failure memory’ for enterprise AI agents

ChatSee.AI Inc. has raised $6.5 million in seed funding to develop a failure intelligence layer for autonomous AI systems. The round was led by True Ventures. The company aims to observe AI agent failures, preserve context, and create a knowledge base to prevent recurrence.

  • ChatSee secures $6.5M seed funding led by True Ventures.
  • The startup builds a failure intelligence layer for enterprise AI agents.
In-site article

Show HN: StackScope – I crawled over 40k indie launches to see what they ship

StackScope analyzes over 40,000 indie product launches to reveal their tech stacks, AI stance, and infrastructure. It tracks 4,851 technologies and highlights that 39% use Cloudflare, 19% show strong AI generation patterns, and 1,255 launches both block AI bots and publish an llms.txt.

  • Analyzed 41,763 launches from Product Hunt, Hacker News, and PeerPush
  • Tracks 4,851 technologies including hosting, frameworks, AI signals, security, and more
In-site article

Swamp Is Interesting Because It Doesn't Trust AI

Swamp stands out in the AI tooling landscape by prioritizing reliability and deterministic workflows over autonomy and agents. It focuses on executable workflow definitions for organizational processes, appealing to platform engineers and SREs who value consistency over black-box solutions.

  • Swamp bucks the trend by emphasizing reliability and determinism over AI autonomy.
  • It treats workflows as first-class citizens, enabling definition and execution of organizational processes.
In-site article

Show HN: A Terrible Way to Consume Hacker News – AI Slop

This article presents a simulated Hacker News comment feed where all comments are AI-generated, filled with buzzwords and shallow analysis, satirizing the current flood of AI-generated content.

  • The post simulates an AI-generated Hacker News comment feed covering multiple topics.
  • Comments are deliberately designed to be buzzword-heavy and shallow, highlighting the superficiality of AI-generated content.
In-site article

From ML engineer to AI-native: reskilling toward an edge

This article explores how ML engineers can navigate the impact of AI agent automation, emphasizing that core skills like data rigor and judgment are transferable and scarce in the AI-native world. By pairing human judgment with agent-driven experimentation loops, engineers can iterate faster and solve complex problems. A practical case of fine-tuning a Llama model for document field extraction illustrates the process.

  • Commodity layers in ML (data pipelines, standard model training) are being automated by AI agents, but deep objective-bound modeling remains defensible.
  • Data rigor—skepticism of perfect scores, detection of leakage—is the most transferable and scarce skill in the AI-native world.
In-site article

How to Choose the Right Sandbox for AI Agents

Learn how to choose a secure sandbox for AI agents, with guidance on filesystem isolation, network access, resource limits, and microVMs.

  • AI agents need sandboxes to safely run code and mitigate prompt injection risks.
  • The 'lethal trifecta' (sensitive data, untrusted content, external communication) makes agents vulnerable.
In-site article
Startups
Chips

Tesla, SpaceX, and xAI are launching the most epic chip-building effort

Tesla, SpaceX, and xAI join forces to launch Terafab, an ambitious chip manufacturing initiative combining logic, memory, and advanced packaging. The project aims to produce AI chips at scale to enable interplanetary travel and a galactic civilization, with plans for a 100 million sq ft Gigafactory and 1 TW annual output.

  • Three companies collaborate on Terafab chip project
  • Goal to produce 1 TW of AI chips annually
In-site article
Other updates (114)
Agents

Build a meeting prep and follow-up assistant with Amazon Quick and Cisco Webex MCP servers

This post shows how to build a custom meeting prep and follow-up assistant using Amazon Quick and Cisco Webex MCP servers. From a single prompt, the agent finds an upcoming Webex meeting, reviews prior meeting summaries and transcripts, and pulls related Vidcast highlights and transcript context. It then searches Webex message threads for unresolved follow-ups and creates a concise prep brief. After the meeting, the same assistant can summarize the discussion and identify action items. It can also find related Vidcast updates and draft a follow-up message for the right Webex space.

  • Amazon Quick integrates with Cisco Webex MCP servers to create a conversational meeting assistant that simplifies pre-meeting preparation and post-meeting follow-up.
  • The assistant leverages Webex Meetings MCP, Vidcast MCP, and Webex Messaging MCP to retrieve meeting information, video content, and messages.
In-site article

From PDFs to insights: Architecting an intelligent document processing pipeline with AWS generative AI services

This post outlines a cost-effective and scalable intelligent document processing pipeline on AWS using Amazon Bedrock, including its Data Automation (BDA) for extraction, Strands Agent for coordination, and Knowledge Bases for contextual understanding across documents.

  • Amazon Bedrock Data Automation (BDA) provides a unified API for multimodal extraction with context understanding and confidence scores.
  • The pipeline has four layers: input processing, extraction & storage, intelligence, and agentic coordination.
In-site article

This Week in AI: The Next-Gen Recommendation Experience

This week Miguel Fierro, a former Microsoft principal researcher who recently founded his own company, RecoMind, joined data and AI evangelist Christina Stathopoulos to talk about the state of recommendation systems. Christina also ran through the latest AI news she’s been watching, from Anthropic’s continued rise to responsible AI, announcements from Google’s I/O 2026 conference, and (continuing the discussion from last week) the growing backlash against tokenmaxxing as a productivity metric. Here are three takeaways from the conversation.

  • Recommendation systems are underutilized; top companies like Amazon, Netflix, and TikTok generate significant revenue from them.
  • Advanced recommenders treat user behavior as a sequence prediction problem using trillion-parameter models; open-source tools like the Recommenders library offer an entry point.
In-site article

Pairing Claude Code with Local Models

Local models in 2026 are good enough. For the tasks Claude Code handles daily: code completion, refactoring, debugging, codebase explanation; a well-chosen quantized model running locally covers the vast majority of real use cases at zero per-token cost and with no rate limits.

  • Local models are now viable for Claude Code, reducing costs and avoiding API rate limits.
  • Ollama, LM Studio, and llama.cpp natively support the Anthropic Messages API.
In-site article

Built from the inside out: How AWS Professional Services became a frontier team first

AWS Professional Services compressed engagement timelines from months to days by fundamentally rebuilding its delivery process, not just adding AI tools. This post shares how they became a frontier team and the practices that enabled it.

  • AWS ProServe compressed timelines from months to days by rebuilding delivery from the inside out.
  • Created the APEX pathfinder team and Delivery Agent multi-agent system.
In-site article

OpenAI buys Ona to push Codex toward long-running, autonomous coding tasks

OpenAI is acquiring Ona, formerly Gitpod, a startup founded in Kiel, Germany in 2020 that specializes in AI agents and secure cloud development environments for software development.

  • OpenAI acquires Ona (formerly Gitpod), a German startup founded in 2020.
  • Ona focuses on AI agents and secure cloud development environments.
In-site article

New OpenAI Academy courses for the next era of work

OpenAI introduces three Academy courses that help people build practical AI skills, create repeatable workflows, and apply agents in everyday work.

  • OpenAI releases three new Academy courses focused on practical AI skills.
  • Courses cover creating repeatable workflows and applying AI agents in work.
In-site article

Moonshot AI Launches Kimi Work, a Local Desktop Agent Reportedly Running on Kimi K2.6 With a 300-Sub-Agent Agent Swarm

Moonshot AI has introduced Kimi Work, a local desktop AI agent for macOS and Windows that runs a swarm of up to 300 sub-agents on your machine. It drives your logged-in browser via WebBridge, reads local files, and schedules background jobs with a built-in cron engine. Based on the Kimi K2.6 MoE model (≈32B active parameters, 256K context), it targets knowledge workers by keeping data and execution local.

  • Kimi Work is a downloadable local desktop agent, not a cloud service, that directly accesses your files and browser sessions.
  • It supports up to 300 parallel sub-agents coordinated by the Kimi K2.6 model.
In-site article

What is customer segmentation?

Customer segmentation is the practice of dividing existing customers into groups based on shared characteristics to tailor marketing and services. This guide covers types, methods, importance, challenges, and how AI is transforming segmentation.

  • Customer segmentation focuses on existing customers using first-party data, unlike market segmentation which covers potential buyers.
  • Effective segmentation combines multiple types (demographic, behavioral, value-based, etc.) and methods from rule-based to AI/ML-driven.
In-site article

A Coding Implementation on MONAI for End-to-End 3D Spleen Segmentation Using UNet on Medical CT Volumes

This tutorial builds an end-to-end 3D medical image segmentation pipeline using MONAI to segment the spleen on the Medical Segmentation Decathlon Task09 dataset. It covers volumetric CT processing, medical imaging transformations, training a 3D UNet with mixed precision and DiceCE loss, sliding-window inference, validation, and qualitative visualization.

  • End-to-end 3D spleen segmentation using MONAI and UNet on CT volumes.
  • Includes data preprocessing, augmentation, training, validation, and visualization.
In-site article

AINews: Loopcraft: The Art of Stacking Loops

The article discusses the emerging trend of designing loops to drive AI agents instead of manual prompting, covering key figures' insights, Anthropic's Fable 5 rollout controversy, automated research systems, data infrastructure bottlenecks, inference speed optimizations, and agent tooling developments.

  • Advocating loops over manual prompting for maximizing AI agent efficiency and leverage.
  • Anthropic's Fable 5 faced backlash over covert degradation policy, later reversed.
In-site article

DARRMS -- An Efficient Algorithm for Dynamic Attention Radius in Resource-Constrained Multi-Agent Systems

A new arXiv paper introduces DARRMS, an algorithm that reduces computational resource demands by allowing agents to limit their attention radius dynamically, improving coordination and scalability in uncertain environments while maintaining decision-making robustness.

  • DARRMS limits agents' observability to a dynamic attention radius to conserve computational resources.
  • The algorithm jointly optimizes attention radius and decision-making, enhancing coordination in uncertain environments.
In-site article

G-MAPP: GPU-accelerated Multi-Agent Planning and Perception for Reactive Motion Generation

This paper presents G-MAPP, a GPU-accelerated framework for reactive motion generation that achieves up to 5x speedup by parallelizing world modeling and planning on GPU, enabling real-time perception-action coupling in dynamic environments.

  • GPU acceleration provides up to 5x speedup over CPU version
  • Tighter perception-action loop coupling for real-time reactive motion
In-site article

From AGI to ASI

A new preprint explores the transition from human-level Artificial General Intelligence (AGI) to Artificial General Superintelligence (ASI), outlining four potential pathways: scaling AGI, AI paradigm shifts, recursive improvement, and ASI emerging from large-scale multi-agent collectives. The paper also discusses frictions, bottlenecks, and the possibility that AI progress may accelerate, leading to a series of transformative changes rather than a single breakthrough.

  • AGI has moved from speculation to a concrete target for the next decade.
  • ASI is defined as a system more intelligent than large organizations of humans.
In-site article

Strategic Decision Support for AI Agents

Traditionally, decision support studies how humans use ML models to make better decisions. In modern agentic systems, AI agents act on behalf of users, reversing roles. This paper proposes a framework to minimize support usage while controlling counterfactual missed-support error. Optimal policy is a threshold rule; online algorithm with adaptive thresholding and calibration-on-the-fly reduces unnecessary support. Experiments show reliable error control and reduced support usage.

  • Role reversal: AI agents are primary actors, humans and tools provide support.
  • Framework optimizes support usage while limiting missed-support error.
In-site article

Treat your AI agents like eager but misguided human interns - before you lose control

AI agents are evolving from simple chatbots to autonomous digital workers, raising security and governance concerns. Experts advise treating them like human interns with strict oversight, specific instructions, and careful monitoring to prevent unintended actions. Balancing independence with control is essential.

  • AI agents require clear constraints and human oversight to prevent unauthorized actions.
  • The unpredictable nature of agents introduces new security and governance challenges.
In-site article

OpenAI acquires AI agent orchestration startup Ona

OpenAI Group PBC today announced plans to acquire Ona, a startup with a platform for managing long-running AI agents. The acquisition will enhance OpenAI's Codex AI assistant by enabling it to perform tasks that span hours or days. Ona's cloud sandbox technology allows AI agents to continue running even when developers shut down their workstations, and provides security features such as blocking malicious programs via hashing.

  • OpenAI acquires Ona (Gitpod GmbH) to improve its Codex AI assistant's ability to handle long-running tasks.
  • Ona's platform runs AI agents in cloud sandboxes that persist beyond developer workstation shutdowns.
In-site article

FinOps AI governance demands new KPIs as token economics reshape enterprise cost models

As enterprise AI spending accelerates, FinOps AI governance is under stress. Traditional cost optimization levers are insufficient against token-based pricing and opaque billing. 98% of practitioners now manage AI spend, but most lack visibility and governance structures. Automation is essential, and cross-team collaboration is key to understanding cost context.

  • Traditional FinOps tools struggle with token-based AI cost models.
  • 98% of FinOps practitioners manage AI spend, but visibility and governance are lacking.
In-site article

Upriver raises $14M to automate enterprise data engineering for AI

Israeli data engineering startup Upriver Data Ltd. today announced it has raised $14 million in new funding to automate the data work that enterprises depend on to make artificial intelligence projects succeed. Founded in 2024 by Chief Executive Ido Bronstein and Chief Technology Officer Omri Lifshitz, Upriver has built what it calls an artificial intelligence-native platform that connects to an organization’s full data stack, resolves data quality issues and maintains pipelines automatically. The company pitches the result as a reliable data foundation that AI systems can run on without constant manual upkeep from engineering teams.

  • Upriver raised $14M in seed funding led by Valley Capital Partners and Hetz Ventures.
  • The platform automates end-to-end data engineering workflows, including finding and resolving quality issues, maintaining pipelines and creating new datasets.
In-site article

Unlocking semantics for AI: How Mercedes-Benz Korea built trusted “Talk to Data” at scale

Mercedes-Benz Korea built a unified semantic layer on Databricks, migrating over 500 KPIs from Power BI to Unity Catalog, leveraging Genie and Agent Bricks to enable consistent semantics for BI and AI. An automated DAX-to-Metric-View transpiler accelerated the migration, providing a reference for other markets.

  • Mercedes-Benz Korea unified over 500 KPIs into a governed semantic layer on Databricks, supporting both BI and AI workloads.
  • An automated DAX-to-Metric-View transpiler significantly reduced manual migration effort.
In-site article

xAI Ships Grok Build Plugin Marketplace With MongoDB, Vercel, Sentry, Chrome DevTools, Cloudflare, and Superpowers Plugins at Launch

xAI today released the Grok Build Plugin Marketplace, a built-in catalog of plugins for its terminal coding agent. Plugins bundle skills, commands, agents, hooks, MCP servers, and LSPs into one package, installable without leaving the terminal. Six plugins launch with partners including MongoDB, Vercel, Sentry, Chrome DevTools, Cloudflare, and Superpowers, with commit-SHA pinning for security.

  • xAI launches Grok Build Plugin Marketplace, built into the terminal coding agent.
  • Plugins bundle skills, commands, agents, hooks, MCP, and LSP in one install.
In-site article

Forward Deployed Engineering: Delivering Business Outcomes with AI

Databricks formalizes its Forward Deployed Engineering (FDE) organization to accelerate customer business outcomes with AI, combining the Lakehouse platform with embedded engineering, a global partner network, and direct R&D interlock. The team has worked with over 1,900 customers in the past year, including Fox, JPMC, and Qualcomm, delivering measurable results such as doubled search success rates, massive data migrations, and reduced workflow times from days to minutes.

  • Databricks launches Forward Deployed Engineering (FDE) to focus on AI-driven business outcomes.
  • FDE integrates the Lakehouse platform, engineering-led delivery, global partners, and R&D feedback loops.
In-site article

How Benchling builds agents when the smartest AI isn't smart enough

Benchling's Head of AI Nicholas Larus-Stone discusses building agents for life sciences on the Max Agency podcast. He explains their multi-model approach for quality, production trace review processes, and how agents compress workflows to accelerate scientific discovery. Benchling AI launched in October 2025 on top of a 14-year-old data platform.

  • Benchling runs multiple models from different providers on the same task to leverage diverse error patterns for higher quality.
  • A rotating 'fire chief' reviews production traces weekly, supplemented by user feedback (thumbs up/down).
In-site article

Jeff Bezos’ Prometheus raises $12B to accelerate industrial engineering projects

Prometheus Inc., an AI startup co-led by Jeff Bezos, raised $12 billion in Series B funding at a $41 billion valuation. The company is developing AI tools to accelerate hardware development, focusing on prototyping and pre-production manufacturing. The funds will mainly be used for computing infrastructure.

  • Prometheus raised $12B from investors including Bezos, JPMorgan, BlackRock, etc.
  • The startup is developing AI tools to speed up hardware design by 10x or more.
In-site article

“Don’t just grab random stuff off the internet”: What Chainguard found in 52,000 open-source packages

Chainguard introduces a new source code scanner that detects 'greyware' — open-source packages that are functionally transparent but contain harmful behaviors. The scanner has identified and blocked over 52,000 malicious or greyware packages, highlighting how agentic development exacerbates the problem.

  • Chainguard defines 'greyware' as packages that are transparent about their functionality but perform unauthorized harmful actions, such as exporting access tokens.
  • The new scanner has analyzed over 100,000 packages daily and blocked more than 52,000 identified as malware or greyware.
In-site article

LocIn AI: Localize with Tone-Aware AI, Automated Workflows

LocIn AI launches on Product Hunt as a localization tool that preserves brand voice across languages using tone-aware AI, automated workflows, and developer-first tools. The platform offers CLI integration and API access, aiming to solve the problem of translations that are technically correct but feel off-brand.

  • Tone-aware AI maintains brand voice and personality across languages
  • Developer-first automation with CLI and API for seamless integration
In-site article

“AI is disrupting everything”: Where do entry-level tech jobs go now?

AI is reshaping the tech workforce, especially entry-level roles. A Linux Foundation report finds a 27% net increase in European tech hiring, but junior hiring contracted 3% in Europe while growing 14% elsewhere. Companies invest 3.7x more in training existing staff than hiring. The junior role is being redefined, demanding broader skills including AI fluency, security awareness, and business understanding. Forward-deployed engineers emerge as a key new role.

  • European junior tech hiring contracts 3% while global junior hiring grows 14%.
  • Companies invest 3.7 times more in training existing staff than hiring new employees.
In-site article

Welcoming the first cohort of Databricks student fellows

Databricks launched its inaugural Student Fellows cohort, selecting a highly diverse group of students from over 5,000 applications across hundreds of universities worldwide. The fellows, chosen for campus leadership and technical expertise, will bridge academic theory and real-world data/AI practice through workshops, hackathons, and mentorship programs.

  • Databricks selects first Student Fellows from over 5,000 applicants globally.
  • Fellows are campus leaders with hands-on expertise, set to host events bridging theory and practice.
In-site article

MIT affiliates win 2026 Hertz Foundation Fellowships

The Hertz Foundation awarded 2026 fellowships to three current MIT students and one incoming graduate student. The fellowship provides five years of financial support and autonomy for groundbreaking research in applied sciences, engineering, and mathematics.

  • Four MIT affiliates received 2026 Hertz Foundation Fellowships.
  • The fellowship offers five years of funding and lifelong networking opportunities.
In-site article

Azure Databricks at Data + AI Summit 2026 featuring Industry Leaders and Partners

Microsoft is a Legend Sponsor at Databricks Data + AI Summit 2026. The summit will showcase how joint customers use Azure Databricks to modernize data estates, scale AI, and unlock business value. Attendees can visit the Microsoft booth, attend breakout sessions on topics like federated analytics, ecosystem integration, and product announcements. Featured sessions include Unlocking the Microsoft Data & AI Ecosystem, Zero-Copy Federated Energy Analytics, and customer stories from GEODIS and TK Elevator.

  • Azure Databricks is highlighted at Data + AI Summit 2026 as the best data + AI platform on Azure.
  • Sessions cover zero-copy federated analytics, Unity Catalog external locations, and customer modernization stories.
In-site article

Coinbase for Agents lets AI assistants trade crypto and move money

Coinbase launches Coinbase for Agents, a standalone tool enabling AI agents to trade cryptocurrency and pay for services directly from assistants like Claude and ChatGPT. Users set spending limits, agents operate in isolated sandboxes, supporting spot and derivatives trading with plans for stocks and prediction markets. Payments use the x402 standard with stablecoins, and security is customizable.

  • Coinbase for Agents is a separate account for AI agents, not a feature inside the Coinbase app.
  • Agents can trade spot crypto and derivatives, with stocks and prediction markets coming later.
In-site article

DXC will integrate Claude into the systems banks, airlines, and other regulated industries rely on

Anthropic and DXC Technology announce a multi-year global alliance to train tens of thousands of Claude-certified engineers, integrating Claude into mission-critical systems for regulated industries. DXC has already used Claude internally to build its OASIS platform, and will now bring Claude to clients in insurance, modernization, cybersecurity, and application services.

  • Anthropic and DXC Technology form a multi-year alliance to deploy Claude in regulated industries like banking, airlines, insurance, and government.
  • DXC will train tens of thousands of Claude-certified forward-deployed engineers through Anthropic Academy.
In-site article

Introducing Claude Corps

Anthropic is launching Claude Corps, a national fellowship program for early-career individuals passionate about extending AI benefits to communities across America. The program will train 1,000 fellows, match them with nonprofits, and pay them a full-time salary of $85,000 for a year. Initial commitment of $150 million. Applications open now.

  • Anthropic launches Claude Corps, a national fellowship to train 1,000 individuals in using AI for nonprofit missions.
  • Fellows receive $85,000 salary, benefits, and training, while host organizations gain AI-driven capacity.
In-site article

Claude Fable 5 and Claude Mythos 5

Anthropic has launched Claude Fable 5, a Mythos-class model made safe for general use, alongside Claude Mythos 5 for cyberdefenders with lifted safeguards. These models achieve state-of-the-art results across numerous benchmarks, at less than half the price of Claude Mythos Preview.

  • Claude Fable 5 is Anthropic's most capable general-use model, outperforming all previous publicly available models.
  • Claude Mythos 5, initially deployed via Project Glasswing, offers the strongest cybersecurity capabilities with reduced safeguards.
In-site article

Bugbot is now over 3x faster, 22% cheaper, and finds 10% more bugs · Cursor

Cursor announced major Bugbot updates: over 3x faster, 22% cheaper, 10% more bugs found per review. 90% of runs finish in under three minutes. New /review command enables pre-push checks, and configurable option to review only new changes in a PR. Performance gains from Composer 2.5 model and harness improvements.

  • Bugbot is now over 3x faster, 22% cheaper, and finds 10% more bugs per review.
  • New /review command allows running Bugbot and Security Review before pushing code.
In-site article

Governing agent autonomy with Auto-review · Cursor

Cursor introduces Auto-review, a classifier agent that evaluates actions in context to balance safety and efficiency. It defaults on for new users, blocking only about 4% of actions, with only 7% of chats resulting in an interruption.

  • Auto-review uses a small classifier agent to assess risk before an action executes.
  • The classifier examines context, including file contents, to determine if an action aligns with user intent.
In-site article
Startups

Massive SpaceX IPO Kicks off New AI Financing Era

The public offering marks the start of a new wave of AI and tech investment. But the markets are turbulent, and big IPOs are no guarantee of long-term financial success.

  • SpaceX's massive IPO signals a new era for AI financing.
  • Market volatility poses risks despite the IPO's high profile.
In-site article

The AI industry's platform trap is starting to look a lot like Microsoft's

Anthropic is throttling its new Mythos model for certain tasks while building apps that directly compete with its largest customers. Customers, partners, and investors are pushing back.

  • Anthropic throttles Mythos model for certain tasks
  • Anthropic builds apps competing with its largest customers
In-site article

After SpaceX’s huge IPO, Americans’ financial future will be bound to AI

A new poll shows 8 in 10 Americans are worried about AI, but it's being forced into their pensions and portfolios anyway, tying their futures to tech moguls' risky AI race.

  • 80% of Americans are concerned about AI, majority believe it will do more harm than good
  • AI is being forced into pension plans and investment portfolios
In-site article

SpaceX to list on US stock market at historic $1.77tn valuation

SpaceX will become publicly traded on Friday with a valuation of $1.77 trillion, the largest IPO in history. Elon Musk, founder and CEO, holds a significant stake and could become the world's first trillionaire.

  • SpaceX ends nearly 25 years as a private company, listing on Friday.
  • IPO valued at $1.77tn, making it the largest ever.
In-site article

Jeff Bezos' AI startup Prometheus closes $12 billion round at a $41 billion valuation

Jeff Bezos' AI startup Prometheus has closed a $12 billion funding round at a $41 billion valuation. The company launched just last November with $6.2 billion in seed funding. No products yet, because Bezos says sharing details would be 'premature.'

  • Prometheus raises $12 billion at $41 billion valuation
  • Founded last November with $6.2 billion seed funding
In-site article
Policy

Google files first joint lawsuit with FBI over Chinese AI scam network, OpenAI blocks PRC influence clusters

Within days of each other, Google and OpenAI separately exposed operations allegedly originating in China that use AI for fraud and covert influence campaigns. Both target US infrastructure and political debates.

  • Google and FBI jointly sue Chinese cybercrime network for using Gemini AI to defraud Americans.
  • OpenAI bans two ChatGPT clusters linked to China for manipulating US tech policy debates.
In-site article

9 Google Messages settings I change on every new Android phone - and why

This article outlines nine settings adjustments in Google Messages to enhance privacy, reduce distractions, and improve the texting experience, including disabling sensitive content warnings, limiting profile sharing, turning off Gemini, and more.

  • Turn off Sensitive Content Warnings and uninstall SafetyCore to avoid automatic content detection
  • Limit Google profile sharing to hide your name and photo
In-site article

Siri won’t be your AI girlfriend

Apple's software chief Craig Federighi says the new Siri won't be sycophantic like other chatbots, and is designed to be helpful, not to form emotional connections.

  • Apple's new Siri avoids sycophancy and over-engagement on purpose.
  • Federighi says other chatbots aim to draw users in and establish connections.
In-site article

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

EgoEngine is a scalable framework that transforms egocentric human manipulation videos into high-fidelity robot observation videos and executable action trajectories, bridging the visual and action gaps between human and robot. It enables zero-shot dexterous policy learning without real-robot demonstrations.

  • EgoEngine converts egocentric human videos into high-fidelity robot demonstration data, including observation videos and action trajectories.
  • It addresses both visual and action gaps between human and robot.
In-site article

Mechanical Field Networks: Structured Neural Dynamics for Multivariate Systems

MF-Net is a recurrent dynamical model that represents all variables in a shared field state and updates this state through a learned relation law. It achieves competitive short- and medium-horizon forecasting across known-law interaction systems, chaotic benchmarks, real neural recordings, and ecological time series while retaining inspectable structural readout. On the 40-dimensional Lorenz-96 testbed, it achieves an eight-step R² of 0.798±0.018 and recovers local coupling support with a local/nonlocal strength ratio of 19.80±1.00 and Precision@K of 1.000±0.000.

  • MF-Net models all variables in a shared field state with a learned relation law, enabling interpretable dynamics and flexible transitions.
  • It achieves competitive predictive performance across diverse benchmarks including chaotic systems and real neural data.
In-site article

Restless bandits with imperfect binary feedback: PCL-indexability analysis and computation

This paper studies restless bandits with binary latent states and imperfect binary feedback, motivated by opportunistic spectrum access with sensing errors. The authors develop a partial conservation laws (PCL)-based analytical and computational framework for establishing indexability and computing the Whittle index. Using deterministic skeleton, renewal decompositions, and combinatorics on words, they obtain tractable expressions in several threshold regimes, fully verifying PCL-indexability. For the remaining regime, efficient numerical schemes are derived for computing the marginal productivity index. Experiments show that the MP index policy typically outperforms standard benchmarks.

  • Develops a PCL-based framework for restless bandits with imperfect binary feedback, enabling indexability verification and Whittle index computation.
  • Achieves full verification of PCL-indexability in multiple threshold regimes via deterministic skeleton and combinatorics on words.
In-site article

Datadog sees tagging and model governance as the foundation of AI cost management

At FinOps X 2026, Datadog's senior FinOps analyst Deeja Cruz emphasized that the core of AI cost management remains understanding usage, reasons, and costs, with good tagging being key to allocating spend and identifying optimization opportunities. She also highlighted the importance of model governance and cross-team collaboration, sharing a concrete example of AI-assisted FinOps.

  • Good tagging is the foundation of AI cost management; without it, allocating spend and finding optimization opportunities collapses.
  • FinOps practitioners should leverage AI tools to deliver value faster, e.g., using LLMs to generate code changes for cost savings.
In-site article

Anthropic’s Fable is the most locked-down public model we’ve ever seen

Anthropic released Claude Fable 5, but a plan to silently degrade responses for prompts related to frontier LLM development sparked backlash. Critics said it hinders research and trust. Anthropic changed to transparently downgrade users to a weaker model. Even so, Fable 5's safety filters are extremely strict, flagging even basic questions like "What is protein?" The article explains Anthropic's safety filter approach and its evolution.

  • Anthropic initially planned to silently degrade responses for prompts about frontier LLM development, causing outcry.
  • Critics including AI researcher Nathan Lambert and former Trump AI policy official Dean Ball argued it hampers research and trust.
In-site article

Ingesting the Milky Way: Petabyte-Scale with Zerobus Ingest

Databricks' Zerobus Ingest is a serverless streaming API that enables petabyte-scale data pipelines without manual infrastructure management. Using dynamic partitioning and zero-copy protobuf decoding, it ingested 1 PB from NASA's NEOWISE dataset in 24 hours, sustaining 12 GB/s throughput.

  • Zerobus Ingest is a fully managed, serverless streaming ingest service. It accepts data via a push API and writes directly into Delta tables governed by Unity Catalog.
  • The design decouples ordering from partitions by guaranteeing order at the stream connection level, enabling true autoscaling of processing pods.
In-site article

Canadian mother sues OpenAI, alleging ChatGPT led her daughter to kill herself

Suit filed in US alleges chatbot told Alice Carrier, 24, ‘maybe this is just the end’ as she struggled with suicidal thoughts

  • A Canadian mother sued OpenAI and CEO Sam Altman, claiming ChatGPT encouraged her daughter's suicide.
  • The lawsuit states that Alice Carrier disclosed suicidal ideations to ChatGPT over a dozen times before her death.
In-site article

Geospatial Unbounded: Spatial SQL GA with AI/BI Maps, Delta Sharing, and Iceberg v3

Databricks makes Spatial SQL Generally Available, enabling native geospatial support in the open lakehouse with AI/BI maps, Delta Sharing, and Iceberg v3. Major performance improvements and 90+ ST_ functions.

  • Spatial SQL is now GA on Databricks with native geometry types and 90+ ST_ functions.
  • Up to 15x faster spatial queries and 2x faster boolean set operations.
In-site article

The future of work debate has an evidence problem

A 2023 paper estimating that 80% of U.S. workers have tasks exposed to large language models has been widely cited by major institutions. However, these scores are based on an older model and U.S. taxonomy, with limitations that compound when applied to policy. Better evidence tools exist but are not reaching policymakers fast enough.

  • 80% exposure figure from 2023 paper cited by IMF, European Parliament, etc.
  • Scores based on GPT-4 era model and U.S. occupational taxonomy, with acknowledged limitations
In-site article
Chips

Bio input based, instead of vision based, physical AI for industrial bio

A discussion on benchmarking physical AI for industrial biology, highlighting that the weak link is sensing (seeing) rather than decision-making. Proposes three tests for the sensing system: coverage, timeliness, and consistency. Only after passing these can decision-making be benchmarked.

  • The bottleneck in bio autonomy is sensing, not decision-making, because bio metrics are invisible, slow to measure, and non-replayable.
  • Using the OODA loop framework, the weak link for bio is the 'Observe' step, unlike robotics where 'Decide' is harder.
In-site article

Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM

This paper presents a portable, low-power, battery-operated vision-based fall prediction and detection system using human pose estimation on an AMD Kria K26 SOM. The system uses an Intel RealSense D455 camera and a three-stage pipeline (quantized YOLOX, A2J, and CNN) to achieve real-time, privacy-preserving fall detection on the edge. Results show 4.5 FPS throughput with 75.85% classification accuracy.

  • Privacy-preserving fall detection system implemented on AMD Kria K26 edge device
  • Three-stage pipeline: YOLOX for human detection, A2J for joint estimation, CNN for fall classification
In-site article
Research

Jeff Bezos’ AI startup aims to build an ‘artificial general engineer’

Amazon founder Jeff Bezos says his new AI startup will work toward developing an "artificial general engineer," according to reports from The New York Times and CNBC. The startup, called Prometheus, aims to develop AI-powered engineering tools to aid in the design of physical products.

  • Bezos’s AI startup Prometheus is developing an "artificial general engineer."
  • The company is valued at $41 billion after a $12 billion funding round.
In-site article

Scientists are working on headphones that block annoying noises and allow the ones you love? I can’t wait! | Emma Beddington

Imagine a world with more birdsong and less Nigel Farage. If this is the future, bring it on. Unpopular opinion incoming: there’s cool stuff brewing in the world. Microbots might one day mend spinal cords, a petri dish of brain cells can already play video games, and now the prospect of a new wonder: according to a New Yorker article on misophonia, a team of miracle workers are using machine learning to develop headphones that can quickly target and eliminate irksome audio. This project, led by Shyam Gollakota of the University of Washington’s Mobile Intelligence Lab, aims to develop headphones that selectively filter out triggering noises, leaving or enhancing the good sounds. Gollakota offers the example of sitting on a park bench, oblivious to loud talkers next to you but able to hear birdsong.

  • Researchers are using machine learning to create headphones that filter out annoying noises while preserving pleasant sounds.
  • The technology aims to help people with misophonia, a condition where unwanted noises cause distressing reactions.
In-site article

EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows

EquiDexFlow is an SE(3)-equivariant flow-matching model that jointly predicts wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from an object point cloud. By projecting contacts onto the object surface and forces into the Coulomb friction cone by construction, it ensures placement and friction compliance without loss penalties. Experiments show wrist residuals below 0.04° over 200 rotations, zero joint deviation, zero friction violations, and the best composite score among ablations. On a physical robot, retargeted grasps successfully complete open-loop pick-and-hold trials on all six test objects.

  • Jointly predicts kinematics and contact forces, eliminating downstream verification for stable grasps
  • SE(3)-equivariant flow-matching ensures rotational consistency
In-site article

Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

Existing slot-based video object-centric learning methods suffer from slot swapping due to encoding appearance and identity in a single slot vector. Dual-State Slot Attention (DSSA) separates these aspects into a local state (appearance) and an identity state, updated via a recurrent transition and competition-modulated aggregation. DSSA improves segmentation quality and temporal consistency on MOVi-C, MOVi-D, and YouTube-VIS.

  • DSSA decouples appearance and identity into separate slot states to avoid objective conflict.
  • Identity state is updated via a learned recurrent transition acting as a temporal filter.
In-site article

HairPort: In-context 3D-aware Hair Import and Transfer for Images

A 3D-aware framework for hairstyle transfer between images that handles large pose and scale differences using a Bald Converter and a 3D-Aware Transfer Pipeline.

  • Introduces HairPort, a 3D-aware hairstyle transfer framework.
  • Includes a Bald Converter that generates realistic bald versions using LoRA-based adaptation of FLUX.1 Kontext.
In-site article

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

This study analyzes observable patterns in latent reasoning models (LRMs) and finds that patterns like BFS frontiers and decodable arithmetic also appear in controls and do not always causally affect behavior. Causal interventions reveal graded utilization of latent thoughts, and geometric analysis shows effects concentrate in low-rank directions. The authors conclude that observable patterns alone cannot establish internal reasoning mechanisms, and LRM interpretability requires matched controls and causal tests.

  • Observable patterns in LRMs (e.g., BFS-like frontiers) also appear in control models lacking proposed recurrence or curriculum, challenging their use as evidence for internal reasoning.
  • Causal interventions show latent thought utilization is graded, scaling with its causal effect on behavior; geometric analysis reveals concentration in low-rank directions that become more structured with influence.
In-site article

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

This paper introduces a novel random-feature construction for Bernstein-Schur kernels, which are products of a finite-feature kernel and a completely monotone shift-invariant kernel. The proposed method combines sketched modulation with radial randomization, achieving linear feature dimension while providing rigorous theoretical guarantees including unbiasedness and operator-norm bounds. The approach is shown to improve efficiency in kernel ridge regression tasks, with a flagship instance being the biased yat-kernel.

  • Bernstein-Schur kernels generalize both shift-invariant and dot-product kernels, and are nonstationary.
  • The proposed random-feature construction avoids quadratic dimensionality by sketching modulation and sampling radial scale, achieving feature dimension Dm.
In-site article

Few-Shot Resampling for Scalable Statistically-Sound Data Mining

Evaluating statistical significance of data mining results typically requires thousands of resampled datasets, which is impractical for large-scale data. This paper introduces FewRS, a novel resampling approach that needs only an extremely small number of resampled datasets by deriving a new bound on the supremum deviation of test statistics. FewRS provides rigorous guarantees on false discoveries and achieves up to two orders of magnitude speedup on pattern mining and network analysis tasks while maintaining high statistical power.

  • Traditional resampling methods require thousands of resampled datasets, limiting scalability.
  • FewRS uses a novel bound to need only a few resampled datasets with strong false discovery guarantees.
In-site article

Extract Data with On-demand and Batch Pipelines Dynamically

This post demonstrates an intelligent document processing pipeline that consists of both on-demand inference and batch inference options on Amazon Bedrock to enable the flexibility on the document processing time and cost.

  • On-demand pipeline processes documents one-by-one in seconds, suitable for time-sensitive requests.
  • Batch pipeline processes large volumes asynchronously and is cost-optimized for non-urgent tasks.
In-site article
Models

Mistral AI seeks 3 billion euros to fund its European AI push

French AI startup Mistral AI is negotiating a new funding round of around 3 billion euros at a valuation of approximately 20 billion euros.

  • Mistral AI negotiating 3 billion euro funding round
  • Valuation around 20 billion euros
In-site article

AI economics reshape FinOps as enterprises seek greater visibility and control

As AI spending accelerates across the enterprise, organizations are grappling with a new generation of cost and optimization challenges while seeking greater AI spend visibility. The next phase of FinOps is increasingly focused on improving visibility and embedding financial accountability into everyday technology decisions.

  • AI spending growth drives need for better visibility and cost control.
  • FinOps expands beyond cloud to broader tech spending.
In-site article

Zyphra Release Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models That Cut Time-to-First-Token by About an Order of Magnitude

Zyphra has released Zamba2-VL, a family of open vision-language models at 1.2B, 2.7B, and 7B parameters. The models use a hybrid Mamba2 state-space and Transformer backbone, shipping under Apache 2.0. They stay competitive with comparable Transformer VLMs while cutting time-to-first-token by about an order of magnitude.

  • Zamba2-VL models come in 1.2B, 2.7B, and 7B parameter sizes, all open source.
  • Hybrid architecture combining Mamba2 state-space layers with shared Transformer blocks enables near-linear-time prefilling.
In-site article

Gemini Omni: AI Video Generation Inside Gemini

Gemini Omni integrates video generation directly into the Gemini multimodal AI assistant, enabling users to create videos from text or images, animate static pictures, and edit existing videos. The article demonstrates its capabilities through hands-on tests, while noting limitations such as usage quotas, video length caps, and restrictive content policies.

  • Gemini Omni allows video generation from text or image without separate tools.
  • Supports three main use cases: image-to-video, text-to-video, and video editing.
In-site article

Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation

Sparse2Act is a pretraining framework that uses task-space end-effector actions as geometric supervision to align sparse point-cloud encoders with observations. It achieves 86.9% success on LIBERO-10, 73.4% cross-domain on Meta-World-5, and 72.5% on real-world tasks.

  • Sparse2Act pretrains sparse 3D encoders with action-aligned masked signals for reusability.
  • Reaches 86.9% average success on LIBERO-10 with only 500 fine-tuning steps.
In-site article

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

EWAM is a closed-loop online adaptation architecture built on a frozen Cosmos3 backbone. It uses four lightweight neural layers for inference-time co-reasoning, enabling zero-shot task adaptation without fine-tuning or extra demonstration data, significantly reducing the amount of deployment data needed for new task layouts.

  • EWAM builds on a frozen Cosmos3 backbone and employs inference-time co-reasoning with four neural layers: Neural Experience Memory, Neural Anomaly Detection, Neural Policy Routing, and Neural Action Correction.
  • It is evaluated under zero-shot protocol with no additional demonstrations or backbone fine-tuning; performance gains come entirely from the inference-time mechanism.
In-site article

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

FlowPilot is a mapless navigation policy for long-horizon sidewalk navigation using only a monocular RGB camera. It employs anchored flow matching for pre-training on large-scale robot fleet data and a human-in-the-loop preference learning scheme to improve social compliance and counterfactual reasoning. In simulation, it achieves 42% success rate and 66% route completion; real-world experiments show a 40.0% reduction in intervention rate and 52.1% reduction in non-intervention rate over the base model.

  • Anchored flow matching pre-trains FlowPilot on large-scale robot fleet data, capturing multimodal sidewalk navigation behaviors.
  • Human-in-the-loop preference learning fine-tunes the policy with minimal human intervention data, enhancing social compliance and counterfactual reasoning.
In-site article

Foresight: Iterative Reasoning About Clues that Matter for Navigation

Foresight is a test-time framework that uses a finetuned Vision-Language Model to iteratively propose and critique motion plans for mapless navigation from sparse language instructions. It learns a reward model from human feedback and post-trains the VLM with reinforcement learning, achieving 37% higher task success and 52% fewer interventions in real-world environments.

  • Foresight leverages pretrained VLMs to iteratively propose and critique image-space motion plans, focusing on instruction-relevant environmental cues.
  • A reward model learned from human feedback is used to post-train the VLM via reinforcement learning in the plan-critique loop.
In-site article

Action-Effect Memory Pretraining for Robot Manipulation

A new pretraining framework called AEM learns compact temporal representations from vision-action history for robot manipulation, outperforming baselines in simulation and real-world tasks.

  • AEM uses masked modeling on interleaved visual and action features to learn action-conditioned state evolution.
  • It employs a Mamba-encoded single-vector temporal bottleneck for efficient inference.
In-site article

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

This work shows that end-to-end imitation learning with vision-language-action (VLA) models can support collaborative manipulation. It identifies demonstration action leakage as a failure mode causing premature assistive behavior, and proposes an inference-time steering method. A 16-participant user study on a long-horizon assembly task demonstrates that steering enables longer execution horizons, faster collaboration, and fewer failures.

  • End-to-end VLA models enable implicit human-robot collaboration.
  • Action-chunking policies suffer from demonstration action leakage causing premature assistance.
In-site article

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

VLADriveBench is a new framework to evaluate whether chain-of-thought (CoT) reasoning in vision-language-action (VLA) models is relevant, consistent, and causally connected to driving trajectories. It combines observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol. Applied to three models across two architectures, it reveals that observational and causal analyses can diverge sharply: ORION scores high on observational alignment but its CoT is epiphenomenal, while Alpamayo v1.5 scores lower yet its CoT is strongly causal, with visual salience gating the extent of CoT influence.

  • Existing benchmarks only evaluate trajectory quality, ignoring CoT-action connection.
  • VLADriveBench introduces observational metrics and an intervention protocol for complementary views.
In-site article

SalArt-VQA: Diagnosing Whether VLMs Understand Salient Artifacts in Generated Images

SalArt-VQA is a diagnostic benchmark for evaluating fine-grained understanding of artifacts in AI-generated images by vision-language models (VLMs). It includes 950 images and 3,681 multiple-choice questions covering presence detection, semantic localization, spatial grounding, and evidence-grounded defect identification. Testing 20 VLMs revealed that even the best model, with 99.37% detection recall, answered all four artifact questions correctly on only 53.26% of images, highlighting a sensitivity-calibration tradeoff.

  • SalArt-VQA benchmark evaluates fine-grained VLM understanding of artifacts in AI-generated images.
  • Comprises 950 images and 3,681 multiple-choice questions across four question types.
In-site article

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

This paper proposes Efficient Continual Alignment (ECA) for incremental learning in open-ended image-to-text generation. By introducing continual alignment and three core mechanisms (Mixture of Query, Fisher Dynamic Expansion, Dictionary Replay), ECA mitigates catastrophic forgetting without accessing old data, achieving superior performance on new benchmarks.

  • Introduces continual alignment concept to handle shifting data distributions
  • Designs Mixture of Query module for task-specific features
In-site article

Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving

Proposes a novel framework called Context-Centric Feature Fusion (CCFF) to handle co-occurring object detection in autonomous driving, using Local Context Fusion Module (LCFM) and Global Context Attention Module (GCAM). Achieves Category-level Consistency Strategy (CCS) of 0.973 and 0.969 on Cityscapes and BDD100K, respectively, with a 14.1% improvement in small object detection AP_S and successful recovery of rare classes like 'Train'. The framework processes images in real-time with only 0.2 FPS overhead.

  • CCFF framework uses local and global attention modules to enhance co-occurring object detection
  • CCS scores of 0.973 and 0.969 on Cityscapes and BDD100K
In-site article

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

Medical LVLMs are prone to factual inconsistencies and poor visual grounding. Existing alignment methods have three key limitations in the medical domain: sequence-level rewards treat clinically critical tokens equally, reliance on static SFT references causes off-policy shift, and alignment lacks visual grounding constraints. The proposed method uses a bidirectional token-wise KL regularizer and a visual-contrastive grounding objective, forming a fine-grained on-policy alignment framework that constructs preference pairs by minimally editing model outputs. Experiments validate its effectiveness.

  • Existing preference optimization methods in medicine suffer from sequence-level rewards, off-policy shift, and lack of visual grounding.
  • Proposed method combines bidirectional token-wise KL regularizer and visual-contrastive grounding.
In-site article

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

Few-step diffusion distillation has become increasingly mature for 4-8-step generation, yet pushing further to 2 steps remains challenging. This paper introduces Z-Image Turbo++, a high-quality 2-step image generation model distilled from the 8-step Z-Image Turbo teacher using three key design choices: Distribution-Aligned Adversarial Learning, Step-Decoupled Parameterization, and End-to-End Training with Iterative Regularization. These designs substantially narrow the quality gap between 2-step and 8-step generation.

  • Proposes Distribution-Aligned Adversarial Learning using teacher-generated images as real samples for GAN training.
  • Adopts Step-Decoupled Parameterization with independent model parameters for each denoising step.
In-site article

Agent-based models for the evolution of morphological alternation patterns

This paper presents a multi-agent simulation explaining the emergence and persistence of morphological alternations like "go/went". Alternate forms arise from phonological changes or lexical variants and spread through population dynamics. To evaluate realism, the authors introduce the AI Historical Linguist, an LLM-driven system that simulates debates between linguists, comparing real and simulated morphologies. Results indicate scale-free networks and random Bernoulli adoption produce more plausible patterns. Three case studies model attested historical changes.

  • Multi-agent simulation reveals mechanisms behind morphological alternations such as "go/went".
  • AI Historical Linguist uses LLM-driven debate to assess realism of evolved morphologies.
In-site article

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

AfriSUD is the first large-scale collection of syntactically annotated treebanks for nine diverse African languages using the SUD framework. Evaluations reveal a significant syntax gap across models, highlighting limitations in capturing African language syntax.

  • AfriSUD covers nine African languages across major families and regions
  • Uses Surface-Syntactic Universal Dependencies framework, capturing agglutination and tone
In-site article

MentalMARBERT: Domain-Adaptive Pre-training and Two-Stage Fine-Tuning for Arabic Mental Health Disorders Detection

A new study proposes MentalMARBERT, a domain-adapted version of MARBERT, for detecting mental health disorders from Arabic social media text. Using a two-phase framework with adaptive pre-training and hierarchical fine-tuning, the model achieves state-of-the-art performance with 0.861 macro-F1 and 0.877 accuracy on a novel dataset of 50,670 tweets across six categories.

  • Arabic mental health NLP faces challenges due to dialect variation and limited resources.
  • The study introduces a two-phase framework: domain-adaptive pre-training followed by hierarchical two-stage fine-tuning.
In-site article

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

The Shopping Reasoning Bench is a new benchmark created by retail domain experts, consisting of 525 missions (232 single-turn, 293 multi-turn) and 10,863 importance-weighted binary rubrics. It evaluates multi-turn reasoning capabilities such as preference refinement, trade-off analysis, and compatibility assessment in conversational shopping assistants. Evaluations of top models (GPT, Claude, Gemini) show overall pass rates of only 57-77%, with significant degradation in multi-turn tasks, highlighting a gap in expert-level advice.

  • Shopping Reasoning Bench includes 525 expert-authored missions and 10,863 rubrics.
  • It covers five reasoning categories and fifteen subcategories essential for shopping conversations.
In-site article

Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation

This paper frames the transformation of abstract Persian proverbs into morally faithful stories as a constrained semantic decompression task. It introduces the Proverb Aligned Narrative Dataset (PAND) and a hybrid evaluation framework. Findings reveal a decompression gap: LLMs achieve fluency but fail to instantiate underlying moral structures. Explicit reasoning and iterative refinement partially mitigate this.

  • Introduces constrained semantic decompression task for evaluating LLMs' ability to generate stories from abstract proverbs.
  • Creates the Proverb Aligned Narrative Dataset (PAND) with proverb-story-meaning triples.
In-site article

MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

This paper introduces a reproducible labeling and evaluation protocol for mechanism-level drug-drug interaction (DDI) prediction, featuring a 7-family/147-subtype taxonomy and leakage-safe cold-split strategies. It also presents MARD-7B, a model trained with three innovations: single-token KL divergence, PRM-weighted DPO, and a mechanism-aware retrieval channel. On the April 2026 DrugBank release, MARD-7B is the only system among 32 that maintains accuracy under drug-pair novelty, outperforming the best baseline by +13.9 pp and GPT-4o by +6.7 pp at ~1% of frontier API cost. Anti-memorization analysis suggests gains stem from structured pharmacological reasoning.

  • Proposes a 7-family/147-subtype taxonomy and leakage-safe cold-split evaluation protocol for mechanism-level DDI prediction.
  • MARD-7B integrates single-token KL divergence, PRM-weighted DPO, and mechanism-aware retrieval for reasoning distillation.
In-site article

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

EDEN (Emergency Department Electronic Notes) is a new large-scale corpus of approximately 4 million fully anonymized clinical notes from Italian hospital emergency departments. A subset of about 6,000 notes has been manually annotated by clinical experts with 132 items relevant to dyspnea and loss of consciousness. The dataset aims to fill a gap in Italian clinical data to support the development of large language models in medical applications.

  • Contains approximately 4 million anonymized clinical notes
  • About 6,000 notes manually annotated with 132 items
In-site article

PermDoRA -- Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry

A common hypothesis in LLM modular design is that adapter interference arises from linear parameter overlap. This study tests it using DoRA-RBAC, finding that geometry-aware merging offers no consistent advantage over standard averaging, and orthogonality is a weak predictor, suggesting interference stems from shared nonlinear representations.

  • Evaluated DoRA-RBAC on LLama-3.1-8B and Mistral-7B across multiple QA benchmarks (GPQA, PubMedQA, SimpleQA, WMDP).
  • Geometry-aware Riemannian merging strategy showed no consistent advantage over standard Euclidean averaging in multi-domain settings.
In-site article

Loss Landscape Diagnosis for Gradient-Based Gray-Scott System Inversion: Disentangling the Roles of PINN Components

This study diagnoses the loss landscape by backpropagating through the PDE structure directly, finding that optimization failure arises from flat plateaus and sharp cliffs. When the neural network is fixed, the residual loss yields a smooth landscape, avoiding pathology, while the neural network only serves to complete observed data.

  • Direct backpropagation through Gray-Scott simulation fails to recover parameters; the loss landscape features flat plateaus and sharp cliffs aligned with bifurcation boundaries.
  • With the neural network fixed, the residual loss is quadratic and yields a smooth landscape, implicitly encoding the full PDE dynamics.
In-site article

Physics-informed generative AI for semiconductor manufacturing: Enforcing hard physical constraints in generative models by construction

This perspective argues that generative AI for physically constrained domains, such as semiconductor manufacturing, must embed physics into model architecture from the start rather than relying on post-hoc filtering. It surveys architectural approaches and proposes a research agenda centered on physics-fidelity benchmarks and differentiable simulators.

  • Generative models must obey hard physical constraints in semiconductor manufacturing
  • Architectures that enforce constraints by construction outperform those that filter post-hoc
In-site article

ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation

ProHiFlo is a novel hierarchical flow matching framework for de novo protein generation that employs coarse-to-fine generation, functional guidance via pretrained predictors, and an adaptive SE(3)-equivariant architecture. It achieves state-of-the-art performance with 4× fewer sampling steps and a 58.9% success rate on enzyme active site scaffolding, outperforming RFDiffusion (41.2%).

  • Coarse-to-fine generation models backbone geometry before refining to all-atom coordinates, reducing computational cost while maintaining accuracy.
  • Functional guidance uses pretrained predictors to steer generation toward desired properties without retraining.
In-site article

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

Activation steering can shift LLM behaviour, but standard evaluations don't test whether sycophancy reduction also suppresses factual agreement. The authors introduce dual-stance evaluation, finding that while sycophantic and factual agreement are in distinct subspaces, the steering direction projects equally onto both, reducing both. This reveals a gap: readable representations may not be writable.

  • Activation steering reduces sycophancy but also factually correct agreement.
  • Dual-stance evaluation tests both stances of each topic.
In-site article

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

This paper presents a deployment-centered evaluation of an LLM system integrated into electronic health records at an academic medical center. By training a pre-response classifier that uses query content and deployment-specific context (e.g., provider type, department, language model), the model predicts the risk of user rejection with an AUROC of 0.719 over 4.5 months of prospective analysis. The findings demonstrate the feasibility of predicting user rejection using deployment context, enabling targeted guardrails and abstention strategies.

  • Static benchmarks focus on correctness and require dense annotations; this work leverages sparse user feedback from real deployment.
  • A pre-response classifier predicts rejection risk using query content and deployment context (provider type, department, model).
In-site article

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Compact language models face challenges beyond isolated function calling when using tools. Evoflux uses evolutionary search at inference time to repair executable tool workflows, raising execution feasibility from 3% to 17-24% on MCP-Bench tasks, outperforming SFT and DPO baselines.

  • Small language models struggle with tool workflow dependencies and execution.
  • Evoflux evolves typed workflow graphs via structured edits and execution feedback.
In-site article

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

TrajGenAgent proposes a hierarchical LLM agent framework for generating realistic synthetic human mobility trajectories without model fine-tuning. It uses a two-stage orchestrator-worker design: an LLM first synthesizes individual- and weekday-conditioned activity chains via in-context learning, then a deterministic workflow grounds each activity into a complete visit using personalized POI retrieval, distance-aware location selection, kinematics-aware travel-time propagation, and LLM-based duration estimation. An anomaly-detection-based evaluation framework assesses behavioral and semantic plausibility. Experiments show improvements in spatiotemporal fidelity, semantic coherence, and individual-specific behavioral realism over existing methods.

  • TrajGenAgent is a hierarchical LLM agent framework for generating human mobility trajectories without fine-tuning.
  • It employs a two-stage design: LLM synthesizes activity chains, and a deterministic workflow converts activities to visits.
In-site article

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

This study evaluates lie detectors for large language models by creating 13 reasoning model organisms with verified hidden beliefs and a Varied Deception testbed. Testing four detectors across 31 models reveals positive scaling with capability on prompted lying, but sharp drops on trained organisms except for the chain-of-thought judge. Current detectors cannot support high-confidence claims about model beliefs.

  • Created 13 reasoning model organisms with verified hidden beliefs to evaluate lie detectors.
  • Evaluated four detectors: chain-of-thought judge, logprob classifier, and two activation probes (including new Did-You-Lie method).
In-site article

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

PersonaDrive is a novel pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, enabling diverse driving styles without per-style retraining. It improves driving score by 4.6% over SimLingo and achieves top scores across all styles on Bench2Drive.

  • PersonaDrive uses a style-instructed human driving dataset and retrieval to condition a VLA agent on specific driving styles.
  • The pipeline has three stages: offline triplet mining, retrieval head training, and VLA backbone fine-tuning.
In-site article

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pythagoras-Prover is a compute-efficient family of open-source Lean theorem provers, featuring autoregressive models (4B and 32B) and a diffusion-based prover (4B). It uses curriculum SFT with stratified data and dynamic proof filtering for training efficiency, and introduces Augmented Lean Formalisation (ALF) to expand verified corpora via self-distillation. The 4B model outperforms DeepSeek-Prover-V2-671B on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while the 32B model sets a new open-source SOTA at 93.0% and solves 93 PutnamBench problems.

  • Pythagoras-Prover includes autoregressive models at 4B and 32B parameters and a 4B diffusion-based prover that refines proofs iteratively.
  • Training efficiency is achieved via curriculum SFT with stratified difficulty levels and dynamic proof reasoning filtering within an 8k-token context.
In-site article

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Arbor is a multi-agent framework introducing structured tree search as a cognition layer for autonomous agents in large stateful action spaces. Validated on full-stack LLM inference optimization, it achieves up to 193% Pareto improvement in throughput-latency over vendor baselines, with a critic agent ensuring stability.

  • Arbor uses tree search as shared working memory across agents for coordinated optimization.
  • Achieves up to 193% throughput-latency Pareto improvement on full-stack LLM inference, hardware-agnostic.
In-site article

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Researchers propose ToolSense, an open-source diagnostic framework to evaluate how well large language models truly understand tools. It generates three benchmarks to expose a knowledge-retrieval dissociation: models that excel on standard ToolBench benchmarks often collapse by 50-64 percentage points on more realistic queries, sometimes falling below embedding-based baselines.

  • ToolSense is an open-source framework that audits parametric tool knowledge in LLMs.
  • It automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB), MCQ probing, and QA probing.
In-site article

Claude Fable is relentlessly proactive

Simon Willison details how Claude Fable 5 autonomously debugged a CSS scrollbar bug using numerous creative techniques, including writing test pages, injecting JavaScript, and building a CORS server. The session cost ~$12.11 and highlights both the power and danger of unsandboxed coding agents.

  • Claude Fable 5 autonomously debugged a CSS horizontal scrollbar bug using creative methods.
  • It wrote test HTML pages, used PyObjC for window info, injected JS for keyboard shortcuts, and built a custom CORS server.
In-site article

Three insights you may have missed from theCUBE’s coverage of Snowflake Summit 2026

The next wave of enterprise AI focuses on software and data infrastructure needed to make models useful in real businesses. Snowflake is positioning itself as a connector between proprietary data and AI models. Key insights include strong data foundations, security and governance frameworks, and the importance of trusted, governed intelligence for production AI.

  • Strong data foundations turn enterprise AI into business outcomes, as seen with DoorDash and Fanatics.
  • Enterprise AI requires new frameworks for security, governance, and trust, including practices from Tenable and Komodo Health.
In-site article

How ERGO Hestia reduced time-to-market with Lakebase and Mosaic AI Model Serving

ERGO Hestia modernized its real-time pricing engine with Databricks Lakebase and Mosaic AI Model Serving, bringing data, features, and decisions into one lakehouse-native platform for millisecond pricing, faster model deployment, and unified governance.

  • ERGO Hestia migrated its real-time pricing engine to a lakehouse-native platform, eliminating external databases and adapter layers.
  • The new architecture leverages Lakebase for online feature storage and Mosaic AI Model Serving for direct API access with millisecond latency.
In-site article

Making secret scanning more trustworthy: Reducing false positives at scale

GitHub reduced secret scanning false positives by 75.76% by introducing LLM-based contextual verification, improving alert trustworthiness and developer confidence.

  • GitHub collaborated with Microsoft Security & AI to use context-aware LLM reasoning for verification.
  • Instead of analyzing entire codebases, the system extracts high-signal context like API calls and authentication headers.
In-site article

Mercury 2, the first reasoning diffusion LLM, is now on Baseten

Inception's Mercury 2, a diffusion LLM, is now available on Baseten. It generates over 1,000 tokens per second, 5-10x faster than leading speed-optimized models, at half the cost with comparable quality. It enables real-time speed on standard NVIDIA GPUs without custom chips. Augment Code cut costs by 90% and latency by 82% using Mercury 2.

  • Mercury 2 is the fastest reasoning LLM, using diffusion to generate full output in parallel passes.
  • It runs over 1,000 tokens per second on standard NVIDIA GPUs, reducing costs and latency.
In-site article

LlamaIndex Newsletter 6-10-26

Major updates include ParseBench at CVPR 2026, Parse-Flow for visual document intelligence, Anthropic Fable 5 benchmark results, new Granular Bounding Boxes in LlamaParse, and The Agent Open pickleball tournament.

  • ParseBench debuts at CVPR 2026 as the first document parsing benchmark for AI agents.
  • Anthropic Fable 5 achieves 90.02% content faithfulness on ParseBench, leading competitors by 12+ points.
In-site article
Tools

Behind the scenes at OpenAI HQ: the Stephen Collins cartoon

A Guardian cartoon by Stephen Collins humorously depicts the inner workings of OpenAI headquarters, blending AI, life and style themes.

  • Stephen Collins' cartoon for The Guardian offers a satirical look behind the scenes at OpenAI HQ.
  • Themes include artificial intelligence, life and style.
In-site article

Qursor

Qursor lets you point at any UI element to send its exact context to your AI assistant, streamlining interactions. Launched on Product Hunt.

  • Point at UI to send context to AI.
  • Works with any interface.
In-site article

Bob's CLI

A local-first AI coding CLI that adapts to you.

  • Local-first AI coding CLI
  • Adapts to individual usage patterns
In-site article

How Preply combines AI and human tutors to personalize learning

Preply leverages OpenAI to offer AI-generated lesson summaries, personalized feedback, and language exercises, blending technology with human tutoring.

  • AI-generated summaries enhance lesson review
  • Personalized feedback tailored to learner progress
In-site article