2026-05-31 07:48 UTCIn-site rewrite3 min readUpdated: 2026-06-30 13:03 UTC

AI search agents often confirm what they already know instead of actually researching the web

Leading AI search agents such as GPT-5.4 and Kimi K2.6 appear to rely on memorized knowledge rather than conducting genuine web research on standard benchmarks. A study from Harbin Institute of Technology introduces LiveBrowseComp, a benchmark based on events from the last 90 days, which causes performance to collapse and reshuffles model rankings, revealing that current evaluations measure knowledge retention rather than search capability.

SourceThe DecoderAuthor: Jonathan Kemper

A new study suggests that leading AI search agents don't actually research on established benchmarks; they mostly use the web to confirm answers they already have. Once models have to go beyond their existing knowledge, search performance falls apart.

Frontier models like GPT-5.4, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek-V4-Pro, and Kimi-K2.6 keep posting higher scores on BrowseComp. The benchmark asks agents complex questions that can only be answered through multi-step browsing and piecing together information from different web sources.

Researchers from the Harbin Institute of Technology and Xiaohongshu have now shown in a study that these results say less about the agents' research skills than assumed. The authors call it "intrinsic knowledge dependence" (IKD), a reliance on internal knowledge the models absorbed during training.

[caption id="attachment_36089" align="aligncenter" width="996"] With static benchmarks, the needed knowledge migrates into parameter memory over model generations, making tasks easier over time. LiveBrowseComp counters this with time-bound questions. | Image: Fan et al.[/caption]

The researchers tested eleven models total, first stripping away all search and browsing tools. Even without internet access, the models scored surprisingly high. MiniMax M2.5 solved 44.5 percent of BrowseComp tasks from memory alone. Kimi K2.6 hit 62 percent on the Chinese BrowseComp-ZH variant. A big chunk of benchmark performance, in other words, comes before any search even happens.

[caption id="attachment_56533" align="aligncenter" width="997"] Even without tools, models score high: MiniMax M2.5 reaches 44.5 percent on BrowseComp. The actual contribution of web search is often small. | Image: Fan et al.[/caption] Searching can actually hurt the answer The second test is more telling. The researchers left the search interface in place but removed all answer-supporting documents from the search index. Every model tested then performed worse than it did without any tool access at all. MiniMax M2.5 dropped from 44.5 to 8.0 percent. Kimi-K2.6 fell from 25.5 to 2.3 percent. The search actively pulls agents away from correct gut-feeling answers as soon as no confirming hits show up.

[caption id="attachment_56534" align="aligncenter" width="996"] The further the search progresses, the more agents hunt for their own hypotheses instead of new facts. When they do find supporting sources, they use them less than a third of the time. | Image: Fan et al.[/caption]

An analysis of the search paths explains why. More than half of all queries come from the model's own reasoning rather than from previously found hits. Even when relevant evidence does appear in search results, the agents fold it into their reasoning less than a third of the time. The loop is model-led, not evidence-led. A benchmark beyond the knowledge frontier To measure real search behavior, the authors built LiveBrowseComp. The benchmark contains 335 human-written questions, each depending on at least one fact from the 90 days before creation and impossible to answer without that current information.

The underlying events come from constantly updated sources like film databases, game directories, security vulnerability registers, and earthquake catalogs. Globally prominent events are filtered out deliberately, leaving obscure but publicly verifiable facts that had little chance of seeping into model parameters during training.

[caption id="attachment_56535" align="aligncenter" width="793"] The pipeline filters only facts from the last 90 days, discards unstable answers, and has each question checked by experts for timeliness, difficulty, and clarity. | Image: Fan et al.[/caption]

Human testers need about the same amount of time for LiveBrowseComp as for BrowseComp and solve a similar number of tasks. The performance drop among models is therefore due to losing the memory shortcut, not because the questions are harder. Leaderboard rankings fall apart On LiveBrowseComp, all models in the closed-book test fall below two percent accuracy. With tools turned on, scores land about 25 to 40 points below the same models' BrowseComp results.

[caption id="attachment_56536" align="aligncenter" width="997"] Without tools, models solve up to 44.5 percent of BrowseComp questions from memory. On LiveBrowseComp, that number drops below two percent across the board, confirming the temporal block against parameter knowledge. | Image: Fan et al.[/caption]

This shifts the rankings. GLM 5.1 leads clearly among open-source models on BrowseComp but falls to mid-pack on LiveBrowseComp. DeepSeek v3.2 sat at the bottom on BrowseComp, then climbed to the top on LiveBrowseComp, passing several models that previously outperformed it. This shows that a model's spot on a static leaderboard mostly shows how much it already knows, not how well it searches. Agents need more steps when they can't rely on memory On BrowseComp, agents solve many questions in very few steps, a sign of quick memory confirmation. On LiveBrowseComp, that pattern disappears. The step counts shift much higher, which suggests the agents are doing real research instead of recalling stored knowledge.

[caption id="attachment_56537" align="aligncenter" width="996"] On BrowseComp, agents solve many questions in just a few steps, a pattern of quick memory confirmation. On LiveBrowseComp, that cluster disappears, and searches take far more rounds. | Image: Fan et al.[/caption]

The authors argue that dynamic, time-sensitive benchmarks should become the standard for evaluating AI agents. They also want training signals that reward evidence-based research over the typical guess-and-verify approach.

Other studies have flagged similar problems. A benchmark from Peking University found that top models often produce the right answer when analyzing documents but cite the wrong source, what the researchers call "attribution hallucination." A tool called CiteAudit recently discovered that fabricated references have already made it into accepted papers at major AI conferences. The reason: commercial models don't reliably catch made-up citations.