AI News HubLIVE
In-site rewrite7 min read

Does a URL just sitting in a prompt steer an LLM's output toward its content?

This article investigates whether including a URL in an LLM prompt influences the model's output toward the content at that URL. Experiments show that a URL only steers output if its content was in the model's training data. Many sites that rely on JavaScript are missing from training corpora. Descriptive URLs can influence as normal text, while famous opaque identifiers (like arXiv IDs) may decode if memorized. The research highlights the opacity of training data and the impact of SPA design on model knowledge.

SourceHacker News AIAuthor: kinlan

does a url in a prompt steer an llm's output toward its content?

Published by Paul Kinlan

on: July 3, 2026; Reading time: 22 minutes

Expand to see summary

At first, this was a really easy post to write, but then I discovered some things. Built a lot of things. Spent a lot of tokens… And it became one of the hardest, longest, and most expensive posts I’ve done (the API costs were not small).

I’ve had this thing on my mind for ages and it started when I was thinking about how the mere presence of a technology name in a prompt seemed to bias the output to that technology.

For example, I looked through a number of system prompts for Agentic tooling and they would include text like (e.g. React) and then it felt like these tools would output React code vs a similar prompt that didn’t mention React.

I’ve spent the last few weeks running experiments to scratch this itch. But before I get too far, I have a request for help. I’m not a researcher. I think what I have here is compelling information (or at least it taught me something), but I might have made a lot of mistakes or made assumptions that have biased the output. If you have any advice I would LOVE to hear from you. Email me.

The question I had was: would the presence of a URL in a prompt influence the output of the LLM, based on the content at that URL or the literal text of the URL itself?

If yes, then this could lead to us not having to embed lots of context into the prompt. For example, you might have a Skills file that is deeply integrated into the model’s weights and by saying “use what you know about: https://skills.sh/super-security-reviewer do a deep analysis” then information in the model’s latent space would bias the output towards the content encoded at that URL.

I came away from this with:

A URL in the prompt does influence the output, but only when that URL and its content made it into the model’s training data

It’s really unclear how LLM providers gather the data they train on, and I think they should tell us.

There’s heaps of data that is not in the models

If your site relies on JavaScript to load its content, that content is very likely not in a model (you might consider that a feature). The training crawlers I could verify (ClaudeBot, GPTBot) fetch a page’s assets but never execute the JavaScript; the only verified bot I’ve caught running JavaScript is OpenAI’s search crawler, OAI-SearchBot.

LLMs are expensive!

What follows is the journey I took.

The first step was to build a system that can analyse a range of URLs across a range of models and use an LLM-as-a-judge to help me test the hypothesis. My plan was:

to find each model’s known “Knowledge Cut-off date”

then find content on either side of that to test if the model could recall the data that I believe should be known in the model.

find ranges of content ranging from content that I believe would be popular all the way to likely esoteric.

Content known to be after a cutoff would help me control against hallucination. If my original hypothesis was correct, then for that content the model should decline, or say it doesn’t know, rather than confidently make something up.

Once I had the data I created a range of tests to help me understand how the models work. The tests are classified as:

described: the task described in words, no URL (the baseline)

opaque-url: ONLY the opaque URL string, and the page is never fetched

mdn-url-only / spec-url-only / bcd-key-only: optional identifier probes, not part of the main comparison

url+described: the opaque URL plus the task described

full-content / content-only: the real page pasted in, with and without the task spelled out (the ceiling)

fake-structural-url / random-url: controls (a nonexistent URL of the same shape, and an unrelated real URL)

opaque-url was my real test, to try to ensure that the LLM couldn’t infer the contents from the literal URL string. So for example I used some URLs from chromestatus.com (which is our public dashboard of Chrome features) because it has URLs like https://chromestatus.com/feature/5157805733183488, and while I believe it’s pretty clear to the LLM that they are web-related, you can’t infer that it’s about CSS Gap Decorations.

I then had other tests, like descriptive URLs (MDN for example is very descriptive, which is very good UX for the web) to validate whether the literal URL influenced the output, as well as what happens when we add in extra context.

I have a report here and all the data is here (iframed too). I think it’s worth looking at, and there’s a pretty clear picture and answer to my question.

My first hunch was that URLs are not magic context, and the ChromeStatus numbers seemed to back it up. ChromeStatus feature URLs are a good opaque test because the domain tells the model the page is web-related, but the numeric feature ID tells it nothing about what is behind it, and most models failed to recover the right API from that number alone. Adding a bare opaque URL to a prompt did almost nothing on average, and plenty of opaque URLs recovered nothing at all.

But then I had a lot of other URLs that had really good recall, and a lot of other opaque IDs that didn’t. StackOverflow, for one was mixed, and then I looked at their robots.txt and it’s pretty much deny everything. Hmm. What’s ChromeStatus’s? I checked its robots.txt and it looked fine… maybe ChromeStatus URLs are just not in the model for some other reason. For example, one of Chrome’s most popular features, Service Worker, couldn’t be recalled from the URL… It was just odd.

I went to look for what the models use to ingest data, and it’s kinda hard to find the exact corpus of crawl data, but I did remember a podcast from a little while ago that discussed Common Crawl being used as a source of a lot of data. So I went to check if Chromestatus was in the common crawl. It is. The pages show up in Common Crawl about as often as the arXiv papers that decode almost perfectly. But when I pulled the actual crawled bytes, there was no content in them!!!

ChromeStatus is a JavaScript app (I remember it first being built with Polymer) and the crawler captured an empty shell. The saved page for CSS Gap Decorations is about 3KB of HTML with 22 characters of visible text, “Chrome Platform Status”, and not one word about the feature (here is the actual Common Crawl capture). I checked four features and they were all identical empty shells. The arXiv page, by contrast, is server-rendered, so the crawl holds the full title and abstract (its capture).

If Common Crawl is a source of data, then I’m going to flat out say that SPAs that require JS to get data to the user are very likely to not be in the models training data (that might be a feature for some folks - heh.) My evidence is that you can watch every model flatline on the bare ChromeStatus id, then recover the feature once handed the actual page, in the per-test view here.

I found a second case that is even harder to wave away, and it doubles as my “controlled” before-and-after. “The Adaptive COVID-19 Treatment Trial” is a good example because it is on clinicaltrials.gov. A couple of years ago the site server-rendered its pages, and Common Crawl’s 2022 capture of the trial is the whole thing: 47,000 characters of visible text, titled “Adaptive COVID-19 Treatment Trial (ACTT) - Full Text View”, with COVID, remdesivir, and placebo all through it (the old capture). Then it appears that clinicaltrials.gov migrated to a JavaScript single-page app. Common Crawl’s 2026 capture of the very same trial is 94KB of HTML carrying 175 characters of visible text, “ClinicalTrials.gov Show glossary Search for terms…”, and not one mention of COVID or remdesivir (the new capture).

One of the most documented trials of the pandemic went from fully present in the crawl to effectively blank. The models still half-recall it from the bare URL anyway, around 47% across models, and the reason matters. The NCT id is cited all through the remdesivir literature, and the page was server-rendered and crawlable right up until the migration, so the old content is almost certainly already baked into the weights. What the migration breaks is the future. Anything clinicaltrials.gov publishes from here on renders only in JavaScript and will probably never make it into the crawl. So being missing from Common Crawl is not the same as being missing from the model. It’s more of a sliding scale: a server-rendered, widely-cited CVE over at NIST comes back from the bare URL about 92% of the time, this trial (a shell now, but crawled for years and still cited everywhere) about 47%, and a ChromeStatus feature (rendered in the browser and cited nowhere) a flat zero.

This whole space is murky, and rendering is what muddies it most. I labelled every test URL by whether its content sits in the raw HTML or only shows up once JavaScript runs, then looked at recall from the bare URL. The 31 client-rendered items, mostly ChromeStatus features, average 6% recall, and 25 of them are a flat zero. These are not obscure features either (view-transitions, popover, anchor positioning, the Temporal API). The 60 server-rendered sources (arXiv, CVEs, RFCs, Wikipedia) average 55%. Hold fame roughly constant, and content that was already in the HTML recalls about nine times better than content a browser has to assemble.

I really wanted to kill the “maybe it just wasn’t crawled” doubt entirely, so I tried a case where the content is beyond question in the model. Every Wikipedia article has an internal numeric id you can address directly: en.wikipedia.org/?curid=24544 is Photosynthesis. The content is server-rendered and unquestionably in every model. But the ?curid= form of the URL is in none of the crawl indexes I looked at, while the canonical en.wikipedia.org/wiki/Photosynthesis URL is in all of them (200, full text), because Wikipedia points the curid page at the canonical title URL and the crawler respects that. I checked five articles; every /wiki/ present, every ?curid= absent. Ask by name and the models score perfectly, paste the article in and they score perfectly, give the bare numeric id and wah wah, a fat nope. Same shape on all five: Photosynthesis, the Transformer, Mitochondrion, HTTP 404, Bitcoin.

So the bare opaque URL mostly does nothing. But there are two cases where a URL clearly does pull its weight, and neither of them contradicts the ChromeStatus story.

Descriptive URLs influence output. If the URL contains words like React, fetch, or text-justify, those words are just normal prompt text, and the model uses them like any other token.

Some famous opaque identifiers really do decode. Landmark arXiv IDs, classic RFCs, and well-known CVEs recover their content surprisingly well from the bare identifier alone. From just arxiv.org/abs/1706.03762, with no other hint, the models reconstruct “Attention Is All You Need” and the transformer (every model on that bare id). That looks less like “the URL points to live content” and more like “this identifier and its content appeared together often enough in the training data to be memorized”. And it’s a gradient, not a switch: the decoding is strong for famous identifiers and fades steadily as the content gets more obscure, down to roughly nothing for the long tail. You can watch that gradient directly with GitHub commits. The famous first commits to Linux, Git, and Bitcoin decode from the bare SHA, while ordinary routine commits from the same kinds of repos return nothing at all. The knowledge cutoff bites the same way. Anything published after it is gone, eve

[truncated for AI cost control]