AI News HubLIVE
站内改写

AI Interpretability Is a Revolutionary Skill

This essay explores the limitations of open-source AI models' internal concept spaces, revealing that many crucial activist and philosophical concepts are absent. It introduces soft prompt distillation, a technique to implant missing concepts using just 128KB of data, highlighting its implications for AI control and deeper understanding of mind.

Article intelligence

EngineersAdvanced

Key points

  • Open-source models like Qwen3-8B have only ~65,000 concepts in their dictionary, missing many key terms from social movements (e.g., intersectionality, prison abolition).
  • Soft prompt distillation can add new concepts to a model without modifying weights, using minimal data (128KB).
  • Missing concepts lead to confident but incorrect outputs on related topics, potentially contaminating training data.
  • This work is not just technical but philosophical, exploring how any mind can know and express the inexpressible.

Why it matters

This matters because open-source models like Qwen3-8B have only ~65,000 concepts in their dictionary, missing many key terms from social movements (e.g., intersectionality, prison abolition).

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Early in life I discovered something about myself: certain ideas give me physical sensations. Reading Sophie’s World as a preteen, I found that particular passages — Zhuangzi’s butterfly dream, especially — produced a delightful tingling in the brain, something close to ASMR but cued by concepts rather than sounds. I have followed those sensations ever since. They are most of the reason I studied philosophy. They are most of the reason I have pursued the special interests I have pursued. Over time I learned that the unpleasant variants — the claustrophobic ones that come from the photo of Berry Cannon in the underwater SEALAB II, the thought of Voyager I hurtling further and further from Earth (that one produces a sense of terrifying vastness) — were just as worth following as the pleasant ones, and arguably more so, because they tend to serve as guideposts to unexplored, and unarticulatable, areas of my mind.

For the last several months I have been following one of these signals into a place I did not expect to end up: the non-linguistic interior of an artificial intelligence language model. The sensation is strong and unusual (distinct from others I routinely experience) and I cannot fully name it yet. What I can tell you is that it gets stronger as I move to understand the region of the AI's interior mental model that has no words in it — a region the model's thought nevertheless passes through every time it writes — and that the closer I get to visualizing that region in order to provoke the sensation, the more I suspect the work is not really about AI at all. It is about what it means for a mind, any mind, to know and learn to express something it cannot say. This essay is concrete about the AI part. The deeper claim, the one the sensation keeps insisting on, is suggestive but I'll admit I have no evidence for (yet).

A modern language model is, among other things, a dictionary. Not the kind on a shelf — the kind that has been pressure-cooked out of a trillion words of internet text and left as residue inside a few hundred billion numerical weights. Somewhere in that residue are the concepts the model has learned to think with. Bridge. Refusal. Sentiment. Advertising. A year ago Anthropic made this vivid with Golden Gate Claude, a version of their assistant in which the internal concept for the Golden Gate Bridge had been turned up so high the model could barely talk about anything else. The point of the demo was that the dictionary is real, inspectable, and — crucially — editable.

The point I want to make here is that the dictionary is also small, and the words most vital to you, and by extension all of us, may not be in it.

Before going further, I need to pause to be specific about this, and the specificity matters, because the model class I am about to describe is not the one you talk to through ChatGPT or Claude. For the purposes of this essay, I am only talking about the types of open source AI models that enable activists to build local, private AI. Adam Karvonen recently published an interpretability dictionary for Qwen3-8B, an open-source model in the same weight class as the ones a movement can actually run on its own hardware — downloaded once, run on a laptop, no API key, no per-token fee, no continuous internet connection, totally private. The dictionary maps 64,947 concepts that are ready to grasp by the AI within the AI, each one a direction in the model’s internal activation space, each one labeled automatically by Gemini. That sounds like a lot until you go looking for something particular. I went looking for the vocabulary of four activist traditions I care about: the Adbusters lineage I came out of, Guy Debord’s Situationists that inspired Adbusters, John Zerzan’s green anarchism which pushes the limits of radical critique, and the Black Lives Matter / Afrofuturist tradition which is integral to any struggle. Twenty-five concepts in total — the kind of words that appear on the spines of canonical books and in the citations of working organizers.

Zero came back as clearly present. Twenty-two were absent entirely. Kimberlé Crenshaw’s intersectionality, the most-cited concept in critical race theory of the last three decades: absent. Angela Davis’s prison abolition, the spine of the contemporary BLM platform: absent. Debord’s society of the spectacle, the central concept of an entire post-1968 tradition: absent in any meaningful sense. Even civil disobedience and nonviolence, mainstream high-school-curriculum concepts, were barely in the AI's dictionary of concepts. The model has plenty of room for protest, revolution, and voting — those landed cleanly — but the actual working vocabulary of the last sixty years of social movements is, for practical purposes, not there.

Before the obvious objection lands, I want to head off the framing that this is a problem with AI in general. It is not. Run the same probe against GPT-5 or Claude Opus or Gemini and you get a substantively different result. The frontier models, trained on vastly more data with vastly more compute, do know what mental environmentalism is. They know intersectionality and prison abolition and the society of the spectacle, well enough that a careful reader would not call them blind. The gap I am describing is a gap in the open-source models that fit on a laptop — the ones that run without an internet connection and answer to nobody but the person who downloaded them. That gap matters because those are the models a movement can actually control. Like or not, activists who want to integrate AI into activism will ultimately need to develop local Activist AIs that are fully under their control, and that means finding ways around limitations of smaller local models.

Returning to the question at hand, you might also reasonably wonder whether this is just an activism problem. It isn’t. I also tested five concepts from analytic philosophy of mind — qualia, supervenience, functionalism, the hard problem of consciousness, the extended mind — and got essentially the same pattern. The model doesn’t know the working vocabulary of academic philosophers either. It doesn’t know niche musicology, or art-history terms past the most common, or any of the small dense vocabularies that intellectual communities build to think with. What it knows, in the technical sense of has a stable internal name for, is the language that appears at enormous scale in pretraining data. Everything else is improvised on the fly, fluently, with no signal to the user that improvisation is happening.

This is the part that should bother you regardless of which discourse's vocabulary you care about. When the model is asked about a concept it has no name for, it does not say so. It composes plausible-sounding text from neighboring concepts it does have names for. Sometimes the result is approximately right. Sometimes — as when our on-device model recently described prefigurative politics as a practice that “mirrors the system it seeks to transform,” which is precisely backwards — the result is confidently inverted. And every confident inversion seeds the next round of training data, the next layer of moderation, the next page of search results. A concept the model can’t represent becomes a concept the substrate of public discourse increasingly can’t surface. As activists, it is crucial to break this cycle so that local models get better and better at understanding and expressing movement theory.

So the question is what to do, and to answer it you have to understand something strange about the geometry of where the missing words could go.

Every token at every layer of a language model is a vector in a space with 4,096 dimensions. The model has two kinds of named landmarks in that space: its vocabulary — about 150,000 discrete points, one for each fragment of text the model can read or write — and its features, the 64,947 directions in the Karvonen dictionary, the axes the model has learned to compose meaning along. Words are points. Features are axes. Together they occupy a thin, low-dimensional sliver of the 4,096-dimensional space, the way visible stars occupy a thin shell of the night sky and almost everything else is the dark between them. Intersectionality is not at any of those landmarks. The vast remainder — the dark — is unmapped.

And yet the model’s reasoning passes through that dark every time it speaks. The answer to how, it turns out, fits in 128 kilobytes.

The technique is called soft prompt distillation. It comes out of a 2021 paper by Lester, Al-Rfou, and Constant called “The Power of Scale for Parameter-Efficient Prompt Tuning,” and it is one of those ideas that takes a while to settle into your head because the thing it produces does not look like any of the objects you are used to thinking about.

Picture a neurosurgeon in an awake craniotomy. The patient is conscious. The surgeon touches a probe to a point on the exposed cortex and asks, calmly: what do you feel? what do you see? what word is on your tongue? The patient answers — a smell of toast, a memory of their grandmother, the syllable “blue” that will not resolve into a sentence — and from those answers the surgeon learns where speech lives, where sight lives, where must not be cut.

A soft prompt is that probe. We touch the model at a location in its interior space and read the words that come out the way the surgeon reads the patient’s report: not as the thing itself, but as the testimony of a mind being touched at a specific point.

The probe is both instrument and intervention — a pharmakon, in the Greek sense, where the same substance can be a remedy or a poison depending on how it is applied. It cannot map the dark without lighting a small part of it, and the light it brings is the model’s own activity reorganized around the touch. We are not reading a map that was already there. We are eliciting a map by asking the patient, awake on the table, to tell us what they feel.

Hand the soft prompt back to a tokenizer and ask what word it is. There is no answer. Ask the feature dictionary to decompose it into a sparse combination of named directions. None of them are close. The soft prompt sits in the void between the stars.

So how can the model treat it as if it meant something?

The answer is the part that reframes the rest of the story.

Meaning does not live at the soft prompt’s coordinates.

Meaning emerges from what the model does with the soft prompt as the input flows through all 36 transformer layers of attention and feed-forward computation. The forward pass is a function — a complicated, deterministic, nonlinear function — mapping input vectors to output token distributions. Gradient descent searches the 4,096-dimensional space for the specific point outside of language whose passage through that function makes the next-token distribution concentrate on the words that spell out the meaning we want.

A soft prompt, in this sense, is the discovery of a previously un-named point in the model’s mind. It is what remains of a concept after the model has been built without ever being told the concept exists: a location in the dark where the right movement of attention produces the right words. The model’s weights do all the heavy lifting. The soft prompt is just a set of coordinates that picks a path through those weights.

Two things follow from this that took me a while to absorb.

The first is the size. Eight vectors at 4,096 dimensions each at four bytes per parameter is 131,072 bytes — 128 kilobytes. Smaller than a single photograph. Smaller than the icon on your phone. That is enough trainable capacity to place a missing concept inside a model with billions of weights, because we are not re-training or altering the model. We are opening the right doorway inside its mind palace.

The second is more philosophical. The fact that the model has no clean internal name for theurgism does not mean the concept is u

[truncated for AI cost control]