The Sequence Knowledge #878: Beyond Transformer: What We Learned
This article concludes the series on alternatives to the Transformer, covering four families: recurrent/linear-recurrent models, state space models, text diffusion models, and liquid/continuous-time models. It also announces a new series on knowledge distillation.
Today, we bring you a summary of our series about transformer alternatives.
For the better part of a decade, the entire field has been a giant, spectacularly funded wrapper around a single operation: self-attention. The Transformer didn’t win because it was the most elegant or the most brain-like design. It won because it had the best scaling story and it won the hardware lottery. Every token looks at every other token, the whole thing maps cleanly onto a GPU grid, and you train it all at once. Add data, parameters, compute, context — and the loss curve cooperates. That smoothness is rare. Most clever ideas in deep learning never become industrial. This one did.
But the tax was always there in plain sight. Self-attention buys you something genuinely valuable — perfect, lossless recall over the entire context, with every token able to address every other token directly, and a training pass that parallelizes across the whole sequence at once. That’s the benefit, and it’s a real one. The cost is that attention scales quadratically with sequence length, and autoregressive decoding drags around a KV-cache that grows linearly with every token you’ve already seen. When you’re pushing past a million tokens, or watching a 70B model’s cache eat 40GB of VRAM, O(n²) compute and O(n) memory stop being footnotes and become the actual bill. So the interesting question was never “are Transformers good?” They’re spectacular. The question is whether they’re the final architecture or just the first truly scalable one — soon to be absorbed into something richer.
That was the thesis we set out to test, and the cleanest way to read the eight issues is as four families, each making a different bet against attention.
The first family is recurrent and linear-recurrent models — the RNN comeback and xLSTM. Their pitch is constant memory: instead of a cache that grows forever, they carry a fixed-size hidden state and pay O(n) compute over a sequence rather than O(n²). The classic objection was that RNNs train serially and can’t saturate a GPU, but the modern variants reformulate the recurrence so it parallelizes during training while staying cheap at inference. The benefit is brutally efficient generation; the open challenge is whether a fixed-size state can hold enough to match attention’s exact recall on long-range, retrieval-heavy tasks.
The second family is state space models — the SSM/Mamba line, the most serious challenger of the bunch. SSMs treat a sequence as a continuous linear dynamical system, which gives them a near-magical dual form: a parallelizable convolution for training and a recurrent scan for inference. They get linear scaling and long-context handling almost for free. The trade-off is expressivity — pure SSMs can struggle with precise in-context copying and lookup, which is exactly why the strongest results today are hybrids that interleave a few attention layers among many SSM layers.
The third family is text diffusion — generation that abandons left-to-right decoding entirely, refining a whole sequence in parallel over a handful of denoising steps. The benefit is non-autoregressive speed and bidirectional context at generation time; the challenge is matching the raw quality and controllability of autoregressive models, which LLaDA, Gemini Diffusion, and Mercury are now pushing on hard.
The fourth family is liquid and continuous-time models, which throw out the parallel-lookup mental model altogether in favor of dynamics that evolve continuously in time, aiming for far smaller, more adaptive networks. The benefit is parameter efficiency and a different inductive bias; the challenge is scaling that story to frontier sizes.
None of these has dethroned attention. But the monoculture is over, and the most likely future is explicitly hybrid: attention where exact recall earns its quadratic cost, something linear-time everywhere else.
Here is the full series, in order:
#846 — Beyond Transformer: A New Series — The kickoff, framing the palpable vibe shift on arXiv toward post-attention architectures and the decade we’ve spent as a wrapper around self-attention. It lays out the plan to map every major viable alternative to the Transformer.
#850 — The Unexpected Comeback of RNNs — The case for recurrent networks as the alternative most people overlooked, revisiting why linear-time recurrence is attractive again. It positions modern RNN variants as a serious challenger rather than a relic.
#854 — Return of the King: Unrolling the xLSTM Architecture — Traces the lineage from the 1990s LSTM through the 2017 Transformer pivot into xLSTM, the modernized revival of Hochreiter and Schmidhuber’s design. It explains how reworked gating and scaling let xLSTM compete with attention-based models.
#858 — How State Space Models Went from Curiosity to Serious Transformer Competitor — Charts the rise of SSMs as the O(n²) attention bottleneck becomes a real constraint at million-token contexts and large KV-caches. It argues state space models have quietly matured into a genuine rival to the dominant paradigm.
#862 — Learning About Text Diffusion Models — Introduces text diffusion as one of the most credible non-autoregressive alternatives to transformers. It covers how diffusion-style generation breaks from strict left-to-right next-token prediction.
#866 — Three Text Diffusion Models You Need To Know About — A practical follow-up profiling the leading players in the space: LLaDA, Gemini Diffusion, and Mercury. It compares how each implements diffusion-based text generation.
#870 — Liquid Models and the Search for a Post-Transformer Architecture — Dives into liquid neural networks as one of the more promising non-Transformer architectures, contrasting their continuous-time dynamics with attention’s parallel lookup-table approach. It frames them within the broader hunt for a successor.
#874 — Transformers or Not? — The capstone, asking whether the Transformer is the final architecture or merely the first truly scalable one, soon absorbed into something richer. It leans toward the latter and surveys the full landscape the series has covered.
What’s next: a new series on distillation
If the last series was about changing the architecture, the next one is about compressing it. We’re starting a deep dive into knowledge distillation — the set of techniques for taking a large, expensive teacher model and pressing its capabilities into a smaller, faster student. It’s one of the least glamorous and most economically important ideas in modern AI: it’s how frontier capability actually reaches production. We’ll cover the classics (logit matching, the original Hinton formulation), the modern variants (sequence-level, on-policy, and self-distillation), what actually transfers and what doesn’t, and why nearly every model you can afford to run is, in some sense, a distilled one. See you in the first issue.