2026-05-08站内改写

Sparser, Faster, Lighter Transformer Language Models

Modern large language models (LLMs) are powerful but costly. A new collaboration between Sakana AI and NVIDIA introduces TwELL sparse packing format and custom CUDA kernels to leverage unstructured sparsity in feedforward layers, achieving over 20% speedup on H100 GPUs with minimal performance loss.

Article intelligence

EngineersAdvanced

Key points

Sakana AI and NVIDIA collaborate to introduce TwELL (Tile-wise ELLPACK), a sparse packing format designed for tiled matrix multiplication kernels.
Custom CUDA kernels fuse multiple matrix multiplications to maximize throughput and reduce storage.
L1 regularization induces >95% sparsity in ReLU-based LLMs with negligible downstream impact.
Over 20% speedup for batched inference and training on H100 GPUs, with reduced energy and memory consumption.

Why it matters

This matters because sakana AI and NVIDIA collaborate to introduce TwELL (Tile-wise ELLPACK), a sparse packing format designed for tiled matrix multiplication kernels.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Sparser, Faster, Lighter Transformer Language Models

) -->

populates byline, meta tags) -->

This page requires JavaScript to display interactive content.

Sparser, Faster, Lighter Transformer Language Models

tl;dr

In collaboration with NVIDIA, we introduce new sparse data structures and GPU kernels to leverage unstructured sparsity for efficient inference and training of LLMs. This work will be presented at ICML 2026.

Summary

Modern large language models (LLMs) are powerful. They can write code, reason through complex problems, and synthesize vast amounts of technical knowledge. However, providing such capabilities at scale comes with tremendous real-world costs.

A large part of these costs comes from the feedforward layers, a ubiquitous component of LLMs since their inception. Yet, inside these layers, an interesting phenomenon can be observed: For any given token, only a small fraction of the hidden activations actually matter, with the rest effectively approximating zero and wasting computation. With ReLU and L1 regularization, this sparsity can be even made to exceed 95% with little to no impact on downstream performance .

So, can we leverage this sparsity to make LLMs faster and lighter?

The difficulties of achieving this objective lie in the hardware. Modern NVIDIA GPUs are extraordinarily good at homogeneous workloads, with specialized units dedicated to dense matrix multiplications. Traditional algorithms for efficiently leveraging unstructured sparsity clearly do not fit these assumptions, and introduce non-trivial constant overheads and bookkeeping that cancel out their theoretical savings.

This work, a collaboration between Sakana AI and NVIDIA, is about resolving this paradox:

First, we introduce a new sparse packing format, called TwELL (Tile-wise ELLPACK), designed specifically to integrate with the tiled matrix multiplication kernels that power modern accelerators without disrupting execution pipelines or introducing additional memory overhead.

Second, we develop a new set of custom CUDA kernels for both LLM inference and training, fusing multiple matrix multiplications to maximize throughput and compressing TwELL to a sparse representation that trivializes storage costs.

To substantiate these gains, we study sparse LLMs at billion-parameter scales and show that mild L1 regularization can induce high levels of sparsity after training with negligible impact on downstream performance. By leveraging our kernels, these sparsity levels translate into over 20% speedups for both batched inference and training on H100 GPUs, while also cutting energy consumption and memory requirements.

Get ready for a deep dive into GPUs, LLMs, and sparsity!

Transformers and Feedforward Layers

At a high level, the "body" of modern transformers is composed of a repeated stack of just two components: an attention block (linear or quadratic), followed by a feedforward block. While attention lets tokens communicate with each other, the feedforward block processes each input token independently, performing multiple matrix multiplications with a set of large learned weight matrices.

Gated feedforward block. The input x branches into an up projection and a gated projection, recombines through elementwise multiplication, and is projected back to the embedding dimension.

Formally, a modern feedforward block takes as input a matrix x∈RM×Kx \in \mathbb{R}^{M \times K}x∈RM×K, where MMM is the total number of tokens (number of batched sequences ×\times× sequence length), and KKK is the token embedding size. The input xxx is then multiplied by three different weight matrices to first expand this representation into a much larger hidden space of size N≫KN \gg KN≫K, apply a simple non-linearity, and then project it back down for the next block. Formally, this computation looks as follows:

hu=xWu,hg=σ(xWg),h=hu⊙hg,y=hWd.h_u = x W_u, \quad h_g = \sigma(x W_g), \quad h = h_u \odot h_g, \quad y = h W_d.hu=xWu,hg=σ(xWg),h=hu⊙hg,y=hWd.

Here, Wu,Wg∈RK×NW_u, W_g \in \mathbb{R}^{K \times N}Wu,Wg∈RK×N are commonly referred to as the weight matrices for the "up" and "gate" projections, and Wd∈RN×KW_d \in \mathbb{R}^{N \times K}Wd∈RN×K as the weight matrix for the "down" projection. The non-linearity σ\sigmaσ, such as SiLU or ReLU, is what introduces sparsity in the gate activations hgh_ghg, which is carried over to the hidden activations hhh following the elementwise product ⊙\odot⊙. As the hidden dimension NNN is typically much larger than KKK (often 4x or more), these layers can often be responsible for the majority of both parameters and FLOPs in modern LLMs.

In this overparameterized regime, prior work has shown that pretrained transformers with ReLU activations can already exhibit high levels of unstructured sparsity in their feedforward blocks , with less than 5-10% non-zero elements in their hidden activations hhh . Other works on SiLU-based LLMs have shown that a substantial fraction of feedforward activations can also be set to zero with little or no additional training, either through training-free thresholding or lightweight finetuning . In this work, we study how different levels of sparsity affect both performance and efficiency by combining ReLU activations with a simple auxiliary L1 regularization loss on the hidden activations, weighted by a coefficient L1L_1L1 and averaged across all neurons:

L1×1MN∑m=1M∑n=1N∣h[m,n]∣.L_1 \times \frac{1}{MN} \sum_{m=1}^{M} \sum_{n=1}^{N} \lvert h[m,n] \rvert.L1×MN1m=1∑Mn=1∑N∣h[m,n]∣.

GPUs, Memory, and Tiling

To understand the challenges of leveraging sparsity to accelerate matrix multiplication, we provide a brief primer on how modern GPUs execute programs and the hardware characteristics.

Whenever executing any code on an NVIDIA GPU using PyTorch or any other library, this launches a GPU kernel, a massively parallel program executed across many threads on the device. The kernel's computation is first distributed across a grid of many independent units of work called cooperative thread arrays (CTAs)A CTA is also commonly referred to as a thread block in several tutorials and books.. Each CTA is scheduled on a single Streaming Multiprocessor (SM)An SM is the GPU execution unit that schedules and runs CTAs, somewhat analogous to a CPU core but built for massive parallelism. and consists of many threads working together to carry out the kernel operations on a set portion of the input data. This design allows the same optimized kernel, written from the perspective of the threads in a single CTA, to be easily ported to different input sizes by simply scaling the number of CTAs in the grid. The threads in the CTA are not entirely independent of each other but are grouped into units of 32 called warps, which execute the same instructions at each clock cycle on different inputs. All these constraints help reduce the cost and chip-space dedicated to each execution unit and enable GPUs to achieve massive parallelism: thousands of threads are scheduled across many CTAs, all performing arithmetic and memory operations concurrently.

However, not all operations are equal. As theoretical FLOPs on modern NVIDIA devices have risen dramatically in recent years, the bottleneck of many common kernels is often memory bandwidth.

GPU memory hierarchy. Registers sit closest to the compute unit, shared memory is CTA-local, and global memory is shared across the whole GPU.

Modern GPUs have a hierarchical memory system with three distinct levels. At the bottom level, we have a global memory available to all threads running on the GPU and residing separately from the compute units on a stack of high-bandwidth memory (HBM). While global memory can be large (80GB for an H100 GPU), it comes with high latency on the order of ~500 cycles and limited per-device bandwidth. Surrounding the compute units themselves, on the second level, we have on-chip shared memoryShared memory and registers are built from SRAM, which is extremely fast but physically large and expensive per bit.. Shared memory is much faster, with latencies closer to ~10–20 cycles and is private to each CTA. At the top level, we have on-chip register memory which can be accessed in a single cycle and is private to each thread. Moving up the hierarchy, each level becomes over an order of magnitude faster, but correspondingly smaller due to the physical constraints of on-chip storage. For instance, while an H100 GPU provides 80GB of HBM, each CTA is limited to at most 228KB of shared memory, and each thread to only a few kilobytes of register storage.

Thus, designing efficient kernels requires minimizing accesses to global memory. This is directly reflected in the design of modern matrix multiplication kernels. Suppose we want to compute h=xWh = x Wh=xW, where x∈RM×Kx \in \mathbb{R}^{M \times K}x∈RM×K and W∈RK×NW \in \mathbb{R}^{K \times N}W∈RK×N. Each output element in this operation can be seen as an independent dot product:

h[m,n]=∑k=1Kx[m,k]W[k,n].h[m, n] = \sum_{k=1}^{K} x[m, k] W[k, n].h[m,n]=k=1∑Kx[m,k]W[k,n].

An intuitive but suboptimal approach to implement this operation would distribute work across threads at the granularity of individual output elements. For instance, each thread could be assigned a single output element h[m,n]h[m, n]h[m,n], load the row mmm of xxx and column nnn of WWW from global memory, perform the dot product, and store the results. The main issue with this approach is that it requires two separate loads for x[m,k]x[m, k]x[m,k] and W[k,n]W[k, n]W[k,n] before each partial accumulation. As a result, the same values of xxx and WWW are redundantly reloaded across many different threads, leading to a large number of expensive global memory transactions that dominate execution time.

Modern GPU kernels avoid this by using tiling. Instead of each thread operating on individual output elements, computation is reorganized by having the threads in each CTA cooperatively compute a separate tile of the output of shape Tm×TnT_m \times T_nTm×Tn. With this approach, each CTA iterates over the reduction dimension KKK in chunks of size TkT_kTk, and at each step all its threads cooperatively load a tile of xxx of shape Tm×TkT_m \times T_kTm×Tk and a tile of WWW of shape Tk×TnT_k \times T_nTk×Tn from global memory into shared memory.

Once both input tiles are loaded in fast shared memory, the threads in the CTA can reuse them to perform all necessary multiply-accumulate operations, updating a full Tm×TnT_m \times T_nTm×Tn tile of the output and storing it back at the end:

Algorithm 1. A sampled CTA tile illustrates tiled matrix multiplication. The CTA advances through the KKK dimension, reading the relevant tiles of xxx and WWW, performing a tiled matmul, and increasing the running-sum accumulator. Finally, the output tile hhh is committed to global memory.

This reorganization allows amortizing the cost of each global memory load over many arithmetic operations: each element of xxx is reused across TnT_nTn output columns, while each element of WWW is reused across TmT_mTm output rows. As a result, the ratio of computation to memory traffic or the kernel's arithmetic intensityArithmetic intensity can be calculated as FLOPs performed / bytes moved to and from global memory. increases dramatically, which can often shift the computation from being memory-bound to compute-bound.

The hardware of modern GPUs has increasingly specialized to support tiling. Many common GPUs have dedicated hardware units called Tensor Cores, with which warps can cooperatively issue matrix multiply-accumulate operations that execute asynchronously on whole tiles of data stored in shared memory, achieving extremely high throughput. On modern devices like the H100, this is further complemented by the Tensor Memory Accelerator (TMA), which enables asynchronous loading and storing tiles of data across global memory and shared memory and allows pipeli

[truncated for AI cost control]