AI News HubLIVE
In-site rewrite6 min read

Show HN: Proxy Block-CAGE, a new sparse block attention

Introducing Proxy Block-CAGE, a sparse attention prototype for long-context prefill. By splitting Q/K/V into blocks, using pooled summaries to select key blocks, and computing exact attention only within selected blocks, it achieves 2.61x speedup over dense attention at 16K tokens on a small Transformer, using only 0.39% of dense capacity while preserving 100% final-token agreement. The article details the evolution from genetic algorithms to mathematical optimization.

SourceHacker News AIAuthor: iqbal1980

Notifications You must be signed in to change notification settings

Fork 0

Star 0

Copy path

More file actions

More file actions

Latest commit

History

History

History

893 lines (616 loc) · 23.3 KB

Copy path

Raw

Copy raw file

Download raw file

Outline

Proxy Block-CAGE: Working with ChatGPT as a research assistant to create a new type of Sparce Attention algorithm

I started this project with a question that was half technical and half personal:

Could I use ChatGPT as a serious research partner, not just as a code generator, to help me explore a core AIML research problem?

I am Iqbal Addou, a PhD student in Bioinformatics and Computational Biology at George Mason University. I have about 23 years of software engineering experience and about 5 years of applied AIML engineering experience, mostly in computer vision. I know PyTorch, training workflows, and how to use modern AI frameworks, but I do not come from a long background in core model research. I am trying to pivot toward that world: efficient inference, sparse attention, long-context systems, and the mathematical/computational side of Transformers.

In my PhD work I have used genetic programming, genetic algorithms, optimization techniques, and physics-informed neural networks. Those methods are not always fashionable in modern LLM work, but I have learned to respect them because they force you to think in terms of search spaces, objectives, constraints, and failure modes. So I wanted to see if some of that mindset could be useful in a Transformer problem.

The first prompt was intentionally broad and a little unusual. I asked ChatGPT to help me find a computationally and mathematically intensive problem in LLMs and Transformers, and to explore out-of-the-box methods such as genetic programming, genetic algorithms, classical AI/ML, and older optimization ideas. I wanted something that could be formulated mathematically, implemented locally, and tested on my own hardware.

The project eventually became Proxy Block-CAGE, a small sparse-attention prototype for long-context prefill. The useful version is simple:

Split Q/K/V into blocks, use cheap pooled query/key block summaries to select a few key blocks per query block, then compute exact local attention only inside those selected key/value blocks.

The strongest result I got was on my laptop GPU: an NVIDIA RTX 2080 Max-Q 8GB. In a tiny 2-layer character-level Transformer trained on a custom text corpus, at 16,384 tokens, Proxy Block-CAGE ran generation-style prefill 2.61x faster than dense PyTorch attention, while using only 0.39% of dense attention capacity and preserving 100% final-token top-1/top-5 agreement on that small eval run.

This is not a production LLM claim. It is a reproducible research prototype and a request for honest feedback.

The result in one table

The most important runs were on a trained 2-layer character-level Transformer, using CUDA fp16 on the RTX 2080 Max-Q.

Context length Dense prefill Proxy Block-CAGE, capacity 64 Speedup Attention density Final-token top-1 agreement Final-token top-5 containment

8,192 7.925 ms 5.263 ms 1.51x 0.781% 95% 100%

12,288 14.980 ms 7.018 ms 2.13x 0.521% 100% 100%

16,384 24.301 ms 9.323 ms 2.61x 0.391% 100% 100%

For the 16,384-token run:

dense prefill_last_ms_median: 24.3009 ms Proxy Block-CAGE 64 prefill median: 9.3229 ms speedup: 2.61x density: 0.00390625 = 0.390625% final-token top-1 agreement: 100% final-token top-5 containment: 100%

The model was tiny, the corpus was small, and the eval sample count was small. I am treating this as a stronger sanity check than random tensors, not as evidence that this works for production LLMs.

Why attention is expensive

After projection, a Transformer layer has:

$$ Q,K,V \in \mathbb{R}^{B \times H \times n \times d} $$

where:

B = batch size H = number of attention heads n = sequence length d = head dimension

Dense causal attention computes:

$$ s_{ij} = \frac{q_i^\top k_j}{\sqrt d} $$

for valid causal pairs:

$$ j \le i $$

Then:

$$ a_{ij} = \frac{\exp(s_{ij})}{\sum_{\ell \le i}\exp(s_{i\ell})} $$

and:

$$ y_i = \sum_{j \le i} a_{ij}v_j $$

The compute is approximately:

$$ O(BHn^2d) $$

At 16,384 tokens, one head has:

$$ \frac{n(n+1)}{2} = \frac{16384 \cdot 16385}{2} = 134{,}225{,}920 $$

causal token-token score locations. With 4 heads:

$$ 4 \cdot 134{,}225{,}920 = 536{,}903{,}680 $$

That is about 537 million query-key score locations for one layer.

Dense attention is extremely optimized in PyTorch and FlashAttention-style kernels, so reducing arithmetic does not automatically mean a faster wall-clock runtime. But the arithmetic shows why long context is a natural place to look for sparsity.

The first idea: evolve attention cages

The original CAGE idea was not block routing. It was genetic/evolutionary attention-mask search.

For a query token (i), instead of attending to every previous key, define a small selected set:

$$ C_i \subseteq {0,1,\dots,i} $$

Then compute attention only inside that selected set:

$$ y_i^{\text{CAGE}}

\sum_{j\in C_i} \operatorname{softmax}_{j\in C_i} \left( \frac{q_i^\top k_j}{\sqrt d} \right)v_j $$

In the genetic algorithm version, an individual chromosome was a list of selected key indices. For example:

query token i = 100 candidate cage C_i = [2, 7, 19, 44, 88, 91, 97, 100]

That candidate means token 100 is only allowed to attend to those key/value positions.

A simple fitness function was:

$$ f(C_i)

\log \sum_{j\in C_i} \exp\left(\frac{q_i^\top k_j}{\sqrt d}\right) $$

The GA operations were ordinary evolutionary operations:

selection: keep better cages elitism: preserve the best cages crossover: mix two cages mutation: replace some indices repair: remove duplicates and enforce causal validity

This was useful conceptually and terrible computationally. Token-level CAGE required one GA search per query token. At 512 tokens, batch 1, 4 heads, that means:

$$ 1 \cdot 4 \cdot 512 = 2048 $$

separate GA units.

In one early benchmark, at only 512 tokens, token-level CAGE took about 7.29 seconds, while dense PyTorch attention took about 0.25 ms. That failure was important. It told me that the idea had to move from token-level search to something coarser.

The second idea: evolve key blocks instead of tokens

The next step was to group the sequence into blocks.

A key block is just a fixed-size contiguous group of key vectors after projection. A Transformer forms:

$$ Q=XW_Q, \qquad K=XW_K, \qquad V=XW_V $$

For each token position (j), there is a key vector (k_j) and a value vector (v_j). With:

key_block_size = 32

we get:

key block 0 = token positions 0..31 key block 1 = token positions 32..63 key block 2 = token positions 64..95 ...

Mathematically:

$$ R_s = {sb_k, sb_k+1, \dots, (s+1)b_k-1} $$

where (b_k) is the key block size.

A key block is not a paragraph, sentence, document chunk, or learned memory slot. In this prototype it is simply a contiguous range of key/value vectors.

Instead of evolving token cages, the block version evolves:

$$ C_r \subseteq {0,1,\dots,N_k-1} $$

where (C_r) is the set of selected key blocks for query block (r), and (N_k) is the number of key blocks.

Example:

query block = tokens 9600..9631 candidate block cage = [41, 299]

If each key block has 32 tokens, selecting 2 key blocks gives each query token at most:

$$ 2 \cdot 32 = 64 $$

candidate key/value positions.

This reduced the number of GA searches from one per query token to one per query block. It was a better direction, but it was still too slow in Python.

The mathematical correction: the GA objective collapsed to top-k

The block-level objective I used was approximately:

$$ F(C_r)

\log\sum_{s\in C_r}\exp(A_{r,s}) $$

where (A_{r,s}) is the score for key block (s) with respect to query block (r).

For a fixed cage size (m), this objective does not need a genetic algorithm. The exact solution is:

$$ C_r^* = \operatorname{TopK}_s(A_{r,s}, m) $$

The proof is simple. Suppose a selected block (a\in C_r) has lower score than an unselected block (b\notin C_r):

$$ A_{r,b} > A_{r,a} $$

Then:

$$ \exp(A_{r,b}) > \exp(A_{r,a}) $$

Replacing (a) with (b) increases:

$$ \sum_{s\in C_r}\exp(A_{r,s}) $$

and therefore increases:

$$ \log\sum_{s\in C_r}\exp(A_{r,s}) $$

So any optimum must contain the top (m) block scores.

This was one of the most useful moments in the project. The genetic algorithm helped me frame the problem, but the math killed the GA for this simple objective.

That changed the story from:

evolve attention cages with GA

to:

use the cage formulation, but select blocks with exact top-k

The third idea: avoid the full token-score matrix

Exact top-k block selection helped, but the early version still depended on too much full token-pair information. That defeats the purpose.

The next refinement was Proxy Block-CAGE.

Instead of scoring all token pairs first, I pool each query block and key block into summaries:

$$ \bar q_r = \frac{1}{|U_r|}\sum_{i\in U_r} q_i $$

$$ \bar k_s = \frac{1}{|R_s|}\sum_{j\in R_s} k_j $$

Then score query block (r) against key block (s):

$$ S_{r,s} = \frac{\bar q_r^\top \bar k_s}{\sqrt d} $$

Then select:

$$ C_r = \operatorname{TopK}_s(S_{r,s}, m) $$

where (m) is the number of selected key blocks.

In the best default configuration:

query_block_size = 32 key_block_size = 32 block_cage_size = 2 selected capacity = 64 tokens pool = mean

At 16,384 tokens:

$$ N_q = N_k = \frac{16384}{32} = 512 $$

The number of proxy block scores across 4 heads is:

$$ 4 \cdot 512 \cdot 512 = 1{,}048{,}576 $$

The local selected token-score upper bound is:

$$ 4 \cdot 16384 \cdot 64 = 4{,}194{,}304 $$

So the approximate score locations are:

$$ 1{,}048{,}576 + 4{,}194{,}304 = 5{,}242{,}880 $$

Dense causal attention at the same size has about:

$$ 536{,}903{,}680 $$

score locations across 4 heads.

The ratio is:

$$ \frac{536{,}903{,}680}{5{,}242{,}880} \approx 102.4 $$

So at 16K context, this configuration touches roughly 100x fewer score locations than dense causal attention. That does not translate to 100x wall-clock speedup because dense attention kernels are highly optimized and this prototype is not. But it explains why the sparse method begins to win at long context.

Proxy Block-CAGE algorithm

For each layer, batch item, and attention head:

Compute (Q,K,V).

Split query positions into query blocks (U_r).

Split key/value positions into key blocks (R_s).

Mean-pool each query block and key block:

$$ \bar q_r = \frac{1}{|U_r|}\sum_{i\in U_r}q_i $$

$$ \bar k_s = \frac{1}{|R_s|}\sum_{j\in R_s}k_j $$

Compute proxy block scores:

$$ S_{r,s}=\frac{\bar q_r^\top \bar k_s}{\sqrt d} $$

Apply a causal block mask so query blocks cannot route to future-only key blocks.

Select the top (m) key blocks:

$$ C_r = \operatorname{TopK}s(S{r,s},m) $$

Expand selected key blocks into selected key/value token positions.

For every query token (i\in U_r), compute exact causal attention only over selected positions:

$$ J_i = {j \in \cup_{s\in C_r}R_s : j \le i} $$

$$ a_{ij}^{\text{CAGE}}

\frac{ \exp\left(q_i^\top k_j / \sqrt d\right) }{ \sum_{\ell\in J_i} \exp\left(q_i^\top k_\ell / \sqrt d\right) } $$

$$ y_i^{\text{CAGE}}

\sum_{j\in J_i} a_{ij}^{\text{CAGE}}v_j $$

The approximation is in the selection of which blocks to include. Once blocks are selected, the local attention inside the selected cage is exact softmax attention.

Complexity

Dense attention is:

$$ O(BHn^2d) $$

Proxy Block-CAGE has two main costs.

First, all block-pair proxy scoring:

$$ O\left(BH d \frac{n^2}{b_qb_k}\right) $$

Second, exact local attention inside selected blocks:

$$ O(BHnmb_kd) $$

where:

b_q = query block size b_k = key block size m = number of selected key blocks

So the total is approximately:

$$ O\left( BHd\left[ \frac{n^2}{b_qb_k} + nmb_k \right] \right) $$

With:

b_q = 32 b_k = 32 m = 2

this becomes:

$$ O\left(BHd\left[\frac{n^2}{1024}+64n\r

[truncated for AI cost control]