First Steps Toward Automated AI Research
Recursive releases early results from its automated AI research system, achieving state-of-the-art performance on fixed-budget language model training, small-model training speed, and GPU kernel optimization. The system automates the research loop: proposing, implementing, experimenting, validating, and iterating. On NanoChat, it achieved 0.9109 BPB, surpassing community solutions; on NanoGPT Speedrun, it reduced training time to 77.5 seconds; on SOL-ExecBench, it reached 0.754 SOL score. The system discovered innovations including hash-table n-gram embeddings and byte-level features.
First Steps Toward Automated AI Research - Recursive
First Steps Toward Automated AI Research
Early results from Recursive’s automated AI research system on model training and GPU kernel benchmarks
JUNE 11, 2026
Today we are releasing early results from Recursive’s automated AI research system. Across three benchmarks, the system achieves state-of-the-art results: in fixed-budget language model training, small-model training speed, and GPU kernel optimization.
The system automates the research loop for a target objective: it proposes an idea, implements it, runs an experiment, validates the result, and uses what it learns to choose the next experiment. It runs many research threads over long horizons, keeps useful context from prior experiments, combines promising branches, and puts results through validation for reward hacks and variance before treating improved performance as real progress. It is designed to scale and harnesses principles of open-ended algorithms, building on ideas from previous work by our team and others into recursively self-improving AI.
We tested the system on benchmarks chosen for both practical importance and tight feedback loops. They stress three core levers of AI progress: better training algorithms, faster training, and more efficient use of hardware. They are also well suited to automated research because they have clear metrics, relatively low variance, and evaluators that can be hardened against reward hacks.
We are open-sourcing artifacts from these runs so others can inspect and build on the system’s outputs.
Benchmark Task Type Metric Previous State of the Art Recursive Improvement
NanoChat Autoresearch Train a small language model to highest performance given a small compute budget Validation BPB 0.9372 0.9109 0.0263 lower Validation BPB, or a 1.3x speedup to reach the same loss
NanoGPT Speedrun Train a small language model to a certain performance as fast as possible Training time required to reach a 3.28 validation loss 79.7 s 77.5 s 2.2s faster training
SOL-ExecBench Optimize GPU kernels toward hardware limits Mean SOL score across 235 kernels 0.699 0.754 18% reduction in gap to the optimal performance estimate of 1.0
Case study 1: NanoChat Autoresearch
Andrej Karpathy’s NanoChat autoresearch repo is a popular starting point for automated research systems. The task is to train a small language model to the lowest validation loss, measured in bits per byte (BPB), within a fixed five-minute budget on a single GPU. It is a natural test of our system because experiments are fast, variance is low, and reward hacks are relatively easy to detect.
Perhaps for those reasons, a public collaborative effort has already formed around this setup. autoresearch@home extends the original setup into a collaborative setting where several dozens of humans and hundreds of their agents collectively improve performance. That gives us a stronger comparison point than Karpathy’s single overnight run. We wanted to test if our system could improve on solutions produced by an entire community of humans and agents.
Our system starts from the same initial seed solution the Autoresearch code starts from. We initially searched on NVIDIA H100 GPUs, then transferred the discovered solution to run on an NVIDIA B200 GPU for a fair comparison to public results. After removing minor reward hacks from the previous best autoresearch@home solution and evaluating it on 10 random seeds, its mean performance is 0.9372 BPB. Our system found a solution that reached 0.9109 BPB, a 0.0263 BPB improvement. Measured another way, our solution reaches the quality of Karpathy’s original overnight autoresearch BPB in roughly 1.3x less training time than the best autoresearch@home solution.
FIGURE 1
Autoresearch starts from an already optimized model with some non-trivial design decisions baked in. To this end, we tested whether our system could also make improvements from a much weaker starting point, a naive initial implementation (a vanilla Transformer with AdamW). Our system improved the model from 1.059 BPB to 0.9344 BPB (evaluated on an NVIDIA B200 GPU), again outperforming the best solution produced by the autoresearch@home community. This does not necessarily prove independent rediscovery, since the underlying models may know many public techniques including those used by or created by the autoresearch@home community, but it does show that the search process can assemble a competitive training stack from a much weaker starting point. The resulting solution also differed in several ways from the public best solution.
FIGURE 2
Figure 3
What modifications did our system come up with? The best solutions were not driven by one trick. They combined changes to architecture, short-context memory, auxiliary losses, attention, optimizer behavior, weight decay schedules, compiler settings, and more.
One of the biggest gains came from a richer short-context memory mechanism. The baseline already uses value embeddings; our system extended this idea with hashed bigram and trigram embedding tables, mixed into the attention value path through learned gates. This gave the model a cheap way to use local n-gram information without paying the time cost of slower convolutional or attention-heavy alternatives.
This connects to recent work such as DeepSeek Engram, which explores hash tables as a sparsity axis. In our setting, hash tables can add 1-2 billion sparse parameters to a roughly 50M parameter model: most entries are inactive on any given batch, and lookup is cheap. Similar hash-table and n-gram ideas also appear in top NanoGPT Speedrun submissions. The system adapted this family of ideas to the fixed-budget setting by injecting hashed bigram and trigram embeddings into attention value vectors across multiple layers, with different hashes per layer to reduce repeated collisions. We are not aware of prior work using this exact variant.
The expandable boxes below include selected technical details from the system’s solutions. We manually inspected these outputs and used AI-assisted analysis to understand the techniques and screen for reward hacks. We may still have missed errors in kernel optimization where we are not specialists, but that is also part of the point: the ideas presented here came from the system, not from our prior expertise.
Hash tables
On top of the standard unigram value embedding in the starting solution, in our best solution the model hashes each bigram and trigram into fixed-size tables and mixes the looked-up vectors into the attention value path through learned, input-dependent per-head gates — effectively folding a classic n-gram model into the transformer's value stream.
for j, layer_i in enumerate(ve_layers): self.bigram_ves[str(layer_i)] = nn.ModuleList([ nn.Embedding(self.bigram_table_size, half_kv_dim), nn.Embedding(self.bigram_table_size, half_kv_dim), ]) self.bigram_hash_primes_per_layer[layer_i] = _decorr_bigram_primes[j]
for j, layer_i in enumerate(sorted(self.trigram_ve_layers)): self.trigram_ves[str(layer_i)] = nn.ModuleList([ nn.Embedding(self.trigram_table_size, half_kv_dim), nn.Embedding(self.trigram_table_size, half_kv_dim), ]) self.trigram_hash_primes_per_layer[layer_i] = _decorr_trigram_primes[j] v = v + gate.unsqueeze(-1) * ve ## standard value embedding v = v + bg_gate.unsqueeze(-1) * bigram_ve ## additional bigram embedding from lookup table v = v + tg_gate.unsqueeze(-1) * trigram_ve ## additional trigram embedding from lookup table
The solution also gives different transformer layers different hash functions (with disjoint hash prime pairs).
self.bigram_hash_primes_per_layer[layer_i] = _decorr_bigram_primes[j] self.trigram_hash_primes_per_layer[layer_i] = _decorr_trigram_primes[j]
That means collisions still happen, but they are less likely to happen in the same way across layers.
The run optimizing the vanilla Transformer used some of the same techniques as our best solution, including hash tables and squared-ReLU MLPs. But it also converged on a different (yet equally competitive) final stack, including token-shifting, weight averaging before eval, and byte-level feature embeddings. This suggests the system was not merely repeating the same discoveries it found in the other run. The expandable box below shows a few modifications unique to the vanilla Transformer run.
Optimizing a vanilla Transformer
Many of the changes in the vanilla Transformer solution also appear in our best solution (which came from starting our system with the Autoresearch initial seed code), such as replacing AdamW with Muon and adding hash tables. A few other improvements did not emerge in our main run that produced the best solution, yet stood out to us. The first is causal token shifting, which blends the previous token's attention projections Q and K into the current token's, with a learned coefficient per dimension.
B, T, C = x.size() x_prev = F.pad(x[:, :-1, :], (0, 0, 1, 0)) q = self.c_q(x + self.q_shift_beta * x_prev).view(B, T, self.n_head, self.head_dim) k = self.c_k(x + self.k_shift_beta * x_prev).view(B, T, self.n_head, self.head_dim) v = self.c_v(x).view(B, T, self.n_head, self.head_dim)
The second is a set of byte-level features injected right after the token embedding. The byte-level features represent information about what bytes (e.g., individual characters) tokens are composed of. Tokens consisting of similar bytes will get similar byte-level embeddings. The byte feature embedding matrix is built as follows:
combined = torch.zeros(vocab_size, 769) for token_id in range(vocab_size): raw_bytes = tokenizer_enc.decode_single_token_bytes(token_id) # variable length if len(raw_bytes) > 0: for b in raw_bytes[:max_bytes]: # max_bytes=16 combined[token_id, b] += 1.0 / len(raw_bytes) # [0:256] byte-frequency histogram combined[token_id, 256 + raw_bytes[0]] = 1.0 # first byte one-hot combined[token_id, 512 + raw_bytes[-1]] = 1.0 # last byte one-hot combined[token_id, 768] = len(raw_bytes) / max_bytes # length feature torch.manual_seed(1337) proj = torch.randn(769, embed_dim) * 0.01 init_emb = combined @ proj # [vocab, embed_dim] self.embed = nn.Embedding(vocab_size, embed_dim) self.embed.weight.data.copy_(init_emb) # used only as the INIT
These embeddings are then updated by gradient descent during training, and added after the token embedding alongside the bigram and trigram embeddings:
x_base = self.wte(idx) # token embedding gates = self.embed_mixer(x_base) # per-token gates over 4 sources [B,T,4] x = x_base x = x + gates[:,:,0:1] * bi_raw # bigram hash x = x + gates[:,:,1:2] * tri_raw # trigram hash x = x + gates[:,:,2:3] * self.byte_embed.get_raw(idx) # byte-content x = x + gates[:,:,3:4] * self.byte_boundary.get_raw(idx) # byte-boundary x = x + self.ssm_light(x) x = self.embed_ctx_norm(x)
These are just a few of the changes our system made in this run.
NanoChat shows how asking our system to improve fixed-budget training led to the discovery of many compounding, budget-aware improvements. The next test was whether the same process could still find gains after years of public human optimization on a benchmark. We tested that on NanoGPT Speedrun, whose best public solution has been highly optimized by the community over two years.
Case study 2: NanoGPT Speedrun
NanoGPT Speedrun is a similar task, yet it's much harder to beat the state of the art because a large community has been optimizing solutions for it for over two years. Instead of asking how low of a validation loss can be achieved in a fixed time budget, the benchmark asks how quickly a small GPT-style model can be trained to a fixed validation loss of 3.28 on the FineWeb text dataset, using a single HGX H100 8-GPU node.
This is a mature community effort, with 83 human record-setting contributions to the leaderboard so far and hundreds of proposed PRs. Since mid-2024, the training time has been pushed from roughly 45 minutes down t
[truncated for AI cost control]