Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch
NanoEuler is a GPT-2-class language model built entirely in C/CUDA from scratch, with no external ML libraries. It includes a hand-written BPE tokenizer, forward and backward passes, pretraining on books and web data, and supervised fine-tuning (SFT). The project runs on CPU for a small showcase model and on GPU using cuBLAS and FlashAttention. It is an educational artifact demonstrating a complete training pipeline.
Notifications You must be signed in to change notification settings
Fork 1
Star 7
BranchesTags
Open more actions menu
Folders and files
NameName
Last commit message
Last commit date
Latest commit
History
45 Commits
45 Commits
cuda
cuda
data
data
LICENSE
LICENSE
Makefile
Makefile
README.md
README.md
nanoeuler
nanoeuler
nanoeuler.c
nanoeuler.c
nanoeuler_check
nanoeuler_check
nanoeuler_train.log
nanoeuler_train.log
shakespeare.txt
shakespeare.txt
Repository files navigation
A GPT-2-class language model built entirely from scratch in C/CUDA — no PyTorch, no autograd, no ML libraries. The forward and backward passes are written and verified by hand, and the whole training pipeline lives in this repo: a hand-written byte-level BPE tokenizer, pretraining on a books + web corpus, and supervised fine-tuning into a chat model (RLHF/DPO planned). It runs on CPU (libm + OpenMP) for a small showcase model, and a full from-scratch CUDA engine — cuBLAS matmuls, a hand-written FlashAttention, validated against a CPU reference by a full-model gradient check — trains a ~116M-parameter model on a single RTX 4070.
Status & honesty. This is a research/educational artifact, built in public. At ~116M parameters trained on a single consumer GPU, it is a text generator in the spirit of GPT-2-small: fluent-ish English, no real world knowledge. It is not a capable assistant — the chat model demonstrates that the pretrain→SFT pipeline works end to end, it is not a useful chatbot. The point of the project is the from-scratch engineering and the complete, understandable training pipeline.
make check # verify the backward pass (gradient check, double precision) make # build the training binary ./nanoeuler train # train the small showcase model (~0.76M params) ./nanoeuler train big # train the larger model (~10M params; meant for a GPU) ./nanoeuler chat # REPL: type a prompt, the model continues it
Why "Euler"?
A residual block computes
x = x + f(x)
Read it as a step of numerical integration. The forward-Euler method advances an ordinary differential equation dx/dt = f(x) by
x(t+Δt) = x(t) + Δt · f(x(t))
With step size Δt = 1 this is exactly the residual update. So a deep residual network is a discretized ODE: depth is integration time, and each layer integrates the hidden state forward by one Euler step. This is the view behind work like Neural ODEs (a ResNet is the Euler discretization of a continuous flow). The project is named after Leonhard Euler, who gave us that integration method.
Example output
A sample from the ~116M model after a partial pretraining run on the books + web corpus (prompt Alessandro eat a):
Alessandro eat a icing textile: the satisfied by the servants in order to keep your weight [Using to a heated, collaborated young people that attend the metric process where the rank is authorized and to contain the sedentary. Some state lawyers were able to insert ...
The content is not meaningful, but notice what it learned on its own: real grammar, long clauses, and an encyclopedic register picked up from the web data. This is the expected behaviour of a small model trained on a single GPU — fluent shape, shallow substance. More training and (far) more data improve fluency; world knowledge needs scale this project does not pretend to have.
Architecture
Decoder-only transformer with the building blocks common to current models:
RMSNorm (pre-norm, no bias)
Rotary position embeddings (RoPE) applied to queries and keys
SwiGLU feed-forward: down(silu(gate(x)) * up(x))
Grouped-query attention (GQA): query heads share a smaller set of key/value heads
Multi-token prediction (MTP): K output heads predict the next K tokens; the auxiliary heads improve the learned representation and enable speculative decoding. Generation uses head 0.
No biases anywhere.
Byte-level BPE tokenizer, hand-written, with GPT-2-style pretokenization (a single leading space attaches to the following word, so spaces are not wasted as standalone tokens). Merges are learned on a sample of the corpus; the GPU model uses a 4096-token vocabulary (~3.4 bytes/token on English).
Each block is x = x + attn(rmsnorm(x)) followed by x = x + swiglu(rmsnorm(x)). A residual connection x = x + f(x) is one step of the forward-Euler method for the ODE dx/dt = f(x) — hence the name, and a nod to Leonhard Euler.
Configurations:
where dim q/kv heads layers context vocab params
small (CPU, nanoeuler.c) 128 4 / 2 4 128 512 ~1.05M
GPU pipeline (cuda/, run_train) 768 12 / 4 16 512 4096 ~116M
The CPU small model trains in a few hours on 12 cores and is a self-contained showcase. The ~116M GPU model is the real pipeline: it pretrains on the books + web mix and is then fine-tuned into a chat model (see below). The head size is 64 (768/12), which fits the FlashAttention kernel.
Verified backward pass
Hand-written back-propagation is easy to get subtly wrong, so every analytic gradient is compared against a central finite difference. The check runs in double precision so floating-point cancellation does not hide correct gradients:
$ make check tok : max rel err 1.02e-04 qkvw : max rel err 7.20e-07 gatew : max rel err 6.86e-08 ... max relative error: 1.02e-04 >>> backward OK (error end marker.
After fine-tuning the model answers in the right shape — it follows the instruction→response format, writes complete sentences and stops on its own. The content, though, is shallow and often wrong: this is a small model trained on a single GPU, so it has little world knowledge to express. SFT teaches the model how to respond, not what it knows — that comes from pretraining and scale. This is a faithful, fully-from-scratch demonstration that the pretrain→SFT pipeline works end to end, not a capable assistant.
Data
Pretraining uses a real books + web mix:
Books — data/get_gutenberg.sh downloads ~95 public-domain Project Gutenberg classics (Austen, Dickens, Dostoevsky, Tolstoy, Melville, the complete Shakespeare, ...). Each book's Project Gutenberg license header/footer is stripped (only the text between the * START ... * / * END ... * markers is kept) so the model trains on prose.
Web — data/get_web.sh pulls a slice of FineWeb-Edu (high-quality educational web text) straight from the Hugging Face parquet files using the DuckDB CLI (a single static binary — no Python, no libraries).
Then concatenate them into the pretraining corpus the trainer reads:
sh data/get_gutenberg.sh # books -> data/gutenberg.txt sh data/get_web.sh # web -> data/web.txt (~1 GB by default) cat data/gutenberg.txt data/web.txt > data/pretrain.txt sh data/get_alpaca.sh # instruction data for SFT -> data/alpaca.json
Corpora and model checkpoints are git-ignored (regenerable).
Roadmap
✅ Hand-written byte-level BPE with GPT-2-style pretokenization.
✅ From-scratch CUDA engine (cuBLAS + FlashAttention), validated by a full-model gradient check.
✅ Pretraining on a books + web mix, with checkpoint/resume.
✅ Supervised fine-tuning (Alpaca) with response-masked loss → a chat model.
⏳ DPO (preference optimization) — the alignment stage, next to build.
⏳ Scale the model and data (toward ~270M) and publish a trained checkpoint people can try.
Files
nanoeuler.c CPU model: forward, backward, training, sampling, chat REPL cuda/nanoeuler_cuda.cu GPU engine: BPE, kernels, FlashAttention, pretrain/SFT/infer/chat, gradient check data/get_gutenberg.sh downloads + cleans the Gutenberg books corpus data/get_web.sh downloads a FineWeb-Edu web slice via the DuckDB CLI (no Python) data/get_alpaca.sh downloads the Alpaca instruction data for fine-tuning Makefile LICENSE shakespeare.txt .gitignore
License
MIT. See LICENSE.
About
GPT-2-style LLM built from scratch in C/CUDA with hand-written backprop, BPE tokenizer, FlashAttention, pretraining, and SFT.
Topics
c
nlp
training
machine-learning
deep-learning
neural-network
openmp
cuda
cublas
language-model
from-scratch
byte-pair-encoding
gpt2
llm
bpe-tokenizer
flashattention
trasformer
Resources
Readme
License
MIT license
Uh oh!
There was an error while loading. Please reload this page.
Activity
Stars
7 stars
Watchers
0 watching
Forks
1 fork
Report repository
Contributors
Uh oh!
There was an error while loading. Please reload this page.
Languages
Cuda 73.7%
C 23.2%
Shell 2.7%
Makefile 0.4%