The Hardware Behind AI
A deep dive into the hardware fundamentals of AI, covering transistors, semiconductors, chip fabrication, die shrinks, the rise of GPUs, and NVIDIA's architecture.
Sidwyn Koh
Jun 06, 2026
Welcome back to Path to Staff. I recently left Meta for personal reasons (was not laid off!), and have found much more time to write. This means learning as much as I can about AI, and distilling what I’ve learned and sharing with you all.
I’m an engineer, but never really dug into how AI works. I use it every day, yet I feel so far away from the tech. A few weeks ago, I finally dove really deep to understand AI from the bottom up. Understanding its internals has helped make me far better at using it.
And I want to share those learnings with you.
This new series is called Unpacking AI.
A five-part series
This deep dive is a five-part series:
The Hardware Behind AI. Transistors, semiconductors, and fabricators. Learn about the big players (TSMC, Nvidia, ASML). The memory-compute bottleneck. And all the acronyms you always wondered about (TPU, ASIC, FPGA, CUDA, etc.)
Data & Model Architecture. Learn about what models are made of. We’ll cover the paper that started it all (”Attention is All You Need”), plus talk about transformers and diffusion models. And of course, we’ll cover how training data is prepared for these models (what sources? how is the data decontaminated and filtered?)
Training. The mechanics of teaching a model. How does pretraining work? What goes into it (backpropagation, optimizers, loss functions)? What scaling laws should we understand before we kick off an expensive training run (up to hundreds of millions $)?
Post-Training & Alignment. How does one guide a model once it’s been taught? How do we apply safety? How do we benchmark and know the model got better? How do we evaluate a model’s performance?
Inference, Serving and Agents. This might be the most familiar topic, since it’s closest to you as an AI user. How does a model output its token and serve the result to you (SSE)? How do systems stay fair and fast? What tools are available (MCP, RAG, tool use) and how do agents work?
Over the course of this series, I expect the syllabus to change as I learn more about AI. For sections with several acronyms, I also list their definition at the top of the section. I also welcome questions in the comments! This will help me improve this series.
Quick note: As much as I love using AI, this article and future ones will be written by hand, with light editing by AI. After all, we’re all too sick and tired of reading AI slop. Some images, however, will be generated by AI.
Transistors and Their Importance
To understand Artificial Intelligence, we first have to go to the core of it.
AI runs off GPU chips. These chips are made from transistors which are manufactured using EUV machines.
Let’s break each of these down, starting with a transistor.
A transistor is a semiconductor device that controls the flow of electricity. It uses a small electrical signal at one terminal to control a much larger current. It either (1) boosts a signal, or (2) decides whether a current can pass. In other words, it acts as either an amplifier or switch.
A semiconductor, most commonly silicon, is a material that conducts electricity only under certain conditions. Its conductivity can be modified by adding impurities.
Who designs these chips? Nvidia and AMD are the biggest players when it comes to designing chips. Not far behind are Google, Amazon and now Meta when it comes to chip design. However, these companies operate as “fabless” designers. That means they only build the architecture, while outsourcing the physical production to foundries, which can cost up to $20B to build and maintain.
Who then makes these transistors? TSMC (Taiwan Semiconductor Manufacturing Company) does. But they require special machines, called extreme ultraviolet (EUV) machines, and a process called lithography, which is the act of printing on chips. There are other foundries like Samsung and Intel, but they are not as advanced as TSMC, which currently holds 70% of the global foundry revenue.
Who makes these EUV machines? These are currently only being manufactured by ASML (Advanced Semiconductor Materials Lithography), a Dutch company that has a monopoly foothold in the EUV machine industry. China is fast catching up, but is still roughly 5 years behind.
A picture of ASML laboratory (source)
Fun fact: there are no major competitors to ASML today. It took them 30 years to reach the stage they’re in today. They’ve integrated thousands of suppliers together to build a generator that fires 50,000 droplets per second. They also own the major company Cymer that makes these EUV sources. These light sources are so short (as short as 13.5nm) that there is no natural source for it.
OK, now we know what a transistor is and how it’s made. Now we can understand GPUs (graphics processing units). A single GPU contains billions of transistors, which are packed onto a die (a raw block of silicon) manufactured with various fabrication technologies.
We’ll now get into these different types of fabrication technologies.
Die Shrinks: Going from 10,000nm -> 2nm in 5 decades
Let’s take a quick history detour of die shrinks.
Why do we need a die to be smaller? Smaller features mean more transistors per square millimeter. This allows manufacturers like TSMC to pack more into each chip and build more capable chips. More cores, more cache and more tensor units. However, one point to note is that by building features smaller, this leads to more defects per wafer. As such, it isn’t necessarily cheaper to produce.
What’s the history of these dies and die shrinks? It’s complicated, but I’ll try to explain in a couple of paragraphs. The first microprocessor, the Intel 4004, was launched in 1971, at a 10µm process line width. This means that the gate length (distance between drain and source electrodes) was at 10µm. Which in turn means that electrons have to travel across this 10,000nm whenever a transistor switches on. A shorter gate = faster speed.
The Intel 4004. Picture taken from Wikipedia.
At that point in time (1971), this chip was designed for a Japanese calculator company, Busicom. However, once Intel realized that this was much more useful for the mass market, Intel repurchased the marketing and technology rights from Busicom.
A few years earlier, Gordon Moore made the observation you’ve heard of as Moore’s Law: that the number of transistors on a chip would keep doubling at low cost roughly every two years.
Over the next few decades, we went from 600nm → 250nm → 180nm → 130nm → 45nm. In the early 2000s, manufacturers hit a wall. There was no way to take the next jump and shorten gate length. However, a TSMC engineer named Burn-Jeng Lin had a breakthrough: adding water between the lens and the wafer. This was a huge bet by ASML in 2003-04, which was at that time a smaller European challenger behind Nikon and Canon. They went all in on immersion and won.
Nikon and Canon stuck to their guns on 157nm dry lithography, but by then it was too late. ASML had a huge head start. Canon essentially exited leading-edge lithography, and while Nikon did eventually build immersion tools, it never recovered its lead, and later sat out EUV entirely.
Today, your iPhones run on 3nm chips. GPUs today all use TSMC’s 5/4/3 nm variants. The state of development is currently at 2nm, and there’s targets to hit 1.6nm (TSMC’s “A16”) around late 2026–2027, with 1.4nm later and true 1nm not expected until the back half of the decade.
Unfortunately, one sad fact is that these numbers no longer mean gate lengths. They’re used more for marketing. The thing that actually improves is transistor density, measured in MTr/mm² (millions of transistors per square millimeter), and even that isn’t measured consistently across foundries.
Five decades of shrinking dies
The Shift from CPU to GPU
Let’s talk a bit about how GPUs got famous in the first place. It all started with the CPU.
CPUs have been in place since 1971, since the first microprocessor. However, when games like Quake were introduced in the 1990s, they lagged pretty badly while trying to render graphics. I remember my own computer grinding to a halt whenever I played intense games (DotA, anyone?).
Graphics accelerator cards were the answer. Instead of having a few sophisticated cores, you’d have thousands of dumb cores running in parallel. Each individual core is super weak, but when combined together in a GPU, the throughput is gigantic. These were great for games, since rendering a 4K image meant computing colors of 8M pixels independently. NVIDIA wasn’t the first to build these cards (other companies called 3dfx and ATI did), but they did coin the term GPU with the GeForce 256 in 1999.
Fast forward to 2006, Jensen Huang, CEO of NVIDIA made a huge bet. That Moore’s Law is slowing. Single-threaded CPU performance was not optimized for the long run. He wanted to build a programming platform for scientific computing on graphics cards. This bet was targeted at scientists who wanted access to supercomputers. At that point of time, these supercomputers were multi-million dollar machines only owned by government labs and a few corporations.
This bet was called CUDA (Compute Unified Device Architecture). This allowed the CPU to offload parallelized computing tasks from the CPU to the GPU. This ended up being their moat. The ecosystem (PyTorch, TensorFlow, which we will cover in later chapters) ended up being built CUDA-first. Silicon and networking was also specialized around this architecture.
The first hint of AI leveraging GPUs came about in 2012. Three University of Toronto researchers, Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton submitted a neural network called AlexNet. (You might start to know these names!) This was trained on two NVIDIA GTX 580 gaming GPUs in Alex’s bedroom.
This proved that GPUs were feasible to train deep neural networks for the first time. These neural networks, which were mostly based on matrix multiplication, were able to be achieved in someone’s bedroom. If the same neural net were to be trained on a CPU, it would have taken centuries.
If we were to compare a CPU (left) to a GPU (right):
How a GPU is structured. You can see that there are many more cores in a GPU! Taken from the CUDA programming guide.
Structure of an NVIDIA GPU
Now let’s take a look at a GPU.
Fair warning: it starts to get very technical from here on out. I will try my best to break it down.
The first is an overall view of the Blackwell GPUs (launched Q4 2024). This is not the latest chip architecture: Rubin R100 was recently announced and plans to ship in a few months. I could not seem to find a good infographic on R100, unfortunately, so let me know if you do.
Nevertheless, let’s examine the Blackwell Ultras, since this does give us a good sense of how it works end-to-end.
Cleaned up version of an image from Blackwell’s overview.
Here in Blackwell, we have 2 dies that have been welded together with a custom interconnect called NV-HBI. An interconnect is basically a physical wire linking two things together. In this case, NV-HBI is an ultra-low latency, proprietary die-to-die interconnect that powers 10 TB/s.
Now, let’s think of each die as a city. Each city contains:
4 Graphics Processing Clusters (GPCs)
A GPC is a district within the city. Think of it as a district housing a group of factories that share some local infrastructure to pass numbers around. Within a GPC, there are 20 SMs, which means that there are 80 SMs per die, and 160 SMs total across both dies. That’s a lot of compute power!
GigaThread Engine + MIG Control
This is the city-level dispatcher. Its job is simple. It receives work from the CPU (through the PCIe Gen 6) and farms it out to the GPCs. The “MIG” part stands for Multi-Instance GPU, which lets the chip be sliced into up to 7 logical GPUs.
Each of these GPUs looks isolated to different tenants, which is important for hyperscalers (e.g. AWS, Google Cloud) to run multipl
[truncated for AI cost control]