Show HN: AI Infrastructure Knowledge Base
A practical, citable knowledge base for deploying, operating, and optimising GPU clusters, from the physical datacentre and the InfiniBand fabric up through Kubernetes, Slurm and Ray, distributed training and reinforcement-learning post-training, and LLM inference serving at scale. Covers the full NVIDIA range: Ampere, Hopper, and Blackwell datacenter GPUs, RTX consumer and workstation cards, and DGX systems (including DGX Spark). Current to mid-2026.
AI infrastructure knowledge base¶
Scope: the landing page for this knowledge base, what it covers and where to start. Reference/landing page, not a single implementation topic.
A practical, citable knowledge base for deploying, operating, and optimising GPU clusters, from the physical datacentre and the InfiniBand fabric up through Kubernetes, Slurm and Ray, distributed training and reinforcement-learning post-training, and LLM inference serving at scale. It covers the full NVIDIA range: Ampere, Hopper, and Blackwell datacenter GPUs, RTX consumer and workstation cards, and DGX systems (including DGX Spark), with their operational, install, and networking differences, and with the Blackwell Ultra (B300 / GB300 NVL72) generation as the current focus. Current to mid-2026.
It is written for the people who keep accelerators busy: systems administrators, GPU server engineers, platform engineers, SREs, and MLOps engineers. Every page follows a predictable shape, carries reference examples (Ansible, Helm/Kubernetes, Slurm, PyTorch, vLLM), and includes architecture diagrams and links to the primary papers and documentation.
This knowledge base (ai-infrastructure.net) is built and maintained by setloop.io, the company behind it.
flowchart LR HW["GPU hardware"] --> BUILD["Build and commission"] BUILD --> PLATFORM["Cluster platform"] PLATFORM --> TRAIN["Training and post-training"] PLATFORM --> SERVE["Inference serving"] TRAIN --> OPS["Operate and optimise"] SERVE --> OPS
Open the reading paths Browse the glossary
What's inside¶
GPU hardware
The full NVIDIA range: Ampere, Hopper, Blackwell datacenter GPUs; RTX consumer and workstation cards; DGX systems and DGX Spark, and how their ops differ.
GPU generations · RTX & workstation
Build & commission
Bill-of-materials validation, datacentre power and cooling, the HPC networking fabric, and commissioning to acceptance.
Networking fabric · Blackwell platform
Cluster technologies
Kubernetes, k3s, Ray and Slurm, each with what/why/when, how to use, develop, scale, serve, and fine-tune.
Orchestration overview
Training & post-training
FSDP, DDP, ZeRO, tensor and pipeline parallelism, DiLoCo; SFT/LoRA, DPO and GRPO; verl, slime, SkyRL and more.
Fine-tuning & RL · RL libraries
Inference serving
Serving the latest open-weight models (Kimi K2, GLM, DeepSeek, Qwen), continuous batching, KV cache, and disaggregated prefill/decode.
Inference serving · Disaggregation
Operate & optimise
Observability, RAS and XID failure modes, NCCL and hardware tuning, SLOs/SLIs, and error-budget alerting.
Observability · Reliability & RAS
Recipes & runbooks
Ansible playbooks, Helm/Kubernetes manifests, telemetry stacks, and step-by-step operational runbooks for the recurring incidents.
Recipes & manifests · Runbooks
How to use this knowledge base¶
Concept pages explain a topic and its traps: overview, core knowledge, a don't-miss checklist, failure modes, and references.
Recipe and runbook pages are example-first: copy-paste manifests, playbooks, and step-by-step procedures with the commands to apply and verify them.
Per-technology pages give each cluster technology, training algorithm, RL library, and runbook its own page following a fixed shape: what it is, why and when to use it, how to use, develop, scale, serve for inference, fine-tune, and run on optimised hardware, plus a cookbook of common use cases.
Suggested starting points¶
New here? Read the knowledge base index for the full map and reading paths.
Standing up a cluster? Ansible bring-up → Kubernetes & Helm platform → telemetry.
Serving a model? Serve open-weight models → SLO/SLI catalog.
Fine-tuning? SFT & LoRA → GRPO → RL libraries.
References¶
NVIDIA Blackwell architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
NVIDIA DGX SuperPOD reference architecture: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html
Kubernetes documentation: https://kubernetes.io/docs/home/
PyTorch distributed overview: https://pytorch.org/docs/stable/distributed.html
vLLM documentation: https://docs.vllm.ai/en/latest/
Related: Start here · Glossary · GPU generations · Operational runbooks