2026-07-02 17:11 UTCIn-site rewrite2 min readUpdated: 2026-07-02 17:38 UTC

Show HN: AI Infrastructure Knowledge Base

A practical, citable knowledge base for deploying, operating, and optimising GPU clusters, from the physical datacentre and the InfiniBand fabric up through Kubernetes, Slurm and Ray, distributed training and reinforcement-learning post-training, and LLM inference serving at scale. Covers the full NVIDIA range: Ampere, Hopper, and Blackwell datacenter GPUs, RTX consumer and workstation cards, and DGX systems (including DGX Spark). Current to mid-2026.

SourceHacker News AIAuthor: hevalon

AI infrastructure knowledge base¶

Scope: the landing page for this knowledge base, what it covers and where to start. Reference/landing page, not a single implementation topic.

A practical, citable knowledge base for deploying, operating, and optimising GPU clusters, from the physical datacentre and the InfiniBand fabric up through Kubernetes, Slurm and Ray, distributed training and reinforcement-learning post-training, and LLM inference serving at scale. It covers the full NVIDIA range: Ampere, Hopper, and Blackwell datacenter GPUs, RTX consumer and workstation cards, and DGX systems (including DGX Spark), with their operational, install, and networking differences, and with the Blackwell Ultra (B300 / GB300 NVL72) generation as the current focus. Current to mid-2026.

It is written for the people who keep accelerators busy: systems administrators, GPU server engineers, platform engineers, SREs, and MLOps engineers. Every page follows a predictable shape, carries reference examples (Ansible, Helm/Kubernetes, Slurm, PyTorch, vLLM), and includes architecture diagrams and links to the primary papers and documentation.

This knowledge base (ai-infrastructure.net) is built and maintained by setloop.io, the company behind it.

flowchart LR HW["GPU hardware"] --> BUILD["Build and commission"] BUILD --> PLATFORM["Cluster platform"] PLATFORM --> TRAIN["Training and post-training"] PLATFORM --> SERVE["Inference serving"] TRAIN --> OPS["Operate and optimise"] SERVE --> OPS

Open the reading paths Browse the glossary

What's inside¶

GPU hardware

The full NVIDIA range: Ampere, Hopper, Blackwell datacenter GPUs; RTX consumer and workstation cards; DGX systems and DGX Spark, and how their ops differ.

GPU generations · RTX & workstation

Build & commission

Bill-of-materials validation, datacentre power and cooling, the HPC networking fabric, and commissioning to acceptance.

Networking fabric · Blackwell platform

Cluster technologies

Kubernetes, k3s, Ray and Slurm, each with what/why/when, how to use, develop, scale, serve, and fine-tune.

Orchestration overview

Training & post-training

FSDP, DDP, ZeRO, tensor and pipeline parallelism, DiLoCo; SFT/LoRA, DPO and GRPO; verl, slime, SkyRL and more.

Fine-tuning & RL · RL libraries

Inference serving

Serving the latest open-weight models (Kimi K2, GLM, DeepSeek, Qwen), continuous batching, KV cache, and disaggregated prefill/decode.

Inference serving · Disaggregation

Operate & optimise

Observability, RAS and XID failure modes, NCCL and hardware tuning, SLOs/SLIs, and error-budget alerting.

Observability · Reliability & RAS

Recipes & runbooks

Ansible playbooks, Helm/Kubernetes manifests, telemetry stacks, and step-by-step operational runbooks for the recurring incidents.

Recipes & manifests · Runbooks

How to use this knowledge base¶

Concept pages explain a topic and its traps: overview, core knowledge, a don't-miss checklist, failure modes, and references.

Recipe and runbook pages are example-first: copy-paste manifests, playbooks, and step-by-step procedures with the commands to apply and verify them.

Per-technology pages give each cluster technology, training algorithm, RL library, and runbook its own page following a fixed shape: what it is, why and when to use it, how to use, develop, scale, serve for inference, fine-tune, and run on optimised hardware, plus a cookbook of common use cases.

Suggested starting points¶

New here? Read the knowledge base index for the full map and reading paths.

Standing up a cluster? Ansible bring-up → Kubernetes & Helm platform → telemetry.

Serving a model? Serve open-weight models → SLO/SLI catalog.

Fine-tuning? SFT & LoRA → GRPO → RL libraries.

References¶

NVIDIA Blackwell architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/

NVIDIA DGX SuperPOD reference architecture: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html

Kubernetes documentation: https://kubernetes.io/docs/home/

PyTorch distributed overview: https://pytorch.org/docs/stable/distributed.html

vLLM documentation: https://docs.vllm.ai/en/latest/

Related: Start here · Glossary · GPU generations · Operational runbooks