2026-07-03 03:48 UTCIn-site rewrite6 min readUpdated: 2026-07-03 04:37 UTC

DGX station and "frontier" models, my hunt for answers

An investigation into NVIDIA's DGX Station reveals its true capabilities for running large AI models locally, including memory architecture details, real-world benchmarks, and community skepticism about its 748GB coherent memory claim.

SourceHacker News AIAuthor: connorturland

DGX Station profile: the local AI box everyone wants benchmarked

Connor Turland·July 02, 2026·

NVIDIA says DGX Station can support models up to 1 trillion parameters.

That is the sentence that makes everyone stop scrolling.

The careful version is harder. Can you put a 1T-class model somewhere in the 748GB coherent memory pool? Sometimes, depending on the model, quant, cache, and runtime. Can you run the full weights of a frontier open model as if the whole box were 748GB of HBM? No. That would be a very expensive mistake.

So we went looking for the real answer.

We have been trying to understand the actual machine underneath the pitch. We have talked to Cornell researchers who had temporary access. We talked to NVIDIA. We inquired about buying one and priced it at about $100,000 per machine. We searched the NVIDIA forums, Reddit, X, LocalMaxxing, model cards, and conference-floor posts. We asked people at the AI Engineer local table for GLM-5.2 numbers because public numbers were thin.

This is a profile of DGX Station from that reporting. It is not a review. We do not have a full benchmark suite yet, but we do have some useful numbers hinting the way.

The local AI interest itself is part of the story. This is not just a few people trying to make old GPUs do one more trick. At AI Engineer, the Local AI room was full of people who knew exactly why the memory split mattered and why the public numbers were still not enough.

Photo credit: Ahmad Osman. Original X post: x.com/TheAhmadOsman/status/2072789682254180776.

The question is simple: does DGX Station let you buy a lot of local model capacity without giving up the memory behavior that makes GPU inference fast?

The machine

NVIDIA's DGX Station page says the machine is powered by GB300 Grace Blackwell Ultra, has 748GB of coherent memory, and supports models up to 1 trillion parameters.

It does not give you 748GB of HBM.

The split that matters is:

Memory tierCapacityBandwidthWhy it matters

HBM3e252GB7.1TB/sfast GPU memory tier

LPDDR5X496GB396GB/slarger CPU-side memory tier

Total coherent memory748GBmixedaddressable pool, not all GPU-speed memory

NVLink-C2Cn/a900GB/s listedCPU-GPU link, with real workload caveats

That is the entire debate in one table. The capacity number is real. The memory split is also real.

If your model and KV cache stay inside HBM, the machine is easy to understand. If the active workload crosses into LPDDR5X, the question becomes empirical: how much does prefill, decode, long context, and concurrency change?

There is another asterisk on the link between the CPU and GPU. Stas Bekman measured NVLink-C2C on DGX Station and reported that it was not behaving like 900GB/s full duplex end-to-end in his bidirectional test. His point was not that DGX Station is useless. His point was that the marketing number and the workload behavior are not the same thing.

That is why we care about actual runs more than spec sheets.

The $100k question

The DGX Station price we got was about $100,000 per machine.

At that price, the machine is not competing with a hobby rig. It is competing with other ways to answer the same three questions: what can we run, how does it feel, and how much does it cost? The GLM-5.2-specific version of that comparison is in the GLM 5.2 local hardware requirements post.

For DGX Station, the competing buckets are:

AlternativeWhat it buys

multi-GPU RTX PRO 6000 rigsmore conventional VRAM capacity, more assembly and operations work

cloud inferenceno hardware ownership, but recurring per-token bills

Mac Studio-style unified memorylots of addressable memory, much less GPU-style memory bandwidth

DGX Spark clusterscheaper nodes, but different memory and interconnect tradeoffs

The buying question is not simply "can I fit a big model?" A lot of machines can fit a heavily quantized big model somewhere in memory.

The buying question is: can we serve useful local frontier-ish workloads with enough speed, context, and concurrency to justify a six-figure workstation?

Why people are skeptical

The local AI crowd is not confused about the headline number. The skepticism is sharper than that.

This r/LocalLLaMA thread about 4x-8x RTX PRO 6000 systems, especially the comment thread starting here, gets at the objection. Someone asks why not buy DGX Station instead of 4-8 RTX PRO 6000s. The reply is basically: DGX Station does not actually have 748GB of VRAM. It has a 252GB HBM tier plus a larger LPDDR5X tier.

The NVIDIA Developer Forums have the same question in more formal language: can a 1T-class model really be served well on this memory layout, and has anyone seen real tokens/sec benchmarks for the GB300 DGX Station when context or weights move past HBM3e?

That is the right standard. Not "does the model load?" The standard is "what happens to prefill, decode, context, and concurrency when the workload crosses the fast memory tier?"

Who has touched one

DGX Station access is still scarce enough that the list of people with credible hands-on signal matters.

From our reporting, the names and groups we kept seeing were: Cornell's group, temporarily; Stas Bekman and Jeff Rasley around Snowflake work; Alex Cheema at Exo with Ahmad Osman and the local AI table around AI Engineer; Andrej Karpathy; and Matthew Berman.

That does not mean all of them published comparable benchmarks. It means the public evidence is still coming from a small circle.

These are the public photos we can point at: lab machines, deliveries, show-floor hardware, Exo's local AI setup, and Snowflake's interior shot.

The man, the legend, Andrej. NVIDIA AI Developer described Karpathy's unit as the first DGX Station GB300, a Dell Pro Max with GB300, personally signed by Jensen Huang. Photo credit: NVIDIA AI Developer. Original X post.

First DGX Station online. Nader Khalil posted this around the first DGX Station coming online. Photo credit: Nader Khalil. Original X post.

Matthew Berman delivery. NVIDIA hand-delivered a pre-production Dell Pro Max with GB300; Berman described it as a 100 lb system with 750GB+ unified memory. Video thumbnail credit: Matthew Berman. Original X post.

Snowflake's Dell Pro Max with GB300. Jeff Rasley's photo shows the interior of the machine Snowflake had delivered. Photo credit: Jeff Rasley. Original X post.

Exo local AI hardware. Alex Cheema's post lists 16 DGX Spark, 3 RTX Spark, 1 DGX Station, ConnectX-7 cables, and two high-speed switches. Photo credit: Alex Cheema. Original X post.

AI Engineer local AI table. This is the table where the GLM-5.2 REAP DGX Station number surfaced. Photo credit: Chris Alexiuk. Original X post.

The most useful academic writeup we found came from Kilian Weinberger's group at Cornell, which had early remote access to a DGX Station in February. Three researchers used it on three different workloads: RL fine-tuning, diffusion language-model retrieval, and synthetic data generation.

That article is worth reading because it is not a launch slide. The RL work used the coherent memory to move from constrained Qwen3-4B runs toward cleaner 4B-7B experiments. The diffusion-LM retrieval work used the local GB300 box to iterate without cluster queues, doubling training batch size and cutting epoch time by about 20% versus their previous shared-cluster setup. The synthetic-data work ran Qwen3-30B-A3B-Instruct in BF16 with vLLM and reported 5.7x the throughput of a single A100 and 2.6x the throughput of a 4xA100 setup.

Stas Bekman and Snowflake published another useful example: post-training Qwen3-32B at 136K sequence length on one DGX Station. That run used ArcticTraining, TiledMLP, Liger-Kernel, BF16 optimizer offload in DeepSpeed, and CPU memory to make a long-sequence SFT recipe fit on a single B300. The writeup also notes the setup work that mattered: stable memory behavior, avoiding fragmentation, and switching the machine into the right coherent-memory mode so CPU memory was available for optimizer-state offload instead of fighting the GPU.

That is useful signal. It is not the same question as serving a 325GB or 500GB-class local frontier artifact on a machine with 252GB of HBM. Cornell and Snowflake showed the machine helping real training and research workloads. They did not settle the local frontier-inference question. We talked to them about the big questions we had, and none of them knew the answer.

So... can it run big models?

This is the actual buyer question.

NVIDIA says DGX Station supports models up to 1 trillion parameters. We believe the capacity claim, with all the caveats above. We do not think a buyer should read that as "load the full FP8 weights of any 700B-1T frontier open model and expect it to behave like it all lives in HBM."

The best public GLM-5.2 clue we have is the AI Engineer booth run. We had been asking for any real GLM-5.2 number on DGX Station: not a launch slide, not a capacity claim, a number from people standing next to the machine.

On X, we asked the AI Engineer local table whether anyone could post metrics for the GLM-5.2 4-bit quant running on DGX Station. A couple hours later, 0xSero posted: GLM-5.2-REAP on DGX Station, drumroll...

60 tokens/second

Rick Blalock posted a short X video of that booth demo. You can get a visual sense of the generation speed.

Your browser does not support the video tag.

Video credit: Rick Blalock (@rblalock). Original X post: x.com/rblalock/status/2072786938147586503.

For cloud context, OpenRouter's Claude Opus 4.8 page shows 64 tok/s as the best throughput across providers, and about 55 tok/s for Anthropic in its one-week provider breakdown.

Alec Fong later added that their first stab at running GLM with NVFP4 was about 25 tok/s, before Alex Cheema and 0xSero helped push the run to 60 tok/s. The evidence trail points toward Luke Alonso's GLM-5.2-NVFP4: Ahmad had posted that Luke uploaded a 467GB NVFP4 artifact, then replied to Sentdex that he was working on getting it running. Alec's post does not name the exact artifact, so we are treating that as strong context, not a confirmed model ID.

The caveat is the important part. The booth run was not the full GLM-5.2 FP8/BF16 model. The model selector appears to show GLM-5.2 REAP 504B, which points to the 0xSero GLM-5.2 REAP 504B GGUF family, not the full GLM-5.2 model. REAP is expert pruning for MoE models, and 0xSero's own Terminal-Bench 2.1 number is lower than the full-model number we have seen.

That is not a scandal. It is the point. The model that ran locally was shaped to run locally.

Here are the big-model numbers we have been able to get:

Model / workloadNumberWhat we know

Kimi 2.5, 1.1T40-50 tok/s total output across all usersNVIDIA rep number; about 595GB model weights; we still need benchmark conditions

Nemotron Ultra, 550Babout 35 tok/s at concurrency 1; scales to 4-5 concurrent usersNVIDIA rep number; useful because it includes a concurrency claim

GLM-5.2-REAP 504Babout 60 tok/spublic 0xSero number from AI Engineer; Alec Fong says an earlier GLM NVFP4 attempt was about 25 tok/s; still missing exact quant, prefill, context, memory residency, and concurrency

These numbers deserve credit because they were hard to come by. They also have different measurement conditions, so the table should not be read as a leaderboard.

For the GLM-5.2 model-sizing side, see the GLM 5.2 local hardware requirements. This profile is about DGX Station as a machine.

Configurations, costs, who ships, and when

NVIDIA lists DGX Station systems from ASUS, Dell, Exxact, Gigabyte, HP, MSI, and Supermicro on the official DGX Station page, and its marketplace links out to buying options for GB300 systems.

The lineup is already messy in the way workstation launches are messy: some pages are configure-and-quote, some are sales conversations, some are regional distributors, and some public photos are special hand-delivered early systems rather than normal orders.

A LocalLLaMA user put the GB300 OEM syst

[truncated for AI cost control]