2026-06-29 01:23 UTCIn-site rewrite6 min readUpdated: 2026-06-29 04:22 UTC

Sophon PFG-1: a monolithic-3D AI ASIC with 330 GB of on-die DRAM and no HBM

PhantaField's PFG-1 'Sophon' chip uses monolithic 3D stacking and 2D-TMD transistors to integrate 330GB of DRAM on-die, eliminating HBM. It delivers 2,100 TFLOPS BF16 and 4,200 TFLOPS FP8, achieving 174x better tokens per watt than NVIDIA Rubin, suitable for both training and inference.

SourceHacker News AIAuthor: minkowsky

PhantaField PFG-1 Whitepaper

PhantaField PFG-1 Sophon Whitepaper

Revision 4.1 · June 2026

Executive Summary

PFG-1 "Sophon" is a unified training-and-inference die on a 750 mm², 32-tier 2D Transition-Metal Dichalcogenide (TMD) Monolithic 3D (M3D) platform. Weights, gradients, and optimizer state reside in on-die 2T0C 2D-TMD gain-cell DRAM; because the array is fully read-write, the same silicon executes BF16 forward/backward training passes and serves low-batch decode at the compute-bound rate.

Compute is pure digital Compute-In-Memory (CIM): each 256×256 DRAM subarray tile pairs a binary sense amplifier with an 8-level adder tree, driven by a 500 MHz bit-serial activation broadcast. At 131,072 tiles/die this yields 4,200 TFLOPS FP8 and 2,100 TFLOPS BF16 in a 7.5 cm² footprint.

The die is built on a 28 nm Si Complementary Metal-Oxide-Semiconductor (CMOS) base tier, a 32-tier 2D-TMD CMOS MAC stack, and a Monolithic Inter-tier Via (MIV) fabric [5][6][7], with the 2T0C DRAM module embedded at the Back-End-Of-Line (BEOL) Metal-3 layer of each memory tier. The die stack cross-section is shown in Figure 1.

PFG-1 "Sophon"

Memory 2T0C 2D-TMD gain-cell DRAM

Compute paradigm Pure digital CIM (sense amp + adder tree)

Target workload Training (fwd + bwd + optimizer) and inference (decode + prefill)

Capacity 330 GB

Compute 2,100 TFLOPS BF16 (4,200 TFLOPS FP8 inference mode / 8,400 TOPS INT8)

Energy / MAC

0.620 pJ (BF16 fwd) / 0.940 pJ (fwd + bwd) / 0.310 pJ (FP8 inference)

Peak efficiency 3.72 TFLOPS/W (BF16 training avg.)

Tokens per watt

38.7 tokens/s per W (80B FP8 decode, 373 W) — ~ 174× an NVIDIA Rubin (R200) or AMD Instinct MI455X at low batch (~ 0.22 tokens/s per W, HBM4-bound)

Active power ≈ 379 W fwd / ≈ 749 W bwd (~ 564 W training avg.); 373 W FP8 decode

80B model perf.

2,406 tokens/s training, 0.23 J/tok; 7,219 tokens/s BF16 decode (14,438 tokens/s FP8 mode), 25.8 mJ/tok

80B + INT4 + speculative (FP8 mode) 72,188 tokens/s effective

BOM $8,358

Sophon eliminates off-die High-Bandwidth Memory (HBM) entirely. For 80B-parameter BF16 training it fits weights + first-order optimizer state fully on-die with ~ 10 GB of activation headroom for gradient-checkpointed micro-batches; for inference it serves an 80B model at 7,219 tokens/s in native BF16 or the full 14,438 tokens/s in FP8 mode — making it a single train-then-serve part that can be elastically repartitioned between training and serving without changing hardware. Against an NVIDIA Rubin (R200) and an AMD Instinct MI455X — both 2026 HBM4 parts — Sophon delivers ~ 2.7–3.1× higher 80B batch-1 training throughput per die and ~ 48–53× higher single-stream FP8 decode throughput, because both GPUs at low batch are HBM-bandwidth-bound at their HBM4 limits (Rubin 22 TB/s, MI455X 19.6 TB/s). Peak dense FLOPS favor the GPUs — Sophon BF16 dense is only ~ 0.21–0.24× their peak — but peak FLOPS do not help at low batch, where weight-memory bandwidth governs.

The architecture delivers ~ 191–214× the weight bandwidth of an HBM4 package (191× vs Rubin, 214× vs MI455X) — a gap no HBM roadmap closes (Section 7).

The economics follow directly: Morgan Stanley puts a single NVIDIA VR200 (Rubin) NVL72 rack at ≈ $7.8M — HBM memory alone ≈ $2.0M (25.7% of the rack, +435% over GB300). Sophon eliminates that line item, for a ~ 9.9× / 11.6× lower hardware BOM than a Rubin / MI455X [17].

Table of Contents

Introduction & Motivation

Architecture Overview

A. Platform (die, tiers, MIV, TMD MAC)

B. PFG-1 "Sophon" — 2T0C DRAM die

C. Die floorplan & on-die system organization

Physical Calculations

A. Cell geometry & per-tier density

B. Bandwidth model

C. Per-MAC energy & power envelope

D. Digital CIM tile physics & 1/N scaling

SPICE Simulation

GPU Architecture & AI Performance

A. Inference

B. Training

C. System view

Thermal Analysis

Scaling Roadmap

Energy-Constrained Ceiling on Model Size

Inference (serving) ceiling

Training ceiling

Economic Analysis

Radiation Tolerance for Space Applications

Validation, Risks & Future Work

References

Equations Appendix

Introduction & Motivation

Modern AI accelerators face a memory wall on both workloads they must serve:

Inference is read-dominated. The model weights are fixed at deployment; every decode step reads the full weight tensor once per generated token. The key metrics are read energy per bit, idle leakage (the model must stay resident between requests), and weight-fetch bandwidth at low batch. Conventional High-Bandwidth Memory (HBM) is bandwidth-bound at low batch: every token's MAC traffic serializes through the ~ 22 TB/s (Rubin) / 19.6 TB/s (MI455X) HBM4 path, and a 288–432 GB HBM4 subsystem draws ~ 10–15 W in self-refresh just to keep the model resident.

Training is read-write symmetric. Every forward pass reads weights; every backward pass writes gradient updates; the optimizer updates weights in place each step. In-place writability, low write energy, and capacity for both weights and optimizer state are critical. A non-volatile inference-only memory cannot train — for example, Single-Level Cell (SLC) Resistive RAM endurance caps at ~10⁶ cycles, while training an 80B model requires ~10¹⁰ write cycles per parameter.

A 2T0C 2D-TMD gain-cell DRAM solves both problems with one cell. It exploits the anomalously low off-current density (Joff ≈ 10⁻¹⁵ A/µm = 1 fA/µm at 28 nm, i.e. ≈ 0.5 fA per cell) of TMD transistors to obtain multi-second retention without an explicit storage capacitor, enabling in-place gradient writes at 20 fJ/bit with unlimited write endurance and a refresh overhead of only ≈ 0.08 W. Because the storage node is writable on every cycle, the same die that serves inference can also train; because retention is seconds-long, idle power collapses to ~ 3 W — an inference-grade idle profile on a fully writable training die.

PhantaField's 2D-TMD M3D platform integrates this DRAM module at the BEOL Metal-3 layer of each memory tier, directly above the logic tier whose MAC array consumes its weights.

Architecture Overview

A. Platform

Sophon uses the following physical stack:

Tier(s) Function Process

Base (Si) Controller, NoC root, host I/O, PCIe/NVLink PHY 28 nm bulk Si CMOS

Tiers 1 – 32

Interleaved 2D-TMD stack: 32 logic tiers (MAC array, 750 mm² each) alternating with 32 memory tiers (2T0C DRAM bank, 750 mm² each), forming 32 logic-plus-memory doublets

BEOL 2D-TMD (MoS₂ n-FET / WSe₂ p-FET) on odd tiers + DRAM module on even tiers

Lid Cu / CVD-diamond heat spreader optional; enables two-side cooling

Total stack height: ~22 µm above the Si die (64 tiers × 0.35 µm/tier). The 90 nm-pitch MIV grid provides 1.23 × 10⁸ slots/mm² available inter-tier connections; the design populates only ~5.5 × 10⁵/mm², leaving > 99% MIV headroom.

Tiers are not split within a single layer; instead the 64-tier stack interleaves dedicated logic and memory tiers in an A/B/A/B… repeating pattern. Two adjacent tiers form one logic-plus-memory doublet; the stack contains 32 such doublets:

Logic tiers (32 × 750 mm² = 24,000 mm² total MAC area): 2D-TMD CMOS MAC array on odd-indexed tiers — MoS₂ n-FETs for NMOS, WSe₂ p-FETs for PMOS. Density 0.175 TFLOPS FP8/mm² (0.0875 TFLOPS BF16/mm²). Clocked at 1.2 GHz, Vdd = 0.6 V.

Memory tiers (32 × 750 mm² = 24,000 mm² total memory area): 2T0C 2D-TMD DRAM on even-indexed tiers, fabricated at the Metal-3 BEOL of that tier. Each memory tier sits directly above its paired logic tier; vertical Monolithic Inter-tier Vias (MIVs) on a sub-100 nm pitch carry bit-line/word-line/sense signals straight up from the logic MAC array into the cells, giving every MAC its own private vertical port to local weights with zero NoC traffic. This interleaved arrangement preserves the same total area and capacity as a hypothetical in-tier 50/50 split, while doubling the per-tier MAC routing area and shortening MAC-to-cell signal paths to a single tier-pitch of 0.35 µm.

Why 2D TMD? TMD CMOS (MoS₂ / WSe₂) is the only transistor technology that simultaneously offers: (1) BEOL-compatible growth at ≤ 450 °C [6]; (2) atomic-scale channel thickness eliminating short-channel leakage [1][2]; (3) electron mobility ≥ 120 cm²/V·s [4]; and (4) intrinsic radiation hardness (no buried-oxide trap volume). Critically, the TMD off-current density Joff ≈ 10⁻¹⁵ A/µm (1 fA/µm) at 28 nm — i.e. ≈ 0.5 fA for a 0.5 µm-wide cell transistor, roughly 4 orders of magnitude lower than Si NMOS at equivalent gate length [2][3] — is what enables a 2T0C cell to retain data for seconds without any storage capacitor [8][9], keeping the cell area at 8 F² rather than the ~20 F² needed for a conventional 1T1C DRAM.

B. PFG-1 "Sophon" — 2T0C DRAM die

Sophon places a 2T0C 2D-TMD gain-cell DRAM (8 F², 1 bit/cell) at the Metal-3 BEOL of each memory tier. The cell structure is shown in Figure 2 and consists of:

Write Transistor (WT): a TMD nFET gated by the Write Word-Line (WWL), which charges the storage node to Vdd or discharges it to GND.

Read Transistor (RT): a TMD nFET whose gate is the storage node; its drain current indicates the stored bit.

Storage node: the parasitic gate capacitance of RT (~2.5 fF at 28 nm TMD) plus the junction capacitance of WT's drain (~0.5 fF). No explicit Metal-Insulator-Metal (MIM) or trench capacitor — that is the "0C" in 2T0C.

The TMD off-current density of 1 fA/µm (Ioff ≈ 0.5 fA for a 0.5 µm cell transistor) gives retention τ = C·Vdd / (2·Ioff) = 1.8 s at 25 °C [8][9] — see Eq. 3 and Figure 3 for the retention curve. Sophon refreshes every 1.0 s (1.8× margin), consuming only ≈ 0.08 W for the full 330 GB die (Eq. 4). Retention derates ≈ 2× per 10 °C; above 60 °C junction temperature, on-die thermal sensors shorten the refresh interval (≈ 159 ms at 60 °C, ≈ 28 ms at 85 °C), with refresh power staying below ~ 4 W even in the hot corner.

Because the storage node is writable on every cycle, Sophon supports in-place BF16 gradient accumulation with unlimited endurance — exactly what training requires — while the same array, read-only, serves the inference decode loop. The die loads a model once and either serves it (inference) or updates it in place (training); a powered-off die reloads its weights from off-die Non-Volatile Memory express (NVMe) at boot (§11.2).

C. Die Floorplan & On-Die System Organization

The 131,072 CIM tiles are not a flat array — they are partitioned across the 32 logic tiers of the stack (§2.A), exactly 4,096 tiles per logic tier (derived: 131,072 ÷ 32). Each tile occupies a fixed cell on its tier and is the atomic unit of compute, storage, and redundancy: a 256×256 weight subarray (65,536 weights) feeding a binary sense amp and an 8-level adder tree, with bit-serial activation broadcast at 500 MHz (16 cycles BF16, 8 cycles FP8). The weights for every tile live in the 2T0C cells of the memory tier directly above it (§2.B), so a tile is physically a vertical logic-plus-memory column, not a planar block. A tier is therefore a 4,096-tile mesh of these columns; the full die is 32 such meshes stacked at 0.35 µm pitch, with the 28 nm Si base below carrying everything that is not compute.

The NoC is a per-tier 2D mesh, not a global fabric. Each logic tier runs its own mesh router fabric at ≈ 290 TB/s bisection, and the 64 tiers together present 18,560 TB/s aggregate (derived: 290 × 64). What rides the NoC is deliberately minimal: activations and partial sums — the operands that must move between tiles to assemble a layer's output across the 4,096-tile fan-in. Weights never touch the NoC. Every weight is read through its tile's private vertical MIV port — a single tier-pitch hop straight down from the cell to its MAC — delivering 4.2 PB/s of in-tile weight bandwidth with zero shared-bus contention (§2.A). This is the load-bearing asymmetry of the floorplan: the multi-petabyte traffic (weight fetch) is kept entirely vertical and local, so the lateral NoC only ever carries the comparatively small activati

[truncated for AI cost control]