AI News HubLIVE
站内改写

Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate

Perplexity AI open-sourced a Rust reimplementation of their Unigram tokenizer, achieving 5x lower latency than Hugging Face's tokenizers crate and reducing CPU utilization by 5-6x in production. The optimizations include double-array trie, bitmap packing, and huge pages.

Article intelligence

EngineersAdvanced

Key points

  • Perplexity AI rewrote the Unigram tokenizer in Rust, achieving 5x lower p50 latency vs Hugging Face tokenizers crate.
  • Three optimizations: double-array trie, bitmap and cache-line packing, and huge pages.
  • Reduced CPU utilization by 5-6x in production, with zero heap allocations on hot path.

Why it matters

This matters because perplexity AI rewrote the Unigram tokenizer in Rust, achieving 5x lower p50 latency vs Hugging Face tokenizers crate.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Perplexity AI’s research team reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in pplx-garden, their inference technology repository.

At production input lengths, the new encoder cuts p50 latency by roughly 5x versus the Hugging Face tokenizers crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE’s tokenizer (C), with zero steady-state heap allocations. In production, it reduced CPU utilization in Perplexity’s inference stack by 5-6x and shaved double-digit milliseconds off reranker latency.

Why Tokenization Became a Bottleneck

LLM inference cost is typically framed around GPU work: KV caches, attention kernels, expert routing. But smaller models, such as embedding models, classifiers, and rerankers, tell a different story. These models are two to three orders of magnitude smaller than frontier transformers.

A reranker scoring hundreds of candidate documents per request is a clear example. With a small model, GPU compute often finishes in single-digit milliseconds. Every input still passes through CPU-side tokenization first. When batch sizes are large, tokenization becomes a meaningful fraction of total request latency.

Perplexity’s work targets XLM-RoBERTa, a model with a 250K-token Unigram vocabulary trained with SentencePiece. Fine-tuned RoBERTa-family encoders are a common production choice for ranking, retrieval, and similarity tasks.

What is Unigram Tokenization?

Unigram tokenization was introduced by Kudo in 2018 and is implemented in SentencePiece. It frames segmentation as a most-probable-path problem. Each vocabulary token has a learned log-probability. The tokenizer picks the segmentation whose token scores sum to the highest value.

The algorithm used to find that best path is the Viterbi algorithm, a dynamic programming technique from 1967. Byte positions form graph layers and vocabulary tokens are edges spanning a contiguous byte range. The DP recurrence iterates over byte positions and updates the best-scoring path at each position.

The outer loop runs in linear time relative to input length. The inner loop walks a vocabulary trie (a prefix tree structure) at each byte position. On a 16K-token input, this inner walk executes hundreds of thousands of trie transitions. It is the hot path.

What was Slow in the Hugging Face Implementation

The Hugging Face tokenizers crate is the default Rust tokenizer most teams reach for. Perplexity used it as the benchmark reference. At 514 tokens (512 + BOS/EOS injection), the reference implementation had three costly patterns:

BottleneckMechanismMeasured impact

Allocation per matchString::from_utf8 + AHashMap lookup per trie match7,295 allocations at 514 tokens; 299,171 at 16K

Pointer chase per byteAHashMap at every trie node; 4 dependent loads per byte stepDependent-load latency dominates the hot path

L2 thrashing on long inputsDP table and output buffers freshly allocated each callL2 miss rate climbs from 8% at 128 tokens to 50% at 16K

Per-token allocation is constant: roughly 2 KB and ~18 allocations per token, regardless of input size. The latency problem becomes severe at longer inputs when cumulative allocations overflow the per-core L2 cache.

Establishing a Baseline Before Changing the Trie

Before switching the trie structure, Perplexity first isolated how much cost came from unnecessary work alone. They made a zero-allocation port of the reference: same HashMap trie, but with a caller-owned scratch struct reused across calls and token IDs stored directly in trie nodes (removing the per-match string allocation and secondary hash-map lookup).

This baseline already cut p50 latency to 155 µs at 514 tokens, down from 326 µs in the reference. Instructions retired dropped 2.4x. The remaining cost was the HashMap pointer chase itself, which the next step addressed.

The Three Optimizations

Optimization 1: Double-Array Trie

The Hugging Face trie stores children in a HashMap at every node. Each byte step requires a hash computation, two pointer dereferences, and a heap access. Perplexity replaced this with a double-array trie, the same structure used by SentencePiece and IREE, originally introduced by Aoe in 1989.

A double-array trie encodes the entire trie in two flat integer arrays, base and check. A child lookup is: next = base[node] + byte, then verify check[next] == node. That is two array reads, one integer add, and one comparison, with no hashing and no pointer chasing. For XLM-RoBERTa’s 250K vocab, the whole trie fits in ~9 MB of contiguous memory. The hot working set per encode is on the order of 100 KB, which fits in L2 cache.

Unlike SentencePiece and IREE, which are general-purpose libraries with lattice bookkeeping and multi-stage pipelines, Perplexity inlined the trie directly in the Viterbi loop and dropped that overhead entirely.

Result at 514 tokens: p50 dropped from 155 µs (zero-allocation baseline) to 68 µs. Wall-clock fell 4.8x from the original reference.

Optimization 2: Bitmap and Inline Packing

The double-array trie still requires two dependent array loads per byte step: first the parent’s base offset, then the check array to confirm the transition is valid. Perplexity replaced the check array with a per-node bitmap (four 64-bit words, 32 bytes) that records which of the 256 possible bytes have valid child transitions.

A bitmap lookup compiles to a single bit test against one 64-bit word. The check array is used only during trie construction and dropped from the runtime layout entirely.

They also packed all four per-node fields (bitmap, base, token ID, and score) into a single 64-byte cache line, matching CPU cache line width exactly. One trie step now loads a single cache line covering the bitmap for the next-byte check, the base offset for the child slot, and the token ID and score at terminal nodes.

Trade-off: trie size grows from ~9 MB to ~50 MB (780K nodes x 64 bytes). The hot working set per encode remains ~100 KB.

Result at 514 tokens: Additional 4.5% wall-clock reduction. L2 accesses dropped from 4.6K to 1.8K per encode.

Optimization 3: Huge Pages for the Trie

At 50 MB, the trie spans roughly 12,000 virtual pages on a default Linux system using 4 KB pages. The first-level data TLB on Intel Sapphire Rapids holds 96 entries. Each Viterbi step touches a different trie node, so TLB misses accumulate. Over a 512-token encode, Perplexity estimated roughly 9,000 cycles spent in page-table walks, about 3% of per-encode budget.

Perplexity backed the trie with 2 MB huge pages via mmap with the MAP_HUGETLB flag. The same 50 MB now spans 25 pages, well within the TLB. This requires vm.nr_hugepages configured at boot. In production, 10,561 huge pages are reserved; the trie uses 24.

Result: 3-12% wall-clock reduction depending on input length. The largest gain is at 4,098 tokens (-12.0%), where page-table traffic was actively competing with trie data for L2 bandwidth. Beyond 4K tokens the gain shrinks because L3 misses dominate.

Final Benchmark Results

All measurements are single-threaded, pinned to one core on an Intel Xeon Platinum 8488C, with 10,000 iterations after 1,000 warmup rounds. At 514 tokens:

Enginep50 LatencyInstructionsAllocations

Hugging Face (tokenizers crate)349 µs3.60M7,295

SentencePiece (C++)128 µs1.83M1,559

IREE tokenizer (C)112 µs2.28M1

Perplexity (final, all 3 optimizations)~63 µs1.04M0

Across the full optimization sequence, instructions per encode fell from 3.66M to 1.04M, a 3.5x reduction. Wall-clock matches that ratio at short inputs and widens at long inputs where the reference’s per-token allocations overflow L2 and L3.

One additional finding: off-the-shelf Rust wrapper crates around SentencePiece and IREE add 1.6-1.9x latency overhead compared to the native C/C++ binaries. The sentencepiece crate allocates a fresh list of token pieces on each call. The overhead is measurable but amortizes at long inputs.

The final Perplexity encoder produces token-exact output against the reference. In production, it uses rayon to parallelize across cores.

Marktechpost’s Visual Explainer

Open Source Release

Perplexity AI Rewrites Its Unigram Tokenizer, Cuts CPU Utilization 5-6x

Perplexity reimplemented their Unigram tokenizer from scratch in Rust and open-sourced it in pplx-garden. Three targeted optimizations removed wasted work from the hot path.

5xLower p50 vs HuggingFace tokenizers crate

5-6xCPU utilization reduction in production

0Heap allocations on the hot path

Source: research.perplexity.ai

The Problem

Why CPU Tokenization Became a Bottleneck

LLM inference cost is usually framed around GPU work: KV caches, attention kernels, expert routing. But small models tell a different story.

1

Rerankers and embedders are small

Two to three orders of magnitude smaller than frontier transformers. GPU compute finishes in single-digit milliseconds.

2

Tokenization runs on CPU before each call

Every input passes through CPU-side tokenization first, turning text into vocabulary IDs.

3

Batch size amplifies the cost

A reranker scoring hundreds of documents per request means tokenization runs hundreds of times per query.

Background

What Is Unigram Tokenization?

Introduced by Kudo (2018), implemented in SentencePiece. Perplexity targets XLM-RoBERTa with a 250K-token Unigram vocabulary.

Most-probable-path problem

Each vocabulary token carries a learned log-probability. The tokenizer picks the segmentation whose token scores sum highest.

Viterbi algorithm (1967)

A dynamic programming method that finds the best path. Byte positions are graph layers; vocabulary tokens are edges.

The hot path is the inner trie walk at each byte position. On a 16K-token input, this executes hundreds of thousands of trie transitions and retires tens of millions of instructions per encode.

Root Cause

Three Bottlenecks in the Hugging Face Reference

Measured at 514 tokens (512 + BOS/EOS) on Intel Xeon Platinum 8488C:

BottleneckMechanismImpact

Allocation per matchString::from_utf8 + AHashMap lookup per trie match7,295 allocs at 514 tokens; 299,171 at 16K

Pointer chase per byteAHashMap at every trie node; 4 dependent loads per stepDependent-load latency dominates

L2 thrashingDP table and output buffers freshly allocated each callL2 miss rate: 8% at 128 tokens, 50% at 16K

Per-token allocation is constant: ~2 KB and ~18 allocations per token regardless of input size.

Step 0: Baseline

Zero-Allocation Port Before Changing the Trie

Before touching the trie structure, Perplexity isolated how much cost came from unnecessary allocations alone. They kept the same HashMap trie but made two changes:

Caller-owned scratch struct reused across calls, removing per-encode DP table allocation

Token IDs stored directly in trie nodes, removing per-match String allocation and secondary hash-map lookup

Reference p50

326 µs

Baseline p50

155 µs (-2.1x)

Allocations alone were the dominant cost. Instructions retired dropped 2.4x. The HashMap pointer chase was now the remaining bottleneck.

Optimization 1

Double-Array Trie

The HashMap trie costs 4 dependent loads per byte step. The double-array trie (Aoe, 1989) replaces it with flat integer arrays base and check.

HashMap trie (reference)

Hash byte, load bucket, follow pointer to child, follow pointer to child’s HashMap. 4 dependent loads per step.

Double-array trie

next = base[node] + byte Verify check[next] == node 2 array reads, 1 add, 1 compare. No hashing.

250K vocab fits in ~9 MB contiguous memory. Hot working set per encode is ~100 KB, fitting in L2 cache. Result: p50 drops from 155 µs to 68 µs, wall-clock 4.8x faster than original reference.

Optimization 2

Bitmap + 64-Byte Cache-Line Packing

The double-array trie still needs two dependent array loads per step. Perplexity replaced the check array with a per-node bitmap.

Per-node bitmap: four 64-bit words (32 bytes), one bit per possible byte value. A single bit test replaces the secon

[truncated for AI cost control]