2026-06-08站内改写6 min readUpdated: 2026-06-08

How to Measure Time to First Token (TTFT)

This article explains how to measure Time To First Token (TTFT) in AI systems, why it differs fundamentally from traditional web API performance measurement, and how to instrument LLM workloads using Python, Node.js, and Apache JMeter.

SourceHacker News AIAuthor: qainsights

In this blog post, we will see how to measure Time To First Token (TTFT) in AI systems, why it is fundamentally different from traditional web API performance measurement, and how you can instrument your LLM workloads using Python, Node.js, and Apache JMeter.

Time To First Token (TTFT) measures the elapsed time between sending a prompt request and receiving the first token in the response stream, making it fundamentally different from traditional HTTP response time, which captures only when the final byte arrives. For streaming LLM APIs, this distinction matters because users perceive responsiveness based on when output begins, not when it ends.

The post covers how LLM APIs deliver tokens via Server-Sent Events or chunked transfer encoding, explains a complete set of related metrics including token throughput and inter-token latency, and provides code examples in Python and Node.js for accurate TTFT instrumentation. Common measurement pitfalls include using low-resolution timers, skipping stream mode, and testing only at single-user concurrency.

Table of Contents

Toggle

The Problem with Traditional API Performance Metrics

When you load test a REST API, you typically measure response time, throughput, and error rate. These three metrics tell you almost everything you need to know. You fire a request, wait for the full HTTP response, record the elapsed time, and move on.

That model completely breaks down the moment you point your load generator at an LLM.

Here is the trap. You call the OpenAI API, the Anthropic API, or a locally hosted Ollama endpoint via plain HTTP. It looks and feels exactly like calling any other REST API. You get a JSON response body, a 200 status code, and a response time in your results. Everything looks normal.

But that response time is lying to you.

The LLM did not compute the entire response in one shot and then flush it out. It generated one token at a time, streamed them to you over the wire, and what your traditional performance tool recorded was the time for the last token to arrive, not the first. That is not a meaningful user experience metric. A user staring at a blank screen for four seconds before anything appears and then watching a wall of text arrive is a terrible experience, even if the total elapsed time is only 5 seconds and sits well within your SLA.

Measuring LLM API performance with HTTP response time alone is like judging a restaurant purely by when the bill arrives.

The Restaurant Analogy

Imagine you walk into a restaurant and order a meal. You are hungry and impatient.

Two things determine whether you feel the experience was fast:

How long until the first dish arrives at your table. It does not have to be the main course. Even a bread basket or a soup tells your brain “they heard me, they are working on it.” This is your TTFT.

How quickly the remaining dishes keep coming after that. Steady, predictable delivery. No long gaps between courses. This is your token throughput and Time To Last Token (TTLT).

Your HTTP response time is like timing the entire meal, from the moment you ordered to the moment you paid the bill and walked out the door. That number might be useful for business analytics, but it tells the kitchen nothing about whether they need to improve their response speed.

TTFT is the bread basket. It is the first signal that something is happening. And for AI-powered applications, it is the single most important metric for perceived performance.

What Is Time To First Token (TTFT)?

TTFT is the elapsed time between the moment your client sends the prompt request and the moment the first token byte arrives in the response stream.

In formula terms:

TTFT = Time of first token received - Time of request sent

It captures everything that happens before generation begins: network latency to the inference server, request queuing, tokenization of the prompt, KV cache lookup, model forward pass for the first token, and serialization back over the wire.

A high TTFT is almost always experienced as the application “hanging.” Even if the model eventually produces a long, high-quality answer, users will have already lost confidence in the system.

TTFT matters most in:

Chat interfaces where users expect real-time streaming

Copilot-style code completion tools

Voice AI pipelines where first-word latency drives naturalness

Agentic workflows where multiple sequential LLM calls compound the delay

The Full LLM Performance Metric Stack

TTFT is the headline metric, but it does not exist in isolation. A complete LLM performance measurement strategy tracks all of the following:

MetricDefinitionWhy It Matters

TTFTTime to first tokenPerceived responsiveness, UX

TTLTTime to last token (end-to-end latency)Total completion time

Token ThroughputTokens generated per secondGeneration speed, cost efficiency

Inter-token LatencyAverage time between consecutive tokensStreaming smoothness

GoodputSuccessful tokens per second under loadReal-world system capacity

JitterVariance in inter-token latencyConsistency under concurrent load

Think of it this way: TTFT tells you how fast the kitchen starts. Token throughput tells you how fast the kitchen works. Jitter tells you whether the kitchen has bad days.

How LLM Streaming Works Under the Hood

LLM APIs deliver tokens via Server-Sent Events (SSE) or chunked HTTP transfer encoding. Instead of buffering the full response, the server flushes each token (or small group of tokens) as it is generated.

A raw SSE stream from the OpenAI API looks like this:

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":"The"}}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":" capital"}}]}

data: [DONE]

Each data: line is a discrete chunk. The timestamp of the very first data: line with actual content in it is your TTFT measurement point. Everything after that contributes to token throughput and TTLT.

This is a fundamentally different measurement model than HTTP response time. You cannot capture TTFT with a standard HTTP sampler that waits for the full response body. You need a streaming-aware client that hooks into the chunk arrival events.

Measuring TTFT in Python

Here is a minimal but accurate TTFT measurement using the httpx library with streaming enabled against the Anthropic Messages API:

import httpx import time import json

def measure_ttft(prompt: str, api_key: str) -> dict: url = "https://api.anthropic.com/v1/messages" headers = { "x-api-key": api_key, "anthropic-version": "2023-06-01", "content-type": "application/json", } payload = { "model": "claude-sonnet-4-20250514", "max_tokens": 512, "stream": True, "messages": [{"role": "user", "content": prompt}], }

ttft = None ttlt = None token_count = 0 request_start = time.perf_counter()

with httpx.Client(timeout=60) as client: with client.stream("POST", url, headers=headers, json=payload) as response: for line in response.iter_lines(): if not line.startswith("data:"): continue raw = line[len("data:"):].strip() if raw == "[DONE]": break try: chunk = json.loads(raw) except json.JSONDecodeError: continue

event_type = chunk.get("type", "")

if event_type == "content_block_delta": now = time.perf_counter() if ttft is None: ttft = now - request_start token_count += 1 ttlt = now - request_start

return { "ttft_ms": round(ttft * 1000, 2) if ttft else None, "ttlt_ms": round(ttlt * 1000, 2) if ttlt else None, "token_count": token_count, "throughput_tokens_per_sec": round( token_count / ttlt, 2 ) if ttlt else None, }

if name == "main": result = measure_ttft( prompt="Explain how transformer attention works in simple terms.", api_key="your-api-key-here", ) print(result)

The key detail here is time.perf_counter(). Do not use time.time() for sub-second precision measurements. perf_counter() uses the highest resolution timer available on your platform, as shown below.

You record request_start before the HTTP connection opens. The first content_block_delta event sets ttft. The last event before [DONE] sets ttlt. Simple, accurate, and portable across any LLM provider that supports SSE streaming.

Measuring TTFT in Node.js

Here is the equivalent measurement in Node.js using the official Anthropic SDK with streaming:

import Anthropic from "@anthropic-ai/sdk";

interface LLMMetrics { ttft_ms: number | null; ttlt_ms: number | null; token_count: number; throughput_tokens_per_sec: number | null; }

async function measureTTFT(prompt: string): Promise { const client = new Anthropic();

let ttft: number | null = null; let ttlt: number | null = null; let tokenCount = 0; const requestStart = performance.now();

const stream = client.messages.stream({ model: "claude-sonnet-4-20250514", max_tokens: 512, messages: [{ role: "user", content: prompt }], });

for await (const chunk of stream) { if (chunk.type === "content_block_delta") { const now = performance.now(); if (ttft === null) { ttft = now - requestStart; } tokenCount++; ttlt = now - requestStart; } }

return { ttft_ms: ttft !== null ? Math.round(ttft * 100) / 100 : null, ttlt_ms: ttlt !== null ? Math.round(ttlt * 100) / 100 : null, token_count: tokenCount, throughput_tokens_per_sec: ttlt && tokenCount ? Math.round((tokenCount / (ttlt / 1000)) * 100) / 100 : null, }; }

measureTTFT("What is the difference between TCP and UDP?").then(console.log);

Use performance.now() from the Web Performance API, not Date.now(). The former gives sub-millisecond resolution; the latter is millisecond-granular at best and drifts with system clock adjustments.

What Does Good TTFT Look Like?

There is no universal SLA, but here are rough benchmarks based on observed inference performance across major providers under normal load conditions:

TTFT RangeInterpretation

3 sUnacceptable for interactive use cases

These numbers compress significantly under concurrent load. A provider that serves TTFT of 250ms for a single request may degrade to 1.5s or more when you ramp up to 50 concurrent users. This is exactly why load testing TTFT under simulated concurrency is important, not just measuring it for a single request in isolation.

Your SLA should define a p95 TTFT target, not a mean. Mean TTFT hides the tail latency that affects your worst-affected users.

Common Pitfalls

Measuring HTTP response time instead of TTFT. As discussed, this is the most common mistake. Your standard HTTP sampler records time-to-last-byte, which is TTLT, not TTFT. They are not interchangeable.

Not enabling streaming. If you call the LLM API without "stream": true, the provider buffers the entire response server-side and sends it as a single HTTP response. You will never observe TTFT in that mode because there are no intermediate chunks to measure.

Using wall-clock time with low resolution. Date.now() and time.time() are not appropriate for sub-second latency measurements. Use performance.now() in Node.js and time.perf_counter() in Python.

Testing only at single-user concurrency. TTFT at one virtual user tells you the best-case floor, not the operational reality. Always run TTFT measurements at realistic concurrency levels.

Ignoring prompt length as a variable. Longer prompts take more time to tokenize and run through the prefill phase on the model, which directly inflates TTFT. Your benchmarks should fix prompt length as a controlled variable, otherwise your results are not comparable across runs.

Conflating model latency with network latency. If your client is geographically far from the inference endpoint, network round-trip time dominates TTFT. Run measurements from a machine in the same cloud region as the model when you want to isolate model performance from infrastructure performance.

Conclusion

Measuring LLM performance is genuinely different from measuring traditional web API performance. HTTP response time, the metric we have trusted for years, gives you a completely wrong picture when applied to streaming L

[truncated for AI cost control]