Serving Local AI on My Jetson Through Durable Streams
The author built a local text-to-speech app StreamTTS using NVIDIA Jetson Orin Nano Super and Kokoro-82M, replacing traditional request-response with durable streams (S2) for shareable, live-updating audio generation, addressing slow inference, fair scheduling, and deduplication.
With local AI feeling more and more practical, I wanted to self-host my own models and run my workloads independently without any third-party provider in the mix, and also look into serving my local model to some users reliably. The Jetson series by NVIDIA is a great starting point, and I went with the Jetson Orin Nano Super kit, aka “The most affordable generative AI supercomputer”! It has 1024 CUDA cores and 32 tensor cores and is rated at 67 TOPS (trillion operations per second), which should be good enough for my little experiment which is a small text-to-speech app powered by Kokoro-82M, a neural text-to-speech model.
It is mostly inspired out of need that I don't want to always read a lot of text, but would rather hear it. So I want something where I select some text, pick a voice, and get a link which I can come back to later or share with people. For now that means pasting text into a page, but I'd want something even more lazy-proof eventually which would be a nicer frontend on top of the same core app. Beyond the app itself, I want to land on a small reference architecture for local inference: a self-contained serving layer that exposes a clean API, so the same setup can back a web app, a CLI, or another service without rework.
Try it out at streamtts.dev (It is self-hosted on my Jetson! 😉):
Not a normal Request/Response API
The simplest way to architect this would be:
POST /generate wait return audio.mp3
Inference is slower than a normal web request. Kokoro on this Jetson can produce speech faster than realtime, but it is still a GPU job. A minute of audio can take many seconds of compute. A cold first sentence can be slower while the model stack warms up. If multiple users submit at once, a blocking request turns into a line of sockets waiting on the GPU.
The output is also naturally incremental. TTS does not need to finish the entire paragraph before the listener hears anything. The model can generate one sentence, encode that sentence to MP3, append it somewhere, and move on. If I force the whole thing into a single response body, I throw away the best property of the workload.
And I want the result to be shareable. The user should be directed to a link immediately where they can "await" the model to produce all the bytes. If they open it while the Jetson is still working, they should hear the prefix and then follow the live edge.
If we start with request-response, we end up adding a pile of infrastructure like:
queue
database for job bookkeeping
object storage for the finished file
retry logic
dedupe logic
cleanup process
All of this is reasonable. But together, it is a lot for one basic promise:
accept work now produce output later let readers follow along
The request feels like the wrong lifetime for this. I want the inference job to work seamlessly across network disruptions. I also do not want a dropped browser tab to kill a running generation. Thus the output should have an identity before it is complete, and readers should be able to start at the beginning, catch up to the tail, or come back later and replay the same bytes!
In summary, I want:
submit work get an output stream immediately worker appends model output client awaits the stream
All of this can be cleanly abstracted over durable streams. A stream is an ordered sequence of records, where a record is just some bytes (here, a chunk of audio plus a little metadata). Durable means every record is persisted, so nothing is lost and a reader can come back later and replay the exact same bytes. Putting the two together, we get a simple but powerful building block.
Append records to the tail, and readers can start at the head, seek to a known sequence number, or sit at the tail and wait for the next record to arrive. A stream store gives you named timelines:
APPEND record READ from seq_num=N TAIL for live records
Each record is the unit of progress. A record has a sequence number, timestamp, headers, and a body. StreamTTS does not need much more structure than that. We represent records like so:
headers: e: audio i: 3 d: 4210 t: "sentence text" body:
e = event type
i = index
d = duration (ms)
t = sentence text
And the output will be shaped like:
pub/casts/4LwnHZDl_vFC seq 0 meta seq 1 start seq 2 audio sentence 0 seq 3 audio sentence 1 seq 4 audio sentence 2 seq 5 eos # end of stream
That stream is the audio file, the live feed, the replay log, and the progress indicator. It is also the contract between the web server, the GPU worker, and every browser that opens the link. The writer does not need to know who is listening. The reader does not need to know whether the writer is still alive. Both sides just agree on one named sequence of records.
Connection-only SSE or WebSockets are great for live delivery, but they do not give you durable replay by themselves. They move bytes to clients that are currently connected. They do not, on their own, remember the bytes for clients that arrive late, disconnect, or refresh the page. So if nobody is connected, there is nowhere durable for a websocket message to go. If a client drops, the server needs some other store to remember what that client missed. If a second listener opens the same link while generation is still running, the websocket connection does not tell the server how to replay the beginning and then follow the live edge. You can absolutely solve this by putting a database or object store next to SSE/WebSockets. But now live delivery and replay are two separate pieces that have to agree.
With a durable stream, that split can be unified! The worker appends output once and a live listener tails the stream. A late listener can read from seq_num=0 and then tails the same stream. Replay and live playback are the same read path, just starting from different offsets.
S2-Lite
S2 Lite is an open source self-hosted, single-binary implementation of the S2 durable streams API. In this setup, it runs on localhost with local disk for durable storage and gives me streams with append, read, tail, and long-polling semantics.
s2 lite --local-root var/s2lite-data --port 4002 --no-cors
We start by creating a basin, which acts as a namespace, and model the whole service as a handful of named streams. The arrows below show which component appends to each stream and which reads from it:
A few streams are shared across all casts:
jobs is the intake log: one record per inference request
jobs/_cursor holds the worker's committed read offset into jobs
jobs/dead collects jobs that failed past retries
progress/done gets one receipt per completed cast
And each cast adds two streams of its own:
catalog/ is the private recipe: full text, voice, title, created time
pub/casts/ is the public output stream: meta, start, audio..., eos
s2 = S2( os.environ.get("S2_ACCESS_TOKEN", "local-token"), endpoints=Endpoints( account=lite_url, basin=lite_url, ), )
config = BasinConfig( default_stream_config=StreamConfig( storage_class=StorageClass.EXPRESS, retention_policy=RETENTION_SECS, ) ) await s2.ensure_basin(basin, config=config)
Each audio record carries the sentence text and duration in milliseconds in its headers, and the raw MP3 bytes in its body. The text gives the browser captions and seek points. The duration lets the player schedule chunks. The browser always starts tailing at seq_num=0.
If the stream is complete, the browser reads through eos and stops. If the worker is still appending, the browser reads the existing prefix, reaches the tail, and waits for the next record. The browser player is also built around the stream shape. It does not use Media Source Extensions or build one growing MP3 file. Each audio record is a complete sentence-sized MP3 chunk. The browser receives each sentence-sized MP3 chunk, decodes it with the Web Audio API, and places it on a virtual timeline.
Fair Scheduling
A single Jetson can’t behave like an elastic inference cluster 😅. If lets say three people submit text, I do not want the first long paragraph to finish completely while everyone else waits. The worker keeps several casts active and tracks how far ahead each stream is relative to wall-clock playback:
def lead(self) -> float: return self.total_ms / 1000.0 - (time.monotonic() - self.started)
Positive lead means the stream has generated audio buffered ahead of playback. Negative lead means the listener is catching up to the live tail.
The scheduling loop is:
admit jobs up to the concurrency cap pick the active stream with the lowest lead generate exactly one sentence for it append that sentence recompute lead repeat
When every active stream is comfortably ahead, the worker sleeps for a tiny bit instead of sprinting one stream to completion creating live-output scheduling. The goal is to keep multiple public streams playable. The unit of fairness is not a request, but one appended sentence.
Submitting Work
When a request comes in, the web process does not load the model. It validates the text and voice, computes a deterministic id, and creates a place where audio will appear.
The id is content-addressed:
def content_id(text: str, voice: str) -> str: h = hashlib.sha256(f"{voice}\x00{text.strip()}".encode()).digest() return base64.urlsafe_b64encode(h).decode().rstrip("=")[:12]
Identical text with the same voice maps to the same stream. That turns repeated submissions into cache hits.
The write path is:
claim catalog/ with the full recipe
claim pub/casts/ with a meta record
append one job to the jobs stream
return /c/
The important operation is the claim. S2 supports conditional append with match_seq_num. StreamTTS uses match_seq_num=0, which means "only append if this stream is empty."
payload = { "records": [{"body": json.dumps(body, separators=(",", ":"))}], "match_seq_num": 0, }
If two people submit the same text at the same time, exactly one request wins the claim and enqueues the job. The other gets the same link and tails the same output stream.
That one append replaces a lock table, a uniqueness constraint, and a dedupe cache.
The Worker is a Durable Consumer
The worker is the only process that owns the model and touches the GPU. It reads from the jobs stream, runs Kokoro-82M, and appends audio records to the cast stream.
On startup, the worker reads the last committed offset from jobs/_cursor:
jobs/_cursor {"offset": 123}
Then it reads jobs starting from that offset. If there is nothing new, it long-polls at the tail.
The subtle part is committing the cursor. StreamTTS can have several active casts at once, and they do not necessarily finish in job order. A short job 10 can finish before a long job 7. The cursor can only move forward when every job up to that point has finished.
The worker uses a contiguous-done watermark:
def advance_watermark(): nonlocal committed moved = False while committed in done_above: done_above.discard(committed) committed += 1 moved = True if moved: self._commit_offset(committed)
If the process crashes, there is no special recovery protocol. On restart, the worker resumes from the last committed offset. Jobs after that offset are read again. Already-complete casts are skipped by checking whether their output stream ends in eos. Incomplete casts run again.
That is at-least-once delivery with idempotent output. It behaves like exactly-once for completed casts because eos is the durable completion marker. We could also use a fencing token with the token being a terminal marker to mark a cast as done.
Retries can leave partial audio in the stream. The start record is therefore an attempt boundary:
seq 0 meta seq 1 start attempt 1 seq 2 audio sentence 0
worker crashes
seq 3 start attempt 2 seq 4 audio sentence 0 seq 5 audio sentence 1 seq 6 eos
The player can treat the latest start as the beginning of the playable attempt and ignore earlier partial audio.
Serving Readers
The public read path is intentionally narrower than the internal S2 API
[truncated for AI cost control]