2026-06-30 23:48 UTCIn-site rewrite4 min readUpdated: 2026-07-01 00:29 UTC

RunInfra: Optimize any open model down to the kernel, deploy in 5 min

RunInfra automatically optimizes open-source AI models for production by selecting the best inference engine, GPU, and configuration through benchmarking and tuning, delivering a deployable stack with significant latency, throughput, and cost improvements.

SourceHacker News AIAuthor: OsamaJaber

Optimize open models for production - RunInfra

Backed byCombinator

Optimize openmodel productionOptimize any open model for productionanyfor

Paste a model. RunInfra benchmarks the options and picks the winner. Deploy it, or own the stack.

Describe the inference workload you want to deploy...

Auto engineAuto GPU

Example workloads

Every optimization ends with a result you can inspect and run.

You get a benchmark receipt and a runnable deployment kit. Nothing hidden.

Serving engine

Compared, not assumed

GPU target

Sized to the model

p95 latency

Benchmarked

Throughput

Measured per GPU

VRAM

Checked for fit

Cost

Tracked per run

GPU kernels

Tuned where supported

Deployment kit

Run it or export it

RUNINFRA

From prompt to a production stack you own

RunInfra compares, tunes, and benchmarks the stack. Deploy or export it.

Describe a Llama 3.1 70BQwen 2.5 7BDeepSeek V3Mistral 7BPhi-4Gemma 2 9BMixtral 8x7BWhisper Large V3Llama 3.1 70B workload in plain English.

RunInfra compares vLLMSGLangTensorRT-LLMvLLM-OmniSGLang and every other engine your model can run on.

It tunes speculative decodingkernel generationserver tuningquantizationKV cache reuseFlashAttention v2continuous batchingserver tuning where it helps, with no config to hand-write.

Deploy on NVIDIA H100NVIDIA H200NVIDIA B200NVIDIA A100NVIDIA L40SNVIDIA L4NVIDIA A100 and pay per million tokens, or export the stack and self-host.

New Session

Optimize Llama-3.1-8B-Instruct on vLLM for cheapest GPU with latency and VRAM checks

Capturing cost-first intent for Llama 3.1 8B on vLLM.

Intake updatedModel, engine, goal

ModelLlama 3.1 8B

EnginevLLM

GoalLow cost, latency checked

Requirements collected575ms

Model, engine, GPU target, latency goal

Plan drafted3.1s

10 execution phases prepared

Plan readyreview / 10 phases / ~23m

Optimize Llama 3.1 8B on vLLM for cheapest GPU with latency checks

Recommended path: vLLM on L4. Review the plan before execution.

Working...

Runbook generated from the workload

Llama 3.1 8B on vLLM, lowest viable cost

draft

Latency target

p95 under 60ms

VRAM budget

24 GB

Est. runtime

~23 min

Execution plan10 phases, 3 validated

AWQ int4 quantizationready

Weight-only int4, calibrated offline

FlashAttention v2ready

Fused attention kernels

Continuous batchingqueued

In-flight request scheduling

Paged KV cachequeued

fp8 cache in paged blocks

CUDA graph capturequeued

Replay the decode-step graph

Speculative decodingqueued

Draft model proposes tokens

Prefix cachingqueued

Reuse shared prompt prefixes

Tensor-parallel sizingready

Single GPU, no sharding

Warmup and autotunequeued

Lock kernel shapes pre-serve

Serving-config tunequeued

Batch size and concurrency

Review the plan, then run.vLLM, L4 to A100

Optimization run

running5/6 phases

Benchmarking the tuned config against cheaper GPU candidates.

Candidate set built0.6s

Serving config tuned2.9s

AWQ int4 quantization applied41.7s

FlashAttention v2 kernels compiled58.3s

Candidates benchmarked94.2s

Confirming winnerlive

Baseline vs optimizedbest candidate so far

MetricBaselineOptimizedDelta

P95 latency184ms38ms-79%

Time to first token120ms22ms-82%

Throughput45 tok/s142 tok/s+216%

VRAM28.4 GB12.1 GB-57%

Cost / 1M tokens$0.42$0.12-71%

GPU candidateswinner marked

GPU candidateCost / 1Mp95Pick

NVIDIA L4misses latency

$0.0884ms-

NVIDIA L40S

$0.1238ms

NVIDIA A100overspec

$0.2131ms-

Deploy or export the stack

Pick a target for the optimized L40S build.

ready

TargetSupported GPUs

Managed by RunInfra

selected

Fully managed endpoint, billed per million tokens

H100L40SA100L4

Your RunPod

Deploy to your own RunPod account

H100A100RTX 4090L40S

Modal

Serverless deploy on RunInfra Modal

H200H100A100L40S

Your Modal

Deploy to your own Modal workspace

H100A100L40ST4

Generated stackDockerfileserve.shruninfra.yaml

1FROM runinfra/vllm:0.6.3-l40s

2ENV MODEL=Llama-3.1-8B-Instruct

3ENV QUANTIZATION=awq_marlin

4COPY ./weights /models

5# optimized serving config

6CMD vllm serve $MODEL \

7--quantization awq_marlin \

8--kv-cache-dtype fp8 \

9--enable-prefix-caching \

10--max-num-seqs 256 \

11--gpu-memory-utilization 0.92

$0.12 / 1M38ms p95L40S

Deploy on thehardware you choose

Compare real GPU prices and deploy wherever fits. No lock-in, no rewrites.

RunInfra

L4$0.80/hr

L40S$1.95/hr

A100$2.50/hr

H100$3.95/hr

H200$4.54/hr

B200$6.25/hr

Serverless

Modal

L4$0.39/hr

L40S$0.99/hr

A100$1.39/hr

H100$2.89/hr

H200$4.39/hr

B200$5.89/hr

On-demand

RunPod

L4from $0.32/hr

A100from $0.75/hr

H100from $2.00/hr

H200from $3.29/hr

B200from $4.34/hr

Marketplace

Vast.ai

Your hardware

Local hardware

Own the winning stack before you ship

Not a black box. You get the measured stack to run, deploy, or export.

BenchmarkVerified

p99 latency64 ms64%

throughput3.4k tok/s2.8x

cost / 1M$0.1564%

Llama 3.1 8B, vLLM, L4 24GB

Benchmark receipt

Before and after, in one record you can reproduce.

p99 latency

throughput

VRAM

cost

reproduction

runinfra.yamlyaml

12345

engine: vLLMquantization: awq-int4kv_cache: fp8max_num_seqs: 256speculative: eagle

Optimized runtime config

Every serving setting RunInfra tuned. Read or change it.

engine flags

batch settings

quantization

kernel paths

kv cache

deployment-kitRunnable

Dockerfile

compose.yaml

k8s/

deployment.yaml

serve.py

benchmark.md

Exportable stack

A runnable repo you take with you.

Dockerfile

compose

launch scripts

reports

Livep50 38 ms

POST/v1/chat/completions200

L42 replicasautoscale

Managed endpoint

The same measured stack, hosted by us.

RunInfra Cloud

resource config

inspectable

portable

Why owning your AI stack matters

Data privacy and control

Keep sensitive workloads on infrastructure you choose.

Customization

Tune the model, runtime, and GPU to your workload.

Performance ownership

Real tuning, measured, not assumed.

Portability

Run it on our cloud, or export it to yours.

Supported across the stack

Open models, serving engines, GPUs, and the clouds you deploy to. RunInfra supports every layer of the inference stack.

Models

Models: Llama 3.3, Whisper, Qwen-Image, NV-Embed, Parler-TTS, Qwen2.5, Cosmos, Pixtral, EmbeddingGemma, RoBERTa, DeepSeek-V3, Sana, Parakeet, Mistral, Wan 2.1, GTE, Qwen2-VL, MMS-TTS, Qwen3 Reranker, Gemma 2, MusicGen, DeepSeek-VL2, Nemotron, FastPitch, PaliGemma, NV-RerankQA, Qwen2-Audio, Hermes 3, Canary, BERT, Llama 3.2 Vision, Qwen3 Embedding

LLMLlama 3.3ASRWhisperImageQwen-ImageEmbedNV-EmbedTTSParler-TTSLLMQwen2.5VideoCosmosVisionPixtralEmbedEmbeddingGemmaClassifyRoBERTaLLMDeepSeek-V3ImageSanaASRParakeetLLMMistralVideoWan 2.1EmbedGTEVisionQwen2-VLTTSMMS-TTSRerankQwen3 RerankerLLMGemma 2AudioMusicGenVisionDeepSeek-VL2LLMNemotronTTSFastPitchVisionPaliGemmaRerankNV-RerankQAAudioQwen2-AudioLLMHermes 3ASRCanaryClassifyBERTVisionLlama 3.2 VisionEmbedQwen3 EmbeddingLLMLlama 3.3ASRWhisperImageQwen-ImageEmbedNV-EmbedTTSParler-TTSLLMQwen2.5VideoCosmosVisionPixtralEmbedEmbeddingGemmaClassifyRoBERTaLLMDeepSeek-V3ImageSanaASRParakeetLLMMistralVideoWan 2.1EmbedGTEVisionQwen2-VLTTSMMS-TTSRerankQwen3 RerankerLLMGemma 2AudioMusicGenVisionDeepSeek-VL2LLMNemotronTTSFastPitchVisionPaliGemmaRerankNV-RerankQAAudioQwen2-AudioLLMHermes 3ASRCanaryClassifyBERTVisionLlama 3.2 VisionEmbedQwen3 Embedding

Engines

Engines: vLLM, SGLang, TensorRT-LLM, vLLM Omni, TEI, Transformers

EnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformersEnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformersEnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformersEnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformersEnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformersEnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformers

GPUs

GPUs: L4, A10, L40S, RTX 4090, A100, H100, H200, B200

24 GBL424 GBA1048 GBL40S24 GBRTX 409080 GBA10080 GBH100141 GBH200192 GBB20024 GBL424 GBA1048 GBL40S24 GBRTX 409080 GBA10080 GBH100141 GBH200192 GBB20024 GBL424 GBA1048 GBL40S24 GBRTX 409080 GBA10080 GBH100141 GBH200192 GBB20024 GBL424 GBA1048 GBL40S24 GBRTX 409080 GBA10080 GBH100141 GBH200192 GBB200

Clouds

Clouds: RunInfra Cloud, Modal, RunPod, Vast.ai

B200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.ai

Common questions

Can't find what you're looking for? Get in touch

What is RunInfra?

Describe what you want to run. RunInfra picks compatible open models, benchmarks GPUs, tunes the runtime, and gives you a deploy-ready stack.

Deploy your first optimized model, measured before you ship

Describe the goal. RunInfra builds and optimizes the stack.

Start BuildingView Pricing

End-to-end encryption

Isolated GPU infrastructure

Zero data retention

SOC 2 Type II

RunInfraby RightNow

All systems operational

Backed by

Combinator

AICPA Type II

SOC 2

Ask AI about RunInfra

Part of RightNow