RunInfra: Optimize any open model down to the kernel, deploy in 5 min
RunInfra automatically optimizes open-source AI models for production by selecting the best inference engine, GPU, and configuration through benchmarking and tuning, delivering a deployable stack with significant latency, throughput, and cost improvements.
Optimize open models for production - RunInfra
Backed byCombinator
Optimize openmodel productionOptimize any open model for productionanyfor
Paste a model. RunInfra benchmarks the options and picks the winner. Deploy it, or own the stack.
Describe the inference workload you want to deploy...
Auto engineAuto GPU
Example workloads
Every optimization ends with a result you can inspect and run.
You get a benchmark receipt and a runnable deployment kit. Nothing hidden.
Serving engine
Compared, not assumed
GPU target
Sized to the model
p95 latency
Benchmarked
Throughput
Measured per GPU
VRAM
Checked for fit
Cost
Tracked per run
GPU kernels
Tuned where supported
Deployment kit
Run it or export it
RUNINFRA
From prompt to a production stack you own
RunInfra compares, tunes, and benchmarks the stack. Deploy or export it.
Describe a Llama 3.1 70BQwen 2.5 7BDeepSeek V3Mistral 7BPhi-4Gemma 2 9BMixtral 8x7BWhisper Large V3Llama 3.1 70B workload in plain English.
RunInfra compares vLLMSGLangTensorRT-LLMvLLM-OmniSGLang and every other engine your model can run on.
It tunes speculative decodingkernel generationserver tuningquantizationKV cache reuseFlashAttention v2continuous batchingserver tuning where it helps, with no config to hand-write.
Deploy on NVIDIA H100NVIDIA H200NVIDIA B200NVIDIA A100NVIDIA L40SNVIDIA L4NVIDIA A100 and pay per million tokens, or export the stack and self-host.
New Session
Optimize Llama-3.1-8B-Instruct on vLLM for cheapest GPU with latency and VRAM checks
Capturing cost-first intent for Llama 3.1 8B on vLLM.
Intake updatedModel, engine, goal
ModelLlama 3.1 8B
EnginevLLM
GoalLow cost, latency checked
Requirements collected575ms
Model, engine, GPU target, latency goal
Plan drafted3.1s
10 execution phases prepared
Plan readyreview / 10 phases / ~23m
Optimize Llama 3.1 8B on vLLM for cheapest GPU with latency checks
Recommended path: vLLM on L4. Review the plan before execution.
Working...
Runbook generated from the workload
Llama 3.1 8B on vLLM, lowest viable cost
draft
Latency target
p95 under 60ms
VRAM budget
24 GB
Est. runtime
~23 min
Execution plan10 phases, 3 validated
01
AWQ int4 quantizationready
Weight-only int4, calibrated offline
02
FlashAttention v2ready
Fused attention kernels
03
Continuous batchingqueued
In-flight request scheduling
04
Paged KV cachequeued
fp8 cache in paged blocks
05
CUDA graph capturequeued
Replay the decode-step graph
06
Speculative decodingqueued
Draft model proposes tokens
07
Prefix cachingqueued
Reuse shared prompt prefixes
08
Tensor-parallel sizingready
Single GPU, no sharding
09
Warmup and autotunequeued
Lock kernel shapes pre-serve
10
Serving-config tunequeued
Batch size and concurrency
Review the plan, then run.vLLM, L4 to A100
Optimization run
running5/6 phases
Benchmarking the tuned config against cheaper GPU candidates.
Candidate set built0.6s
Serving config tuned2.9s
AWQ int4 quantization applied41.7s
FlashAttention v2 kernels compiled58.3s
Candidates benchmarked94.2s
Confirming winnerlive
Baseline vs optimizedbest candidate so far
MetricBaselineOptimizedDelta
P95 latency184ms38ms-79%
Time to first token120ms22ms-82%
Throughput45 tok/s142 tok/s+216%
VRAM28.4 GB12.1 GB-57%
Cost / 1M tokens$0.42$0.12-71%
GPU candidateswinner marked
GPU candidateCost / 1Mp95Pick
NVIDIA L4misses latency
$0.0884ms-
NVIDIA L40S
$0.1238ms
NVIDIA A100overspec
$0.2131ms-
Deploy or export the stack
Pick a target for the optimized L40S build.
ready
TargetSupported GPUs
Managed by RunInfra
selected
Fully managed endpoint, billed per million tokens
H100L40SA100L4
Your RunPod
Deploy to your own RunPod account
H100A100RTX 4090L40S
Modal
Serverless deploy on RunInfra Modal
H200H100A100L40S
Your Modal
Deploy to your own Modal workspace
H100A100L40ST4
Generated stackDockerfileserve.shruninfra.yaml
1FROM runinfra/vllm:0.6.3-l40s
2ENV MODEL=Llama-3.1-8B-Instruct
3ENV QUANTIZATION=awq_marlin
4COPY ./weights /models
5# optimized serving config
6CMD vllm serve $MODEL \
7--quantization awq_marlin \
8--kv-cache-dtype fp8 \
9--enable-prefix-caching \
10--max-num-seqs 256 \
11--gpu-memory-utilization 0.92
$0.12 / 1M38ms p95L40S
Deploy on thehardware you choose
Compare real GPU prices and deploy wherever fits. No lock-in, no rewrites.
RunInfra
L4$0.80/hr
L40S$1.95/hr
A100$2.50/hr
H100$3.95/hr
H200$4.54/hr
B200$6.25/hr
Serverless
Modal
L4$0.39/hr
L40S$0.99/hr
A100$1.39/hr
H100$2.89/hr
H200$4.39/hr
B200$5.89/hr
On-demand
RunPod
L4from $0.32/hr
A100from $0.75/hr
H100from $2.00/hr
H200from $3.29/hr
B200from $4.34/hr
Marketplace
Vast.ai
Your hardware
Local hardware
Own the winning stack before you ship
Not a black box. You get the measured stack to run, deploy, or export.
BenchmarkVerified
p99 latency64 ms64%
throughput3.4k tok/s2.8x
cost / 1M$0.1564%
Llama 3.1 8B, vLLM, L4 24GB
Benchmark receipt
Before and after, in one record you can reproduce.
p99 latency
throughput
VRAM
cost
reproduction
runinfra.yamlyaml
12345
engine: vLLMquantization: awq-int4kv_cache: fp8max_num_seqs: 256speculative: eagle
Optimized runtime config
Every serving setting RunInfra tuned. Read or change it.
engine flags
batch settings
quantization
kernel paths
kv cache
deployment-kitRunnable
Dockerfile
compose.yaml
k8s/
deployment.yaml
serve.py
benchmark.md
Exportable stack
A runnable repo you take with you.
Dockerfile
compose
launch scripts
reports
Livep50 38 ms
POST/v1/chat/completions200
L42 replicasautoscale
Managed endpoint
The same measured stack, hosted by us.
RunInfra Cloud
resource config
inspectable
portable
Why owning your AI stack matters
Data privacy and control
Keep sensitive workloads on infrastructure you choose.
Customization
Tune the model, runtime, and GPU to your workload.
Performance ownership
Real tuning, measured, not assumed.
Portability
Run it on our cloud, or export it to yours.
Supported across the stack
Open models, serving engines, GPUs, and the clouds you deploy to. RunInfra supports every layer of the inference stack.
Models
Models: Llama 3.3, Whisper, Qwen-Image, NV-Embed, Parler-TTS, Qwen2.5, Cosmos, Pixtral, EmbeddingGemma, RoBERTa, DeepSeek-V3, Sana, Parakeet, Mistral, Wan 2.1, GTE, Qwen2-VL, MMS-TTS, Qwen3 Reranker, Gemma 2, MusicGen, DeepSeek-VL2, Nemotron, FastPitch, PaliGemma, NV-RerankQA, Qwen2-Audio, Hermes 3, Canary, BERT, Llama 3.2 Vision, Qwen3 Embedding
LLMLlama 3.3ASRWhisperImageQwen-ImageEmbedNV-EmbedTTSParler-TTSLLMQwen2.5VideoCosmosVisionPixtralEmbedEmbeddingGemmaClassifyRoBERTaLLMDeepSeek-V3ImageSanaASRParakeetLLMMistralVideoWan 2.1EmbedGTEVisionQwen2-VLTTSMMS-TTSRerankQwen3 RerankerLLMGemma 2AudioMusicGenVisionDeepSeek-VL2LLMNemotronTTSFastPitchVisionPaliGemmaRerankNV-RerankQAAudioQwen2-AudioLLMHermes 3ASRCanaryClassifyBERTVisionLlama 3.2 VisionEmbedQwen3 EmbeddingLLMLlama 3.3ASRWhisperImageQwen-ImageEmbedNV-EmbedTTSParler-TTSLLMQwen2.5VideoCosmosVisionPixtralEmbedEmbeddingGemmaClassifyRoBERTaLLMDeepSeek-V3ImageSanaASRParakeetLLMMistralVideoWan 2.1EmbedGTEVisionQwen2-VLTTSMMS-TTSRerankQwen3 RerankerLLMGemma 2AudioMusicGenVisionDeepSeek-VL2LLMNemotronTTSFastPitchVisionPaliGemmaRerankNV-RerankQAAudioQwen2-AudioLLMHermes 3ASRCanaryClassifyBERTVisionLlama 3.2 VisionEmbedQwen3 Embedding
Engines
Engines: vLLM, SGLang, TensorRT-LLM, vLLM Omni, TEI, Transformers
EnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformersEnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformersEnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformersEnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformersEnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformersEnginevLLMEngineSGLangEngineTensorRT-LLMEnginevLLM OmniEngineTEIEngineTransformers
GPUs
GPUs: L4, A10, L40S, RTX 4090, A100, H100, H200, B200
24 GBL424 GBA1048 GBL40S24 GBRTX 409080 GBA10080 GBH100141 GBH200192 GBB20024 GBL424 GBA1048 GBL40S24 GBRTX 409080 GBA10080 GBH100141 GBH200192 GBB20024 GBL424 GBA1048 GBL40S24 GBRTX 409080 GBA10080 GBH100141 GBH200192 GBB20024 GBL424 GBA1048 GBL40S24 GBRTX 409080 GBA10080 GBH100141 GBH200192 GBB200
Clouds
Clouds: RunInfra Cloud, Modal, RunPod, Vast.ai
B200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.aiB200RunInfra CloudH100ModalA100RunPodRTX 4090Vast.ai
Common questions
Can't find what you're looking for? Get in touch
What is RunInfra?
Describe what you want to run. RunInfra picks compatible open models, benchmarks GPUs, tunes the runtime, and gives you a deploy-ready stack.
Deploy your first optimized model, measured before you ship
Describe the goal. RunInfra builds and optimizes the stack.
Start BuildingView Pricing
End-to-end encryption
Isolated GPU infrastructure
Zero data retention
SOC 2 Type II
RunInfraby RightNow
© 2026 RunInfra. All rights reserved.
All systems operational
Backed by
Combinator
AICPA Type II
SOC 2
Ask AI about RunInfra
Part of RightNow