AI News HubLIVE
In-site rewrite2 min read

Kimi Vendor Verifier

Kimi open-sources the Vendor Verifier (KVV) to help users verify the accuracy of inference implementations of open-source models. It includes six critical benchmarks for detecting common deployment issues and encourages infrastructure providers to fix root causes.

SourceKimi Blog

Kimi Vendor Verifier

Features

Sheets

Build Excel formulas, pivots & charts

Docs

Create, convert & review documents

Kimi Claw

Deploy 24/7 AI agents in one click

Kimi Code

AI Code Agent for Terminal & IDE

Kimi WebBridge

A browser extension for AI agents

Research

Kimi K2.6

Advancing Open-Source Coding

Agent Swarm

Scale Out, Not Just Up

WorldVQA

Atomic World Knowledge in MLLMs

Kimi K2.5

Visual Agentic Intelligence

Kimi Vendor Verifier

Rebuilding the Chain of Trust

Kimi K2 Thinking

Open-source thinking model

Kimi K2

Open Agentic Intelligence

Resources

Hermes Agent Overview

Hermes API Integration

OpenClaw SaaS

How to Deploy OpenClaw

How to Install OpenClaw on Mac

How to Install OpenClaw on Windows

AI Tools for Excel

Vibe Coding Guide

How to Vibe Code

How to Build a Website from Scratch

Refactor moonshot.ai with Kimi Code CLI

Pricing

Help Center

Getting Started

Agent Mode

Websites

Docs & Sheets

Slides

Deep Research

Kimi Claw

Membership

Kimi Code

Kimi API

Others

Kimi K2.6

Try Kimi

<Research

Rebuilding the "Chain of Trust": Kimi Vendor Verifier ​

Alongside the release of the Kimi K2.6 model, we are open-sourcing the Kimi Vendor Verifier (KVV) project, designed to help users of open-source models verify the accuracy of their inference implementations.

Not as an afterthought, but because we learned the hard way that open-sourcing a model is only half the battle. The other half is ensuring it runs correctly everywhere else.

Official Evaluation Results ​

You can click here to access the Kimi API K2VV evaluation results for calculating the F1 score.

Why We Built KVV ​

From Isolated Incidents to Systemic Issues

Since the release of K2 Thinking, we have received frequent feedback from the community regarding anomalies in benchmark scores. Our investigation confirmed that a significant portion of these cases stemmed from the misuse of Decoding parameters. To mitigate this immediately, we built our first line of defense at the API level: enforcing Temperature=1.0 and TopP=0.95 in Thinking mode, with mandatory validation that thinking content is correctly passed back.

However, more subtle anomalies soon triggered our alarm. In a specific evaluation on LiveBenchmark, we observed a stark contrast between third-party API and official API. After extensive testing of various infrastructure providers, we found this difference is widespread.

This exposed a deeper problem in the open-source model ecosystem: The more open the weights are, and the more diverse the deployment channels become, the less controllable the quality becomes.

If users cannot distinguish between "model capability defects" and "engineering implementation deviations," trust in the open-source ecosystem will inevitably collapse.

Our Solution ​

Six Critical Benchmarks (selected to expose specific infra failures):

Pre-Verification: Validates that API parameter constraints (temperature, top_p, etc.) are correctly enforced. All tests must pass before proceeding to benchmark evaluation.

OCRBench: 5 minutes smoke test for multimodal pipelines.

MMMU Pro: Verify Vision input preprocessing by testing diverse visual inputs.

AIME2025: Long-output stress test. Catches KV cache bugs and quantization degradation that short benchmarks hide.

K2VV ToolCall: Measures trigger consistency (F1) and JSON Schema accuracy. Tool errors compound in agents; we catch them early.

SWE-Bench: Full agentic coding test. (Not open sourced due to dependency of sandbox)

Upstream Fix: We embed with vLLM/SGLang/KTransformers communities to fix root causes, not just detect symptoms.

Pre-Release Validation: Rather than waiting for post-deployment complaints, we provide early access to test models. This lets infrastructure providers validate their stacks before users encounter issues.

Continuous Benchmarking: We will maintain a public leaderboard of vendor results. This transparency encourages vendors to prioritize accuracy.

Testing Cost Estimation ​

We completed full evaluation workflow validation on Two NVIDIA H20 8-GPU servers, with sequential execution taking approximately 15 hours. To improve evaluation efficiency, scripts have been optimized for long-running inference scenarios, including streaming inference, automatic retry, and checkpoint resumption mechanisms.

An Open Invitation ​

Weights are open. The knowledge to run them correctly must be too.

We are expanding vendor coverage and seeking lighter agentic tests. Contact Us: [email protected]

Products

Kimi Open Platform Kimi Code Pricing

Features

AI Agent Agent Swarm AI Website Builder AI Document Agent AI Slides Generator AI Sheets Agent Deep Research

Company

Moonshot AI Terms of Service Privacy Policy