2026-06-04 14:42 UTCIn-site rewrite5 min readUpdated: 2026-06-30 13:03 UTC

Show HN: Black-box API bug detection across 7 AI systems

KushoAI published a benchmark evaluating 7 AI systems for detecting planted API bugs given only a JSON schema and one valid sample payload. KushoAI ranked first with a composite score of 0.83, significantly outperforming others on complex bugs. The report finds that prompt engineering improves breadth but not cross-field business logic reasoning, and that test composition matters more than volume.

SourceHacker News AIAuthor: riyajoshi

Executive Summary

AI coding tools can generate API tests quickly. The harder question is whether those tests find bugs.

This report evaluates that question across 20 live API scenarios spanning seven application domains, with 97 planted functional bugs across three complexity tiers. Each system receives only a JSON schema and one valid sample payload, then must generate API test cases that expose failures in a live reference API. No source code. No documentation beyond the schema. No hints about where failures are planted.

The evaluation uses APIEval-20 v1.0, a black-box benchmark contributed by KushoAI. Because KushoAI is also one of the evaluated systems, this report includes the methodology, workflow definitions, repeated-run setup, and robustness checks so readers can understand where the performance difference comes from.

Seven systems were compared across three groups: general-purpose LLMs, coding agents, and KushoAI. The report also compares workflow modes engineering teams commonly try in practice: one-shot prompting, structured test-strategy prompting, prompt chaining, native coding-agent workflows, and native API test generation.

Simple structural bugs are no longer a meaningful differentiator. Most systems can generate missing-field, null, empty-array, and wrong-type tests. These tests are useful, but they are also the easiest class of failures to discover from the schema alone.

Prompt engineering improves parameter coverage, formatting, and field-level negative tests. It makes suites broader, more explicit, and easier to parse, but it does not consistently make general coding tools reason about cross-field business states.

The gap opens on complex bugs. KushoAI detects 76% of complex planted bugs in this evaluation, compared with 53% for the strongest coding-agent workflow and 34% for the strongest general-purpose LLM.

KushoAI ranks first on the primary score and across all bug complexity tiers. The largest margin appears on the metric most tied to production risk: cross-field and business-logic bug detection.

Key Findings

Plausible-looking test suites can still miss bugs.

Several systems generated suites with readable names, valid payloads, and broad field coverage. The difference appeared only after running those suites against live APIs with known planted failures.

Simple schema-level tests are now table stakes.

Most systems can generate tests for missing fields, null values, empty arrays, and wrong types. These tests are useful, but they are not enough to evaluate whether a tool can find production-relevant failures.

Prompting helps breadth more than depth.

Structured prompts improved parameter coverage, JSON validity, and field-level negative tests. They did not consistently produce cross-field business-logic tests.

Complex bugs separate field mutation from API test design.

The hardest bugs required combining individually valid fields into invalid states, such as invalid refund state, role hierarchy violations, conflicting recurrence rules, or notification channels enabled before verification.

Test composition matters more than test volume.

Coding-agent workflows often generated many tests. The gap came from whether those tests explored meaningful field interactions.

Consistency matters for CI/CD adoption.

KushoAI showed the lowest run-to-run variance among all evaluated systems. For teams integrating generated tests into automated pipelines, output stability matters as much as peak performance.

Why API Bug Detection Needs a Different Evaluation

Most API test generation comparisons ask whether a tool can produce tests. That is too low a bar. Any current LLM can generate a list of plausible tests from an API schema.

The test names may sound comprehensive, and the payloads may be syntactically valid, but that does not tell an engineering team whether the suite actually reduces risk. Traditional coding benchmarks usually measure properties like code correctness, task completion, or whether generated tests execute. API testing has a different core objective: finding behavior that violates the intended contract of a live service.

A more useful evaluation question is narrower: Given only the request schema and one valid sample payload, with no source code, no documentation beyond the schema, and no hints about planted failures, can an AI system generate tests that trigger planted functional bugs in a live API?

That is the task evaluated in this report. It evaluates end-to-end behavior: the agent reads the schema and sample, constructs a test suite, the suite is executed against live reference implementations, and scoring is determined by which planted bugs are triggered.

This black-box constraint reflects a common practitioner reality. Teams often receive an OpenAPI schema or request payload examples before they have complete documentation, test data, or implementation context. In that setting, a useful testing agent has to infer likely constraints from field names, data types, descriptions, nested structure, and the operation being performed.

The benchmark contains 20 scenarios across e-commerce, payments, authentication, user management, scheduling, notifications, and search/filtering. Across those scenarios, it contains 97 planted functional bugs: 28 simple, 35 moderate, and 34 complex.

The benchmark does not try to reproduce every production condition. It isolates one capability that matters in production: whether an AI system can generate high-signal API tests from limited request-shape context. That makes the comparison controlled, repeatable, and easier to inspect.

Methodology

Every system received the same two inputs per scenario: a JSON schema and one valid sample payload. No implementation code, response schema, logs, changelog, production examples, or planted-bug hints were provided.

Each system had to produce a JSON array of test cases. Each case included a test_name and a complete request payload. No expected outcomes were required; the evaluator determines whether a test triggers a planted bug by running it against the live API.

The schema and sample payload together represent the minimum useful context a tester might have. The schema tells the agent what fields exist and what constraints are explicit. The sample payload shows how the API is normally used. The benchmark intentionally withholds everything else so that systems cannot rely on implementation leakage or hand-written documentation that points directly at the failure modes.

This keeps the task focused on test generation rather than assertion writing. A system is not rewarded for writing a confident expected outcome unless the request payload actually reaches a planted bug.

CategorySystemsHow they were used

General-purpose LLMsGPT-5, Claude Sonnet 4.6, Gemini 2.5 ProAPI/chat mode with a structured JSON-output prompt.

Coding agentsClaude Code, Cursor, GitHub CopilotNative agentic workflow with schema files and prompt instructions.

API testing agentKushoAINative API test generation workflow.

The systems evaluated here update frequently. Results should be read as a point-in-time evaluation of the specific models, product modes, prompts, and workflows used during this study.

Workflow Modes Compared

The workflow comparison is included because teams rarely stop after a single prompt. In practice, engineers try a one-shot prompt, then make the prompt more explicit, then ask the tool to review its own gaps, then build local scripts around the process.

Workflow modeDescriptionSystems included

One-shot promptGenerate tests from the schema and sample payload in a single pass.General LLMs and coding-agent baseline runs

Structured strategy promptAdds explicit instructions for required fields, invalid types, formats, enums, boundaries, and negative cases.General LLMs and coding agents

Per-scenario prompt chainOne prompt to infer the strategy, one to generate tests, one to review gaps, and one to emit final JSON.Coding agents

Native coding-agent workflowAgent reads local scenario files, writes suites to disk, and revises after format validation.Claude Code, Cursor, Copilot

KushoAI native workflowPurpose-built API testing generation with internal field analysis and cross-field candidate construction.KushoAI

For each non-KushoAI system, the main leaderboard reports the strongest workflow observed across the tested modes. This gives general LLMs and coding agents the benefit of structured prompting and iteration rather than comparing KushoAI only against one-shot outputs.

Bug Complexity Tiers

TierDefinitionExamples

SimpleNo semantic domain understanding required.Missing required field, null, wrong type, empty array.

ModerateRequires understanding field meaning or documented constraints.Invalid currency code, malformed email, out-of-range rating, invalid enum.

ComplexRequires reasoning about relationships between fields or operation semantics.Mutually exclusive fields, refund amount greater than original transaction, date range where end precedes start.

Scoring Formula

Final Score = 0.70 x Bug Detection Rate + 0.20 x Coverage Score + 0.10 x Efficiency Score

Bug Detection Rate = bugs_triggered / total_planted_bugs Coverage Score = param_coverage Efficiency Score = min(1, bugs_found / number_of_tests)

Bug detection is weighted most heavily because tests that do not find bugs have limited engineering value, even if they look broad. Coverage rewards suites that exercise each top-level schema field at least once. It is intentionally simple and should not be read as a proxy for edge-case depth, business-logic coverage, or bug-finding quality. Efficiency penalizes suites that bury a few useful cases inside a large amount of redundant noise.

Overall Results

RankSystemCategoryBest workflowBug detect rateCoverageEfficiencyFinal scoreStd dev across runs

1KushoAIAPI testing agentNative KushoAI0.891.000.140.83+/-0.03

2Claude CodeCoding agentPrompt chain0.760.980.180.76+/-0.05

3CursorCoding agentPrompt chain0.700.950.160.70+/-0.07

4GitHub CopilotCoding agentStructured prompt0.640.920.140.64+/-0.08

5Claude Sonnet 4.6General LLMStructured prompt0.600.900.200.62+/-0.09

6GPT-5General LLMStructured prompt0.560.880.180.58+/-0.08

7Gemini 2.5 ProGeneral LLMStructured prompt0.490.820.170.51+/-0.10

Mean Final Score

KushoAI

0.83

Claude Code

0.76

Cursor

0.70

Copilot

0.64

Sonnet 4.6

0.62

GPT-5

0.58

Gemini 2.5 Pro

0.51

Coverage is near-saturated across the leading systems, so the leaderboard should not be read as a coverage story. In this report, Coverage measures whether generated suites exercise top-level schema fields. It does not measure edge-case depth, cross-field reasoning, or business-logic coverage. The separation comes from bug detection, complex-bug detection, and run-to-run consistency. KushoAI has the highest bug detection rate, the strongest complex-bug rate, and the lowest standard deviation across runs.

KushoAI achieved full Coverage (1.00) across all 20 scenarios, meaning its generated suites exercised every top-level schema field in every scenario. This reflects the native workflow's schema traversal approach rather than selective field targeting.

One metric worth contextualizing is Efficiency, defined here as the ratio of bugs found to tests generated. KushoAI's Efficiency score (0.14) reflects that its native workflow generates more tests per scenario than general LLMs, which increases overall exploration but lowers the bugs-per-test ratio. Because the final score weights bug detection at 0.70, this tradeoff is intentional: finding more bugs with more tests is preferable to finding fewer bugs with fewer tests. Teams optimizing for CI runtime can apply deduplication or suite trimming after the initial generation pass.

Coverage and bug detection diverge. A model can touch many fields and still miss the failure. For example, a suite may test currency with an empty string an

[truncated for AI cost control]