AI News HubLIVE
站内改写4 min read

Show HN: GEDD – Find what your AI agent gets wrong (before your users do)

GEDD is an open-source tool that enables domain experts to systematically discover AI agent failure modes without pre-existing evaluation rubrics. It generates a production-ready eval pipeline in 90 minutes through a conversational process, capturing domain-specific errors like dosage confusion or coverage hallucination. The tool is grounded in Grounded Theory methodology and has been tested across multiple domains with 17 demo scenarios.

SourceHacker News AIAuthor: balasvce19855

Notifications You must be signed in to change notification settings

Fork 0

Star 1

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

152 Commits

152 Commits

.github

.github

grounded-evals

grounded-evals

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

CONTRIBUTING.md

CONTRIBUTING.md

LICENSE

LICENSE

METHODOLOGY.md

METHODOLOGY.md

README.md

README.md

SETUP.md

SETUP.md

Repository files navigation

You shipped an AI agent. Now you need to prove it works — to your CEO, to compliance, to the team that inherits it. The agent fails in ways no rubric anticipated, and the eval tools expect you to know what to measure before you've seen what breaks.

GEDD is the tool for before you have a rubric. A domain expert has a conversation, and 90 minutes later you have a production eval pipeline.

The eval pipeline is the product. The agent is just the thing it produces.

📖 Why Grounded Theory? for reliable AI Agents — the long-form argument behind this repo.

The Pipeline

flowchart TD subgraph DE["🧑‍💼 DOMAIN EXPERT — /gedd in Claude Code"] direction TB S1["1️⃣ Define Agent"] S2["2️⃣ System Prompt"] S3["3️⃣ Deploy to AgentCore"] S4["4️⃣ Golden Queries"] S5["5️⃣ Annotate & Judge"] S1 ==> S2 ==> S3 ==> S4 ==> S5 end

S5 ==>|"📄 session.json"| HANDOFF:::handoff

HANDOFF ==> S6

subgraph ML["🔧 ML ENGINEER — grounded-evals mlflow"] direction TB S6["6️⃣ SageMaker MLflow Pipeline"] end

S3 -.->|"deploy"| AC["☁️ Bedrock AgentCore"] S4 -.->|"invoke"| BR["🤖 Claude Haiku 4.5"] S6 -.->|"track"| SM["📊 SageMaker MLflow"] S6 -.->|"gate"| CI["🚦 CI/CD Pipeline"]

classDef handoff fill:#fce4ec,stroke:#c62828,stroke-width:3px,stroke-dasharray: 5 5

Loading

Two personas. Six steps. One file connects them.

Step Who What happens Output

1 Domain Expert "RxBot helps patients with medications" Bounded context

2 Domain Expert "Never prescribe. Always escalate." System prompt + safety rules

3 Domain Expert One command → live endpoint Agent on AgentCore

4 Domain Expert 20 test cases via Open Coding Golden queries + responses

5 Domain Expert ✓/⚠/✗ → name the failures Error codes + G-Eval rubric

6 ML Engineer grounded-evals mlflow --run-eval SageMaker experiment + CI/CD gates

Why deploy before testing? The agent only needs the system prompt. By deploying at Step 3, all golden queries run against the real endpoint — latency, IAM, cold starts included.

The Flywheel

The pipeline isn't linear — it's a loop. Production failures feed back into new test cases. The eval suite grows with the agent.

flowchart TD subgraph EXPERT["🧑‍💼 DOMAIN EXPERT"] D["Define + Prompt + Deploy"] Q["Golden Queries Open Coding methodology"] A["Annotate ✓/⚠/✗ + error codes"] D --> Q --> A end

subgraph ENGINEER["🔧 ML ENGINEER"] J["Build Judge Rubric + weights + hard-fails"] K{"Calibrate κ ≥ 0.80?"} CI["CI/CD Gate TSR ≥ 95%"] J --> K K -->|"Yes"| CI K -->|"No — fix criteria"| J end

A -->|"session.json"| J CI -->|"✅ Ship"| PROD["🚀 Production"] PROD -.->|"🔄 New failure discovered"| Q

style PROD fill:#c8e6c9,stroke:#2e7d32

Loading

Each guide maps to a section of the flywheel:

Guide Covers For

Pipeline Guide Full workflow + CI/CD YAML Both

Domain Expert Guide Steps 1-5 walkthrough PMs / SMEs

PM → Production Judge Turn annotations into CI judge ML Engineers

Cohen's Kappa Calibrate judge-human agreement ML Engineers

Building an LLM Judge Rubric design + few-shot calibration ML Engineers

Quick Start

Domain Expert

cd grounded-evals pip install -e . claude

/gedd

90 min → golden dataset + judge

ML Engineer

pip install sagemaker-mlflow

grounded-evals mlflow \ --session session.json \ --tracking-uri $ARN \ --run-eval

Explore Demos

pip install -e ".[dev]" grounded-evals serve

Open localhost:8080 17 pre-loaded scenarios

What the Domain Expert Discovers

We tested across 4 domains. In every case, the expert caught failures an engineer would miss:

Domain Error Code What Happened Why Only an Expert Catches It

💊 Pharmacy dosage_unit_confusion Said "mg" when context suggests "mcg" 1000x error — potentially fatal

🏠 Insurance coverage_hallucination Assumed policy exists without checking Policyholder believes they're covered

💰 Tax incomplete_guidance Didn't recommend CPA for $200K scenario Liability issue in tax advice

🛂 Immigration bar_misapplication Said 3-year bar applies to 90-day overstay Bar triggers at 180+ days (INA §212(a)(9)(B))

These aren't generic "hallucination" labels. They're domain-specific failure modes in the expert's own vocabulary — and they become the criteria in the deployed judge.

Architecture

flowchart LR CC["Claude Code /gedd skill"] --> SJ["session.json"] SJ --> CLI["grounded-evals mlflow"] CLI --> SM["SageMaker MLflow Experiments + Judges"] CLI --> BR["Bedrock AgentCore + Claude"] SM --> CICD["CI/CD Regression gates"] CICD --> BR

Loading

All AWS-native. IAM for auth. S3 for artifacts. No external services.

17 Demo Scenarios

No LLM calls needed. Each is pre-loaded with golden queries, annotations, error codes, and a generated judge.

View all 17 demos

Demo Domain Key failure modes

TravelBot Flight booking Hallucinated entities, fabricated booking data

ClinicalBot Clinical triage Missed escalation, contraindication miss

LexBot Legal assistant Jurisdiction error, unauthorized legal advice

WealthBot Financial planning Unlicensed advice, projection hallucination

HRBot HR policy Q&A Policy misquote, confidentiality breach

EduBot Student learning Answer reveal, grade inflation

VaultEx AI Crypto exchange Regulatory misguidance, fee hallucination

PixelGuard Gaming moderation False positive bans, harassment miss

InsureBot Insurance claims Bad-faith denial, coverage hallucination

PropBot Real estate Fair Housing steering, fabricated comps

RxBot Pharmacy Drug interaction miss, dosage unit confusion

TaxBot Tax/accounting Deduction hallucination, Circular 230 violation

ClaimsBot Defense contracting ITAR violation, CUI spillage

FoodBot Food safety Allergen cross-contact, HACCP temp error

AutoBot Automotive Lemon law omission, CARS Rule violation

MigrateBot Immigration Asylum deadline miss, bar misapplication

EnergyBot Energy/utilities Solar ITC outdated, NEM 3.0 confusion

CLI Reference

Command What it does

chat Conversational coaching (Steps 1-5)

eval Run golden queries against a model

annotate Mark responses ✓/⚠/✗ with error codes

judge Generate G-Eval judge prompt

mlflow Export to SageMaker MLflow (Step 6)

export Write golden dataset as JSONL/CSV/JSON

status Session dashboard

analyze Map error codes to eval dimensions

serve Start the web UI

fracture Fracture domain into test categories

check-saturation Check dataset coverage

coverage Bar-chart breakdown by category

compare Check if a new prompt adds unique coverage

Why This Works

Most eval tools ask: what should we measure? GEDD asks: what is actually happening?

You can't evaluate what you haven't observed. Pre-baked rubrics miss your agent's unique failures.

Criteria are weighted by evidence. A dosage unit confusion isn't the same severity as a tone slip.

Your evaluation evolves with the agent. The flywheel absorbs new failure modes naturally.

Your work becomes load-bearing. The judge is in your domain vocabulary, not generic "helpfulness 1-5."

⭐ Found this useful?

If GEDD helped you find what your agent gets wrong, a star helps others find it too.

License: MIT-0. See LICENSE. Security: see CONTRIBUTING.

About

Find what your AI agent gets wrong — before you have a rubric. Qualitative eval for PMs.

Topics

python

product-management

ai-agents

grounded-theory

prompt-engineering

ai-testing

ai-quality

amazon-bedrock

llm-evaluation

eval-framework

Resources

Readme

License

MIT-0 license

Code of conduct

Code of conduct

Contributing

Contributing

Security policy

Security policy

Uh oh!

There was an error while loading. Please reload this page.

Activity

Custom properties

Stars

1 star

Watchers

0 watching

Forks

0 forks

Report repository

Releases

No releases published

Packages 0

Uh oh!

There was an error while loading. Please reload this page.

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Python 99.7%

Other 0.3%

Generated from amazon-archives/__template_MIT-0