Show HN: GEDD – Find what your AI agent gets wrong (before your users do)
GEDD is an open-source tool that enables domain experts to systematically discover AI agent failure modes without pre-existing evaluation rubrics. It generates a production-ready eval pipeline in 90 minutes through a conversational process, capturing domain-specific errors like dosage confusion or coverage hallucination. The tool is grounded in Grounded Theory methodology and has been tested across multiple domains with 17 demo scenarios.
Notifications You must be signed in to change notification settings
Fork 0
Star 1
BranchesTags
Open more actions menu
Folders and files
NameName
Last commit message
Last commit date
Latest commit
History
152 Commits
152 Commits
.github
.github
grounded-evals
grounded-evals
CODE_OF_CONDUCT.md
CODE_OF_CONDUCT.md
CONTRIBUTING.md
CONTRIBUTING.md
LICENSE
LICENSE
METHODOLOGY.md
METHODOLOGY.md
README.md
README.md
SETUP.md
SETUP.md
Repository files navigation
You shipped an AI agent. Now you need to prove it works — to your CEO, to compliance, to the team that inherits it. The agent fails in ways no rubric anticipated, and the eval tools expect you to know what to measure before you've seen what breaks.
GEDD is the tool for before you have a rubric. A domain expert has a conversation, and 90 minutes later you have a production eval pipeline.
The eval pipeline is the product. The agent is just the thing it produces.
📖 Why Grounded Theory? for reliable AI Agents — the long-form argument behind this repo.
The Pipeline
flowchart TD subgraph DE["🧑💼 DOMAIN EXPERT — /gedd in Claude Code"] direction TB S1["1️⃣ Define Agent"] S2["2️⃣ System Prompt"] S3["3️⃣ Deploy to AgentCore"] S4["4️⃣ Golden Queries"] S5["5️⃣ Annotate & Judge"] S1 ==> S2 ==> S3 ==> S4 ==> S5 end
S5 ==>|"📄 session.json"| HANDOFF:::handoff
HANDOFF ==> S6
subgraph ML["🔧 ML ENGINEER — grounded-evals mlflow"] direction TB S6["6️⃣ SageMaker MLflow Pipeline"] end
S3 -.->|"deploy"| AC["☁️ Bedrock AgentCore"] S4 -.->|"invoke"| BR["🤖 Claude Haiku 4.5"] S6 -.->|"track"| SM["📊 SageMaker MLflow"] S6 -.->|"gate"| CI["🚦 CI/CD Pipeline"]
classDef handoff fill:#fce4ec,stroke:#c62828,stroke-width:3px,stroke-dasharray: 5 5
Loading
Two personas. Six steps. One file connects them.
Step Who What happens Output
1 Domain Expert "RxBot helps patients with medications" Bounded context
2 Domain Expert "Never prescribe. Always escalate." System prompt + safety rules
3 Domain Expert One command → live endpoint Agent on AgentCore
4 Domain Expert 20 test cases via Open Coding Golden queries + responses
5 Domain Expert ✓/⚠/✗ → name the failures Error codes + G-Eval rubric
6 ML Engineer grounded-evals mlflow --run-eval SageMaker experiment + CI/CD gates
Why deploy before testing? The agent only needs the system prompt. By deploying at Step 3, all golden queries run against the real endpoint — latency, IAM, cold starts included.
The Flywheel
The pipeline isn't linear — it's a loop. Production failures feed back into new test cases. The eval suite grows with the agent.
flowchart TD subgraph EXPERT["🧑💼 DOMAIN EXPERT"] D["Define + Prompt + Deploy"] Q["Golden Queries Open Coding methodology"] A["Annotate ✓/⚠/✗ + error codes"] D --> Q --> A end
subgraph ENGINEER["🔧 ML ENGINEER"] J["Build Judge Rubric + weights + hard-fails"] K{"Calibrate κ ≥ 0.80?"} CI["CI/CD Gate TSR ≥ 95%"] J --> K K -->|"Yes"| CI K -->|"No — fix criteria"| J end
A -->|"session.json"| J CI -->|"✅ Ship"| PROD["🚀 Production"] PROD -.->|"🔄 New failure discovered"| Q
style PROD fill:#c8e6c9,stroke:#2e7d32
Loading
Each guide maps to a section of the flywheel:
Guide Covers For
Pipeline Guide Full workflow + CI/CD YAML Both
Domain Expert Guide Steps 1-5 walkthrough PMs / SMEs
PM → Production Judge Turn annotations into CI judge ML Engineers
Cohen's Kappa Calibrate judge-human agreement ML Engineers
Building an LLM Judge Rubric design + few-shot calibration ML Engineers
Quick Start
Domain Expert
cd grounded-evals pip install -e . claude
/gedd
90 min → golden dataset + judge
ML Engineer
pip install sagemaker-mlflow
grounded-evals mlflow \ --session session.json \ --tracking-uri $ARN \ --run-eval
Explore Demos
pip install -e ".[dev]" grounded-evals serve
Open localhost:8080 17 pre-loaded scenarios
What the Domain Expert Discovers
We tested across 4 domains. In every case, the expert caught failures an engineer would miss:
Domain Error Code What Happened Why Only an Expert Catches It
💊 Pharmacy dosage_unit_confusion Said "mg" when context suggests "mcg" 1000x error — potentially fatal
🏠 Insurance coverage_hallucination Assumed policy exists without checking Policyholder believes they're covered
💰 Tax incomplete_guidance Didn't recommend CPA for $200K scenario Liability issue in tax advice
🛂 Immigration bar_misapplication Said 3-year bar applies to 90-day overstay Bar triggers at 180+ days (INA §212(a)(9)(B))
These aren't generic "hallucination" labels. They're domain-specific failure modes in the expert's own vocabulary — and they become the criteria in the deployed judge.
Architecture
flowchart LR CC["Claude Code /gedd skill"] --> SJ["session.json"] SJ --> CLI["grounded-evals mlflow"] CLI --> SM["SageMaker MLflow Experiments + Judges"] CLI --> BR["Bedrock AgentCore + Claude"] SM --> CICD["CI/CD Regression gates"] CICD --> BR
Loading
All AWS-native. IAM for auth. S3 for artifacts. No external services.
17 Demo Scenarios
No LLM calls needed. Each is pre-loaded with golden queries, annotations, error codes, and a generated judge.
View all 17 demos
Demo Domain Key failure modes
TravelBot Flight booking Hallucinated entities, fabricated booking data
ClinicalBot Clinical triage Missed escalation, contraindication miss
LexBot Legal assistant Jurisdiction error, unauthorized legal advice
WealthBot Financial planning Unlicensed advice, projection hallucination
HRBot HR policy Q&A Policy misquote, confidentiality breach
EduBot Student learning Answer reveal, grade inflation
VaultEx AI Crypto exchange Regulatory misguidance, fee hallucination
PixelGuard Gaming moderation False positive bans, harassment miss
InsureBot Insurance claims Bad-faith denial, coverage hallucination
PropBot Real estate Fair Housing steering, fabricated comps
RxBot Pharmacy Drug interaction miss, dosage unit confusion
TaxBot Tax/accounting Deduction hallucination, Circular 230 violation
ClaimsBot Defense contracting ITAR violation, CUI spillage
FoodBot Food safety Allergen cross-contact, HACCP temp error
AutoBot Automotive Lemon law omission, CARS Rule violation
MigrateBot Immigration Asylum deadline miss, bar misapplication
EnergyBot Energy/utilities Solar ITC outdated, NEM 3.0 confusion
CLI Reference
Command What it does
chat Conversational coaching (Steps 1-5)
eval Run golden queries against a model
annotate Mark responses ✓/⚠/✗ with error codes
judge Generate G-Eval judge prompt
mlflow Export to SageMaker MLflow (Step 6)
export Write golden dataset as JSONL/CSV/JSON
status Session dashboard
analyze Map error codes to eval dimensions
serve Start the web UI
fracture Fracture domain into test categories
check-saturation Check dataset coverage
coverage Bar-chart breakdown by category
compare Check if a new prompt adds unique coverage
Why This Works
Most eval tools ask: what should we measure? GEDD asks: what is actually happening?
You can't evaluate what you haven't observed. Pre-baked rubrics miss your agent's unique failures.
Criteria are weighted by evidence. A dosage unit confusion isn't the same severity as a tone slip.
Your evaluation evolves with the agent. The flywheel absorbs new failure modes naturally.
Your work becomes load-bearing. The judge is in your domain vocabulary, not generic "helpfulness 1-5."
⭐ Found this useful?
If GEDD helped you find what your agent gets wrong, a star helps others find it too.
License: MIT-0. See LICENSE. Security: see CONTRIBUTING.
About
Find what your AI agent gets wrong — before you have a rubric. Qualitative eval for PMs.
Topics
python
product-management
ai-agents
grounded-theory
prompt-engineering
ai-testing
ai-quality
amazon-bedrock
llm-evaluation
eval-framework
Resources
Readme
License
MIT-0 license
Code of conduct
Code of conduct
Contributing
Contributing
Security policy
Security policy
Uh oh!
There was an error while loading. Please reload this page.
Activity
Custom properties
Stars
1 star
Watchers
0 watching
Forks
0 forks
Report repository
Releases
No releases published
Packages 0
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Contributors
Uh oh!
There was an error while loading. Please reload this page.
Languages
Python 99.7%
Other 0.3%
Generated from amazon-archives/__template_MIT-0