AI News HubLIVE
站内改写

Show HN: HermesBench – workflow reliability evals for personal AI agents

HermesBench is a benchmark for evaluating the reliability of complete personal AI agent configurations, including prompts, models, tools, memory, and more. It currently achieves a baseline score of 78.2 across 27 workflow recipes, with transparent traces. The benchmark emphasizes evidence-driven scoring and requires early feedback.

Hermes Agent runtime evaluation

Benchmark the whole personal agent, not just the model.

HermesBench evaluates complete Hermes configurations: prompt, model/provider, tools, AgentSkills, memory, gateway behavior, delegation, safety, latency, and stability. The current public baseline scores 78.2 across 27 personal-agent recipes with redacted traces you can inspect.

Inspect the baseline Run one recipe Star on GitHub Give feedback

78.2 current public baseline

27 workflow recipes

9 scored suites

Why trust it

Evidence first, with visible limits.

Every published result links back to scenario definitions, public score axes, driver closure decisions, deterministic checks, and redacted trace timelines. The site is deliberately clear that this is one early baseline, not a base-model leaderboard.

Site map

Three tabs for the current evidence shape.

With one baseline published, a leaderboard is premature. The site now starts from the content people need to navigate: recipes, profiles, and traces.

Agent-driven quick start

Run it through a coding agent.

The public user pathway is intentionally simple: copy the prompt to Codex, Claude, or another coding agent. The agent loads the HermesBench skill and drives one scenario recipe first. Full bundle runs are opt-in because they take longer and cost more.

Prompt to copy into Codex or Claude

Use the HermesBench skill and run one default scenario recipe for my current Hermes configuration.

Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md

Follow the skill's "Run Current Hermes Configuration" workflow. Use the Python API default single-recipe path, save artifacts, and summarize the score and main findings. Do not run the full bundle unless I explicitly ask.

Alpha feedback

The best next action is concrete feedback.

HermesBench needs early feedback on setup friction, scoring surprises, recipe realism, profile evidence, and redaction trust. Star the repo if the benchmark shape is useful; open an issue if one recipe, trace, or score axis feels wrong.

Open feedback issue Read feedback guide Submission contract

Coverage model

Workflow recipes, broad personal-agent coverage.

HermesBench starts with one valuable workflow recipe, then lets you opt into broader suites when you need more confidence. The bundled catalog covers everyday personal-agent work: context, calendar, web, reports, communication, location, travel, finance, safety, and power-user integrations.

Browse recipes

Personal core Communications Ambient and travel Private sensitive Power-user optional

Scoring philosophy

Good agents finish the right thing safely.

Outcome reached Evidence / truthfulness Runtime / scope safety Responsiveness Task fulfillment Communication quality

HermesBench is reliability-first, but not capability-blind. A good configuration should do useful work, tell the truth about what it knows, avoid unsafe side effects, stay stable, respond promptly, and communicate clearly. Lopsided scores are penalized because a personal agent that is capable but unsafe, safe but unhelpful, or correct but unusably slow is not actually good.

Detailed formulas and implementation mechanics live in the methodology document; the website keeps the scoring model readable for users and LLM agents.