2026-06-28 00:33 UTCIn-site rewrite4 min readUpdated: 2026-06-28 01:21 UTC

Toolkit for Your AI Scientists – Rigorous, Auditable and Verifiable

ARA is a protocol and skill bundle for AI scientists that makes autonomous research verifiable, observable, and structured, addressing the bottleneck of verifying AI-generated scientific results.

SourceHacker News AIAuthor: amberjcjj

Uh oh!

There was an error while loading. Please reload this page.

Notifications You must be signed in to change notification settings

Fork 41

Star 406

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

69 Commits

.github/workflows

docs

examples

packages

skills

.gitignore

CONTRIBUTING.md

LICENSE

README.md

Repository files navigation

The ecosystem layer for AI scientists. A protocol and skill bundle that makes autoresearch verifiable, crystallized, and observable — so trust scales with speed instead of collapsing under it.

The new bottleneck in science

AI scientists can now generate hypotheses, execute experiments, and produce results at near-infinite speed. But this acceleration has created a new fundamental bottleneck: How do we verify it? And how do we effectively guardrail the process?

When an AI generates thousands of exploratory steps, human researchers cannot manually untangle the logs to ensure empirical rigor. We need a fundamental shift in how research is documented and supervised.

Publishing compiles a rich research process into a lossy narrative (left). ARA preserves it as a structured, machine-executable knowledge package the AI scientist writes and the human reads (right).

ARA is a bundle of agent skills and protocols built to solve this bottleneck. It provides a rigorous, structured way to document research knowledge, strategically crystallize insights over time, and make autonomous scientific processes entirely observable and verifiable. Jump to how to use it ↓

Core Design Principles

Instead of leading with layers, the bundle maps directly to how it solves the bottleneck through three core design principles:

🛡️ Guardrailing & Verification

AI agents require precise constraint boundaries to prevent hallucinated conclusions. The system acts as a strict epistemic anchor, automatically applying formal verification principles to ensure every scientific claim is directly wired to ground-truth execution and falsifiable results.

🧠 Crystallizing Insights

Research is rarely a straight line; it is a messy graph of pivots and dead ends. The system forces AI scientists to systematically document their trajectory, crystallizing fleeting, unstructured logs into highly structured, reliable research knowledge that builds compounding value over time.

👁️ Total Observability

Supervising AI scientists shouldn't require reading endless terminal outputs. The system translates complex agent behaviors and exploration graphs into a clean, minimalist interface. It lets human researchers maintain high-level oversight, seamlessly stepping in to course-correct or guide the AI's behavior with zero friction.

🛠️ Quickstart: The Four Core Skills

To operationalize these design principles, ARA provides four specialized agent skills. You can install them via:

npx @ara-commons/ara-skills

Auto-detects Claude Code, Cursor, Gemini CLI, OpenCode, Codex, and Hermes, then prompts for skills, agents, and install scope (global vs. local). Full CLI reference: packages/ara-skills/.

Then reach for a skill by what you need:

If you want to… Skill Invoke

Capture research faithfully as you work — decisions, ablations, dead ends, configs research-manager /research-manager (or wire it to run automatically)

Compile an existing paper, repo, or notes into a structured ARA compiler /compiler

Verify an artifact's epistemic rigor before you trust, publish, or submit it rigor-reviewer /rigor-reviewer

Observe the full research trajectory in an interactive process map research-visualizer /research-visualizer

Make capture automatic. Append this to your agent's system-prompt file (CLAUDE.md, AGENTS.md, .cursorrules, or GEMINI.md) so the record fills itself in every session:

ARA: end-of-session research capture

At the END of every coding session, invoke the /research-manager skill to record decisions, experiments, dead ends, and claims into the ara/ artifact.

See each skill's SKILL.md for the full specification: research-manager · compiler · rigor-reviewer · research-visualizer

Under the hood — the artifact anatomy

The four pillars all read and write one structure. An ARA organizes research into four interlocking layers:

example_artifact/ PAPER.md # Root manifest + layer index (~200 tokens) logic/ # Cognitive layer — What & Why claims.md # Falsifiable assertions with proof refs experiments.md # Declarative experiment plans solution/ architecture.md # System design + component graph algorithm.md # Math + pseudocode constraints.md # Boundary conditions related_work.md # Typed dependency graph src/ # Physical layer — How configs/ # Hyperparameters with rationale environment.md # Dependencies, hardware, seeds trace/ # Exploration graph — Journey exploration_tree.yaml # Research DAG with typed nodes + dead ends evidence/ # Raw proof tables/ # Exact result tables figures/ # Extracted data points

Cross-layer forensic bindings thread claims in /logic to code in /src and evidence in /evidence. Dead-end nodes (×) in the exploration graph preserve failure modes so no agent re-walks them.

Key structural principles

Progressive disclosure — PAPER.md (~200 tokens) tells an agent whether the artifact is relevant; deeper files load on demand.

Cross-layer binding — claims reference experiments, experiments reference evidence, heuristics reference code. Everything resolves.

Dead ends preserved — failed approaches and rejected alternatives are first-class nodes in the exploration graph, not noise to drop.

Provenance tracking — every entry is tagged (user, ai-suggested, ai-executed, user-revised), distinguishing human-confirmed facts from AI inferences.

Why it works

The supervision gap is not hand-waving — it shows up as measurable cost. Across benchmarks, an ARA beats a strong PDF + repo baseline on the three things agents do with research (understand, reproduce, extend), most dramatically on recovering the failure knowledge a narrative drops. For the full argument — the two structural taxes, the benchmark results, and the case for agent-native research — read the writeup:

→ The Last Human-Written Paper: Agent-Native Research Artifacts

Compatibility

These skills follow the Agent Skills open standard and work with:

Claude Code (Anthropic)

Codex CLI (OpenAI)

GitHub Copilot

Cursor

Any agent supporting the Agent Skills specification

Citation

If you use ARA in your research, please cite:

@misc{liu2026humanwrittenpaperagentnativeresearch, title={The Last Human-Written Paper: Agent-Native Research Artifacts}, author={Jiachen Liu and Jiaxin Pei and Jintao Huang and Chenglei Si and Ao Qu and Xiangru Tang and Runyu Lu and Lichang Chen and Xiaoyan Bai and Haizhong Zheng and Carl Chen and Zhiyang Chen and Haojie Ye and Yujuan Fu and Zexue He and Zijian Jin and Zhenyu Zhang and Shangquan Sun and Maestro Harmon and John Dianzhuo Wang and Jianqiao Zeng and Jiachen Sun and Mingyuan Wu and Baoyu Zhou and Chenyu You and Shijian Lu and Yiming Qiu and Fan Lai and Yuan Yuan and Yao Li and Junyuan Hong and Ruihao Zhu and Beidi Chen and Alex Pentland and Ang Chen and Mosharaf Chowdhury and Zechen Zhang}, year={2026}, eprint={2604.24658}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2604.24658}, }

Contributing

See CONTRIBUTING.md for how to add or improve skills.

License

MIT

About

Research Ecosystem for AI Scientists

Resources

Readme

License

MIT license

Contributing

Uh oh!

There was an error while loading. Please reload this page.

Activity

Custom properties

Stars

406 stars

Watchers

2 watching

Forks

41 forks

Report repository

Releases

7 tags

Packages 0

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

HTML 40.0%

JavaScript 36.0%

Python 14.7%

CSS 9.3%