AI News HubLIVE
站内改写5 min read

OpenDevOps – An open-source AI agent that investigates AWS/Azure incidents

OpenDevOps is an open-source multi-cloud DevOps agent that supports AWS and Azure, integrates any LLM via LiteLLM, and investigates cloud incidents to find root causes and provide mitigation plans. It is ~10x cheaper than AWS DevOps Agent, self-hosted, auditable, and customizable.

SourceHacker News AIAuthor: ahmadhammad01

Notifications You must be signed in to change notification settings

Fork 0

Star 2

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

240 Commits

240 Commits

.github/workflows

.github/workflows

apps

apps

assets/demos

assets/demos

demos

demos

deployment

deployment

design-system

design-system

.dockerignore

.dockerignore

.gitignore

.gitignore

CHANGELOG.md

CHANGELOG.md

CLAUDE.md

CLAUDE.md

LICENSE

LICENSE

Makefile

Makefile

PLAN.md

PLAN.md

README.md

README.md

Repository files navigation

Open-source multi-cloud DevOps agent (AWS + Azure). Bring any LLM via LiteLLM — OpenAI, Anthropic, OpenRouter, Groq, Gemini, Mistral, Ollama for air-gapped / regulated environments, or reuse your existing Claude Code subscription (auto-detected). Investigates incidents, finds root causes, and gives actionable mitigation plans — without the cloud-vendor DevOps-agent price tag.

📊 Benchmarked, not just demoed

On a reproducible 10-incident suite (real AWS + Azure resources, scored against ground truth), running on a commodity open model (gpt-oss-120b — no frontier model required):

Root causes found Median time Cost / investigation vs. AWS DevOps Agent vs. manual triage

9 / 10 (90%) ~52 s ~$0.03 ~10× cheaper¹ (~$0.03 vs ~$0.43) ~1,000× cheaper²

~$0.03 of compute replaces ~$50 of engineer toil — and costs a fraction of a managed cloud DevOps agent — while returning the answer in under a minute instead of half an hour. Reproduce it with make eval → full benchmark & methodology.

¹ vs. AWS DevOps Agent — its per-second rate applied to the same wall-clock time (~$0.43/investigation); verify against AWS's published pricing. ² Illustrative unit economics vs. ~20–40 min of on-call triage. Cost shown is the provider-dashboard actual; see caveats.

Cloud setup: AWS (IAM) · Azure (service principal / login)

Demo

Autonomous incident detection — a crashing Lambda is caught automatically, the agent reads the traceback from CloudWatch Logs, finds the root cause, surfaces it on the Monitoring dashboard, and posts the mitigation to Slack. No human in the loop.

Why OpenDevOps?

Amazon Q Developer and the AWS DevOps Agent are excellent if you live entirely inside the AWS Console with Bedrock-managed models. OpenDevOps is the open-source alternative for everyone else:

Any LLM, not just Bedrock. LiteLLM-compatible — OpenAI, Anthropic direct, OpenRouter, Groq, Gemini, Mistral, or run Ollama locally for air-gapped / regulated environments. Auto-detects your existing Claude Code subscription so you pay zero incremental LLM cost if you're already on a Max/Pro plan.

Multi-cloud out of the box. AWS + Azure investigations in the same chat (one organization can connect both clouds at once). AWS-only agents stop at the AWS perimeter.

Your data stays in your database. Investigations, prompts, and tool outputs persist in your Postgres or SQLite — your VPC, your retention, your encryption. Matters for HIPAA, PCI, FedRAMP, and EU AI Act audits.

Fully auditable. Every prompt, tool call (args + result), and token is open and streamed live to the UI; nothing is hidden. AWS Agent is a closed black box.

Customizable. Add tools as plain Python functions, add runbooks by dropping a SKILL.md file, modify the system prompt. Fork it if you need to.

Investigate from anywhere. Built-in MCP server makes it usable from Claude Desktop, Cursor, or any MCP client — not just the AWS Console.

OpenDevOps AWS DevOps Agent / Q Developer

LLM Any (LiteLLM, Claude Code, Ollama) Bedrock-managed only

Cloud coverage AWS + Azure (more coming) AWS only

Data location Your DB / VPC AWS-managed, not portable

Customization Open source — modify anything Closed product

Pricing LLM at retail (or $0 via Ollama / Claude Code) Per-investigation + Bedrock markup

Self-host Docker / Railway / on-prem / air-gapped No

When AWS is the better pick: if you're 100% AWS, never plan to leave, and want zero infrastructure to run, Amazon Q Developer's native Console integration and AWS-only signals (Trusted Advisor, AWS Config, Compute Optimizer) are hard to beat. OpenDevOps is for everyone else.

What's inside

LangChain DeepAgents as the agent framework — planning, tool orchestration, and session memory out of the box

21 read-only AWS tools across CloudWatch (6), CloudTrail (2), ECS (4), Lambda (4), EC2 (2), RDS (2), IAM (1), plus bash escape hatch, cross-session history analytics, skills, and submit_investigation — plain Python functions, schemas inferred automatically

Azure support (CLI-first) — investigates Azure through the read-only az CLI + kubectl (for AKS) and a set of Azure runbook skills (AKS debugging, App Service errors, Azure Monitor/KQL, VM diagnostics) — no separate SDK tools needed. Read-only; connect via a service principal or az login — see apps/documentation/azure_setup.md

Sandboxed bash execution tool — agent can run whitelisted read-only AWS CLI (aws), Azure CLI (az), kubectl, and docker commands as a last resort when the structured tools fall short; every command validated against an allowlist before execution; never uses shell=True; hard 30-second timeout

Includes CloudWatch Logs Insights (query_logs_insights) — full query language support: fields, filter, stats, sort, limit; results include scanned MB

Streaming responses — FastAPI SSE endpoint streams agent tokens in real time as the LLM reasons; tool calls appear as they complete

Event-driven incident detection — EventBridge → SQS → long-poll consumer; 9 EventBridge rules cover CloudWatch alarms, ECS task failures, Lambda async errors, RDS events, EC2 state changes, CodePipeline failures, and AWS Health events; uses a DLQ plus database-backed incident claims to avoid duplicate investigations; runs alongside the metric poller — see apps/documentation/event_detection.md

Context enrichment — before the LLM runs, deterministic boto3 calls fetch facts about the affected resource (alarm details, recent logs, function config, etc.) to reduce tool call count and speed up investigations

Monitoring dashboard — live incident feed showing all event-driven investigations: confidence level (or FAILED badge), affected service, root cause summary; each alert links back to its original investigation session via View investigation so you can follow up without losing context; real-time SSE push keeps the page live without polling — see apps/documentation/monitoring.md

AWS Configuration settings tab — admin-only editable tab in Settings for SQS Queue URL and AWS Region; shared org-wide via database-backed app config; includes an inline IAM permission checker per service

Web UI — React + Vite SPA served by FastAPI:

Chat page — streaming responses, collapsible tool call inspector, cost/latency card, stop button; supports ?prompt= deeplink for pre-seeded investigations from the Monitoring dashboard

Session history sidebar — lists all past conversations; click any to resume with full tool call inspector and cost card restored; new chat and delete (soft) buttons

Monitoring page — live incident feed from event-driven detection; alert detail with investigate deeplink

Dashboard — session counts, tool call stats, cost/latency, context saved, activity chart, service breakdown, root cause distribution, recent sessions

History page — keyword search across all past sessions

Settings page — AWS Configuration (editable, admin-only), Environment (read-only env vars), Agent config, Integrations

Team page — admin-only user management: add, remove, and change roles

Auth & RBAC — optional password-based auth with admin and user roles; JWT tokens; first registered user auto-becomes admin; disabled by default (set JWT_SECRET to enable) — see apps/documentation/auth.md

Three storage backends — pick one via CHECKPOINT_BACKEND in .env; see apps/documentation/databases.md

memory — zero config, no persistence; great for CI and quick testing; autonomous polling/event monitoring is disabled in this mode

sqlite — local file, no external services; recommended for single-server and personal use

postgres — full production persistence via psycopg3 + AsyncPostgresSaver

Schema: users, sessions, messages, tool_calls, usage_events — see apps/documentation/schema.md

Soft delete — deleted sessions are hidden immediately but data is preserved for the 30-day cleanup job

Structured logging via Loguru — used consistently across all modules (tools, agent, API, CLI); every request shows agent reasoning, tool calls with args/results, and a done summary with latency + token counts

CLI — devops-agent investigate, ask, and report commands powered by the same agent

Any LLM via LiteLLM — OpenAI, Anthropic, OpenRouter, Groq, Gemini, Mistral, Ollama (local / air-gapped), or any OpenAI-compatible endpoint. Auto-detects local Claude Code subscription (~/.claude OAuth) so a Max/Pro plan can power the agent at zero incremental cost. Swap models via a single env var (LLM_MODEL) — no code changes

Quick Start

  1. Install dependencies

cd apps/backend && uv sync

  1. Configure environment

cp .env.example .env

Edit .env — add your OPENROUTER_API_KEY and set AWS_PROFILE

  1. Set up AWS profile

aws configure --profile devops-agent-readonly

AWS Access Key ID: your_key_id

AWS Secret Access Key: your_secret_key

Default region: us-east-1

Default output format: json

Verify

aws sts get-caller-identity --profile devops-agent-readonly

  1. Choose a storage backend

Three options — pick one and add it to .env. Full details in apps/documentation/databases.md.

Memory (default — zero config, nothing persists on restart)

CHECKPOINT_BACKEND=memory

SQLite (recommended for local dev — persists to a file, no external service needed)

CHECKPOINT_BACKEND=sqlite SQLITE_PATH=./data/agent.db # created automatically on first start

PostgreSQL (recommended for production)

Start Postgres with Docker

docker run -d --name opendevops-pg \ -e POSTGRES_DB=opendevops \ -e POSTGRES_USER=dev \ -e POSTGRES_PASSWORD=dev \ -p 5433:5432 \ postgres:16

Add to .env

CHECKPOINT_BACKEND=postgres DATABASE_URL=postgresql://dev:dev@localhost:5433/opendevops

Create app tables (safe to re-run)

cd apps/backend && uv run migrate

  1. Run

Option A — Docker Compose (recommended, AWS CLI included)

docker compose -f deployment/docker-compose/docker-compose.yml up --build

Backend: http://localhost:8000

Frontend: http://localhost:80

Postgres (host): localhost:5433

The backend image installs AWS CLI v2 automatically — the bash execution tool works out of the box. Host AWS credentials (~/.aws) are mounted read-only into the container. For production on AWS, remove the volume mount and attach an IAM role to the instance/task instead.

Option B — Local dev (two terminals)

Terminal 1 — FastAPI backend with hot reload

cd apps/backend && uv run dev

Terminal 2 — React frontend (Vite dev server with HMR)

cd apps/frontend && npm run dev

Open http://localhost:5173

Note: local dev requires aws CLI installed on your machine for the bash tool to work. Install it from https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

CLI

cd apps/backend

Investigate an incident

uv run devops-agent investigate "high error rate on my payment Lambda"

With alarm and service hints

uv run devops-agent investigate "latency spike" --alarm HighLatencyAlarm --service api-service

Freeform Q&A

uv run devops-agent ask "why would a Lambda function suddenly start throttling?"

Daily ops health report

uv run devops-agent report

AWS IAM Setup

The agent needs read access across your AWS account, plus optional write access scoped to opendevops-* resources if you use the event-driven monitoring setup wizard. Two least-privilege policies (Operational + Setup) and full step-by-step instructions are in apps/documentation/iam_setup.md.

Pro

[truncated for AI cost control]