OpenDevOps – An open-source AI agent that investigates AWS/Azure incidents
OpenDevOps is an open-source multi-cloud DevOps agent that supports AWS and Azure, integrates any LLM via LiteLLM, and investigates cloud incidents to find root causes and provide mitigation plans. It is ~10x cheaper than AWS DevOps Agent, self-hosted, auditable, and customizable.
Notifications You must be signed in to change notification settings
Fork 0
Star 2
BranchesTags
Open more actions menu
Folders and files
NameName
Last commit message
Last commit date
Latest commit
History
240 Commits
240 Commits
.github/workflows
.github/workflows
apps
apps
assets/demos
assets/demos
demos
demos
deployment
deployment
design-system
design-system
.dockerignore
.dockerignore
.gitignore
.gitignore
CHANGELOG.md
CHANGELOG.md
CLAUDE.md
CLAUDE.md
LICENSE
LICENSE
Makefile
Makefile
PLAN.md
PLAN.md
README.md
README.md
Repository files navigation
Open-source multi-cloud DevOps agent (AWS + Azure). Bring any LLM via LiteLLM — OpenAI, Anthropic, OpenRouter, Groq, Gemini, Mistral, Ollama for air-gapped / regulated environments, or reuse your existing Claude Code subscription (auto-detected). Investigates incidents, finds root causes, and gives actionable mitigation plans — without the cloud-vendor DevOps-agent price tag.
📊 Benchmarked, not just demoed
On a reproducible 10-incident suite (real AWS + Azure resources, scored against ground truth), running on a commodity open model (gpt-oss-120b — no frontier model required):
Root causes found Median time Cost / investigation vs. AWS DevOps Agent vs. manual triage
9 / 10 (90%) ~52 s ~$0.03 ~10× cheaper¹ (~$0.03 vs ~$0.43) ~1,000× cheaper²
~$0.03 of compute replaces ~$50 of engineer toil — and costs a fraction of a managed cloud DevOps agent — while returning the answer in under a minute instead of half an hour. Reproduce it with make eval → full benchmark & methodology.
¹ vs. AWS DevOps Agent — its per-second rate applied to the same wall-clock time (~$0.43/investigation); verify against AWS's published pricing. ² Illustrative unit economics vs. ~20–40 min of on-call triage. Cost shown is the provider-dashboard actual; see caveats.
Cloud setup: AWS (IAM) · Azure (service principal / login)
Demo
Autonomous incident detection — a crashing Lambda is caught automatically, the agent reads the traceback from CloudWatch Logs, finds the root cause, surfaces it on the Monitoring dashboard, and posts the mitigation to Slack. No human in the loop.
Why OpenDevOps?
Amazon Q Developer and the AWS DevOps Agent are excellent if you live entirely inside the AWS Console with Bedrock-managed models. OpenDevOps is the open-source alternative for everyone else:
Any LLM, not just Bedrock. LiteLLM-compatible — OpenAI, Anthropic direct, OpenRouter, Groq, Gemini, Mistral, or run Ollama locally for air-gapped / regulated environments. Auto-detects your existing Claude Code subscription so you pay zero incremental LLM cost if you're already on a Max/Pro plan.
Multi-cloud out of the box. AWS + Azure investigations in the same chat (one organization can connect both clouds at once). AWS-only agents stop at the AWS perimeter.
Your data stays in your database. Investigations, prompts, and tool outputs persist in your Postgres or SQLite — your VPC, your retention, your encryption. Matters for HIPAA, PCI, FedRAMP, and EU AI Act audits.
Fully auditable. Every prompt, tool call (args + result), and token is open and streamed live to the UI; nothing is hidden. AWS Agent is a closed black box.
Customizable. Add tools as plain Python functions, add runbooks by dropping a SKILL.md file, modify the system prompt. Fork it if you need to.
Investigate from anywhere. Built-in MCP server makes it usable from Claude Desktop, Cursor, or any MCP client — not just the AWS Console.
OpenDevOps AWS DevOps Agent / Q Developer
LLM Any (LiteLLM, Claude Code, Ollama) Bedrock-managed only
Cloud coverage AWS + Azure (more coming) AWS only
Data location Your DB / VPC AWS-managed, not portable
Customization Open source — modify anything Closed product
Pricing LLM at retail (or $0 via Ollama / Claude Code) Per-investigation + Bedrock markup
Self-host Docker / Railway / on-prem / air-gapped No
When AWS is the better pick: if you're 100% AWS, never plan to leave, and want zero infrastructure to run, Amazon Q Developer's native Console integration and AWS-only signals (Trusted Advisor, AWS Config, Compute Optimizer) are hard to beat. OpenDevOps is for everyone else.
What's inside
LangChain DeepAgents as the agent framework — planning, tool orchestration, and session memory out of the box
21 read-only AWS tools across CloudWatch (6), CloudTrail (2), ECS (4), Lambda (4), EC2 (2), RDS (2), IAM (1), plus bash escape hatch, cross-session history analytics, skills, and submit_investigation — plain Python functions, schemas inferred automatically
Azure support (CLI-first) — investigates Azure through the read-only az CLI + kubectl (for AKS) and a set of Azure runbook skills (AKS debugging, App Service errors, Azure Monitor/KQL, VM diagnostics) — no separate SDK tools needed. Read-only; connect via a service principal or az login — see apps/documentation/azure_setup.md
Sandboxed bash execution tool — agent can run whitelisted read-only AWS CLI (aws), Azure CLI (az), kubectl, and docker commands as a last resort when the structured tools fall short; every command validated against an allowlist before execution; never uses shell=True; hard 30-second timeout
Includes CloudWatch Logs Insights (query_logs_insights) — full query language support: fields, filter, stats, sort, limit; results include scanned MB
Streaming responses — FastAPI SSE endpoint streams agent tokens in real time as the LLM reasons; tool calls appear as they complete
Event-driven incident detection — EventBridge → SQS → long-poll consumer; 9 EventBridge rules cover CloudWatch alarms, ECS task failures, Lambda async errors, RDS events, EC2 state changes, CodePipeline failures, and AWS Health events; uses a DLQ plus database-backed incident claims to avoid duplicate investigations; runs alongside the metric poller — see apps/documentation/event_detection.md
Context enrichment — before the LLM runs, deterministic boto3 calls fetch facts about the affected resource (alarm details, recent logs, function config, etc.) to reduce tool call count and speed up investigations
Monitoring dashboard — live incident feed showing all event-driven investigations: confidence level (or FAILED badge), affected service, root cause summary; each alert links back to its original investigation session via View investigation so you can follow up without losing context; real-time SSE push keeps the page live without polling — see apps/documentation/monitoring.md
AWS Configuration settings tab — admin-only editable tab in Settings for SQS Queue URL and AWS Region; shared org-wide via database-backed app config; includes an inline IAM permission checker per service
Web UI — React + Vite SPA served by FastAPI:
Chat page — streaming responses, collapsible tool call inspector, cost/latency card, stop button; supports ?prompt= deeplink for pre-seeded investigations from the Monitoring dashboard
Session history sidebar — lists all past conversations; click any to resume with full tool call inspector and cost card restored; new chat and delete (soft) buttons
Monitoring page — live incident feed from event-driven detection; alert detail with investigate deeplink
Dashboard — session counts, tool call stats, cost/latency, context saved, activity chart, service breakdown, root cause distribution, recent sessions
History page — keyword search across all past sessions
Settings page — AWS Configuration (editable, admin-only), Environment (read-only env vars), Agent config, Integrations
Team page — admin-only user management: add, remove, and change roles
Auth & RBAC — optional password-based auth with admin and user roles; JWT tokens; first registered user auto-becomes admin; disabled by default (set JWT_SECRET to enable) — see apps/documentation/auth.md
Three storage backends — pick one via CHECKPOINT_BACKEND in .env; see apps/documentation/databases.md
memory — zero config, no persistence; great for CI and quick testing; autonomous polling/event monitoring is disabled in this mode
sqlite — local file, no external services; recommended for single-server and personal use
postgres — full production persistence via psycopg3 + AsyncPostgresSaver
Schema: users, sessions, messages, tool_calls, usage_events — see apps/documentation/schema.md
Soft delete — deleted sessions are hidden immediately but data is preserved for the 30-day cleanup job
Structured logging via Loguru — used consistently across all modules (tools, agent, API, CLI); every request shows agent reasoning, tool calls with args/results, and a done summary with latency + token counts
CLI — devops-agent investigate, ask, and report commands powered by the same agent
Any LLM via LiteLLM — OpenAI, Anthropic, OpenRouter, Groq, Gemini, Mistral, Ollama (local / air-gapped), or any OpenAI-compatible endpoint. Auto-detects local Claude Code subscription (~/.claude OAuth) so a Max/Pro plan can power the agent at zero incremental cost. Swap models via a single env var (LLM_MODEL) — no code changes
Quick Start
- Install dependencies
cd apps/backend && uv sync
- Configure environment
cp .env.example .env
Edit .env — add your OPENROUTER_API_KEY and set AWS_PROFILE
- Set up AWS profile
aws configure --profile devops-agent-readonly
AWS Access Key ID: your_key_id
AWS Secret Access Key: your_secret_key
Default region: us-east-1
Default output format: json
Verify
aws sts get-caller-identity --profile devops-agent-readonly
- Choose a storage backend
Three options — pick one and add it to .env. Full details in apps/documentation/databases.md.
Memory (default — zero config, nothing persists on restart)
CHECKPOINT_BACKEND=memory
SQLite (recommended for local dev — persists to a file, no external service needed)
CHECKPOINT_BACKEND=sqlite SQLITE_PATH=./data/agent.db # created automatically on first start
PostgreSQL (recommended for production)
Start Postgres with Docker
docker run -d --name opendevops-pg \ -e POSTGRES_DB=opendevops \ -e POSTGRES_USER=dev \ -e POSTGRES_PASSWORD=dev \ -p 5433:5432 \ postgres:16
Add to .env
CHECKPOINT_BACKEND=postgres DATABASE_URL=postgresql://dev:dev@localhost:5433/opendevops
Create app tables (safe to re-run)
cd apps/backend && uv run migrate
- Run
Option A — Docker Compose (recommended, AWS CLI included)
docker compose -f deployment/docker-compose/docker-compose.yml up --build
Backend: http://localhost:8000
Frontend: http://localhost:80
Postgres (host): localhost:5433
The backend image installs AWS CLI v2 automatically — the bash execution tool works out of the box. Host AWS credentials (~/.aws) are mounted read-only into the container. For production on AWS, remove the volume mount and attach an IAM role to the instance/task instead.
Option B — Local dev (two terminals)
Terminal 1 — FastAPI backend with hot reload
cd apps/backend && uv run dev
Terminal 2 — React frontend (Vite dev server with HMR)
cd apps/frontend && npm run dev
Open http://localhost:5173
Note: local dev requires aws CLI installed on your machine for the bash tool to work. Install it from https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
CLI
cd apps/backend
Investigate an incident
uv run devops-agent investigate "high error rate on my payment Lambda"
With alarm and service hints
uv run devops-agent investigate "latency spike" --alarm HighLatencyAlarm --service api-service
Freeform Q&A
uv run devops-agent ask "why would a Lambda function suddenly start throttling?"
Daily ops health report
uv run devops-agent report
AWS IAM Setup
The agent needs read access across your AWS account, plus optional write access scoped to opendevops-* resources if you use the event-driven monitoring setup wizard. Two least-privilege policies (Operational + Setup) and full step-by-step instructions are in apps/documentation/iam_setup.md.
Pro
[truncated for AI cost control]