Back to Portfolio
AI & Automation

DevOps & Incident Response Bot

Slack-native agent that triages alerts, runs runbooks, and opens PRs for known fixes.

Client:Financial Services
Duration:10 weeks
Team:4
Year:2025
DevOps & Incident Response Bot
-48%
MTTR
-50%
Pages
100%
Audited

The Challenge

On-call engineers were paged constantly for known issues with documented runbooks, and MTTR was creeping up as the system grew.

Our Solution

Slack-native LangGraph agent that subscribes to Datadog alerts, matches them to runbooks, executes safe remediation steps, and opens GitHub PRs for code-level fixes — escalating to a human only when confidence is low.

Key Features

  • Slack-native triage
  • Datadog alert subscription
  • Runbook execution engine
  • GitHub PR drafting
  • Confidence-based escalation
  • Full audit trail

Our Process

  1. 1

    Runbook capture

    Catalogued 80+ existing runbooks.

  2. 2

    Agent design

    LangGraph state machine per incident type.

  3. 3

    Safety

    Read-only first, then guarded write actions.

  4. 4

    Rollout

    Enabled per service after dry-run validation.

Results

  • MTTR down 48%
  • On-call paging volume cut in half
  • Auditable trail for every automated action
"On-call is humane again. The bot handles the boring 80% and only wakes us for the real ones."
SRE Lead
Financial Services

Want a project like this?

Tell us what you're building. We'll show you how we'd approach it.

Start a Conversation