> ## Documentation Index
> Fetch the complete documentation index at: https://docs.openhands.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Incident Triage

> Using OpenHands to investigate and resolve production incidents

<Card title="View Example Workflow" icon="github" href="https://github.com/OpenHands/software-agent-sdk/tree/main/examples/03_github_workflows/04_datadog_debugging">
  Check out the complete Datadog debugging workflow with ready-to-use code and configuration.
</Card>

When production incidents occur, speed matters. OpenHands can help you quickly investigate issues, analyze logs and errors, identify root causes, and generate fixes—reducing your mean time to resolution (MTTR).

<Note>
  This guide is based on our blog post [Debugging Production Issues with AI Agents: Automating Datadog Error Analysis](https://openhands.dev/blog/debugging-production-issues-with-ai-agents-automating-datadog-error-analysis).
</Note>

## Overview

Running a production service is **hard**. Errors and bugs crop up due to product updates, infrastructure changes, or unexpected user behavior. When these issues arise, it's critical to identify and fix them quickly to minimize downtime and maintain user trust—but this is challenging, especially at scale.

What if AI agents could handle the initial investigation automatically? This allows engineers to start with a detailed report of the issue, including root cause analysis and specific recommendations for fixes, dramatically speeding up the debugging process.

OpenHands accelerates incident response by:

* **Automated error analysis**: AI agents investigate errors and provide detailed reports
* **Root cause identification**: Connect symptoms to underlying issues in your codebase
* **Fix recommendations**: Generate specific, actionable recommendations for resolving issues
* **Integration with monitoring tools**: Work directly with platforms like Datadog

## Automated Datadog Error Analysis

The [OpenHands Software Agent SDK](https://github.com/OpenHands/software-agent-sdk) provides powerful capabilities for building autonomous AI agents that can integrate with monitoring platforms like Datadog. A ready-to-use [GitHub Actions workflow](https://github.com/OpenHands/software-agent-sdk/tree/main/examples/03_github_workflows/04_datadog_debugging) demonstrates how to automate error analysis.

### How It Works

[Datadog](https://www.datadoghq.com/) is a popular monitoring and analytics platform that provides comprehensive error tracking capabilities. It aggregates logs, metrics, and traces from your applications, making it easier to identify and investigate issues in production.

[Datadog's Error Tracking](https://www.datadoghq.com/error-tracking/) groups similar errors together and provides detailed insights into their occurrences, stack traces, and affected services. OpenHands can automatically analyze these errors and provide detailed investigation reports.

### Triggering Automated Debugging

The GitHub Actions workflow can be triggered in two ways:

1. **Search Query**: Provide a search query (e.g., "JSONDecodeError") to find all recent errors matching that pattern. This is useful for investigating categories of errors.

2. **Specific Error ID**: Provide a specific Datadog error tracking ID to deep-dive into a known issue. You can copy the error ID from DataDog's error tracking UI using the "Actions" button.

### Automated Investigation Process

When the workflow runs, it automatically performs the following steps:

1. Get detailed info from the DataDog API
2. Create or find an existing GitHub issue to track the error
3. Clone all relevant repositories to get full code context
4. Run an OpenHands agent to analyze the error and investigate the code
5. Post the findings as a comment on the GitHub issue

The agent identifies the exact file and line number where errors originate, determines root causes, and provides specific recommendations for fixes.

<Note>
  The workflow posts findings to GitHub issues for human review before any code changes are made. If you want the agent to create a fix, you can follow up using the [OpenHands GitHub integration](https://docs.openhands.dev/openhands/usage/cloud/github-installation#github-integration) and say `@openhands go ahead and create a pull request to fix this issue based on your analysis`.
</Note>

## Setting Up the Workflow

To set up automated Datadog debugging in your own repository:

1. Copy the workflow file to `.github/workflows/` in your repository
2. Configure the required secrets (Datadog API keys, LLM API key)
3. Customize the default queries and repository lists for your needs
4. Run the workflow manually or set up scheduled runs

The workflow is fully customizable. You can modify the prompts to focus on specific types of analysis, adjust the agent's tools to fit your workflow, or extend it to integrate with other services beyond GitHub and Datadog.

Find the [full implementation on GitHub](https://github.com/OpenHands/software-agent-sdk/tree/main/examples/03_github_workflows/04_datadog_debugging), including the workflow YAML file, Python script, and prompt template.

## Manual Incident Investigation

You can also use OpenHands directly to investigate incidents without the automated workflow.

### Log Analysis

OpenHands can analyze logs to identify patterns and anomalies:

```
Analyze these application logs for the incident that occurred at 14:32 UTC:

1. Identify the first error or warning that appeared
2. Trace the sequence of events leading to the failure
3. Find any correlated errors across services
4. Identify the user or request that triggered the issue
5. Summarize the timeline of events
```

**Log analysis capabilities:**

| Log Type         | Analysis Capabilities                               |
| ---------------- | --------------------------------------------------- |
| Application logs | Error patterns, exception traces, timing anomalies  |
| Access logs      | Traffic patterns, slow requests, error responses    |
| System logs      | Resource exhaustion, process crashes, system errors |
| Database logs    | Slow queries, deadlocks, connection issues          |

### Stack Trace Analysis

Deep dive into stack traces:

```
Analyze this stack trace from our production error:

[paste full stack trace]

1. Identify the exception type and message
2. Trace back to our code (not framework code)
3. Identify the likely cause
4. Check if this code path has changed recently
5. Suggest a fix
```

**Multi-language support:**

<Tabs>
  <Tab title="Java">
    ```
    Analyze this Java exception:

    java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3210)
        at java.util.ArrayList.grow(ArrayList.java:265)
        at com.myapp.DataProcessor.loadAllRecords(DataProcessor.java:142)

    Identify:
    1. What operation is consuming memory?
    2. Is there a memory leak or just too much data?
    3. What's the fix?
    ```
  </Tab>

  <Tab title="Python">
    ```
    Analyze this Python traceback:

    Traceback (most recent call last):
      File "app/api/orders.py", line 45, in create_order
        order = OrderService.create(data)
      File "app/services/order.py", line 89, in create
        inventory.reserve(item_id, quantity)
    AttributeError: 'NoneType' object has no attribute 'reserve'

    What's None and why?
    ```
  </Tab>

  <Tab title="JavaScript">
    ```
    Analyze this Node.js error:

    TypeError: Cannot read property 'map' of undefined
        at processItems (/app/src/handlers/items.js:23:15)
        at async handleRequest (/app/src/api/router.js:45:12)

    What's undefined and how should we handle it?
    ```
  </Tab>
</Tabs>

### Root Cause Analysis

Identify the underlying cause of an incident:

```
Perform root cause analysis for this incident:

Symptoms:
- API response times increased 5x at 14:00
- Error rate jumped from 0.1% to 15%
- Database CPU spiked to 100%

Available data:
- Application metrics (Grafana dashboard attached)
- Recent deployments: v2.3.1 deployed at 13:45
- Database slow query log (attached)

Identify the root cause using the 5 Whys technique.
```

## Common Incident Patterns

OpenHands can recognize and help diagnose these common patterns:

* **Connection pool exhaustion**: Increasing connection errors followed by complete failure
* **Memory leaks**: Gradual memory increase leading to OOM
* **Cascading failures**: One service failure triggering others
* **Thundering herd**: Simultaneous requests overwhelming a service
* **Split brain**: Inconsistent state across distributed components

## Quick Fix Generation

Once the root cause is identified, generate fixes:

```
We've identified the root cause: a missing null check in OrderProcessor.java line 156.

Generate a fix that:
1. Adds proper null checking
2. Logs when null is encountered
3. Returns an appropriate error response
4. Includes a unit test for the edge case
5. Is minimally invasive for a hotfix
```

## Best Practices

### Investigation Checklist

Use this checklist when investigating:

1. **Scope the impact**
   * How many users affected?
   * What functionality is broken?
   * What's the business impact?

2. **Establish timeline**
   * When did it start?
   * What changed around that time?
   * Is it getting worse or stable?

3. **Gather data**
   * Application logs
   * Infrastructure metrics
   * Recent deployments
   * Configuration changes

4. **Form hypotheses**
   * List possible causes
   * Rank by likelihood
   * Test systematically

5. **Implement fix**
   * Choose safest fix
   * Test before deploying
   * Monitor after deployment

### Common Pitfalls

<Warning>
  Avoid these common incident response mistakes:

  * **Jumping to conclusions**: Gather data before assuming the cause
  * **Changing multiple things**: Make one change at a time to isolate effects
  * **Not documenting**: Record all actions for the post-mortem
  * **Ignoring rollback**: Always have a rollback plan before deploying fixes
</Warning>

<Note>
  For production incidents, always follow your organization's incident response procedures. OpenHands is a tool to assist your investigation, not a replacement for proper incident management.
</Note>

## Automate This

You can set up continuous health monitoring using [OpenHands Automations](/openhands/usage/automations/overview).
Copy this prompt into a new conversation to set one up:

```
Create an automation called "API Health Monitor" that runs every 30 minutes.

It should check https://api.example.com/health and:
- If the response is not 200 OK, send an alert to #alerts with the status code and response body
- If healthy, just log success without alerting anyone

Learn more at https://docs.openhands.dev/openhands/usage/use-cases/incident-triage
```

For deeper error analysis with Datadog integration, see the
[Datadog debugging workflow](https://github.com/OpenHands/software-agent-sdk/tree/main/examples/03_github_workflows/04_datadog_debugging).

## Related Resources

* [OpenHands SDK Repository](https://github.com/OpenHands/software-agent-sdk) - Build custom AI agents
* [Datadog Debugging Workflow](https://github.com/OpenHands/software-agent-sdk/tree/main/examples/03_github_workflows/04_datadog_debugging) - Ready-to-use GitHub Actions workflow
* [Prompting Best Practices](/openhands/usage/tips/prompting-best-practices) - Write effective prompts
