By JUSTINE GICHANA — 03 Nov 2024

Debugging Production Issues at 3 AM: A Developer's Survival Guide

The PagerDuty alert woke me at 3:17 AM. "Critical: API response time > 5s." My heart rate spiked before I was fully awake.

This is the reality of running production systems. Here's what I've learned from too many 3 AM incidents.

The First 60 Seconds

When you get paged, you're groggy and panicked. Having a checklist helps:

Acknowledge the alert (stops it from escalating)
Check the dashboard (is the site actually down?)
Look at recent deployments (did we just ship something?)
Check external dependencies (is AWS having issues?)

These 60 seconds give you context before you start debugging.

The Incident That Taught Me Everything

Last year, our API went down at 2:47 AM. Response times went from 200ms to 30 seconds. Then timeouts. Then complete failure.

I was on-call. Here's how it went down.

Minute 1-5: Panic

My first instinct was to restart everything. Bad idea.

Restarting without understanding the problem often makes things worse. You lose logs, you lose state, and you might restart into the same failure mode.

I took a breath and started investigating.

Minute 5-15: Gather Data

I checked:

Logs: Nothing obvious
Metrics: CPU and memory looked normal
Database: Queries were slow but not failing
External APIs: All responding normally

The smoking gun: Database connection pool was exhausted. We had 100 connections (our limit) and all were in use.

Minute 15-30: Find the Root Cause

Why were connections not being released?

I checked recent deployments. We'd shipped a new feature 6 hours earlier. It made database calls but... oh no.

def get_user_data(user_id):
    conn = db.get_connection()
    result = conn.execute(query)
    return result  # Never released the connection!

We forgot to close the connection. Each request leaked a connection. After 6 hours, we'd exhausted the pool.

Minute 30-45: Fix It

I had two options:

Rollback the deployment
Hotfix the bug

At 3 AM, rollback is usually safer. But our deployment process takes 15 minutes. The hotfix was one line:

def get_user_data(user_id):
    conn = db.get_connection()
    try:
        result = conn.execute(query)
        return result
    finally:
        conn.close()  # Always release

I deployed the hotfix. Response times recovered in 2 minutes.

Minute 45-60: Verify and Document

I watched metrics for 15 minutes to ensure stability. Then I wrote a quick incident report while it was fresh.

Total downtime: 43 minutes.

Lessons from That Night

1. Have a Runbook

We now have runbooks for common issues:

High response times
Database connection issues
Memory leaks
External API failures

Each runbook has:

Symptoms
Likely causes
Investigation steps
Fix procedures

At 3 AM, you don't want to think. You want to follow a checklist.

2. Logs Are Your Best Friend

But only if they're good logs. We now log:

Request IDs (to trace requests across services)
Timing information
Database query times
External API calls

Structured logging helps:

logger.info(
    'database_query',
    query_time=0.234,
    query_type='SELECT',
    table='users',
    request_id=request_id
)

This is searchable and aggregatable.

3. Metrics Beat Logs

Logs tell you what happened. Metrics tell you when it started happening.

We track:

Request rate
Response time (p50, p95, p99)
Error rate
Database connection pool usage
Memory and CPU

When something breaks, metrics show you exactly when it started.

4. Alerts Should Be Actionable

Bad alert: "CPU usage high"
Good alert: "API response time p95 > 1s for 5 minutes"

The good alert tells you:

What's wrong (response time)
How wrong (p95 > 1s)
For how long (5 minutes)

You can act on this.

5. Practice Debugging

We do "chaos engineering" exercises. Once a month, we intentionally break something in staging and practice debugging it.

This builds muscle memory. When a real incident happens, you're not learning the tools—you're using them.

The 3 AM Debugging Toolkit

1. Centralized Logging

We use Loki. All logs from all services in one place. Searchable.

{service="api"} |= "error" | json | response_time > 1000

This finds all API errors with response time over 1 second.

2. Distributed Tracing

Jaeger shows us the path of a request through our system. When something is slow, we can see exactly where.

3. Metrics Dashboard

Grafana dashboard with:

Request rate
Error rate
Response time
Database metrics
External API health

One screen tells me the system's health.

4. Database Query Analyzer

We log slow queries automatically. If the database is the problem, we know which query.

The Mental Game

Stay Calm

Panic makes you stupid. Take 30 seconds to breathe before you start.

Don't Guess

Gather data first. Form hypotheses. Test them. Don't randomly restart things.

Communicate

Post in Slack: "Investigating API slowness. Will update in 10 minutes."

This keeps stakeholders informed and buys you time to focus.

Know When to Escalate

If you're stuck after 20 minutes, wake up someone else. Two brains are better than one.

Prevention Is Better Than Cure

Code Review

That connection leak? Should've been caught in code review. Now we have a checklist:

Are resources properly closed?
Are errors handled?
Are timeouts set?

Staging Environment

Our staging environment mirrors production. We load test every deployment.

Gradual Rollouts

We deploy to 10% of servers first. If metrics look good after 30 minutes, we deploy to 100%.

This catches issues before they affect everyone.

Automatic Rollback

If error rate spikes after a deployment, we automatically rollback. No human intervention needed.

The Aftermath

After every incident, we do a blameless post-mortem:

What happened?
Why did it happen?
How did we detect it?
How did we fix it?
How do we prevent it?

We focus on systems, not people. The goal is to learn, not to blame.

The Reality

You will get paged at 3 AM. It's part of the job. But you can make it less painful:

Have good observability
Write runbooks
Practice debugging
Stay calm
Learn from incidents

The first few times are terrifying. After a while, you develop a system. You know what to check. You know how to investigate.

It still sucks to be woken up at 3 AM. But at least you know what to do.

And that makes all the difference.