Debugging Production Issues at 3 AM: A Developer's Survival Guide
Debugging Production Issues at 3 AM: A Developer's Survival Guide
The PagerDuty alert woke me at 3:17 AM. "Critical: API response time > 5s." My heart rate spiked before I was fully awake.
This is the reality of running production systems. Here's what I've learned from too many 3 AM incidents.
The First 60 Seconds
When you get paged, you're groggy and panicked. Having a checklist helps:
- Acknowledge the alert (stops it from escalating)
- Check the dashboard (is the site actually down?)
- Look at recent deployments (did we just ship something?)
- Check external dependencies (is AWS having issues?)
These 60 seconds give you context before you start debugging.
The Incident That Taught Me Everything
Last year, our API went down at 2:47 AM. Response times went from 200ms to 30 seconds. Then timeouts. Then complete failure.
I was on-call. Here's how it went down.
Minute 1-5: Panic
My first instinct was to restart everything. Bad idea.
Restarting without understanding the problem often makes things worse. You lose logs, you lose state, and you might restart into the same failure mode.
I took a breath and started investigating.
Minute 5-15: Gather Data
I checked:
- Logs: Nothing obvious
- Metrics: CPU and memory looked normal
- Database: Queries were slow but not failing
- External APIs: All responding normally
The smoking gun: Database connection pool was exhausted. We had 100 connections (our limit) and all were in use.
Minute 15-30: Find the Root Cause
Why were connections not being released?
I checked recent deployments. We'd shipped a new feature 6 hours earlier. It made database calls but... oh no.
def get_user_data(user_id):
conn = db.get_connection()
result = conn.execute(query)
return result # Never released the connection!
We forgot to close the connection. Each request leaked a connection. After 6 hours, we'd exhausted the pool.
Minute 30-45: Fix It
I had two options:
- Rollback the deployment
- Hotfix the bug
At 3 AM, rollback is usually safer. But our deployment process takes 15 minutes. The hotfix was one line:
def get_user_data(user_id):
conn = db.get_connection()
try:
result = conn.execute(query)
return result
finally:
conn.close() # Always release
I deployed the hotfix. Response times recovered in 2 minutes.
Minute 45-60: Verify and Document
I watched metrics for 15 minutes to ensure stability. Then I wrote a quick incident report while it was fresh.
Total downtime: 43 minutes.
Lessons from That Night
1. Have a Runbook
We now have runbooks for common issues:
- High response times
- Database connection issues
- Memory leaks
- External API failures
Each runbook has:
- Symptoms
- Likely causes
- Investigation steps
- Fix procedures
At 3 AM, you don't want to think. You want to follow a checklist.
2. Logs Are Your Best Friend
But only if they're good logs. We now log:
- Request IDs (to trace requests across services)
- Timing information
- Database query times
- External API calls
Structured logging helps:
logger.info(
'database_query',
query_time=0.234,
query_type='SELECT',
table='users',
request_id=request_id
)
This is searchable and aggregatable.
3. Metrics Beat Logs
Logs tell you what happened. Metrics tell you when it started happening.
We track:
- Request rate
- Response time (p50, p95, p99)
- Error rate
- Database connection pool usage
- Memory and CPU
When something breaks, metrics show you exactly when it started.
4. Alerts Should Be Actionable
Bad alert: "CPU usage high"
Good alert: "API response time p95 > 1s for 5 minutes"
The good alert tells you:
- What's wrong (response time)
- How wrong (p95 > 1s)
- For how long (5 minutes)
You can act on this.
5. Practice Debugging
We do "chaos engineering" exercises. Once a month, we intentionally break something in staging and practice debugging it.
This builds muscle memory. When a real incident happens, you're not learning the tools—you're using them.
The 3 AM Debugging Toolkit
1. Centralized Logging
We use Loki. All logs from all services in one place. Searchable.
{service="api"} |= "error" | json | response_time > 1000
This finds all API errors with response time over 1 second.
2. Distributed Tracing
Jaeger shows us the path of a request through our system. When something is slow, we can see exactly where.
3. Metrics Dashboard
Grafana dashboard with:
- Request rate
- Error rate
- Response time
- Database metrics
- External API health
One screen tells me the system's health.
4. Database Query Analyzer
We log slow queries automatically. If the database is the problem, we know which query.
The Mental Game
Stay Calm
Panic makes you stupid. Take 30 seconds to breathe before you start.
Don't Guess
Gather data first. Form hypotheses. Test them. Don't randomly restart things.
Communicate
Post in Slack: "Investigating API slowness. Will update in 10 minutes."
This keeps stakeholders informed and buys you time to focus.
Know When to Escalate
If you're stuck after 20 minutes, wake up someone else. Two brains are better than one.
Prevention Is Better Than Cure
Code Review
That connection leak? Should've been caught in code review. Now we have a checklist:
- Are resources properly closed?
- Are errors handled?
- Are timeouts set?
Staging Environment
Our staging environment mirrors production. We load test every deployment.
Gradual Rollouts
We deploy to 10% of servers first. If metrics look good after 30 minutes, we deploy to 100%.
This catches issues before they affect everyone.
Automatic Rollback
If error rate spikes after a deployment, we automatically rollback. No human intervention needed.
The Aftermath
After every incident, we do a blameless post-mortem:
- What happened?
- Why did it happen?
- How did we detect it?
- How did we fix it?
- How do we prevent it?
We focus on systems, not people. The goal is to learn, not to blame.
The Reality
You will get paged at 3 AM. It's part of the job. But you can make it less painful:
- Have good observability
- Write runbooks
- Practice debugging
- Stay calm
- Learn from incidents
The first few times are terrifying. After a while, you develop a system. You know what to check. You know how to investigate.
It still sucks to be woken up at 3 AM. But at least you know what to do.
And that makes all the difference.