GiftWish SaaS 1 day incident + 1 week follow-up Site Reliability Engineer

Production Incident Postmortem: GiftWish Latency Spike

Led incident response and postmortem analysis for a production latency spike that caused 504 errors during peak traffic. Identified root causes, documented learnings, and proposed concrete reliability improvements.

Incident Response Monitoring PostgreSQL Performance SRE

The Challenge

During peak evening traffic, API response times spiked from 200ms to 3000ms, causing 504 Gateway Timeout errors for 15% of users. The incident lasted 45 minutes before the team applied emergency mitigations. No monitoring alerts fired early enough to prevent user impact. The team needed to understand why this happened and prevent recurrence.

The Solution

Led a structured incident response process: established incident command, gathered logs and metrics, identified database connection pool exhaustion as the root cause, and applied immediate mitigation by scaling connection pools and adding read replicas. After restoring service, conducted a blameless postmortem that documented the timeline, root causes, contributing factors, and action items. Proposed concrete improvements to monitoring, capacity planning, and database architecture.

Technical Implementation

The postmortem followed SRE best practices with a detailed timeline reconstructed from logs and metrics. Root cause analysis revealed database connection pool limits were too low for peak traffic, and slow queries were holding connections longer than expected. Contributing factors included missing connection pool monitoring, no alerting on API latency percentiles, and lack of load testing against realistic traffic patterns. Action items included implementing connection pool metrics and alerts, optimizing slow queries identified in logs, setting up proper p95/p99 latency alerts, conducting monthly load tests, and documenting runbooks for common incidents.

Results & Impact

Reduced 70%
MTTD
4 documented
Runbooks
Now monthly
Load Testing
8 new
Alerts Created

Want Similar Results?

Let's discuss how I can help transform your infrastructure.