“Have you ever stared at a dashboard, drowning in hundreds of metrics, wondering which ones actually matter?”
The Pina Colada Incident
A real story: the entire team was off at a company meetup, mingling across time zones, while I was home with a newborn, holding down the on-call EMEA fort. The shift started calm with no alerts. I even let myself believe I might actually get some rest.
But just as my shift ended and the next on-call engineer picked up, chaos struck.
The alerts were relentless:
- High CPU utilization!
- Memory exhaustion!
- Database nearing saturation!
At first, it felt like a false alarm. There were no customer complaints, no tweets screaming, “Your service is down!” Just a barrage of resource-level metrics firing off like a rogue fireworks show.
Then came the kicker: most of our customers were in the US, and while things seemed fine during my shift, the real trouble started when they logged in hours later. Multi-tenant workspaces began to struggle, and complaints trickled in. The problem was real - but we’d had no early clues about customer impact.
This wasn’t just an observability failure - it was a misalignment of priorities. Our metrics told us about machine performance, but not about user experience. And when it comes to reliability, customer experience is the only metric that truly matters.
Why resource metrics alone don’t help
Metrics like CPU utilization or memory consumption are tempting to monitor—they’re easy to measure and can feel reassuringly precise. But the truth is, they don’t always reflect what users experience.
Think about it:
- High CPU utilization might mean your system is working efficiently under load.
- Memory saturation could simply be an artifact of smart caching strategies.
Neither of these necessarily spells disaster. But when users can’t log in, complete transactions, or access key features, that’s when the alarms should go off.
During the Pina Colada incident, we were flooded with warnings about CPU usage but had no way of knowing that users in the US were starting to experience real issues. This was a wake-up call: it’s not enough to monitor what your machines are doing - you need to measure what your users are experiencing.
Golden Signals: metrics that matter for customer experience
The solution lies in focusing on Golden Signals - latency, traffic, errors, and saturation—metrics that reveal how well your service is delivering for users. Introduced in Google’s SRE book, Golden Signals shift the focus from machine-centric to user-centric observability.
Here’s how Golden Signals come into play:
- Latency: In e-commerce, high latency during the checkout process can lead to abandoned carts and lost revenue. But minor delays in order history retrieval might be acceptable.
- Traffic: For a streaming service, traffic spikes during a new season release are critical to monitor to ensure smooth playback.
- Errors: Booking systems need to track payment processing errors closely, as they’re directly tied to revenue and customer trust.
- Saturation: Gaming platforms must watch system saturation to avoid lag that disrupts player experience, even if backend batch jobs suffer temporarily.
Golden Signals aren’t just technical - they’re business metrics. They reflect how well your service supports user interactions that matter most.
Building a user-centric reliability strategy
The Pina Colada incident wasn’t just a lesson in observability - it was a lesson in collaboration. SREs don’t just keep systems running; they align teams around a shared understanding of what reliability means for the business and its users.
Here’s how to build a user-centric reliability strategy:
1. Focus on What Matters: from alerts to Golden Signals
Start by separating signal from noise. Every alert should answer one critical question: Does this directly impact the customer experience? Prune unnecessary alerts that don’t drive action or reflect user-facing issues.
For instance:
- An alert for high CPU on a reporting server might not warrant a response.
- Conversely, latency in the checkout flow is critical, as it directly affects sales.
Once alerts are cleaned up, prioritize Golden Signals—latency, traffic, errors, and saturation. These metrics provide a focused lens on how users interact with your service. However, their weight depends on the business context:
- In e-commerce, latency and errors in the checkout process take precedence over search result performance.
- In food delivery, traffic spikes during meal times should trigger capacity checks to avoid app slowdowns.
Tailor your metrics to reflect the critical parts of the user journey that align with business priorities.
2. Align Metrics with Business and User Goals
Not all parts of your system need to be equally reliable—just the parts that matter most to users and revenue. This is where Service Level Objectives (SLOs) and Service Level Indicators (SLIs) shine.
- SLOs set clear targets for the most critical workflows:For an e-commerce platform, aim for 99.95% success rate in checkout transactions, while allowing relaxed targets for analytics reports.For a ride-sharing app, prioritize reliability of ride booking over trip history.
- For an e-commerce platform, aim for 99.95% success rate in checkout transactions, while allowing relaxed targets for analytics reports.
- For a ride-sharing app, prioritize reliability of ride booking over trip history.
- SLIs measure specific aspects of user experience, guiding observability efforts:In streaming services, track playback failure rates and time-to-first-frame latency.For e-commerce platforms, monitor the percentage of successful purchases and page load times during peak events.
- In streaming services, track playback failure rates and time-to-first-frame latency.
- For e-commerce platforms, monitor the percentage of successful purchases and page load times during peak events.
SLOs and SLIs work together to ensure that reliability targets are rooted in the most critical user workflows, not just what’s easiest to measure.
3. Collaborate Across Teams: Shared Understanding of Reliability
Reliability isn’t just a technical concern—it’s a shared responsibility. Product, engineering, and business teams must align on what reliability means for users and the business.
For example:
- In a flight booking platform, product teams prioritize conversion rates during checkout, while engineering focuses on latency and error rates in payment APIs. Collaboration ensures both perspectives are covered.
- In a SaaS product, product managers might prioritize API uptime for third-party integrations, while engineers ensure database query performance is robust.
Collaboration fosters alignment, ensuring metrics and goals reflect user needs and business priorities across the organization.
The Takeaway
Reliability is about more than keeping systems running - it’s about delivering seamless, delightful user experiences. Metrics like CPU or memory usage might keep machines happy, but they don’t guarantee customer satisfaction.
The next time you’re flooded with alerts, ask yourself: Does this metric reflect customer pain, or is it just noise? The answer could change how you approach reliability forever.