“Have you ever stared at a dashboard, drowning in hundreds of metrics, wondering which ones actually matter?”
The Pina Colada Incident
A real story: the entire team was off at a company meetup, mingling across time zones, while I was home with a newborn, holding down the on-call EMEA fort. The shift started calm with no alerts. I even let myself believe I might actually get some rest.
But just as my shift ended and the next on-call engineer picked up, chaos struck.
The alerts were relentless:
At first, it felt like a false alarm. There were no customer complaints, no tweets screaming, “Your service is down!” Just a barrage of resource-level metrics firing off like a rogue fireworks show.
Then came the kicker: most of our customers were in the US, and while things seemed fine during my shift, the real trouble started when they logged in hours later. Multi-tenant workspaces began to struggle, and complaints trickled in. The problem was real - but we’d had no early clues about customer impact.
This wasn’t just an observability failure - it was a misalignment of priorities. Our metrics told us about machine performance, but not about user experience. And when it comes to reliability, customer experience is the only metric that truly matters.
Metrics like CPU utilization or memory consumption are tempting to monitor—they’re easy to measure and can feel reassuringly precise. But the truth is, they don’t always reflect what users experience.
Think about it:
Neither of these necessarily spells disaster. But when users can’t log in, complete transactions, or access key features, that’s when the alarms should go off.
During the Pina Colada incident, we were flooded with warnings about CPU usage but had no way of knowing that users in the US were starting to experience real issues. This was a wake-up call: it’s not enough to monitor what your machines are doing - you need to measure what your users are experiencing.
The solution lies in focusing on Golden Signals - latency, traffic, errors, and saturation—metrics that reveal how well your service is delivering for users. Introduced in Google’s SRE book, Golden Signals shift the focus from machine-centric to user-centric observability.
Here’s how Golden Signals come into play:
Golden Signals aren’t just technical - they’re business metrics. They reflect how well your service supports user interactions that matter most.
The Pina Colada incident wasn’t just a lesson in observability - it was a lesson in collaboration. SREs don’t just keep systems running; they align teams around a shared understanding of what reliability means for the business and its users.
Here’s how to build a user-centric reliability strategy:
Start by separating signal from noise. Every alert should answer one critical question: Does this directly impact the customer experience? Prune unnecessary alerts that don’t drive action or reflect user-facing issues.
For instance:
Once alerts are cleaned up, prioritize Golden Signals—latency, traffic, errors, and saturation. These metrics provide a focused lens on how users interact with your service. However, their weight depends on the business context:
Tailor your metrics to reflect the critical parts of the user journey that align with business priorities.
Not all parts of your system need to be equally reliable—just the parts that matter most to users and revenue. This is where Service Level Objectives (SLOs) and Service Level Indicators (SLIs) shine.
SLOs and SLIs work together to ensure that reliability targets are rooted in the most critical user workflows, not just what’s easiest to measure.
Reliability isn’t just a technical concern—it’s a shared responsibility. Product, engineering, and business teams must align on what reliability means for users and the business.
For example:
Collaboration fosters alignment, ensuring metrics and goals reflect user needs and business priorities across the organization.
Reliability is about more than keeping systems running - it’s about delivering seamless, delightful user experiences. Metrics like CPU or memory usage might keep machines happy, but they don’t guarantee customer satisfaction.
The next time you’re flooded with alerts, ask yourself: Does this metric reflect customer pain, or is it just noise? The answer could change how you approach reliability forever.
Find out if MentorCruise is a good fit for you – fast, free, and no pressure.
Tell us about your goals
See how mentorship compares to other options
Preview your first month