Building a customer-focused Observability Maturity Model

Published in

Xero Developer

10 min readMar 25, 2024

A huge amount is required of engineers — from writing code to operating services and everything in-between. It’s easy to overlook important customer issues because knowing everything that should be done to make problems visible is difficult.

Below, we’ll cover the capabilities included in our maturity model, how observability fits into our engineering standards and finally, each maturity level with some key points. The concept of maturity levels can be applied to a lot of different areas, but in this context the levels refer to a progression of characteristics that relate to observability.

If you’re unsure of what observability is, it’s the ability to understand a system’s internal state by analysing the data it generates, such as logs, metrics, and traces. Observability helps teams analyse what’s happening in context across various inter-connected systems so you can detect and resolve the underlying causes of issues.

The suggested model aims to make observability adoption easier with supporting training, examples and templates for your environment, making problems more visible. This model isn’t tied to a particular vendor like most Observability Maturity Models published to-date.

Background

In 2017, Xero started to change from a central operations team to product ‘build and run’ teams. One advantage to having a central operations team is they are deeply invested in observability. There are a range of reasons why operations engineers are invested in reaching a high level of maturity, least of all because it streamlines their day-to-day work. Once product teams are expected to build and also run their products, observability expertise cannot be assumed and often needs to be built from the ground up. There are a lot of benefits to the build and run team model which we won’t discuss in this blog, but the increase in cognitive load is significant. Previously the specific ‘run’ expertise for things like observability were only required by operations/system engineers.

Capabilities

orange square showing graph, red square showing paper, blue square showing square

Metrics, logs and traces are the three pillars of observability, so it’s probably not too much of a surprise that they are capabilities in the maturity model. But why are Service Level Objectives (SLOs) included too?

We have included SLOs in the maturity model because they are closely related to the three pillars but with one critical difference: SLOs are focused on the customer. When we think about observability as purely technical work, metrics and systems can become the focus, rather than the customers using our product.

To help us measure our progress towards our SLOs, we use Service Level Indicators (SLIs). The main SLIs we use at Xero are Latency, Error Rate and Synthetic Availability. These metrics are central to our monitoring and alerting strategy. SLIs are simply just a metric while an SLO is a target for a metric. See the Google SRE book section on SLOs for more information.

Engineering standards

Xero has internal engineering standards which contain requirements (must) and considerations (should/could) that teams follow while preparing and making changes to production. We thought we’d share a few examples which you might want to consider if you are incorporating some observability related standards.

“ Deployments must only proceed to production when the current operating state is known

Are there any on-going incidents that may be relevant for your service? Is your service healthy, allowing you to compare before and after metrics? Do you have any outstanding alerts that would not trigger as you expect if there is a deviation in performance?

These are some of the questions that engineers consider before pushing changes to production.

“ Teams must have established SLO-based alerting on defined SLOs.

SLO-based alerts are one of the main ways our teams configure alerts. Is latency above the threshold that your customers expect or are error rates higher than normal? Your on-call engineer should be alerted to this right away. Alerting on host metrics such as CPU or Memory are often not good metrics to use because the system can still be operating fine when a host or two have high resource utilisation. Poor customer experience can happen for a range of different reasons.

“ The team owning the product must ensure the impact of a release is monitored for a minimum of 2 hours after a release if they do not meet all minimum engineering standards.

That is, if you don’t have well observed systems, don’t finish a release at 5pm and go home, even if it’s just a ‘small change’.

No engineer likes activities filled with toil — repeated work that can be automated. This standard encourages teams to invest in their release and observability so this two hour post-release activity is not required. Why two hours?

Some key reasons include:

Symptoms of degradation can take time to manifest
Consumers of your service can take time for their alerts to go off
A change to our distributed system can have unintended consequences

These standards help us achieve a basic objective; we want to know about issues before our customers tell us. This is a great principle for all engineers to keep in mind.

Maturity levels

The rest of this blog will focus on how we have defined each maturity level at Xero, both in simple terms and highlighting some of the key aspects.

Base level

“ Customers often tell us about issues prompting a delayed, reactive response. There are gaps in our tooling hindering investigation

*Key features of the base maturity level*

It often takes multiple iterations to tune SLOs to customer impact. Similarly if SLOs are not used regularly after being implemented, they can lose their usefulness because systems change and evolve over time. Xero built a custom SLO tool a few years ago (before there were market offerings), that serves us well and is quick to create and modify SLOs.

For the two standard compute platforms we use at Xero, you can simply include a YAML configuration file to get application and infrastructure monitoring plus logging sent to our logging platform. SLOs similarly have templates that use the metrics available from our monitoring platform.

Alerting is not fit for purpose — it is often noisy and not tuned to customer experience. Teams can suffer from alert fatigue and are notified by the customer experience team when there are issues in their systems meaning our customers are telling us, not our alerting.

When setting up new alerts, ensure that the thresholds are appropriate by sending a message to your team on-call channel for a week or two. Alerts should not stay there for scenarios where an on-call engineer needs to address something now.

Beginner level

“ We have a high-level awareness of production health and adequate tooling to investigate issues

*Key features of the beginner maturity level*

SLOs aren’t useful if you don’t know when they are unhealthy. Paging your on-call engineer when your SLO error budget has run out is a step towards SLO maturity.

The more complete a trace you can construct, the faster you can pinpoint where a failure might be happening. Enabling distributed tracing is the next step beyond what standard APM agents will provide, allowing traces through multiple services to be stitched together.

Synthetic tests cover the basic system state. One of the challenges with error and latency metrics is covering the case where an app is completely down due to some systemic failure like the network being broken, meaning there are no metrics coming through to trigger alerts. This can be handled in a number of ways — synthetic tests from locations outside our network is a good signal for this situation.

Building a service dashboard with basic metrics like SRE Golden Signals — latency, error rate, saturation, throughput — and specific metrics that are relevant to your service helps to speed up troubleshooting when something inevitably goes wrong.

Intermediate level

“ We are already investigating issues before our customers are significantly affected and actively trying to mitigate impact

*Key features of the intermediate maturity level*

Reliability reviews are a practice that we strongly encourage. Teams can add their own flavour to the reviews but the agenda often starts with deep-diving into the SLOs which have the worst performance. Looking at the SLO trend over the last month, quarter and year helps to identify degradation that might not be picked up by alerting, allowing the team to reorient themselves to the current system performance. Changes to the system are often the trigger for trends improving and/or getting worse. Identifying incidents that occurred during the month is a key part of determining if SLOs are fit-for-purpose and accurately reflect customer experience.

Instrumenting core aspects of your code that don’t automatically get picked up by the APM agent will help with troubleshooting when you need it. Using tracing in development and pre-production is important for exploring how rich your traces are. Vendors often provide an SDK for custom instrumentation but OpenTelemetry is gaining adoption as the industry standard and helps to reduce vendor lock-in. Some of our teams are starting to experiment with OpenTelemetry but it is not widely adopted at-present.

Browser-based synthetic tests act like a user on the site creating a fast feedback loop if anything is unhealthy and provide useful information like specific location failures to speed up investigation.

A lot of system context can be captured on a dashboard. Is a queue filling up and not draining? Are database writes taking longer than normal? Does a dependency have high latency? When a team has these metrics on a dashboard, a lot of time is saved with metric discovery in a high pressure situation. Meaningful dashboards are also a useful tool for on-boarding new engineers to your on-call rotation.

Advanced level

“ When our services are not at full health, we know about it and are working to restore excellent service

*Key features of the advanced maturity level*

Does your Product Manager frame the expected customer experience for SLOs and lead the team conversation when they are trending in the wrong direction? Having a shared understanding when you are discussing the priority of new feature vs reliability work is a core goal of SLOs. Often, SLOs are driven by engineers.

The first part of a chaos experiment is thinking about the various failure modes of your system. This, in-it-of itself is a good step to start considering your system’s edge cases. Running experiments to validate or disprove how you think your system will respond under certain conditions is a hugely valuable way to update your team’s mental model of the system.

Distributed tracing is good, but when that sampling happens at each component and you are only sampling 5% of requests, the chance of getting complete traces is low when you are talking about two components, let alone five. Tail based sampling sends all trace data to an edge collector, which constructs full traces before discarding the incomplete data.

Alerts should be high-urgency and need realtime intervention. If a disk is filling up but won’t cause issues for a week or memory is growing but can wait a couple of days, the on-call engineer doesn’t need to be woken up at 2am to deal with it. What other cases could be dealt with during work-hours using pre-emptive messages to your on-call channel via your internal messaging system (e.g. Slack)?

Expert level

“ We have excellent engineering practices, actively making our systems safer with highly automated processes

*Key features of the expert maturity level*

When you are designing a new system or modernising, are SLOs part of the requirements for the system and validated through-out the build process? Often, SLOs are created retrospectively based on the historical performance of your system. If your system is performing to your users expectations, this can be a valid approach. However, confirming the design decisions along the way with data can lead to less surprises and expensive re-work once the system is closer to go-live.

Chaos engineering in production is the gold standard for production operations. In order to get there, you’ll need to be confident in your system to the degree that instances can be randomly terminated and it be unnoticeable to your users. Your system needs to be highly automated to ensure that service remains consistent and edge cases can be recovered from without waking up your on-call engineer.

Can you trace a user’s actions from the user interface right through to the database? This provides a high level of detail almost instantly — that otherwise would be a very time-consuming task if you need to stitch together logs from multiple systems or join partial traces and logs. A system that has a single complete trace makes a complex troubleshooting exercise, simple!

Informing other teams and/or third parties that they have degraded service when you have high SLO targets is important to maintain a high bar. What does it look like when your dependencies are degraded? Do you have mitigations in place like circuit breakers to gracefully deal with a situation that could otherwise cause cascading failure?

Conclusion

Observability can be hard! There is a wide variety of tooling in the industry and no one tool that does everything well. Each vendor has their own way of doing things and specific selling points for their tool, so being an expert in one tool doesn’t automatically translate to another. Add that to the sheer volume of things required of an engineer, it is easy to see why consistently high levels of maturity are hard to achieve.

The emergence of OpenTelemetry is starting to improve industry standardisation allowing companies to be less tied to a specific vendor. Our experience has shown that internal tooling and templates significantly streamline an engineer’s ability to improve observability.

Does this resonate with you? Are you interested in follow up posts about certain things mentioned but not in enough detail? Comment below about what you’d like to see more of.