Scaling Log Management with Xero’s Rapid Business Growth

Ethan Motion
Xero Developer
Published in
6 min readApr 30, 2020

--

When Xero’s Reliability teams were tasked with reducing platform cost, my team and I explored the cost of the tools that we managed on behalf of Xero’s Product and Platform teams. We identified a surprising area where the rate of increase for cost was greater than the rate of business growth — log management. Like many SaaS companies, Xero manages logging through a third-party service that facilitates the collection, transmission, aggregation, storage, and disposal of log data.

Predicted business growth vs logging growth
This graph shows how logging growth was expected to grow at a higher rate than business growth. Graph does not use real data, and is for demonstration purposes only.

We made it our mission to explore Xero’s logging behaviours. We wanted to reduce the cost of our log management tools while maintaining first-class observability through logging. In this blog, I’ll be discussing how we managed to shift our logging growth from superlinear to sublinear, and achieved a logging model that successfully scales with Xero’s rapid growth.

Quick Wins

Our first step was to create a Priority Matrix in order to compare and contrast the effort vs impact for some actions we could take to reduce cost. Our immediate priority was to quell the exponential logging growth as quickly as possible, so we began with some quick wins to buy us some time, while we worked on more robust long-term solutions.

We wanted to identify sources of logging that were high in volume, but low in value. We defined ‘low value’ as a lack of searches against the data and/or observing that the content of the logs did not appear to be of value. We identified around 10 major logging sources that we believed could be reduced or stopped completely.

All it took from here were some conversations with the log owners. We found that for almost every logging source that we identified as ‘low value’, the logging was either unused, unwanted, or sometimes not even known about. Once we notified the owners about these logging sources they were quickly removed.

Logging Standards

Through these conversations with log owners, we learned that a key reason teams were logging high volumes of unnecessary data was that there weren’t clear guidelines around what should be logged. Our internal logging documentation had gone stale. Because of this, the default behaviour was to play it safe and “log everything”.

Log all the things meme

This feedback validated that we should focus on giving our internal logging documentation a complete revamp. We updated our docs to be more simple, and to have consistent formatting for an improved UX. During this cleanup we also produced Xero’s internal Logging Standards — a set of requirements and recommendations that any team at Xero could refer to when reviewing their existing logging, and developing new features and services.

One of the factors that was contributing to our increasing rate of logging was an inconsistent use of log levels. A substantial portion of our logging contained log levels, but there seemed to be little reasoning behind how they were applied. Often the content was leftover from the software’s development, or information that was simply no longer needed. We also found high volumes of ‘debug’ and ‘trace’ logs logged from our Production environment.

It was clear that we needed to define logging levels and specify which levels should be logged in which environment. The following chart is what we produced:

Logging level recommendations chart
The Logging Level Recommendations we produced for development teams to understand when they should use each log level and what each level’s purpose was.

As you see, we recommended disabling debug and trace logging on all of our Production systems, for these reasons:

  • The increased volume of logging can cause disk space to deplete much more quickly.
  • Since debug and trace logging is often very verbose, it can use a high amount of resources. This can require larger and more expensive instances.
  • The long-term cost of processing and storing the high volumes of logging produced by enabling debug and trace logs in production is expensive and unnecessary.

We didn’t make this a requirement because we cannot be sure that these recommendations will fit all scenarios (e.g. these logs can be helpful in diagnosing issues in production), but they should fit most scenarios.

With these standards available, we were now able to engage more confidently with logging owners. This resulted in more logging reductions, while maintaining first-class logging observability. We also communicated these standards to the business so that logging owners could take their own initiatives in reducing their logging.

Automation

Since these wins proved worthwhile, we saw value in adding a level of automation to this process.

We built dashboards and alerting which monitor the volume of all of our logging sources. When there is a significant increase, our alerting notifies us. This meant that as Administrators, we eliminated time previously spent on manually reviewing logging source growth, and could instead spend more of our time engaging with teams.

Sublinear!

It’s at this point we are able to confirm that our logging growth rate, and therefore cost, is now increasing at a slower rate than the growth of the business.

Actual business growth vs logging growth
This graph shows how Xero’s logging growth is now growing at a slower rate than business growth. Graph does not use real data, and is for demonstration purposes only.

While this is awesome, this isn’t the end. We expect the logging growth rate to continue to increase as we enter the next phase of scaling our log management.

Empowerment and the Future of Scaling Log Management

So what are our next steps? How do we continue to scale first-class logging into the future?

With Xero continuing to grow at pace, it’s important that my team and I step back from juggling Xero’s many logging sources ourselves. Instead we should give each team the information and support that they need in order to empower them to maintain their own sublinear logging growth rate.

We’ve started doing this by exposing more granular ingestion rates to teams directly, by providing them with dashboards and alerting. We’re also rolling out sublinear logging growth rate as a standard that must be met by all teams.

A couple of methods that we’re wanting to implement, to remain sublinear, are sampling and dynamic filtering of logging.

Sampling

For some info level log streams, there will be a point in time when there’s sufficient logging volume that alerting can be configured, dashboards can be built, and additional logging simply becomes extra noise. This is the point when log sampling becomes a viable option. Log sampling is when we only ingest a portion of the log stream (say a half of the logs). We can use this sample rate to estimate actual values by extrapolating the data.

Filtering

For other log streams, dynamic filtering is also a possibility, albeit a little more complex to implement.

Dynamic filtering involves the sampling or filtering logs based on a range of factors (e.g. time of day, during a release window, etc).

Key Learnings

  1. Good logging is essential for first-class observability. In a rapid-growth software company, that’s frequently adding high-traffic services, logging volume does not scale without good log management administration.
  2. To understand the root causes of Xero’s logging growth, it was critical that we engaged with our customers, listened, and understood their pain points and their resulting logging behaviours.
  3. Good observability of your log management systems is critical for identifying and addressing problems early.
  4. Providing clear and concise logging standards and recommendations is a key baseline to refer to in training sessions and conversations with teams about their logging practices.

That brings me to the end of this chapter. Hopefully you’ve found our journey so far useful. I’m looking forward to the next chapter of Xero’s log management journey.

Xero Reliability Logo

--

--