Data teams love calculating and tracking everything with metrics. We already have the infrastructure in place to do so… yet often fail to apply the same strategy for our own work.
To fix that, let’s take a look at a few metrics useful for measuring our own performance.
In my old office, there was a shelf that held tchotchkes and stuffed animals collected over the years.
When people would visit me, then serving as CTO of SimilarWeb, they would sometimes fiddle with these items. Before leaving work on most days, I made sure everything was neatly returned to its proper place.
I remember hurrying through this therapeutic ritual one afternoon to make it to an evening performance of The Book of Mormon… when I was interrupted by a regional sales director, Adia, asking if we could speak.
She was struggling to convert new trial users to enterprise plans. And issues with data insights my team provided in the product, she explained, were contributing to the drop in conversion.
Situations like these arise all the time in companies, and can go one of two ways.
The first is characterized by time-consuming investigations, anecdotal decision-making, and endless meetings. The second features quick resolutions driven by constructive, pointed conversations.
What determines which way a given situation will go?
Metrics as the Differentiator
In my case, the presence of internal metrics tracking the performance of my team anchored the ensuing discussion towards the second path.
I swiveled my monitor towards Adia and pulled up a dashboard showing an indicator of Data Quality over time. The metric was both stable over the last several months—even when isolating for her region—and above the internal target we set at the beginning of the year.
“Either this metric isn’t reflective of reality.” I said pointing to the screen, “Or, there is a disconnect between what you believe is hurting conversion rates and what actually is. Let’s work together to figure out which it is.”
From this objective terra firma, we were able to have a constructive discussion on what was more likely responsible for the decline. Satisfied, we both left the office. I hailed a cab and made it to my seat right as Elder Price came on stage…
Picking Meaningful Metrics
What this (only partially!) fictionalized situation illustrates one of the benefits you’ll get from measuring the output of a data engineering team. While most people agree this is something they should do, many get stuck on what metrics to track, where to track them, and when they should be reviewed.
Data Quality was in fact one of three primary internal metrics we used to measure the data engineering unit. Along with quality, we looked at Uptime and Velocity as additional metrics to gauge ourselves.
Why these three and how did we define them? Read on to find out!
Metric #1: Data Quality
Data quality can be measured by the number of incidents triggered by data issues. An incident can be defined to include internal alerts, failed tests, and issues raised by external consumers, if relevent.
At SimilarWeb, the data engineering team’s output was surfaced to external users since we were, primarily, a data product. Therefore any issues raised to customer support about data issues negatively impacted our data quality metric.
The option to include weighting for severity of incident and resolution time are there, but will involve additional complexity to incorporate.
Higher is not always better
It is tempting to pick a really high number to set as your data quality goal.
“We want the data to be accurate 99% of the time,” or “There should be no incidents on 95% of days.” may sound good on the surface. However, aiming high isn’t always worth it because your team will pay an opportunity cost to maintain it.
This takes the form of time spent investigating data quality problems that prove rare or shouldn’t be prioritized. Another important metric that we’ll discuss later, velocity, will slow down as a result.
Instead of aiming to maximize one metric in a vacuum, it’s important be mindful of the potential trade-offs between them.Choosing a data quality goal of, say, 80% may be right for you.
The exact number isn’t as important as finding a monthly or quarterly cadence to review and tune it!
Metric #2: Data Uptime
Data uptime is the percent of time a dataset is delivered on time, relative to the expected frequency or SLA requirements.
Let’s look more closely at these two variables:
- Expected Frequency: How often a dataset is expected to be updated. Most commonly daily, hourly, or real-time.
- SLA (Service Level Agreement) Requirement: The frequency clause stipulated by an SLA agreement between a data producer and consumer for when data must be updated by. For example, an SLA might state an absolute time like 7:00am or a relative time like every 3 hours.
If multiple data sets are delivered, we should measure uptime per data set. To combine the uptime stats from all our data sets into one metric, we can simply use the weighted average. Just like with application development uptime, we can use the Nine System to define what our data uptime should be.
It is fine to look at just one of the expected frequency or SLA stipulation to determine uptime, but I recommend using both to paint a full picture of your data delivery performance.
Metric #3: Development Velocity
Just like in app development, we can look at velocity – the number of user story points completed per iteration – when measuring data engineering teams. Why mention it here if it’s the same metric for app development?
I’ve heard the argument that data engineers cannot work in an agile environment because data research and exploration doesn’t fit the framework. My experience managing data orgs for the past decade leads me to a different conclusion.
Research can absolutely be expressed in small, testable user stories and epics. Furthermore, I’ve seen that working in agile methodology and measuring velocity, even when research is part of the work, improves the efficiency of data engineering and data science teams.
If you feel like you have no way to quantify your data engineering team, tracking data quality, data uptime, and velocity is a great place to start. Insights from these metrics will allow your team to make informed technical decisions and better communicate the impact of those decisions outside the org.
One of the hardest things to get organizational buy-in for is an initiative aimed to reduce technical debt. Explaining the benefit in terms of delivering more accurate data faster (and having numbers to back it up!) is an effective way to do so.
Ultimately, you can measure any number of KPIs, but my recommendation is to pick three-to-four that are focused and simple to understand; otherwise you’ll optimize for too many things.
By the way LakeFS, the open-source project I created (along with my co-founder Oz Katz), influences all three metrics by providing simplified workflow management to data environments.