Einat Orr, PhD.
August 17, 2022

A few weeks ago, I was looking at a dashboard in our internal BI system. It’s a simple system. Redash over PostgreSQL that has just a few hundreds of thousands of rows. 

I noticed a change in one of my favourite metrics that calculates the number of new installations since the beginning of the quarter. Naturally, it’s a nondecreasing metric.

And yet, when I looked at it, it showed 499, a smaller number than the 500 I saw the day before. Our BI analyst tried to debug the issue by running the same query with yesterday’s timeframe, expecting 500. She got 499. So we have a problem: a non-decreasing metric that is decreasing, and an issue that does not reproduce….. 

And we’re not talking about a big data environment here. This error didn’t come from Spark being poorly distributed or configured. It was just as simple as SQL over PostgreSQL. 

I would’ve been really upset about this 25 years ago, when I started my career as a data practitioner. But I’ve learned over time that when working with data, events like this are just another day at the office. 

The data world is riddled with challenges. Luckily, we now have tools to tackle the two issues we find at the core of every data practitioner’s experience. Keep on reading to find out how recently developed tools help solve the two intrinsic data challenges we know so well. 

Table of Contents

What is data manageability? 

Data Manageability is the set of processes and tools that ensure an organisation is capable of controlling and tracking the data it holds.

It means questions such as: what data sets do we have? How do they relate to one another? How do they come about and why? Who owns them? How do they evolve over time? Are easily answered.

Two intrinsic challenges of working with data

Let’s start with a quick overview of the two roadblocks intrinsic to data:

Challenge 1: Data discovery, or knowing your data 

What data do I have in front of me? Where is it? Does it mean what I think it means? What is the context on how it’s collected, stored, or calculated? Who owns it? This is all the essential information we need to get relevant and correct insights from it. 

Challenge 2: Data is transient and changes over time

We like thinking of our data as static. Once we are reported about a set of events, the data we received is correct, complete, and consistent. In reality, for most data sets, none of those  assumptions hold. We may have data arriving late, we may have bugs in previous calculations that should be fixed, and we may have additional data sources that shed new light on old insights. Whatever the reason may be, our single source of truth is only true for the time we looked at it. While transiency is a problem we had with small data, it worsens with scale.  

These two problems aren’t anything new

These issues were present even 50 years ago with traditional data warehouses. Solutions were introduced by the large data warehouses providers, using the warehouse schema as the data catalog, for example. The open source solutions, such as PostgreSQL had partial solutions to catalog, both enterprise software and the open source solutions had offered no real solution to the transiency. If those problems are so imminent, why weren’t solutions developed? 

Because the ecosystem was trying to survive the exponential growth in scale. 

We prioritized being able to ingest more data, run compute, analyze these huge amounts of data, and serve it to our consumers in a way they could digest. 

About 4 years ago, once we figured out how to deal with massive data sets – we started looking at these intrinsic problems again. And tons of new tools appeared on the market!

Some of them emerged earlier but gained traction in the last 3-4 years, and others are being brought to life right now. Let’s dive deeper into those tools and see how data practitioners can use them to solve these two data challenges.

Source: lakeFS

Data discovery? Discovery Platforms come to the rescue 

The what, where, and why questions are hard to answer in the data world. The best solution so far has been a data catalog.

In the past, we had database schemas that we added forms of metadata to. Advanced organizations had test systems in place, ensuring that metadata got saved together with the schema. When working with several databases, you’d just add another layer. Large enterprises, using services companies, had  built tools based on manual documentation that covered all the relationships, not only within the schema but between different schemas of different DBs.. 

Next came semi-manual catalogs. Some data was brought in automatically – for example, from existing database schemas or other metadata. And some of the work was done using mechanisms like crawling. The rest was added manually. 

This has now evolved into a solution that aspires to be an entirely automated catalog that relies on APIs to data sources, and data analysis tools. This phase is still under construction and attracts a lot of interest. A data discovery platform would be the Holy Grail, giving us an interface that would answer the fundamental questions of where, what, and why is my data?

Source: lakeFS 

Currently, a tool that offers all that in one package doesn’t exist. But this is where the future of data catalogs is going. 

In the future, we’re bound to see tools with functionalities like:

  • Holding all the metadata for present data
  • Assisting in the querying process of data via its own data discovery platform logic 
  • Advanced search features
  • Extensive information about metrics – with information about the metric itself, where it exists, and whether it exists in several places, showing who owns which parts of a metric
  • A lineage between different datasets and how they evolve from one to the other 
  • Preview of the data – advanced visualization for a better understanding of the data
  • And most boldly: a querying interface agnostic to the data source itself….

This wide range of functionality gives easy access to observability and knowledge of our data. I dare say we never had that before – not even for the most straightforward data scenarios.

Example solutions in data discovery

Source: lakeFS

There is a growing number of players in this space, each focusing on different parts of the data discovery platform. This family of tools started with enterprise offers in 2014 and is now expanding to SMBs and includes open-source-based  solutions. 

Most open source solutions emerged from large organizations, such as Netflix, Uber, Meta, and LinkedIn. Some are now backed by commercial companies founded  by the project creators, and provide them as a service, allowing even the smallest startups to use catalogs.

Solving the transient nature of data 

Let me first convince you that data changes over time by sharing some scenarios. 

Scenario 1: Missing data

Some of the data didn’t arrive or was late due to some issue with the operational system. But when we analyze the data tomorrow, we will already have all of the data there. Although we’re reporting the same timeframe, we now have more data than before. 

Source: lakeFS

Scenario 2: Corrected data

Someone put data into an operational system manually and made a mistake, and the error was discovered and the data was corrected. The most severe change would be replacing the entire data – for example, if we decided to upgrade it. 

Source: lakeFS

What can we do when the past changes? 

We need to understand that the data we have seen yesterday is the “truth” of yesterday. And the way the same time frame  looks today, is today’s “truth.” Our single source of truth is time dependent, or in other words, transient. Next generation data warehouses and lakehouses have some capabilities to support data transiency. 

Here’s an example from Snowflake that provides two layers of versioning – one for the table itself. You can access the table’s historical states by calling this table’s timeframe. And you can also look at the history of queries and see the different results over time. Still, the amount of time you can hold this history is limited, and the service costs money. 

Source: lakeFS

But what if you are using a data lake, or a lakehouse such as DataBricks or Dremio? 

Open table formats

Open table formats allow us to support the mutability of data in a world that is essentially immutable – the world of files in the object storage that constitutes our data lake. 

How do open table formats work? Suppose you have a specific version of a file. If something changes (data is deleted or edited), you’d get a new small file that represents the delta. In addition you have a log that indicates the metadata of these delta files, keeping track of the history of changes . To learn what the data looked like at a certain point in time, you need to compact that information to get a new version of the files that will include that data. This process requires computation.

This logic was first meant to support concurrency: the need for us to look at one version/schema of the data, until we decide to expose another. But it allows you to go back in time to the file version before any changes were applied.

There are three common open source open table formats: Apache Hudi, Apache Iceberg, and Delta Lake. They all provide similar functionality – with their specific strengths and weaknesses (and slightly different agendas) – we created a full comparison of the three.  

All three are now backed up by commercial companies. Data practitioners can now use Tabular to avoid managing Iceberg on their own, for Hudi one can use OneHouse, and of course, Databricks stands behind Delta lake.

Source: lakeFS

But is that enough? While these formats are great to allow mutability at the table level, they leave a few problems unsolved: Here are a few of them: How do we deal with consistency across multiple tables? How do we introduce quality checks into the process of modifying them? And of course – how do we solve for unstructured data that cannot be represented as a table at all?

Git-like approach to data

The first data practitioners to require a repository level time travel for the data are data scientists, since reproducibility is a critical part of their work. To be honest, it is a critical part of the work of any data practitioner, but since data scientists are scientists, they are educated to expect reproducibility as a basis for good science, and they are hence the first to look for that in their work.

While the vast majority of MLOps tools focus on managing the model itself, we now see a trend around “data-centric ML”  that focuses  on the data’s evolution within the life cycle of managing ML, rather than the model. The idea is to manage all data sets – structured and unstructured – that are part of our modeling, testing, staging and production, in a way that serves the quality of our results.

This is the Git-like approach to data that may include input data, Tags and ground truth,  intermediate results, and models. When we look at it all as a repository, we can finally manage data throughout the ML life cycle, get better results, and cope with the intrinsic challenges of data.

This approach is useful when dealing with data centric-AI, but also when managing data  operations with a lot of data sets, or a lot of data practitioners, dependent on one another.

What we look for here, is not only reproducibility, but good practices to ensure quality and resilience. Leaning on engineering best practices such as CI/CD, we look for CI and CD for data. What does it look like? It allows us to open a branch of the data and develop or test in isolation, and it allows us to automate this by pre-commit and ore-merge hooks.

Source: lakeFS

Wrap up

Data discovery and transiency are some of the essential layers one can bring up when talking about the world of data. The two are linked because metadata is transient as well. To support good metadata, we also need to support transiency. 

As a data practitioner, I’d aim to have the ability to support transiency and have Git-like operations over data. Transiency tools need excellent integration with the discovery platforms to enable metadata and metadata for the time when it was correct. Those two transient layers of metadata could only provide that.

This post is based on a keynote session I delivered at Hayadata summit – the biggest community-led summit for data professionals in Israel. If you’d like to watch it – here is the link.

LakeFS

  • Get Started
    Get Started
  • Join our live webinar on October 12th:

    Troubleshoot and Reproduce Data with Apache Airflow
    +