Einat Orr, PhD.
February 28, 2022
As data practitioners, we use many different terms to talk about what we do – we call it business intelligence, analytics, data pipelines, or insights. But there’s one term that captures what we do really well: delivering products.

When I was leading a 200 person engineering team at SimilarWeb, I couldn’t help but notice about the gap between the best practices used by teams delivering software applications and those delivering data-intensive applications. 

It seemed to me that data products were struggling with quality and high cost of error, while the software folks were reaping benefits of engineering best practices such as agile development methodology and ALM tooling.

Viewed from this perspective, it became clear that those working on the data products had much to gain from adopting engineering best practices and application lifecycle management (ALM) principles. To do this required both a mindset change and a tooling change. 

This is the story of how I have been working to make this possible.

Table of Contents

Why Application Lifecycle Management?

software development lifecycle circle chart
Image from synotive.com

Application lifecycle management transformed the world of software applications, shifting the focus from the hardship of delivery to how well the product answers user needs. This shift was possible when a set of tools came together for developers to go through the entire process of coding, testing, deploying, and maintaining production deployments with high observability, quality, and extremely low cost of error. 

In the past several years, the data world has started to become more aware of this need. The conversation in the community about data-intensive applicationsdata productsobservability into data pipelines, and terms such as “data downtime” or data mesh all draw their power from the analogy to software engineering best practices.

All of these concepts find a lot of backing from an emerging set of tools. Here’s a selection of open-source projects that support this way of understanding the work of data scientists and engineers:

  • Analyst workflows  they bring software engineering best practices into the work of data analysts. Example: dbt
  • Orchestration tools – these tools handle the complex process of running hundreds, or even thousands, of distributed systems and organizing them, helping to manage our pipelines. Examples: Airflow, Dagster, Prefect.
  • Observability tools – this is a growing category where we can find testing tools that allow us to verify data quality. Examples: Deequ, GriFFin, Great Expectation
emerging-tools-data-engineering-best-practices

All of those tools are stepping stones in our journey towards ALM for data products, but something fundamental is still missing – what manages the data itself?

Case in point, it is difficult to come up with simple solutions to some important questions. Questions like:

  • How can I easily create a development environment where I can test new pipelines?
  • How can I easily run my data pipeline parallel to production in a way that allows me to stage it and see that it brings quality results over time? 
  • How do I make sure that data is introduced to my consumers only if it’s of high quality?
  • In case of a quality issue in production, How can I ensure that it’s possible to revert data to the last consistent high-quality snapshots?

When we encounter data quality problems, making all of these things work turns out to be pretty difficult.

Why not manage data like code?

The solution to these problems – and the key to ALM – I realized was in large part due to the git source control model for code.

This inspired a logical next question: What if you could manage data just like you manage code?

As a thought experiment, let’s see how this could look.

DesireSolution
Quality issue with production data$ git revert main^1
Working in isolation$ git branch create my-branch
Reproducing results$ git commit XXXXXXX
Ensuring changes are safe and atomic$ git merge my-change main # CI hooks now run
Experimenting with large, potentially destructive changes$ git branch create experiment-spark-3
$ ./run_crazy_job_that_might_delete_everything.sh
$ git branch reset experiment-spark-3

Let’s explain each of these desired behaviors in a bit more detail.

Quality issue in production data – If we encounter a quality issue in production, we’d want to be able to simply revert the data to the last commit.

Working in isolation – We might want to work in isolation, so we should be able to branch our data repository and get an isolated environment for our data. We can either develop on it or use it as a staging environment. 

Reproducing results – To reproduce results, we need to return to a commit of data in the repository. It’s a consistent snapshot of our repository from the past, so it has all of our results.

Ensuring that changes are safe and atomic – To make sure that all the changes are safe and atomic, we can use the concept of merging. For example, we would carry out some work in a branch, test it, and – once we ensure that it’s high-quality – merge it into our main branch and expose it to our users.

Experimenting with large and potentially destructive changes – If we want to do some pretty drastic things like chaos data engineering or testing something really big, we can create a new branch, delete stuff, try and run a better version of Spark, etc. And once we’re done, we can discard the branch confident that nothing happened to our production.

Streamlining these behaviors is the key to unlocking ALM best practices for data practitioners.

Let’s now see the effect these behaviors have on common phases of the data lifecycle.

Application Lifecycle Management for data in development, deployment, and production

Development

Experimenting – If we want to start experimenting, all it takes is creating a new branch from the main. Hopefully, this is an atomic and quick action. Now we have a branch of our repository where we can run ML models, make changes to the code, and ingest new data to the product. We can do all of that in isolation, so the main branch isn’t affected. 

Debugging – We can even debug problems there. For example, if we spot an issue in the main branch, we can create a branch and debug it there. Why is that so valuable? Because it gives us the exact snapshot of data at the time of the failure, which is extremely important in data environments. 

Collaboration – And if we want to collaborate, we can easily do that by opening a branch for our team, working together on a static snapshot of the data, so that we all agree on the results and compare them. Once we’re done collaborating, we can either discard the results or use them in our product. 

Deployment

Version control – What would happen in deployment? This is where the concept of the commit allows us to have data versions – a very clear version history from a lineage perspective.

Which versions of the input created which versions of output? Answering this question is easy because they’re both parts of the same commit. We get a simple method for following versions of our repository or datasets and can point consumers to newly deployed data. 

Testing – We can automate testing with pre-merge hooks. If tests succeed, the data is merged into the main. If they fail, we get a consistent snapshot of the data at the time of the failure, which makes debugging easier. This is how we prevent breaking changes from entering our production environment. 

Production

Still, no testing system or CI/CD pipeline is perfect. We need to be prepared for any issues that might arise along the way.

Rollback – If we’re in production and we exposed our users to their own data, we can roll back in one atomic action. We also have our latest commit – a consistent snapshot of all the data sets that were introduced to our users. 

Troubleshooting – So, we can open a branch for the problematic data and a bucket on the side to actually troubleshoot our production. The users will see a delay in data, not the wrong data. 

Cross-collection consistency – Another important advantage we would get in a system like that is cross-collecting consistency. That way, we can make sure that we only expose our consumers to all the data sets that they need, after verifying that all those data sets are tested for consistency. When the customer sees the data, they get a single source of truth. 

This isn’t a vision of some distant future. Some of the tools that allow doing that are already available and others are joining the party. Sooner than later, we will be able to create CI/CD or data application lifecycle management that reduces our cost of error and provides high quality. 

Let’s now dive into our data lake and see an example of such a tool and how it can be used from a technical perspective. 

lakeFS: Git-like interface for scalable data – an open-source tool

lakeFS Git-like interface for scalable data

The reference architecture is a data lake that is based on object storage, such as S3, GCS, Azure blob, Min.IO, Vast data, etc.

In this scenario, you can use lakeFS and transform the data lake into a Git-like repository to quickly implement parallel pipelines for experimentation, reproducibility, and CI/CD for data.

lakeFS supports AWS S3, Azure Blob Storage, and Google Cloud Storage (GCS) as its underlying storage service. It has an API compatible with S3 and works seamlessly with all modern data frameworks such as Spark, Hive, AWS Athena, Presto, and others.

Once you install it, you can start using all of the commands I described earlier.

lakefs-commit-data-model
Diagram of data model used by lakeFS.

lakeFS lets you manage data lakes like code because it sits on top of the object storage and provides Git-like operations on top of the data via an API. So, you can keep on working with your data tools. 

You’ll have a production branch and working branches that you can use over the long-term, or short-term, or discard. You can protect these branches and use commits, commit IDs, tags, and mergers. When you want to use data, instead of accessing the object storage directly, you access it through lakeFS, specifying a branch or commit ID. 

For example, If a customer calls you and says they have a bug, no problem – you can create a branch right now to get the snapshot of the data as it appeared to that customer and quickly troubleshoot the error. 

If you want to try a new algorithm that may delete files, you can create a branch for experimentation and develop in isolation. Once you’re done, you can merge this data back to production or use ETL to delete the branch.

lakeFS manages metadata – every commit is a collection of metadata pointing to the objects in the managed bucket. This makes all of the commit, merge, and diff operations very efficient. 

The solution uses copy on write – creating a new branch is a zero-copy operation. This helps to generate savings on your S3 bucket due to usage reductions ranging from 20% to 60%. 

As for security, the open-source lakeFS is installed inside your data system so no data leaves your premises. All the data and metadata are saved in your bucket. Moreover, our SaaS product, lakeFS Cloud (now in beta) has SOC 2 compliance.

About lakeFS

The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.

Our mission is to maximize the manageability of open source data analytics solutions that scale.

To learn more...

Read Related Articles.

LakeFS

  • Get Started
    Get Started