In the last 30 years, agile development methodology played a significant part in the digital transformation the world is undergoing. What stands as the basis of the methodology is the ability to iterate fast on product features, using the shortest possible feedback loop from ideation to user feedback. This short feedback loop allows us to develop the right software solution to accommodate the needs of our users.
The engineering infrastructure that allows this short feedback loop is based on source version control that enables collaboration and testability. It enables integration to automation servers that allow CI/CD, in addition to testing platforms that guarantee the maintenance of high quality software throughout the process. Another critical part of code version controls is the ability to roll back production code in cases of quality issues, which dramatically reduces the cost of error and the potential impact on the end users.
But what happens when our application is data-intensive? Can we use this methodology and tools stack for our data pipelines? Do we want to apply the same best practices that have so effectively proven to work for application development? And if so, how can we do it?
I believe that we do need to follow these best practices because by doing so we will achieve the same advantage of moving fast with low costs of error.
Why is data engineering lagging behind on engineering best practices?
We would all really like to live in a world where we update a data pipeline and move on to our next task, knowing that everything will go well.
In reality, it’s always a fire drill. 🧨🔥🧯
Clearly, It’s not because of the data engineers themselves, but rather due to some deep historical reasons.
Let’s go back in time.
Initially there were relational databases and life was fairly simple because data was small and things were changing rather slowly.
Bigger organizations gathering larger amounts of data brought data warehouses to the world. These were still relational databases, but better, and accommodated business intelligence use cases.
But data kept growing, and the need for a distributed system to handle it became imminent. We had to change the basic technology to cope with the amount of data, and this is how the Hadoop ecosystem was created.
Data engineering became closely coupled with distributed infrastructure which is something completely different than working with a database.
So, while software development was building tools for CI/CD and quality assurance, data engineering was struggling with the scale of the data, trying to tam the infrastructure just to deliver data.
Introducing high-quality data to the organization became a challenge due to the sheer scale of data and nature of the infrastructure at hand.
It took us some 10 years to really tame the beast called Hadoop and achieve a scalable, managed Spark that can work in any organization (thank you, Databricks!), or use a distributed database technology (thank you, Snowflake!). Today, the three major cloud vendors give us managed Spark out of the box, as well as Presto and homegrown distributed DBs like Snowflake or BigQuery.
Now that we’ve reached this stability, we’re looking into the future, where we want an infrastructure for our data architectures that will allow data mutability, scale, version control, testability, and monitoring. This allows us, the data engineers, to finally get to the same point as our fellow software engineers are at: short feedback loops, high quality, and high development velocity with a low cost of error.
Look around at where software engineers are effectively iterating on products. And then create the same ecosystem of processes and tools that creates a short feedback loop, helping us to get the data out there and – ultimately – improve our data products.
How to achieve engineering best practices in data engineering
Start by treating your data operations as a product
Step 1 is a shift in the mindest – first and foremost, you need to adopt the right state of mind. Instead of calling data operations “data operations,” or talking about Business Intelligence and dashboards or data pipelines, we should call it what it really is: a data product.
If we treat our data operations as a product, then the best practices for product development will naturally unfold. There are also some implications on roles and organization structure that we will not discuss here, but those will complement the approach of continuously delivering high quality data products from development to production using engineering best practices.
The building blocks of a data product
What does a data product look like today? Here are the building blocks of a data product:
First, there’s the data itself, that we ingest from the relevant data sources into our data lake. Then we have the code that we run to analyze that data. This is the code that makes our ETLs. We keep this code in a source control system such as Git.
In addition, we have the infrastructure that supports running the code over the data. What does the infrastructure include?
- Storage – A storage layer to store the data, usually an object storage such as S3, GPS, Azure Blob or min.io
- Compute – A distributed compute engine that allows running the logic of our code on scalable data sets, such as Spark or Presto.
- Orchestration – An orchestration tool that allows us to orchestrate workflows / pipelines in a way that is data aware. A pipeline allows us to create a derivative of the data, and then using that derivative – maybe together with another data set – create another derivative, and so on. It is usually represented using a DAG.
This infrastructure allows us to define our data pipelines, and run them efficiently over cost effective storage.
Apply engineering best practices for data pipelines
So how do we implement the engineering best practices for such a product? What we aim to do here is manage the data products from development to production the same way that we manage software products.
Here are two best practices that help you do just that.
1. Best practice #1: Develop in isolation
When you create new functionality, you naturally want to test it and release it to users only if it’s of high quality. Nothing helps you more here than an isolated dev environment, on which you can play around and test things. As mentioned above, the data development environment is composed of 3 major elements:
- The code
- The infrastructure
- The data itself.
How can we create a development environment that will allow us to work safely and productively on all 3 elements in isolation?
The code – The first step is to isolate the code. We have the code of the data analysis itself, that runs within the compute engine, for example Spark. We also have the code that we run within the orchestration tools, that orchestrates those jobs that we run on the compute engine. Both aspects of the code can do that using Git, where you just branch out the code.
The infrastructure – We know how to manage infrastructure as code, using K8S and Terraform. This means we can create an isolated version of our infrastructure. In our case, this automation is created by the DevOps team,spinning up an isolated instance of orchestration and compute engine.
The data – You can use a data versioning tool here as well to create a branch for your data (example open-source tools are DVC, Git LFS, Dolt, or lakeFS). That branch is your golden ticket to working in isolation.
Last but not least, you need a data quality testing framework. You can develop it on your own or use one of the commercial or open-source tools out there like great_expectations or pyvaru. This lets you test your branch thoroughly before merging it into your main branch.
2. Best practice #2: Building CD for data products
What is continuous deployment (CD) for data? Well, we are in production, so the code is production code. It didn’t change, so you don’t need to branch out from it. The production orchestration code is still the same, and you’re not interested in changing the pipeline orchestration.
You’re still using the production orchestration and production compute engine because you’re near or in production.
Nevertheless, you still need an isolated branch of the data.
Because although everything stays the same, the data is new. Maybe the characteristics of your data changed because the schema changed. Maybe the change was introduced by the people who sent us the data. Or maybe something went wrong in the way they collected it.
All of that might make the model running in production inaccurate. Remember, data constantly changes.
This motivates data engineers to check the results for the production code, orchestration, and compute before they show it to the users. Furthermore, this gives you an opportunity to find out if there’s anything strange there before exposing the results.
The way out? Continuous deployment of new data!
The way to do that is simply by ingesting new data into a data branch using a data version control tool. This enables you to run all your production systems on the branch rather than in production.
When you finish that run, we can test the results to make sure everything is as expected and if the quality tests have passed, it can then be merged. However, if you spot an issue, you can start debugging; you have the exact snapshot of data at the time of failure on your branch to help you.
In this article, we explored some of the challenges in applying engineering best practices to data lakes and then explored two benchmarks for developing in isolation and building CD for data products. By implementing these two best practices – inspired straight from what software engineers are doing so well right now – you can work on improving your product instead of endlessly tinkering with the plumbing or putting out fires.