If you work with a smaller dataset or do one-off jobs, the way you manage backfills isn’t that crucial. But what if you face constantly growing datasets with billions to trillions of records? Your backfilling data strategy will have a much bigger impact.
When dealing with modern data pipelines on such a scale, it’s key that data teams try to make the most of their limited resources – from cost and time to processing and storage. Luckily, there’s one approach to data backfilling that enables this efficiency and helps you keep costs at bay as you work with massive datasets.
Data version control helps teams manage modern, constantly-changing data. You can build a solid backfilling strategy by implementing the lakeFS open-source project.
Table of contents
- What is data backfilling?
- 4 examples of data backfilling
- Common challenges of backfill historical data
- Best practices to manage backfill data today
- How can you solve backfilling challenges with lakeFS
- Backfilling data with lakeFS
- Wrap up
What is data backfilling?
Data backfilling is the process of retroactively processing historical data or replacing old records with new ones as part of an update.
Most of the time, data practitioners do data backfilling when they experience an anomaly or quality incident resulting from bad data entering their systems (like data warehouses).
As you can tell, data backfilling is a time-consuming and tedious process. The only thing worse is having to do it twice due to a mistake.
Many data teams create standard operating procedures (SOPs) for data backfilling to avoid dealing with the mess.
4 examples of data backfilling
1. Missing data
Sometimes, a part of the data may be missing when the first calculation is done. Check out the column below for NULL values – this data arrived later than others. After a few days, we can fill in the column and have proper values there (on the right). One glance at the two columns below shows how the data has changed from the version we had previously.
2. Fixing a mistake in data
Another example is fixing an error in data. The tables below contain values, but they’re wrong because of a human error, a wrong calculation, or the logic of collecting the data.
Once you fix the problem, the values that we incorrect will also get fixed. Check out the tables below: you can see that while some values stayed the same (the correct ones), others were replaced by other values (because they were incorrect).
3. Data from calculations
Another example of changing data is when it doesn’t come from collection, but calculation. For example, you can have a model that does price estimation for you and fills tables with data based on the calculation.
But what if at some point, you discover that you could be using a better logic to accomplish that? You can implement it on historical data to get better estimations of the past. If you do that, the whole dataset will change.
4. Work with unstructured data
Teams spend a lot of time working with semi-structured and unstructured data. Imagine that you have a training set for a computer vision algorithm and are using images that serve as a good sample of the world for which you want to develop your ML model.
You chose to substitute some of those photographs with others that depict the same location in the space you want to create – but better. As a result, some photos remain unchanged while others are replaced.
Another way is to approach the problem from a different angle, such as by examining the distinct properties of your photographs. All of these changes in the data make even the most basic tasks difficult.
Common challenges of backfill historical data
You need a data backfilling strategy for any data pipeline that is growing dynamically. In such pipelines, new data is ingested every day, posing some serious challenges to team members around collaboration, consistency, and reproducibility.
Collaborating on datasets
If you modify the historical data, the team needs to know what the source of truth is. To work concurrently on a rapidly developing data set, team members need a version (snapshot) of the data that is separated for their usage. Collaboration requires this type of isolation for experimenting smoothly, developing a uniform language of data versions for discussion, and having a means for bringing data back to the team from an isolated version.
Lineage is the metadata that reveals the interdependence of our data sets – for instance, which data sets were used as inputs to create a new data collection.
Obtaining lineage has never been easy, and it’s even more difficult since each data set has several versions over time, which must be included in the lineage metadata (or it won’t be usable). This, as well as quick access to the data, is necessary for audits in many sectors.
Data practitioners expect to receive the same result if they run the same code over the same data; in other words, they want their results to be reproducible. To achieve it, you need to be aware of which version of the input dataset and code were used.
Multi table transactions
There is no such thing as an island dataset. If you update a table and have a new version of it, you better update all datasets that rely on it to reflect this new version. If you don’t, the data lake will be inconsistent since one data set has changed while others haven’t.
What you need here are tools that let you make changes to all mutually dependent datasets, in a single atomic action while avoiding inconsistencies.
Best practices to manage backfill data today
How can data practitioners address all these challenges and find a good strategy for backfilling data?
Here’s some good news: we already have a working solution for dealing with datasets that continuously change. It’s known as version control.
Version control was designed for code, not data. If you tried to use Git over data right now, you’d likely fail due to performance issues.
However, the concept of versioning data in the same way software engineers version our code is enticing. It would be a massive help in managing modern, constantly changing data.
Curious to see what data version control looks like in practice? Jump to the next section, where we show how to solve data backfilling challenges with the open-source project lakeFS.
How can you solve backfilling challenges with lakeFS
Consider a data operation with several data sources saving data to object storage (S3, min.io, Azure Blob, GCS, and so on), with ETLs operating on distributed computing platforms such as Apache Spark or Presto.
Those ETLs could be made up of tens, hundreds, or thousands of tiny jobs organized by orchestration DAGs. The data is consumed by a variety of consumers, including ML developers, BI analysts, and the next person to write an ETL over the data for some new use case.
Most of us are already executing such data operations or will be doing so shortly.
When managing backfills in such a data architecture, the event of a backfill will trigger jobs that update the derivatives of the data backfilled. So, the entire data lake is updated and fully consistent.
When handling data backfilling requirements for such cases, you’re probably looking at the following:
- Perform the backfill in isolation and test the quality of the update, so you avoid exposing consumers to low quality.
- In isolation, run the jobs to update all data that depends on the backfilled data set to ensure consistency between all data sets in your data lake.
- Expose all changes to the data lake in one atomic action to ensure consistency and concurrency control.
To achieve that, the open-source project lakeFS is a good fit.
Backfilling data with lakeFS
lakeFS is a version control system for data lakes that can help with data backfill in a few ways:
lakeFS allows you to version your data, which means you can easily roll back to previous versions of data if something goes wrong during backfilling. This ensures that you don’t lose any data during the process and can easily recover from any issues that may arise.
With lakeFS, you can create branches of your data lake, which allows you to experiment with different backfilling strategies without affecting the main branch. This can be especially useful when dealing with large datasets or complex data pipelines, where you may need to try out different approaches to find the most efficient way to backfill.
Atomic commits & merges:
Using lakeFS, data is typically ingested & transformed on separate branches. Then, once you want to promote the data to production, you atomically merge the changes into the production branch. This helps to ensure data consistency during the backfilling process and prevents data corruption or loss.
lakeFS allows for collaboration among teams, which can be helpful when coordinating a backfilling effort. Different teams can work on different parts of the backfilling process, and changes can be easily tracked and managed through lakeFS’s version control system.
A typical environment can look like this:
Using lakeFS, if I look at the production branch, I can see historically all the commits across all branches that led to this data state:
Next, there are two strategies you can take:
- Rollback commits with data that needs to be backfilled and reprocessed.
- Branch out of the last “good” commit in the raw data, reprocess, and rebase it.
Every data practitioner who deals with large datasets needs to have a good backfilling method.
You’ll have an easier time managing data backfills if you use a data version control platform that supports the scale, complexity, and constantly changing nature of data today. Data versioning opens the door to transforming a chaotic environment into one that is controllable, where you know where the data comes from, what has changed, and why.
Want to try this on your own? Take lakeFS for a spin by running it locally.