9 Data Version Control Best Practices: Ensuring Accurate and Reliable Data Quality
Table of Contents
What do Word documents, codebases, and software artifacts we find all around us have in common? They all rely on some form of version control. Versioning creates an essential record of incremental modifications made and logs when these changes occur.
Data practitioners can now rely on this capability as well, thanks to data version control tools. This ability allows them to easily travel in time and switch between versions, experiment in isolation, and troubleshoot issues quickly.
But how do you maximize the impact of data versioning on your data pipeline’s entire lifecycle? Here’s a selection of best practices for data version control to help teams ensure high data quality and validity at every step of the way.
Understanding data version control and its benefits
Data version control is based on the version management method used for application source code.
Data not only accumulates but also evolves over time. Data may change due to historical updates, filling in missing data, or correcting errors in data we already have. In addition, when different users experiment on the same datasets, it easily leads to many versions of the same dataset, which is far from a unified source of truth.
Furthermore, in machine learning environments, teams may have many versions of the same model trained on different versions of the same dataset. If these machine learning models and data are not adequately validated and versioned, the outcome might be a tangled web of datasets and experiments that will not be reproducible.
Data version control systems assist teams in tracking datasets by recording changes to a dataset repository. Version control comes with two major advantages:
- Visibility into the evolution of the project’s data over time, including what was added, updated, and removed.
- Version control systems offer notions such as branching, committing, and reverting that allow us to manage the full lifecycle of the data pipelines from development to production safely, using concepts such as an isolated dev/test environment and CI/CD for the data.
Why is it necessary to version data?
This is a common problem for data producers and consumers of data warehouses and data lakes: the world is always changing, and so is data.
Reproducibility, auditing, and sheer manageability become difficult when the data changes over time. The cost of error for data practitioners is very high due to a lack of reproducibility, and the velocity of delivering new insights such as analytics or updated models is impaired.
This is the issue that data versioning addresses. Beyond typical data versioning methods, advanced data versioning aids in the establishment of secure data ingest and analysis procedures.
For example, in machine learning experiments, data scientists can use version control to test different model versions to increase efficiency and easily make changes to the dataset. Using this type of versioning, teams may simply capture the versions of their data and model files in Git commits, providing a method to move between these unique data contents.
Team members may now browse a common history of data, code, and models. This enables you to use alternate storage systems for your data and models in any cloud or on-premise solution while maintaining project consistency using logical file names.
Data versioning also aids in data governance and compliance by allowing teams to examine data changes using audit tools.
Different types of data version control systems
There are multiple approaches to data versioning, each with unique advantages and drawbacks.
Let’s dive into each approach!
Versioning approach 1: Full data duplication
Do you have a dataset that you want to examine and see how it changes over time? One way to do that is to save a full duplicate of it in a different location each time you want to change something and create a new version.
Naturally, this method is only viable for smaller datasets with a reasonable versioning frequency. Just imagine how resource-consuming ML model duplication would be!
This method generates versioned data, but it does so in the least storage-efficient manner. A section that hasn’t changed will be duplicated across all versions, no matter what.
Versioning approach 2: Using metadata for data relevance in tables
A more storage-efficient and gradual versioning method involves adding and maintaining two metadata columns, designated valid_from and valid_to, in a tabular dataset.
The idea is that you don’t ever replace an existing record while changing a
record in this dataset. Instead, you add new records and update the valid_to field to the current timestamp for any overwritten entries.
This method allows you to time travel, but you can only interact with the versions by adding filters to queries on the metadata columns. When tables grow, this becomes a performance limitation for accessing versions.
Needless to say, this method is irrelevant for data that is not saved in a table, such as semistructured and unstructured data.
Versioning approach 3: Full data version control
This method considers versioning a first-class citizen of the data environment. It covers all the data you enter into the system.
To make this happen, data version control tools need to overcome several challenges:
- Reduce the data storage footprint by avoiding making duplicates of data items that are unaltered across versions.
- Enable operations that allow you to deal with the versions directly – like creating a version, deleting a version, and comparing versions.
- Works well no matter the data scale, format, or where you store it (on-premises vs. cloud solutions like Google Cloud Storage).
Luckily, there are solutions on the market that address all three data version control challenges.
9 data version control best practices
1. Pick the right data versioning tool
Not all data version control solutions are designed with the same user in mind.
Here are a few factors to consider before selecting a data versioning solution:
- Use case – Some systems are built to serve a single persona, such as a data scientist, researcher, or analyst, while others provide an organizational foundation that may be used by any data practitioners in the business.
- Where the data is kept – Can the data remain where you manage it, or must it be lifted and shifted to the data version control system?
- Ease of use – how simple is it to incorporate this product into your workflow? Is there an easy-to-use data-versioning interface, as well as decent documentation and tutorials?
- Supported data types – Can it handle different sorts of data and withstand changes in data structures? Is it capable of versioning structured and unstructured data as well as handling a range of file types such as CSV, JSON, and binary files?
- Integration – how well does it interact with your existing stack? Can you easily connect to your infrastructure, platform, or model training workflow? Is it compatible with popular frameworks like TensorFlow, PyTorch, and Scikit-learn?
- Scalability – can the solution be scaled? Can it handle your project’s increasing data load while retaining top performance?
- Collaboration – Does the solution support collaboration, allowing several users to work on the same project at the same time? Is role-based access control, version history, and the ability to add metadata to multiple versions included?
- Open source vs. closed source – A developer community (which may include a commercial enterprise) creates and maintains open-source data versioning tools that are free to use. Since users may update and enhance the codebase to fit their own needs, these solutions are more versatile and configurable than proprietary software. If you want the system to better fit your needs, you can contribute code to the community.
2. Define your data repository
Once you have a data versioning system in place, you need to define your data repository.
The concept of a data repository is similar to a Git repository or a cloud storage solution like an Amazon S3 Bucket.
A well-defined repository will include data sets that are used together for some logical purpose of analysis and hence need to be consistent over time – for example, all data sets related to your sales funnel optimization or to the ML model used to predict your customer churn.
To give you an implementation example: in the open-source version control solution lakeFS, a repository is a logical namespace used to bring together data, branches, and commits.
Before setting out to version your data, you need to clarify which data will be part of your repository.
3. Commit changes to allow time travel on your data repository
In a data version control system that is similar to distributed version control systems like Git, commits are immutable “checkpoints” that include a complete snapshot of a repository at a specific moment in time. Sounds familiar? That’s because it’s exactly like Git commits.
An example commit includes metadata such as who applied the change, a timestamp, a commit message, and arbitrary key/value pairs that you can add as additional metadata.
This allows you to examine your data lake at a specific moment in its history using commits, and you can be positive that the data you see is precisely as it was at the time of committing it.
Since commits are logical, one should use them to create the checkpoints that would serve them best for their use case. For example, commit after each experiment or after each data pipeline step is completed.
Some data versioning solutions allow multiple users to access various branches (or even commits) on the same repository at the same time. For instance, in lakeFS all the live branches and commits are instantly available to all users, unless otherwise specified in RBAC.
4. Branch to achieve a dev/test environment for data
In data version control, branches are conceptually comparable to Git branches, but only in data lakes or warehouses. When a user creates a new branch, it works like a consistent snapshot of the whole repository that is independent of any previous branch (previous version) and its modifications.
This is highly useful for creating an isolated version of the data for the development of new or changed ETLs or for experimentation during the modeling phase in ML. Branches are also common in creating an isolated test environment for new data ingested to ensure it is of high quality or testing changes to ETLs, pipelines, or models.
Another way to view a branch is to consider it an extremely long-lived database transaction that provides Snapshot Isolation.
Users can merge an isolated branch back to the branch from which they’ve forked once they’re done applying changes to the data. In data versioning tools, this action is atomic; readers will either see all of the committed modifications or none at all.
Branching is essential for achieving isolation and atomicity, which come in handy for dev/test environments.
These two features allow users to accomplish things that would otherwise be extremely difficult to get right, such as replacing data in place, adding or updating many objects and collections as a single piece, running tests and validations before exposing data to others, etc.
5. Merge to achieve atomic changes to your production branch
Merging is the process of integrating modifications from one branch into another. A merge produces a new commit in which the destination is the first parent and the source is the second.
Similar to Git’s merge capabilities, the lakeFS merge command allows you to join data branches. After you commit data, you can examine it before merging it into the target branch. A merge creates a commit on the target branch that includes all of your modifications.
What’s more, data version control tools that use this logic guarantee fast and atomic merges since they don’t involve any data copying – in lakeFS, branching is a zero-copy action.
6. Automating data version control processes using hook
Another helpful borrowing from Git is Git hooks, shell scripts you can find in the Git repository’s hidden. git/hooks directory. These scripts perform actions in reaction to particular circumstances, allowing you to automate your development lifecycle.
Similarly, tools for versioning data based on Git offer hooks that allow you to automate and ensure that a certain set of checks and validations occur prior to crucial life-cycle events. Unlike Git, they’re executed remotely on a server and are guaranteed to occur when the right event is triggered.
For example, lakeFS enables hooks to be executed when two types of events:
- pre-commit events – before a commit is acknowledged,
- pre-merge events – just before a merging process.
Returning an error for any event type will force the solution to halt the process and return the failure to the requesting user. This is a very strong guarantee: you can now codify and automate the rules and procedures that all data lake participants must follow.
7. Smart management of data and version expiration
One challenge we all have with data is deciding on its retention policy. Do we keep everything forever? That translates to high storage costs and difficulty in manageability.
How do we decide when and how to hard delete data or different versions?
A data version control system offers you the ability to retain data on the basis of business logic. Since data version control systems hold information on which data is no longer pointed to from any active branch or version of the data, they can automatically set those data sets for deletion.
It’s possible that there’s no need to keep different versions older than 30 days or a year. If you’re confident that for a certain period of time, certain versions of data are no longer relevant, a TTL (time to live) policy is an excellent approach to having previous copies of data destroyed automatically.
Once those versions are deleted, any data that was only used by them is automatically deleted from the storage.
8. Version conceptually, not chronologically
There’s nothing wrong with creating fresh copies of a dataset on a daily or hourly basis. In many circumstances, finding the version with the closest creation time to when you want it is sufficient to discover what you’re searching for.
Tying versioning to the start and/or completion of data pipeline jobs makes data versions even more significant.
And the ETL script was completed? Make a new version. Are you about to send an email to your “highly engaged” subscribers? First, make a backup of the dataset.
This allows you to incorporate more useful metadata about your versions beyond the time they were generated, which allows you to figure out what went wrong much faster if something goes wrong.
9. Use versioning to improve teamwork
One of the difficulties in data settings is avoiding stepping on the toes of your colleagues. Data assets are frequently considered a kind of shared folder that anybody may access, write to, or modify.
This creates issues with concurrency, i.e.: What data are you looking at? I see something different.
Version control systems allow teams to share a branch, and collaborate over data while still having commits to ensure concurrency and avoid the mess.
Different teams can communicate over the data by specifying the commit their branch was taken from. This works best if it’s tagged with the logical information on the commit, such as “Results of 31/07 ingest pipeline”.
It’s crucial to surround whatever tooling you choose to use for data versioning with a strong culture and processes. By sharing these best practices with your team members, you’ll maximize the value of data version control at your organization and make every data practitioner’s life easier.
At lakeFS, we’ve created a series of articles, with each one delving into a unique aspect of data version control.
Data Version Control: What Is It and How Does It Work?
This guide will provide you with a comprehensive overview of data version control, explaining what it is, how it functions, and why it’s essential for all data practitioners.
Learn more in our guide to Data Version Control.
Best Data Version Control Tools
As datasets grow and become more complex, data version control tools become essential in managing changes, preventing inconsistencies, and maintaining accuracy. This article introduces five leading solutions that practitioners can rely on to handle these daily challenges.
Learn more in our detailed guide to Best Data Version Control Tools.
Data Version Control With Python:
A Comprehensive Guide for Data Scientists Explore the essentials of data version control using Python. This guide covers isolation, reproducibility, and collaboration techniques, equipping you with the knowledge to manage data with precision.
Learn more in our guide to Data Version Control with Python.
Table of Contents
Table of Contents