Data Version Control: What Is It and How Does It Work?
Table of Contents
Imagine the software engineering world before distributed version control systems like Git became widespread. This is where the data world is currently at.
The explosion in the volume of generated data forced organizations to move away from relational databases and instead store data in object storage. This escalated manageability challenges that teams need to address before realizing the full potential of their data.
This is where we come back to data versioning. Versioning of data is important because it enables higher velocity of data teams while reducing the cost of errors.
Read this article to learn everything you need to know about data version control – what it is, how it works, and why it’s so important for every data practitioner out there. We will also show you a few data versioning tools on the market and outline their key advantages and drawbacks.
What is version control?
Production-level systems require some form of versioning and a single source of truth. Any resource that gets continuously updated – especially by multiple people – needs some kind of audit trail to keep track of all changes.
In software engineering, the solution to this in software engineering is Git, which allows engineers to commit changes, create different branches from a source, and merge back our branches to the original, to name a few.
So, what is Data Version Control?
Data version control is the same paradigm for datasets instead of source code. Live data systems constantly ingest new data while different users experiment on the same datasets. This can easily lead to multiple versions of the same dataset, which is definitely nothing like a single source of truth
Additionally, in machine learning environments, teams may have several versions of the same model trained on different versions of the same dataset. If these models aren’t properly audited and versioned, you might end up with a tangled web of datasets and experiments.
Data version control is all about tracking datasets by registering changes on a particular dataset. Version control gives you two primary benefits:
- Visibility into the project’s development over time – showing what has been added, modified, and removed.
- Risk management – you can easily switch to an older version of work if an unexpected problem occurs with the current version. A document that describes each change lets you see the differences between versions, helping you to manage issues faster.
What pain does a data version control system solve for data scientists and engineers?
Both administrators and users of databases, data warehouses, and data lakes often face this common problem:
The data they have represents only the current state of the world.
Since the world is always changing, this data is also subject to constant change. If you want to get back to or look into an older data status, you can dive into a log file and restore it. This sounds good, but the method isn’t handy for data analytics purposes.
This is the pain that data versioning solves. Beyond standard approaches to versioning data, more advanced data versioning helps users set up secure data storage operations.
For example, in the context of machine learning, data scientists can test their models to increase efficiency and make changes to the dataset. With this type of versioning, teams can easily capture the versions of their data and models in Git commits, and this provides a mechanism to switch between these different data contents.
The result is a single history for data, code, and machine learning models that team members can traverse. This keeps projects consistent with logical file names and allows you to use different storage solutions for your data and models in any cloud or on-premises solution.
Data versioning also improves data compliance by letting teams use audit functions to review data modifications, which the system meticulously tracks.
How does a data version control system work?
Data versioning is based on storing successive versions of data created or changed over time. Versioning makes it possible to save changes to a file or a certain data row in a database, for instance. If you apply a change, it will be saved, but the initial version of the file will remain as well.
That way, you can always roll back to an earlier version if there are problems with the current version. This is essential for people working in data integration processes because incorrect data can be fixed by restoring an earlier, correct state.
➡️ If you’re wondering what makes data version control systems different from open table formats (OTFs), take a look at this guide: Open Table Formats vs. Data Version Control Systems
What data versioning tools are out there?
The data versioning space includes a few handy tools that have their advantages and limitations. Here’s a short overview of five such solutions.
lakeFS is a version control system located over the data lake and based on Git-like semantics. Data engineers can use it to create isolated versions of the data, share them with other team members, and effortlessly merge changes into the main branch.
lakeFS supports managing data in AWS S3, Azure Blob Storage, Google Cloud Storage, and any other object storage with an S3 interface. The platform smoothly integrates with popular data frameworks such as Spark, Hive Metastore, dbt, Trino, Presto, and others.
lakeFS unifies all data sources in data pipelines, from analytics databases to key-value stores, via a unified API that lets you easily manage the underlying data in all data stores.
How to create a branch in lakeFS
lakeFS uses branches in a similar way to Git. Data practitioners can use them to isolate changes for experimentation and then re-integrate them once ready.
Note: Branching is a zero-copy procedure in lakeFS.
How do you create a branch in lakeFS?
What you need is the lakectl tool to create the branch, which you first need to configure with your credentials.
Run this command in a new terminal window:
docker exec -it lakefs lakectl config
Follow the prompts and enter the credentials that you obtained in the first step. Leave the server endpoint URL as http://127.0.0.1:8000.
Now that lakectl is configured, you can use it to create a branch. Run the following:
docker exec lakefs
lakectl branch create
You should get a confirmation message like this:
Source ref: lakefs://quickstart/main
created branch 'denmark-lakes' 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8f1ef2d956646816
You have a new branch to experiment on your data! Once done, you can either merge it into the main branch (integrate the changes you’ve applied) or chuck it and start a new experiment.
This open-source project integrates a versioned database built on top of the Noms storage engine and allows for Git-like operations on data. If you use a relational database and want to continue using it while also having version control capabilities, Dolt is a good pick.
How does Dolt work? It relies on a data structure called a Prolly tree (a Prolly tree is a block-oriented search tree that brings together the properties of a B-tree and a Merkle tree). This combination works well because B-tree is used to hold indices in relational databases, allowing you to balance its structure and providing good performance when reading or writing from a database.
However, Dolt isn’t a good solution if your data isn’t in a relational database or if you wish to keep your data in place. Managing the petabyte scale would be impossible. If speed is a concern, this structure is also less efficient. And if you rely heavily on unstructured data, then it’s time to look for another solution.
The problem with Git is that it cannot scale for data. But engineers can use an add-on called Git LFS to manage both data and code.
The idea behind this solution derives from game development: game developers usually deal with game code but also tons of artifacts – mostly binaries that impact the game’s look.
Game developers managed those assets together with code, which made their repositories extremely heavy and confusing. So, they built an add-on to Git that allows them to avoid doing that if there’s no need.
The logic behind Git LFS is simple and relies on managing metadata. This use case grew on data practitioners who do machine learning and research because they also deal with files that don’t code and are a little larger than what you’d expect, including files that manage code. These are best kept together because of the connection between the model and the data it was running on.
Git LFS integrates seamlessly with every Git repository. But if you decide to use it, expect your code and files to live there. This means that you have to lift and shift your data to coexist with your code
DVC was designed to work with version-controlled systems like Git. When you add data to a project using DVC commands, it will upload the data to a remote storage service and generate a metadata file that points to that location.
Next, the metadata file will be added to a Git repository for version control. When data files are modified or added/removed, the metadata file is updated, and new data is uploaded. That way, you can keep track of data and share it with collaborators without actually storing it in the repository by using the metadata files.
However, DVC is missing important relational database features. If you’re a relational database person, it’s probably not the best choice. Also, caching becomes unrealistic when you’re operating on a petabyte scale and using hundreds of millions of objects.
Want to learn more? Here’s a detailed comparison of DVC, Git LFS, and lakeFS.
The Nessie Project
The open-source project Nessie provides a new level of control and consistency around data too. Nessie draws inspiration from GitHub, a platform where programmers create, test, release, and update software versions.
By extending analogous development methodologies and concepts to data, Nessie enables data engineers and analysts to update, restructure, and repair datasets while maintaining a consistent version of truth.
To provide consistent data version control, Nessie leverages CI/CD. Data engineers and analysts can generate a virtual clone of a data set, update it, and then merge it back into the original data set.
This means they may consume, alter, and experiment with data in isolation, across several users and processing engines, without fear of jeopardizing the core data, which represents the one version of truth. They don’t make any modifications to the core data collection until they have a reliable version.
Data practitioners who use the right data version control tools to handle the scale, complexity, and constantly changing nature of modern data can transform a chaotic environment into a manageable one.
They gain full control of their data, enabling them to know where it comes from, what has changed, and why. This way, data practitioners can go back to being that person in the room who corrects the manager when they’re about to make a decision based on inaccurate data.
At lakeFS, we’ve created a series of articles, with each one delving into a unique aspect of data version control.
Data Version Control Best Practices
This article outlines a range of best practices for data version control, designed to help teams maintain the highest quality and validity of data throughout every stage of the process.
Learn more in our detailed guide to Data Version Control Best Practices.
Best Data Version Control Tools
As datasets grow and become more complex, data version control tools become essential in managing changes, preventing inconsistencies, and maintaining accuracy. This article introduces five leading solutions that practitioners can rely on to handle these daily challenges.
Learn more in our detailed guide to Best Data Version Control Tools.
Data Version Control With Python:
A Comprehensive Guide for Data Scientists Explore the essentials of data version control using Python. This guide covers isolation, reproducibility, and collaboration techniques, equipping you with the knowledge to manage data with precision.
Learn more in our guide to Data Version Control with Python.
Table of Contents
Table of Contents