Einat Orr, PhD.
December 1, 2022

Imagine the software engineering world before distributed version control systems like Git became widespread. This is where the data world is currently at.

The explosion in the volume of generated data forced organizations to move away from relational databases and instead store data in object storage. This escalated manageability challenges that teams need to address before realizing the full potential of their data. 

This is where we come back to data versioning. Versioning of data is important because it enables higher velocity of data teams while reducing the cost of errors. 

Read this article to learn everything you need to know about data version control – what it is, how it works, and why it’s so important for every data practitioner out there. We will also show you a few data versioning tools on the market and outline their key advantages and drawbacks. 

Table of Contents

What is Version Control?

Production-level systems require some form of versioning and a single source of truth. Any resource that gets continuously updated – especially by multiple people – needs some kind of audit trail to keep track of all changes. The solution to this in software engineering is Git, which allows engineers to commit changes, create different branches from a source, and merge back our branches to the original, to name a few. 

Source: lakeFS

So, what is Data Version Control?

Data version control is the same paradigm for datasets instead of source code. Live data systems constantly ingest new data while different users experiment on the same datasets. This can easily lead to multiple versions of the same dataset, which is definitely nothing like a single source of truth.

Additionally, in machine learning environments, teams may have several versions of the same model trained on different versions of the same dataset. If these models aren’t properly audited and versioned, you might end up with a tangled web of datasets and experiments.

Data version control is all about tracking datasets by registering changes on a particular dataset. Version control gives you two primary benefits:

  • Visibility into the project’s development over time – showing what has been added, modified, and removed. 
  • Risk management – you can easily switch to an older version of work if an unexpected problem occurs with the current version. A document that describes each change lets you see the differences between versions, helping you to manage issues faster.

What pain does a Data Version Control system solve?

Both administrators and users of databases, data warehouses, and data lakes often face this common problem: 

The data they have represents only the current state of the world. 

Since the world is always changing, this data is also subject to constant change. If you want to get back or look into an older data status, you can dive into a log file and restore it – but this method isn’t handy for data analytics purposes. 

This is the pain data versioning solves. Beyond standard approaches to versioning data, more advanced data versioning helps users to set up a secure operation of data storage. For example, in the context of machine learning, data scientists might test their models to increase efficiency and make changes to the dataset. With this type of versioning, teams can easily capture the versions of their data and models in Git commits, and this provides a mechanism to switch between these different data contents. 

The result is a single history for data, code, and machine learning models that team members can traverse. This keeps projects consistent with logical file names and allows you to use different storage solutions for your data and models in any cloud or on-premise solutions. Data versioning also improves data compliance by letting teams use audit functions to review data modifications. 

How does a Data Version Control system work?

Data versioning is based on storing successive versions of data created or changed over time. Versioning makes it possible to save changes to a file or a certain data row in a database, for instance. If you apply a change, it will be saved, but the initial version of the file will remain as well. 

That way, you can always roll back to an earlier version if there are problems with the current version. This is essential for people working in data integration processes because incorrect data can be fixed by restoring an earlier correct state. 

What Data Version Control systems are out there?

The data versioning space includes a few handy tools that have their advantages and limitations. Here’s a short overview of four such solutions.

Dolt

This open-source project integrates a versioned database built on top of the Noms storage engine, and it allows for Git-like operations for data. If you use a relational database and want to continue using it while also having version control capabilities, Dolt is a good pick. 

How does Dolt work? It relies on a data structure called a Prolly tree (a Prolly tree is a block-oriented search tree that brings together the properties of a B-tree and a Merkle tree). This combination works well because B-tree is used to hold indices in relational databases, allowing you to balance its structure and providing good performance when reading or writing from a database.

However, Dolt isn’t a good solution if your data isn’t in a relational database or if you wish to keep your data in place. Managing the petabyte scale would be impossible. If speed is a concern, this structure is also less efficient. And if you rely heavily on unstructured data, then it’s time to look for another solution.

Git LFS

The problem with Git is that it cannot scale for data. But engineers can use an add-on called Git LFS to manage both data and code. 

The idea behind this solution derives from game development: game developers usually deal with game code but also tons of artifacts – mostly binaries that impact the game’s look. Game devs managed those assets together with code, which made their repositories extremely heavy and confusing. So, they built an add-on to Git that allows them to avoid doing that if there’s no need. 

The logic behind Git LFS is simple and relies on managing metadata. This use case grew on people who do machine learning and research because they also deal with files that don’t code and are a little larger than what you’d expect, including files that manage code. These are best kept together because of the connection between the model and the data it was running on.

Git LFS integrates seamlessly with every Git repository. But if you decide to use it, expect your code and files to live there. This means that you have to lift and shift your data to coexist with your code.

DVC

DVC was designed to work with version-controlled systems like Git. When you add data to a project using DVC commands, it will upload the data to a remote storage service and generate a metadata file that points to that location. 

Next, the metadata file will be added to a Git repository for version control. When data files are modified or added/removed, the metadata file is updated, and new data is uploaded. That way, you can keep track of data and share them with collaborators without actually storing them in the repository by using the metadata files. 

However, DVC is missing important relational database features. If you’re a relational database person, it’s probably not the best choice. Also, caching becomes unrealistic when you’re operating on a petabyte scale and using hundreds of millions of objects.

lakeFS

lakeFS is a version control system located over the data lake and based on Git-like semantics. Engineers can use it to create isolated versions of the data, share them with other team members, and merge changes into the main branch effortlessly. 

lakeFS supports managing data in AWS S3, Azure Blob Storage, Google Cloud Storage, and any other object storage with an S3 interface. The platform smoothly integrates with popular data frameworks such as Spark, Hive Metastore, dbt, Trino, Presto, and others. 

lakeFS unifies all data sources in data pipelines, from analytics databases to key-value stores, via a unified API that lets you easily manage the underlying data in all data stores.

Source: lakeFS

Want to learn more? Here’s a detailed comparison of DVC, Git LFS, and lakeFS.

Wrap up

Data practitioners who use the right version control tools to handle the scale, complexity, and constantly-changing nature of modern data can transform a chaotic environment into a manageable one. 

They gain full control of their data, enabling them to know where it comes from, what has changed, and why. This way, data practitioners can go back to being that person in the room who corrects the manager when they’re about to make a decision based on inaccurate data.

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +