Best Data Version Control Tools 
Table of Contents
As the volume of datasets rises and data practitioners use them at an increasing pace due to advances in machine learning, keeping track of the changes applied to the data becomes more challenging. As a result, teams experience problems like inconsistencies, inaccuracies, and a lack of reproducibility in their data products. Luckily, data version control tools are here to help.
In this article, we review five of the top data versioning technologies that data practitioners use to solve their daily challenges.
What is data version control?
Data versioning is based on the approach of version control applied to application source code.
As various users experiment on the same datasets, live data systems continually absorb fresh data. This may easily result in several versions of the same dataset, which is far from a single source of truth.
Furthermore, teams in machine learning settings may have multiple versions of the same ML model trained on various versions of the same dataset. Experiment tracking is a key capability teams need. If these ML models and data aren’t properly verified and versioned, the result might be a tangled web of datasets and trials.
Data versioning tools help teams track datasets by tracking changes to a repository of datasets. Version control provides two key advantages:
- Visibility into evolution through time of the project’s data, including what was added, changed, and withdrawn.
- Since the data is managed as a repository, concepts like a branch, commit and revert allow us to tag certain data versions and effortlessly go back to using them by using the branch or commit ID.
- The concept of branching allows for the creation of an isolated environment of data that can be used to develop in isolation, test changes, and promote only quality data to production. This process can also be automated to resemble CI/CD processes we know from code to data pipelines.
Why is there a need to version data?
This is a prevalent issue for both administrators, providers, and consumers of databases, data warehouses, and data lakes:
The world is always changing, and data is changing with it.
This is the problem that data versioning solves. Beyond traditional ways of versioning, advanced data versioning assists in establishing a safe data storage process.
In the context of machine learning, for example, data practitioners may test their models to improve efficiency and make modifications to the dataset. Teams can simply capture the data versions and model versions in Git commits using this form of versioning, and this gives a means to transition between these distinct data contents.
Thanks to this, team members can navigate a single history of data, code, and models. This allows you to utilize alternative storage systems for your data and models in any storage or on-premise solution while keeping projects consistent with logical file names.
Note that data versioning also helps with data compliance by allowing teams to evaluate data alterations using audit tools.
How do you pick the right data version control tool?
Consider the following factors when selecting a data versioning technology for your workflow:
- Use case – Data version control systems are not all created with the same user in mind. Some systems are designed to support a specific persona, such as a data scientist, researcher, or analyst, while others provide an organizational infrastructure that all data practitioners in the organization can benefit from (including data science and engineering experts).
- Where is the data stored – Can the data stay where you manage it, or do you need to lift and shift it to the data version control system?
- Flexibility – can it handle various types of data and tolerate changes in data structures? Is it able to version various types of data and handle a variety of file types, including CSV, JSON, and binary files? Or is it inflexible like an SQL database for unstructured data?
- Ease of use – how easy is it to include this solution in your workflow? Does it have an easy-to-use interface for versioning data, as well as good documentation and tutorials?
- Integration – how effectively does it integrate into your stack? Can you connect to your infrastructure, platform, or model training workflow without any trouble? Does it include integrations with common frameworks such as TensorFlow, PyTorch, and Scikit-learn?
- Scalability – is the solution scalable? Can it meet the rising volume of data generated by your project while maintaining peak performance? And does it open the door to scalable metadata management as well?
- Collaboration capabilities – does the solution have them to allow multiple users to work on the same project at the same time? Does it include role-based access control, version control database and history, and the ability to add metadata to multiple versions?
- Open-source vs. proprietary – Open-source data versioning tools are free to use and are created and maintained by a developer community (including a commercial company). They’re more adaptable and configurable than proprietary software since users may change and expand the codebase to meet their own requirements. You can contribute code to the community in cases where you wish the system to better suit your needs.
Best data version control tools for data practitioners in 2023
As the volume of datasets rises and data practitioners increasingly utilize them due to advances in machine learning, keeping track of the changes applied to the data becomes a more challenging endeavor. Data version control tools are emerging as a vital solution.
Below, you will find a review of the top data version control solutions in 2023, embraced by data practitioners to address daily challenges:
|Tool||Open Source||Release Date||All Data Formats||Data Stays in Place||Scalability||Performance||All Use Cases||Integrates with Git|
|Project Nessie||2020|| |
Research / Data Science Only
Research / Data Science Only
*Note: DVC requires copying the data to a local FS for it to be consumed, while GitLFS and XetHub require copying the data to their managed storage in addition to copying locally to be consumed
Let’s dive into the details of these leading tools, exploring the unique features that have made them popular choices among data practitioners
lakeFS is an open-source version management system based on Git-like semantics that works on top of an existing data lake. Data engineers and data scientists can use it to version control their data while creating and maintaining data pipelines and machine learning models to ensure reproducibility, collaboration, and high-quality outcomes.
lakeFS manages data across AWS S3, Azure Blob Storage, Google Cloud Storage, and any other object storage that has an S3 interface. The platform interfaces seamlessly with prominent data engineering and data science frameworks, including:
- Orchestration tools such as Airflow, Dagster, Prefect, and Kubeflow,
- Labelling systems such as Labelbox and Dataloop,
- Compute engines such as Spark, Trino, Ray and Polars
- Ingest systems such as Kafka, Airbyte and Memphis
- Data quality platforms such as Great Expectations, and more.
lakeFS supports any data format, including open table formats, and semi-structured and unstructured data. Its versioning engine is agnostic to the data format it manages but was designed to provide deeper support for open table formats such as Delta Lake, Apache Iceberg, and Apache Hudi.
The system is highly scalable and can support petabytes of data and billions of objects while having a negligible influence on critical path data operations such as data retrieval or saving. It comes with a command line interface and a GUI for easy access.
2. Project Nessie
Project Nessie is an open-source solution that offers a new degree of control and consistency. It takes a cue from GitHub, the platform where developers build, test, release, and upgrade software versions.
Nessie lets data engineers and analysts update, rearrange, and rectify datasets while keeping a consistent version of the truth by applying comparable development techniques and principles to data. This improves the data operations (DataOps) approach, which aims to assure the supply of timely, correct data to the business.
Nessie supports data management of the open table format Apache Iceberg. It implements a catalog that allows it to version control the Apache Iceberg tables using the metadata the format is saving when managing changes to data per table.
The use of a proprietary catalog limits Nessie’s ability to integrate into data architectures where a catalog such as HMS, DataBricks Unity, or Tabular already exists.
DVC is an open-source tool that data scientists can use to version data and models, track experiments, and compare any data, code, parameters, models, or performance graphical displays.
The tool works with all major cloud platforms and storage types and can handle huge files and machine learning models effectively. Git, the source code versioning tool that many developers use, and the concept of Git repository were the basis for the development of DVC.
However, it was not designed to handle large data files and sets (over 10,000 objects) and influences the performance of critical path data operations due to its reliance on Git in the background.
Collaboration with others requires you to carry out a number of setups – setting up remote storage, establishing roles, and granting access to each contributor. This may quickly become time-consuming if you’re expecting multiple users.
4. Git LFS
Git LFS is an open-source project that extends Git’s capacity to store huge binary files such as audio samples, movies, and massive datasets while maintaining its lightweight design and performance.
Large files are saved in the Git LFS storage and accessed via pointers in local copies of the distant server. The tool can hold any sort of file, independent of format, making it adaptable and useful for Git versioning of huge files. Developers can easily migrate huge files to Git LFS without disrupting their current workflow.
However, Git LFS requires you to use a one-of-a-kind remote Git server, making it a one-way door. This is a drawback for users who may prefer to use vanilla Git. It also calls for an LFS server to function. Not Every Git hosting service offer such a server, so in some cases, you will need to build one yourself or switch to another Git provider.
XetHub is a closed-source storage solution that brings fast access and version control to ML data and models. Machine learning teams can develop with familiar tools such as Python, an S3-like CLI, or Git in a single repository, while XetHub ensures that all changes are committed in the background for guaranteed reproducibility and easier process of managing data.
Benefits of XetHub include:
- Instant read-only streaming access to repositories for efficient data or model loading on remote training machines and easy local data exploration.
- Flexible interfaces for working with files such as fsspec and Git access patterns.
- Seamless collaboration with Git-style pull requests and access controls.
- Automatic block-level deduplication across your repository for free copies and faster development iterations.
XetHub currently supports repositories of !TB, with the intention of growing this scale in the future.
Each data version control system demonstrates a variety of pros and cons that you should consider when picking the right one for your organization. While some solutions are more user-friendly and excel in terms of speed and simplicity, others provide more extensive capabilities and scalability.
When making a decision, evaluate the unique needs of your project and weigh the advantages and disadvantages of each solution. Keep in mind that many open-source solutions have the support and SLAs your company requires from commercial organizations
At lakeFS, we’ve created a series of articles, with each one delving into a unique aspect of data version control.
Data Version Control: What Is It and How Does It Work?
This guide will provide you with a comprehensive overview of data version control, explaining what it is, how it functions, and why it’s essential for all data practitioners.
Learn more in our guide to Data Version Control.
Data Version Control Best Practices
This article outlines a range of best practices for data version control, designed to help teams maintain the highest quality and validity of data throughout every stage of the process.
Learn more in our detailed guide to Data Version Control Best Practices.
Data Version Control With Python:
A Comprehensive Guide for Data Scientists Explore the essentials of data version control using Python. This guide covers isolation, reproducibility, and collaboration techniques, equipping you with the knowledge to manage data with precision.
Learn more in our guide to Data Version Control with Python.
Table of Contents
Table of Contents