Git, the Source Control, aka Code Version Control
When we wish for “Git for Data”, we already know what code version control is, and that Git is the standard tool for code version control. For the sake of those who have just joined us, let’s define those terms.
Back in the 60’s of the 20th century, when people creating code wanted to collaborate with one another, the need to manage the files containing the code together became imminent, and systems that would take care of the manageability of code started to emerge, given the name “Version Control” tools. Why Version control? Because when trying to collaborate on a set of file commentating code, when every programmer can change any file at any time, is done by versioning the code files, but also making sure they are consistent with one another to create a working application, so the tools actually version control a repository of code files.
After many version control tools were tried, one had won the heart of developers, and it is now the standard tool for code version control. It’s called Git.
Git had conquered the world of code because it had best supported engineering best practices required by developers, mainly:
- Collaborate during development.
- Develop and Test in isolation
- Revert code repository to a sable version in case of an error
- Reproduce and troubleshoot issues with a given version of the code
- Continuously integrate and deploy new code (CI/CD)
Why Git for data? Or, Why Version Control Data?
Now that we are on the same page as to why one would version control code, we can see if those same needs apply for data.
Today, data practitioners of all types, analysts, data engineers, machine learning engineers and researchers, work on ever growing amounts of data. The work of a data practitioner is comprised of:
- The data, that serves as: input, intermediate result, and output of their work
- The code that is created to analyze the data, from simple SQL queries, through complex ETLs in a distributed computation engine such as spark, Trino, Presto, etc’, to modeling ML or other types of algorithms to solve complex prediction problems.
- The Infrastructure, the computation engine used to extract insights, often distributed systems such as Spark or Presto/Trino, and the data storage or data bases used to save the data. Those are complemented by orchestration systems such as airflow, and data catalogs such as DataHub or Alation.
Source code is well managed in Git, and the infrastructure is now managed as code, due to a paradigm shift in DevOps, where tools such as Terraform allow codifying the world of infrastructure, and the code created for that is managed in Git.
The Data is unique to data intensive applications, and managing it alongside the other two, adds significant complexity. To ensure engineering best practices as mentioned above, we now need to go through the same paradigm shift we went through for DevOps in dataOps, and manage data like we manage code. In other words, we need Git for Data!
The benefits of “Git for Data”
We have established that implementing engineering best practices in data should go through Git-like operations on data, the same way we have already implemented in infrastructure. Let’s dive into the specifics that git-for-data will allow us to accomplish. Fair warning, this list is about to look really familiar to you.
1. Collaborate during development
In order to collaborate on developing a data intensive application, all involved must look at the same version of the data. Since data accumulates and updates over time, the way to ensure that would be to version control the data and use a commit ID or a branch to synchronize all involved to use that version.
2. Develop and Test in isolation
Branching is the way to achieve isolation. A branch is a snapshot of the data repository at the time of its creation. From that point on any changes done to the data of the branch are only recorded in the branch, and any changes done to other branches of the data are not influencing the branch one uses. This allows an isolated work environment for development and testing. With data, a common use case would be to create a branch of production data to allow development and testing over production data with no additional costs and with no risk to production.
3. Revert code repository to a stable version in case of an error
Revert is a Git operation that allows you to time travel within your repository and go back to any point in time you have tagged by a commit, a branch, or a merge. Once an issue with the quality of your data was revealed, you can revert your data to the last stable snapshot of the repository, and one atomic action that takes milliseconds and ensures consistency over all data sets managed within the repository.
4. Reproduce and troubleshoot issues with a given version of the code
Assuming you have data consumers that rely on different versions of your data, you can support them and troubleshoot any issue that they have by reading the data from the version they are consuming. This is the way to achieve reproducibility on the data.
5. CI/CD for Data – Continuously integrate and deploy new data
Combining the power of the Git operations mentioned above allows us to automate the testing of new data before we expose it to our consumers. We call this process continuous integration/deployment of data. Assuming we have a set of test we run to ensure the quality of the data, we can create /ingest this data on a branch, exposed only to us, and to use a web hook similar to a Github action to initiate the tests to run on the data, where the hook will allow the merge of the data to a public branch of the test pass, or will not allow it if the test fails, leaving us with the version of the data at the time of the failure for debugging.
Should I use Git itself for Data?
By now, we all want to use Git concepts on our data as we know the value of the presented engineering best practices. So why not use Git? Git is designed to work on files, and can version control them, so at first glimpse it may seem like a plausible solution….
A deeper look will show the opposite. Git is inadequate for managing versions of data lakes. The main reasons are:
- Git was built for human scale. The number of Git operations per minute is expected to be in the thousands, while when it comes to data,
- Git requires the data lake to be managed within its own storage while when your data lake is object storage, lifting and shifting it to your Git implementation, cloud or on prem, is not realistic.
- Git supports source code formats, while data lakes may include binaries, unstructured and semi structured data, and tabular data, with file sizes that span from very small to extremely large (TB) files.
In light of the reasons above – it is recommended to use a tool that is designed for data in order to achieve the benefits of git for data. Luckily, there are several tools that provide these functionalities.
Tools Providing ‘Git for Data’
There are some publications on the different tools that exist out there, and provide the capabilities of git-for-data. All of them have benefits and challenges, and should be selected according to the user’s specific use case and what they are optimizing for:
- Dolt – an open-source project building a versioned database, based on a storage engine called Noms (which is also open-source). It provides Git-like operations for data. This is an optimal solution if your data is mostly tabular data that is stored or migrated from a relational DB. However, if you are dealing with unstructured data and hence need high performance and scalability, this is not the optimal solution.
- DVC – a project inspired by Git LFS in order to give data scientists and researchers Git LFS with additional capabilities suitable for their use cases of data science. This is an optimal solution if your data needs to stay in place. However, since it is designed mainly to cater data science needs it is limited in performance for data retrieval and also it is limited in scalability.
- lakeFS – lakeFS is optimized for data operations that have many data sources saving data into an object storage such as S3, min.io, Azure Blob. GCS, etc., and ETLs running on distributed compute systems like Apache Spark or Presto. Those ETLs might be built out of tens or hundreds or even thousands of small jobs orchestrated by Airflow DAGs. There are numerous consumers consuming the data – ML engineers, BI analysts, or the next person to write an ETL over data for some new use case. lakeFS is designed to cater the needs of such operation and all the providers and consumers that are operating it. It is optimal for format agnostic data that stays in place, and with its high performance and scalability – it is a highly adopted and a recommended git for data solution.