Einat Orr, PhD.
December 15, 2020

The demand for infrastructure that contributes to the collection, storage, and analysis of data is growing with the increasing amounts of data managed by organizations. Every organization that manages data pipelines to extract insights from data encounters the need for reproducibility, safe experimentation, and means to ensure data quality. The path to answering these needs goes through the ability to version the data and travel between snapshots of the data at different points in time. 

The aspiration would be to treat the data lake as a Git repository, alas Git doesn’t scale to petabytes of data and billions of objects. This implies that Git-like operations over data should be an infrastructure, rather than a feature provided by an application. In other words, data versioning is an essential part of the infrastructure of your data lake.

Why you want data versioning as infrastructure:

Power all your applications with versioning capabilities

An infrastructure with simple integration allows every application writing and/or reading data (Kafka, Spark, Flink, Presto, R-studio, etc’) to enjoy the power of versioning and git-like operations.  Even applications that have data versioning capabilities can be used over the versioning infrastructure. This allows using several applications (e.g. MLOps tools) in parallel.

Infrastructure ensures standardization

  • The API and terminology to versioning and Git-like operations is standardized by using a unified infrastructure.
  • A reference to a certain version of the data is universal within the organization.
  • Several applications can use a given version of the data or branch from it to different experimentations.
  • Monitor permissions and retention of versions globally to optimize costs and increase manageability.

Scalability and high performance

infrastructure is committed to performance and scalability of data versioning. It is built to scale with the growth of data across the organization. 

Optimizing storage costs

Using data versioning as infrastructure is implemented by managing metadata. If an object exists in several branches, it is saved physically only once and the rest of the logical structure is done with pointers. This allows deduplication of objects, and cost efficiency. 

Change management of applications is possible

Easily replace an application, test a competitor, or evaluate an upgrade. The independent nature of versioning as an infra allows you to run the evaluation in isolation on the data, including (but not limited to) a specific commit of all data sets related to a model.

Visibility and Retention

Data versioning infrastructure provides visibility to the status of all branches and allows metadata and data retention logic that is based on business logic rather than data ordering within buckets.

Functionality

Providing versioning and git-like operations over data is complex. When done as a by-product of another functionality (MLOps, data mutability, or collaboration) it’s implemented in a very narrow form. Therefore limiting the manageability, scalability, and cost effectiveness of the data versioning.

Existing solutions for data versioning as an infrastructure  

There are two different approaches to providing data versioning as infrastructure, one that provides the storage layer as well as the Git-like engine, and another that enables existing object storage with Git like operations.

  • DoltHub (based on the open source project Dolt) provides a versioned DB with Git-like operations. Splitgraph will migrate your data from its original location to postgres when it’s used.
  • The open source project NessieProject provides Git-like operations over Netflix’s Iceberg format by utilizing Iceberg’s metadata for time travel. So while it’s working over the existing data lake, it requires using a specific data format.
  • lakeFS provides Git-like operations over your existing data lake if it’s an object storage such as S3, GCS or Min.IO and Ceph. It is format agnostic and highly scalable, and you can manage several data repositories, just like you manage several Git repositories for code.

Learn more about how to branch and merge data, and of course, feel free to check out lakeFS on Github!

LakeFS

  • Get Started
    Get Started