Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Oz Katz
Oz Katz Author

Oz Katz is the CTO and Co-founder of lakeFS, an...

Last updated on April 26, 2024

Git for data may sound odd at first. But using the Git logic and mechanisms for data lakes makes a lot of sense. After all, software developers figured out long ago how to efficiently collaborate on constantly changing source code with Git. 

We created lakeFS to provide similar capabilities to data folks. It provides data version control capabilities based on Git semantics, only scaled to the size of a typical S3 bucket. 

To help you understand the value of Git for data, here’s an overview of the tools that are part of every lakeFS repository. They’re bound to help with your data strategy and ensure that your organization meets its compliance, quality, and safety requirements.

Commits, branches and tags

Let’s start with the basic versioning mechanisms you can find in lakeFS: making commits, creating branches, adding tags, and merging branches to the main.

Here’s a quick snapshot of how these mechanisms work: 

Commits let you capture a specific state of all the data managed in the repository. By using commits, you can browse a repository at a certain point in its history, knowing that the data you see is precisely as it was when it was committed. 

Branches allow you to create development, staging, and production environments without copying any data. This opens the door to building resilient data pipelines using the sandboxed pipeline pattern .

Tags come in handy for building reproducible ML experiments and models as they can provide a reliable and efficient means to create snapshots of your datasets allowing you to track your experiments and data versions to the exact state of the data at any point in time. Another good use case for tags is collaborating with other data scientists and engineers, as consistent use of tags ensures that everyone looks at the same version of the data.

How do they all come together? Here’s an example flow:

  1. Create a new branch from the main to make an immediate “copy” of your production data.
  2. Apply changes to the isolated branch to understand their impact before exposing it to other users or data consumers.
  3. Finally, carry out a merge from the feature branch to the main branch to atomically promote the improvements to production.

lakeFS allows following this pattern, allowing for a more efficient data distribution approach that reliably provides data assets you can trust.

Hooks

The image above includes a step called Pre-Merge Hook. What is a hook, and why should you care?

lakeFS hooks let you automate a set of checks and validations and ensure they’re performed prior to crucial lifecycle events.

lakeFS hooks are theoretically similar to Git Hooks, but unlike Git, they operate remotely on a server, ensuring that they’re executed when the right event is triggered.

lakeFS hooks have two powerful use cases:

  1. They can serve as quality gates – By adding pre-merge hooks, you only allow high-quality data to be merged into the production branch.
  2. Automating common data operations – You can use post-commit or post-merge hooks to register new data into a data catalog.

For all event types, returning an error will force lakeFS to halt the operation and communicate the failure to the asking user.

This is a powerful guarantee: 

You can now codify and automate the rules and procedures that all data lake participants must follow.

lakeFS comes with a Lua Virtual Machine that lets you run hooks directly in lakeFS without having to rely on any external components.

Access control policies

Data lake platforms often lack straightforward data governance enforcement. Data governance rules are demanding to begin with, let alone when you’re facing the added complications of maintaining data in a data lake. As a result, implementing them is a costly, time-consuming continuing activity that demands constant monitoring. Typically, this comes at the price of data engineering or other business-enhancing DevOps efforts.

lakeFS includes a few handy features that help to establish governance at scale in a simple, rapid, and transparent manner. 

The first is Role-Based Access Control (RBAC), which allows you to specify exactly what users can do and where.

For example, data engineers who maintain the data platform, including lakeFS, can be allowed to manage the repository and branches but, simultaneously, be restricted from seeing any sensitive data inside it.

Granting access to a role rather than a user is efficient and flexible. If you have a dozen data scientists who need to be able to create their own branches from production data to run experiments against, making that change in one Data Scientist role configuration will be faster than getting a list of all the data scientists and changing the permissions for each one individually.

In lakeFS, RBAC allows for applying granular access controls at the repository, path, and specific object levels. 

Branch protection rules

Another lakeFS feature that comes in handy for data governance is branch protection.

Branch protection allows for enforcing quality control and governance with hooks. Once a branch is protected, no one can write directly to it, and users can only merge branches into that branch. This ensures hooks can’t be bypassed. It’s common to define our “production” or “main” branch as protected. 

Using pre-merge hooks, you may validate your data before it reaches your critical branches and is exposed to customers.

Garbage Collection

Finally, there’s garbage collection. 

When it comes to data management, teams must balance the ability to restore and recover previous data with the ability to discard outdated data. Data deletion may be a key step for cost reduction. 

Furthermore, to comply with regulatory obligations (such as GDPR, FINRA, and HIPAA), companies must dispose of data in line with privacy rights and data retention policies and standards. 

The Garbage Collection feature in lakeFS helps to achieve the following goals:

  1. Keep costs in check by retiring old commits that should no longer be consumed, hard deleting their data.
  2. Meet regulatory and compliance needs such as GDPR by ensuring PII and other sensitive data are effectively deleted.

Note: Garbage Collection is offered as a managed service in lakeFS Cloud. Users of the open-source version can run and manage Garbage Collection periodically themselves

Data imports

All the Git mechanisms and data governance features for data look just great, but how do you get started?

All it takes is importing data to lakeFS. But trust me, it’s not the usual kind of import you have in mind.

lakeFS import allows external data to be brought into a lakeFS repository without moving the data at all. lakeFS uses copy-on-write to achieve this.

The usual way to get started with lakeFS is by importing existing data and then using branches to create a development or test environment.

Have a look at this video to learn more about the process of importing data from a data lake into lakeFS. And check out this guide to importing multiple data buckets into lakeFS to get started.

Wrap up

By using these features in lakeFS, we can build a data strategy that helps you meet your organization’s quality, governance, and resilience requirements. 

Features such as Role-Based Access Control, data versioning, and managed garbage collection enable efficient data lake governance. This enhances data quality and boosts data consumption for decision-making, resulting in operational improvements, better-informed business strategies, and greater financial performance

Git for Data – lakeFS

  • Get Started
    Get Started
  • Did you know that lakeFS is an official Databricks Technology Partner? Learn more about -

    lakeFS for Databricks
    +