Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Oz Katz
Oz Katz Author

Oz Katz is the CTO and Co-founder of lakeFS, an...

June 2, 2021
lakeFS Delta Lake Scenic Lake

One of the most common questions we receive from existing and potential users of lakeFS is “Can it work with Delta Tables?

It’s understandable why we hear this question given Delta’s rapid adoption and advanced capabilities including:

  • Table-level ACID operations
  • Data mutations including deletes and “in-place” updates
  • Advanced partitioning and indexing abilities (with z-order)
While the above features are powerful, the combination of Delta Tables within a lakeFS repository is even more powerful. With Delta Lake and lakeFS together you can enable additional data safety guarantees, while simplifying operations. 

For example:

  • ACID operations can span across multiple Delta tables
  • CI/CD hooks can be used to validate data quality and even ensure referential integrity
  • Tables can be cloned in zero-copy fashion, without duplicating data

lakeFS & Delta in Action

To prove this point, we’ll demonstrate how to guarantee data quality in a Delta table by utilizing lakeFS branches and hooks into the workflow for adding new data.

We’ll start by creating two Delta tables, representing loans and loan payments on top of data stored in a lakeFS repository. To do this we’ll use a Databricks notebook, configured to use lakeFS as the underlying storage:

create delta table

Once we run the above commands, we will see the following Delta metadata files added to the repository. In lakeFS, we can commit them on the active branch, as shown below.

Great! Now let’s create a second notebook containing a set of validation rules between the two loan tables that will serve as the data quality pre-merge hook.

validate delta loan records

As you can see above, we have created two data validation checks in the form of SQL queries:

 

 

  1. Check for integrity between the loan_payments foreign key and loans primary keys
  2. Check no payment is ever higher than the total amount of the loan

Since branching and merging in lakeFS are zero-copy metadata operations, we can utilize a separate branch from the main one for ingesting new files. In this way new data gets added in isolation and can be tested by a lakeFS hook to run the validation before being merged back to main.

 

The first step is to create the lakeFS branch, which we will call dev-reports. We can create it using the API, CLI or the lakeFS UI:

In the proposed branching scheme, we’ll have a main and a dev-reports branch. Most consumers should read from the main branch to read data that is guaranteed to be tested and validated.

A consumer that is ok reading “dirty” data in order to see the absolute latest can do so from the dev-reports branch:

lakefs-branches-dev-main

Automating Data Deployment with lakeFS Hooks

To provide this guarantee, we’ll configure the tests we created run automatically before we expose new data to consumers.

To do this, let’s first create a Databricks Job that executes the validation notebook created earlier:

databricks-create-job

We can define a lakeFS webhook to by uploading a config (shown below) with prefix _lakefs_actions/ to the main branch. This will automatically execute this job as a pre-merge hook on main:

Let’s save this file and deploy the following Flask webhook to execute the Databricks job:

For reusable webhooks that you could simply deploy and use, check out the examples in the lakeFS-hooks repository!

Now, let’s add a “bad” record and insert it into our loan_payments table  this record refers to a loan that doesn’t exist.

delta insert bad record

Let’s attempt to commit and merge this change into our main branch:

Hurray! The pre-merge hook to main failed and consumers of that branch will never see this record in the dataset.

Wrapping Up

When using lakeFS together with Delta, we can introduce changes to data and schema safely, providing powerful guarantees about the data contained within. 

In this architecture each technology is responsible for what it was designedDelta Lake for scalable, transaction-friendly tables, and lakeFS for managing the data lifecycle.

Want to Learn more?

Need help getting started?

Git for Data – lakeFS

  • Get Started
    Get Started