One of the most common questions we receive from existing and potential users of lakeFS is “Can it work with Delta Tables?“
It’s understandable why we hear this question given Delta’s rapid adoption and advanced capabilities including:
- Table-level ACID operations
- Data mutations including deletes and “in-place” updates
- Advanced partitioning and indexing abilities (with z-order)
- ACID operations can span across multiple Delta tables
- CI/CD hooks can be used to validate data quality and even ensure referential integrity
- Tables can be cloned in zero-copy fashion, without duplicating data
lakeFS & Delta in Action
To prove this point, we’ll demonstrate how to guarantee data quality in a Delta table by utilizing lakeFS branches and hooks into the workflow for adding new data.
We’ll start by creating two Delta tables, representing loans and loan payments on top of data stored in a lakeFS repository. To do this we’ll use a Databricks notebook, configured to use lakeFS as the underlying storage:
Once we run the above commands, we will see the following Delta metadata files added to the repository. In lakeFS, we can commit them on the active branch, as shown below.
Great! Now let’s create a second notebook containing a set of validation rules between the two loan tables that will serve as the data quality pre-merge hook.
As you can see above, we have created two data validation checks in the form of SQL queries:
- Check for integrity between the loan_payments foreign key and loans primary keys
- Check no payment is ever higher than the total amount of the loan
Since branching and merging in lakeFS are zero-copy metadata operations, we can utilize a separate branch from the main one for ingesting new files. In this way new data gets added in isolation and can be tested by a lakeFS hook to run the validation before being merged back to main.
The first step is to create the lakeFS branch, which we will call dev-reports. We can create it using the API, CLI or the lakeFS UI:
In the proposed branching scheme, we’ll have a main and a dev-reports branch. Most consumers should read from the main branch to read data that is guaranteed to be tested and validated.
A consumer that is ok reading “dirty” data in order to see the absolute latest can do so from the dev-reports branch:
Automating Data Deployment with lakeFS Hooks
To provide this guarantee, we’ll configure the tests we created run automatically before we expose new data to consumers.
To do this, let’s first create a Databricks Job that executes the validation notebook created earlier:
We can define a lakeFS webhook to by uploading a config (shown below) with prefix _lakefs_actions/ to the main branch. This will automatically execute this job as a pre-merge hook on main:
Let’s save this file and deploy the following Flask webhook to execute the Databricks job:
For reusable webhooks that you could simply deploy and use, check out the examples in the lakeFS-hooks repository!
Now, let’s add a “bad” record and insert it into our loan_payments table — this record refers to a loan that doesn’t exist.
Let’s attempt to commit and merge this change into our main branch:
Hurray! The pre-merge hook to main failed and consumers of that branch will never see this record in the dataset.
When using lakeFS together with Delta, we can introduce changes to data and schema safely, providing powerful guarantees about the data contained within.
In this architecture each technology is responsible for what it was designed—Delta Lake for scalable, transaction-friendly tables, and lakeFS for managing the data lifecycle.
Want to Learn more?
Check out our Github repo 🙂
say “Hi” in our Slack group!
To learn more:
—lakeFS Hooks: Implementing CI/CD for Data using Pre-merge Hooks
—3 Data lake Anti-Patterns to Avoid
—Hudi, Iceberg and Delta Lake: Data Lake Table Formats Compared