lakeFS + Unity Catalog Integration: Step-by-Step Tutorial

Amit Kesarwani

Last updated on April 26, 2024

Home > Blog > lakeFS + Unity Catalog Integration: Step-by-Step Tutorial

Ready to get started? Watch how

Efficient data management is a critical component of any modern organization.

As data volumes grow and data sources become more diverse, the need for robust data catalog solutions becomes increasingly evident. Recognizing this need, lakeFS, an open-source data lake management platform, has integrated with Unity Catalog, a comprehensive data catalog solution by Databricks.

In this blog post, we will explore the exciting features and benefits of this integration and how it simplifies data management workflows.

Unity Catalog by Databricks

Unity Catalog is a unified governance solution for all data and AI assets including files, tables, machine learning models, and dashboards in your lakehouse on any cloud.

It provides a centralized solution for cataloging, organizing, and managing diverse data sources, making it easier for data engineers, data scientists, and analysts to find and utilize the data they need. With features such as data discovery, data lineage, and governance capabilities, Unity Data Catalog enables teams to unlock the true potential of their data.

Seamless Integration with lakeFS

The integration between lakeFS and Unity Catalog brings a range of benefits to organizations working with large-scale, complex data.

Full Data Versioning for Unity tables

By integrating lakeFS with Unity Catalog, organizations gain powerful data versioning capabilities.

lakeFS allows users to version their data assets, capturing changes over time. This feature enables teams to track modifications, compare different versions, and easily revert to previous states if necessary. Users can now query tables as they appear in different branches or tags in lakeFS.

Collaboration and Teamwork

Using this integration, it’s now possible to expose changes to stakeholders using standard SQL tables. Using the isolated, zero-copy branching provided by lakeFS, users can modify tables and automatically expose their changes as Unity tables. All consumers have to do is select the name of the lakeFS branch and use it as the Unity schema name.

Enhanced Data Governance

With the integration, organizations can establish robust data governance practices. lakeFS comes with a powerful hook system allowing users to control exactly which changes are permitted, validating both data and metadata, while Unity allows defining fine-grained access controls at the table and even column level. This combination makes it easy for security teams to define controls and guardrails to protect their most important asset – their data.

Unlock the full power (and cost benefits) of serverless data warehousing

Enigma uses lakeFS to produce verified data. Their processes write data to a production table. The lakeFS Unity Catalog Export feature is configured to expose these tables on the production branch and on several other branches. Next, data scientists can leverage this setup to query the tables on all exported branches.

Then, data scientists can use Databricks Unity Catalog to examine the tables, their schema, and query their data. Unity supports SQL and serverless queries, meaning Enigma’s data scientists can work without managing Spark clusters.

Unity Catalog Integration: How does it work?

Leveraging the external tables feature within Unity Catalog, lakeFS registers a Delta Lake table exported from lakeFS with Unity Catalog and you can access it through the Unity catalog. The subsequent step-by-step tutorial will lead you through the process of configuring a Lua hook that exports Delta Lake tables from lakeFS, and subsequently registers them in Unity Catalog.

Tables are defined in the lakeFS repository by using a short YAML file that maps Delta Lake tables in the repository to Delta table names in the Unity Catalog. Once a Delta Lake table is exported from lakeFS, the branch it belongs to in lakeFS is configured to be visible in Unity Catalog as schema.

lakeFS Unity Catalog Integration: How it Works

Prerequisites

1. AWS Credentials with S3 access

2. lakeFS Server (you can deploy one independently or use the hosted lakeFS Cloud solution, free for 30 days).

3. lakeFS Credentials (Key & Secret) or the permissions to create those credentials.

4. Databricks Service Principal e.g. create “unity-exporter-hook” service principal:

5. The service principal has the “Service principal: Manager” privilege over itself:

lakeFS Unity Catalog integration: Service principal details

6. The service principal has “Databricks SQL access” and “Workspace access” entitlements:

lakeFS Unity Catalog integration: unity exporter hook

7. Databricks Service Principal has token usage permissions, and an associated personal access token configured.

lakeFS + Unity Catalog Integration: Step-by-Step Tutorial

Table of Contents

Ready to get started? Watch how

Unity Catalog by Databricks

Seamless Integration with lakeFS

Full Data Versioning for Unity tables

Collaboration and Teamwork

Enhanced Data Governance

Unlock the full power (and cost benefits) of serverless data warehousing

Unity Catalog Integration: How does it work?

Prerequisites

1. AWS Credentials with S3 access

2. lakeFS Server (you can deploy one independently or use the hosted lakeFS Cloud solution, free for 30 days).

3. lakeFS Credentials (Key & Secret) or the permissions to create those credentials.

4. Databricks Service Principal e.g. create “unity-exporter-hook” service principal:

5. The service principal has the “Service principal: Manager” privilege over itself:

6. The service principal has “Databricks SQL access” and “Workspace access” entitlements:

7. Databricks Service Principal has token usage permissions, and an associated personal access token configured.

8. Databricks SQL Warehouse. Go to the “Overview” tab of your SQL Warehouse and write down the ID which you will need later.

9. Your Databricks SQL warehouse allows the service principal to use it (SQL Warehouses -> <SQL warehouse> -> Permissions -> <service principal>: Can use):

10. Create Databricks Unity Catalog if you don’t have it:

11. Grant the “USE CATALOG”, “USE SCHEMA”, “CREATE SCHEMA” and “SELECT” permissions on catalog to the service principal (either use SQL command or UI):

12. Connect to cloud object storage using Unity Catalog

13. Databricks all-purpose compute cluster with lakeFS configured. Also, install the lakeFS Python library on the cluster.

Demo Notebook

Step 1: Acquire lakeFS Access Key and Secret

Step 2: Create lakeFS Python client

Step 3: Create lakeFS Repository

Step 4: Table descriptor definition

Step 5: The Unity Catalog exporter script

Step 6: lakeFS Action configuration

Step 7: Create a Delta Table in the source branch

Step 8: Commit your changes

Step 9: lakeFS Action runs

Step 10: Access the table through Unity Catalog

Step 11: Create a new branch in lakeFS

Step 12: lakeFS Action runs again

Step 13: Access the table in the dev branch through Unity Catalog

Step 14: Update the table in isolated dev branch

Step 15: lakeFS Action runs again

Step 16: Access the tables through Unity Catalog

Summary

Want to learn more?

Related Articles

Agentic Data Access: How AI Agents Securely Access Enterprise Data

Give Your AI agent a Versioned Filesystem: A Self-Correcting Receipts Pipeline on E2B and lakeFS

Scaling ML Data Without Breaking Compliance

Pick up the Slack with lakeFS