Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Amit Kesarwani
Amit Kesarwani Author

Amit heads the solution architecture group at Treeverse, the company...

,
Jonathan Rosenberg
Jonathan Rosenberg Author

Last updated on March 18, 2024

Efficient data management is a critical component of any modern organization. 

As data volumes grow and data sources become more diverse, the need for robust data catalog solutions becomes increasingly evident. Recognizing this need, lakeFS, an open-source data lake management platform, has integrated with Unity Catalog, a comprehensive data catalog solution by Databricks

In this blog post, we will explore the exciting features and benefits of this integration and how it simplifies data management workflows.

Unity Catalog by Databricks

Unity Catalog is a unified governance solution for all data and AI assets including files, tables, machine learning models, and dashboards in your lakehouse on any cloud.

It provides a centralized solution for cataloging, organizing, and managing diverse data sources, making it easier for data engineers, data scientists, and analysts to find and utilize the data they need. With features such as data discovery, data lineage, and governance capabilities, Unity Data Catalog enables teams to unlock the true potential of their data.

Seamless Integration with lakeFS

The integration between lakeFS and Unity Catalog brings a range of benefits to organizations working with large-scale, complex data.

Full Data Versioning for Unity tables

By integrating lakeFS with Unity Catalog, organizations gain powerful data versioning capabilities. 

lakeFS allows users to version their data assets, capturing changes over time. This feature enables teams to track modifications, compare different versions, and easily revert to previous states if necessary. Users can now query tables as they appear in different branches or tags in lakeFS.

Collaboration and Teamwork

Using this integration, it’s now possible to expose changes to stakeholders using standard SQL tables. Using the isolated, zero-copy branching provided by lakeFS, users can modify tables and automatically expose their changes as Unity tables. All consumers have to do is select the name of the lakeFS branch and use it as the Unity schema name.

Enhanced Data Governance

With the integration, organizations can establish robust data governance practices. lakeFS comes with a powerful hook system allowing users to control exactly which changes are permitted, validating both data and metadata, while Unity allows defining fine-grained access controls at the table and even column level. This combination makes it easy for security teams to define controls and guardrails to protect their most important asset – their data.

Unlock the full power (and cost benefits) of serverless data warehousing

Enigma uses lakeFS to produce verified data.  Their processes write data to a production table. The lakeFS Unity Catalog Export feature is configured to expose these tables on the production branch and on several other branches. Next, data scientists can leverage this setup to query the tables on all exported branches.

Then, data scientists can use Databricks Unity Catalog to examine the tables, their schema, and query their data.  Unity supports SQL and serverless queries, meaning Enigma’s data scientists can work without managing Spark clusters.    

Unity Catalog Integration: How does it work?

Leveraging the external tables feature within Unity Catalog, lakeFS registers a Delta Lake table exported from lakeFS with Unity Catalog and you can access it through the Unity catalog. The subsequent step-by-step tutorial will lead you through the process of configuring a Lua hook that exports Delta Lake tables from lakeFS, and subsequently registers them in Unity Catalog.

Tables are defined in the lakeFS repository by using a short YAML file that maps Delta Lake tables in the repository to Delta table names in the Unity Catalog. Once a Delta Lake table is exported from lakeFS, the branch it belongs to in lakeFS is configured to be visible in Unity Catalog as schema.

lakeFS Unity Catalog Integration: How it Works

Prerequisites

1. AWS Credentials with S3 access
2. lakeFS Server (you can deploy one independently or use the hosted lakeFS Cloud solution, free for 30 days).
3. lakeFS Credentials (Key & Secret) or the permissions to create those credentials.
4. Databricks Service Principal e.g. create “unity-exporter-hook” service principal:
Databricks Service Principal

5. The service principal has the “Service principal: Manager” privilege over itself:
lakeFS Unity Catalog integration: Service principal details

6. The service principal has “Databricks SQL access” and “Workspace access” entitlements:
lakeFS Unity Catalog integration: unity exporter hook

7. Databricks Service Principal has token usage permissions, and an associated personal access token configured.
Access personal token

8. Databricks SQL Warehouse. Go to the “Overview” tab of your SQL Warehouse and write down the ID which you will need later.
9. Your Databricks SQL warehouse allows the service principal to use it (SQL Warehouses -> <SQL warehouse> -> Permissions -> <service principal>: Can use):
lakeFS Unity Catalog integration manage permissions

10. Create Databricks Unity Catalog if you don’t have it:
%sql
CREATE CATALOG IF NOT EXISTS lakefs_unity_catalog_demo;
11. Grant the “USE CATALOG”, “USE SCHEMA”, “CREATE SCHEMA” and “SELECT” permissions on catalog to the service principal (either use SQL command or UI):
%sql
GRANT USE CATALOG ON CATALOG lakefs_unity_catalog_demo TO `unity-exporter-hook`;
GRANT USE SCHEMA ON CATALOG lakefs_unity_catalog_demo TO `unity-exporter-hook`;
GRANT CREATE SCHEMA ON CATALOG lakefs_unity_catalog_demo TO `unity-exporter-hook`;
GRANT SELECT ON CATALOG lakefs_unity_catalog_demo TO `unity-exporter-hook`;
lakeFS Unity Catalog demo

12. Connect to cloud object storage using Unity Catalog
%sql
CREATE EXTERNAL LOCATION [IF NOT EXISTS] `<location-name>`
URL 's3://<bucket-name>/'
WITH ([STORAGE] CREDENTIAL `<storage-credential-name>`)
[COMMENT '<comment-string>'];
  • The service principal has the “CREATE EXTERNAL TABLE” permission on External Location:
%sql
GRANT CREATE EXTERNAL TABLE ON EXTERNAL LOCATION `<location-name>` TO `unity-exporter-hook`
13. Databricks all-purpose compute cluster with lakeFS configured. Also, install the lakeFS Python library on the cluster.

Demo Notebook

You will run the demo notebook in a Databricks workspace. You can either download the demo notebook from this git repository & import it into your Databricks workspace or create the demo notebook from scratch.

Step 1: Acquire lakeFS Access Key and Secret

In this step, you will acquire the lakeFS Access Key and Secret that will later on be used in the following steps. If you already have a Secret Key, you can skip this section.

Note: To create a new access key, You need either AuthManageOwnCredentials Policy or AuthFullAccess Policy attached to your user. 

Login to lakeFS and click on Administration -> Create Access Key

Acquire lakeFS Access Key and Secret

A new key will be generated:

Create access key

As instructed, copy the Secret Access Key and store it somewhere safe. You will not be able to access it again (but you will be able to create new ones).

Step 2: Create lakeFS Python client

Open the demo notebook or create a new notebook in Databricks. Create lakeFS Python client in the notebook cell (change lakeFS Endpoint URL, Access Key, and Secret Key):

import lakefs
from lakefs.client import Client

lakefsEndPoint = '<lakeFS Endpoint URL>'
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'

clt = Client(
    host=lakefsEndPoint,
    username=lakefsAccessKey,
    password=lakefsSecretKey,
)

Step 3: Create lakeFS Repository

Change S3 bucket-name:

repositoryName = "unity-catalog-demo"
storageNamespace = 's3://<bucket-name>/' + repositoryName
sourceBranch = "main"

repo = lakefs.Repository(
    repositoryName,
    client=clt).create(
        storage_namespace=storageNamespace,
        default_branch=sourceBranch,
        exist_ok=True)
branchMain = repo.branch(sourceBranch)

Step 4: Table descriptor definition

import yaml
table_name = "famous_people"
unity_catalog_name = 'lakefs_unity_catalog_demo'

table_descriptor = {
    'name': table_name,
    'type': 'delta',
    'path': f'tables/{table_name}',
    'catalog': unity_catalog_name,
}

# Write table descriptor to lakeFS
with branchMain.object(path=f'_lakefs_tables/{table_name}.yaml').writer() as out:
    yaml.safe_dump(table_descriptor, out)

Step 5: The Unity Catalog exporter script

luaScriptName = "scripts/unity_export.lua"

lua_script = """

local aws = require("aws")
local formats = require("formats")
local databricks = require("databricks")
local delta_export = require("lakefs/catalogexport/delta_exporter")
local unity_export = require("lakefs/catalogexport/unity_exporter")

local sc = aws.s3_client(args.aws.access_key_id, args.aws.secret_access_key, args.aws.region)

-- Export Delta Lake tables export:
local delta_client = formats.delta_client(args.lakefs.access_key_id, args.lakefs.secret_access_key, args.aws.region)
local delta_table_details = delta_export.export_delta_log(action, args.table_defs, sc.put_object, delta_client, "_lakefs_tables")

-- Register the exported table in Unity Catalog:
local databricks_client = databricks.client(args.databricks_host, args.databricks_token)
local registration_statuses = unity_export.register_tables(action, "_lakefs_tables", delta_table_details, databricks_client, args.warehouse_id)
for t, status in pairs(registration_statuses) do
    print("Unity catalog registration for table \\"" .. t .. "\\" completed with commit schema status : " .. status .. "\\n")
end

"""

branchMain.object(path=luaScriptName).upload(data=lua_script, mode='wb')

Step 6: lakeFS Action configuration

Change your Databricks workspace URL (with the format https://<instance-name>.cloud.databricks.com), Databricks personal access token, Databricks SQL Warehouse ID, AWS Region, AWS Access Key and AWS Secret Key:

newBranch = "dev"

databricks_host = 'https://<instance-name>.cloud.databricks.com'
databricks_token = '<Databricks personal access token>'
warehouse_id = '<Databricks SQL Warehouse ID>'

aws_region = '<AWS Region>'
aws_access_key_id = '<AWS Access Key>'
aws_secret_access_key = '<AWS Secret Key>'

hook_definition = {
    'name': 'unity_exporter',
    'on': {
        'post-commit': {
            'branches': [sourceBranch, newBranch+'*']
        },
        'post-create-branch': {
            'branches': [newBranch+'*']
        }
    },
    'hooks': [
        {
            'id': 'Unity-Registration',
            'type': 'lua',
            'properties': {
                'script_path': luaScriptName,
                'args': {
                    'aws': {
                        'access_key_id': aws_access_key_id,
                        'secret_access_key': aws_secret_access_key,
                        'region': aws_region
                    },
                    'lakefs': {
                        'access_key_id': lakefsAccessKey,
                        'secret_access_key': lakefsSecretKey 
                    },
                    'table_defs': [table_name],
                    'databricks_host': databricks_host,
                    'databricks_token': databricks_token,
                    'warehouse_id': warehouse_id
                }
            }
        }
    ]
}

with branchMain.object(path='_lakefs_actions/unity_exporter_action.yaml').writer() as out:
    yaml.safe_dump(hook_definition, out)

Step 7: Create a Delta Table in the source branch

data = [
    ('James','Bond','England','intelligence'),
    ('Robbie','Williams','England','music'),
    ('Hulk','Hogan','USA','entertainment'),
    ('Mister','T','USA','entertainment'),
    ('Rafael','Nadal','Spain','professional athlete'),
    ('Paul','Haver','Belgium','music'),
]
columns = ["firstname","lastname","country","category"]
df = spark.createDataFrame(data=data, schema = columns)
df.write.format("delta").mode("overwrite").partitionBy("category",
"country").save(f"lakefs://{repositoryName}/{sourceBranch}/tables/{table_name}")
df.show()
lakeFS Unity Catalog integration Delta Table

Step 8: Commit your changes

branchMain.commit(message='Added configuration files and Delta table!', 
        metadata={'using': 'python_api'})

Step 9: lakeFS Action runs

Once the previous commit step finishes, the lakeFS action will start running since we’ve configured it to run on “post-commit” events on the “main” branch.

The action will run and will export the “famous_people” Delta Lake table to the repo’s storage namespace, and will register the table as an external table in Unity Catalog under the catalog “lakefs_unity_catalog_demo”, schema “main” (as the branch’s name) and table name “famous_people”.

Go to the “Actions” tab in lakeFS UI:

Action runs main

Click on the hyperlink under the “Run ID”:

Run ID

Click on the “unity_exporter” Action and expand “Unity-Registration” hook to view the results:

lakeFS Unity Catalog integration Unity Registration

Step 10: Access the table through Unity Catalog

After the table is registered in Unity Catalog, view it in the Databricks’s Catalog Explorer:

Catalog Explorer

You can leverage your preferred method to query the data from the exported table under “lakefs_unity_catalog_demo.main.famous_people”:

df = spark.sql(f"SELECT * FROM `{unity_catalog_name}`.`{sourceBranch}`.`{table_name}`")
df.show()
Query the data

Step 11: Create a new branch in lakeFS

newBranch = "dev1"
branchDev = repo.branch(newBranch).create(source_reference=sourceBranch)

Step 12:  lakeFS Action runs again

Once the new branch creation finishes, the lakeFS action will start running again since we’ve configured it to run on “post-create-branch” events on the “dev*” branch.

The action will run and will export the “famous_people” Delta Lake table again to the repo’s storage namespace, and will register the table as an external table in Unity Catalog under the catalog “lakefs_unity_catalog_demo”, schema “dev1” (as the branch’s name) and table name “famous_people”.

Go to the “Actions” tab in lakeFS UI:

lakeFS Action runs in UI

Click on the hyperlink under the “Run ID”:

Run ID dev

Click on the “unity_exporter” Action and expand “Unity-Registration” hook to view the results:

lakeFS Unity Catalog integration registration

Step 13: Access the table in the dev branch through Unity Catalog

After the table is registered in Unity Catalog under the “dev1” schema, view it in the Databricks’s Catalog Explorer:

Access dev branch Unity Catalog

You can query the exported table under “lakefs_unity_catalog_demo.dev1.famous_people”:

df = spark.sql(f"SELECT * FROM `{unity_catalog_name}`.`{newBranch}`.`{table_name}`")
df.show()
Query the data

Step 14: Update the table in isolated dev branch

Now you can update the table on the “dev1” schema without impacting the table in “main” schema:

from pyspark.sql.functions import col

df_us = df.filter(col("country") == "USA")
df_us.write.format("delta").mode("overwrite").save(f"lakefs://{repositoryName}/{newBranch}/tables/{table_name}")
df_us.show()
Update table isolated dev branch

Commit your changes:

branchDev.commit(message='Updated delta table!',
        metadata={'using': 'python_api'})

Step 15:  lakeFS Action runs again

Once the previous commit finishes, the lakeFS action will start running again since we’ve configured it to run on “post-commit” events on the “dev*” branch.

The action will run and will export the “famous_people” Delta Lake table again to the repo’s storage namespace, and will re-register the table as an external table in Unity Catalog under the catalog “lakefs_unity_catalog_demo”, schema “dev1” (as the branch’s name) and table name “famous_people”.

lakeFS Action runs in UI

Step 16: Access the tables through Unity Catalog

Query the exported table under the “dev1” schema:

df = spark.sql(f"SELECT * FROM `{unity_catalog_name}`.`{newBranch}`.`{table_name}`")
df.show()
Update table isolated dev branch

While the data in the “main” schema is not impacted:

df = spark.sql(f"SELECT * FROM `{unity_catalog_name}`.`{sourceBranch}`.`{table_name}`")
df.show()
Query the data

Summary

You integrated Databricks Unity Catalog with lakeFS version control software and executed git actions via Python.

lakeFS accelerates your team and simplifies the version control process for the analytical workloads:

  • Unique zero-copy operation to create test/dev environments for your data
  • Allows you to clone Unity schemas without copying any data
  • Integrates easily with Unity Catalog so users can seamlessly access production as well as test/dev data

Want to learn more?

If you have questions about lakeFS, then drop us a line at hello@treeverse.io or join the conversation on the lakeFS Slack channel!

Git for Data – lakeFS

  • Get Started
    Get Started