Metadata Management for the Modern Data Lakehouse
Efficient data management is a critical component of any modern organization.
As data volumes grow and data sources become more diverse, the need for robust data catalog solutions becomes increasingly evident. Recognizing this need, lakeFS, an open-source data lake management platform, has recently integrated with Unity Data Catalog, a comprehensive data catalog solution by Databricks.
In this blog post, we will explore the exciting features and benefits of this integration and how it simplifies data management workflows.
Unity Catalog by Databricks
Unity Catalog, developed by Databricks, is a unified governance solution for all data and AI assets including files, tables, machine learning models and dashboards in your lakehouse on any cloud.
It provides a centralized solution for cataloging, organizing and managing diverse data sources, making it easier for data engineers, data scientists, and analysts to find and utilize the data they need. With features such as data discovery, data lineage, and governance capabilities, Unity Data Catalog enables teams to unlock the true potential of their data.
Seamless Integration with lakeFS
The integration between lakeFS and Unity Catalog brings a range of benefits to organizations working with large scale, complex data.
Full data versioning for Unity tables
By integrating lakeFS with Unity Catalog, organizations gain powerful data versioning capabilities.
lakeFS allows users to version their data assets, capturing changes over time. This feature enables teams to track modifications, compare different versions, and easily revert to previous states if necessary. Users can now query tables as they appear in different branches or tags in lakeFS
Collaboration and teamwork
Using this integration, it’s now possible to expose changes to stakeholders using standard SQL tables. Using the isolated, zero-copy branching provided by lakeFS, users can modify tables and automatically expose their change as Unity tables. All consumers have to do is select the name of the lakeFS branch and use it as the Unity schema name.
Enhanced data governance
With the integration, organizations can establish robust data governance practices. lakeFS comes with a powerful hook system allowing users to control exactly which changes are permitted, validating both data and metadata, while Unity allows defining fine grained access controls at the table and even column level. This combination makes it easy for security teams to define controls and guardrails to protect their most important asset – their data.
Unlock the full power (and cost benefits) of serverless data warehousing
Enigma uses lakeFS to produce verified data. Their processes write data tables to a production table. The lakeFS Delta Sharing service is configured to expose these tables on the production branch and on several other branches. Next, data scientists can leverage this setup to query the tables on all exported branches.
Then, data scientists can use DataBricks Unity to examine the tables, their schema, and query their data. Unity supports SQL and serverless queries, meaning Enigma’s data scientists can work without managing Spark clusters.
How does it work?
DataBricks Unity Catalog allows third-party services to share data using the Delta Sharing protocol. The provider publishes schemas and tables as a share
. Unity can define a catalog
to provide access to these defined tables.
The lakeFS Delta Sharing service is configured to export lakeFS repositories
as Delta Sharing shares
. Once a repository is exported, its branches may be configured to be visible in Unity as schemas
. Tables
are defined in the lakeFS repository by using a short YAML file that maps DeltaLake tables in the repository to Delta Sharing table names. The Delta Sharing service can also export partitioned directories of Parquet files (known as “Hive Metastore” tables) by configuring the YAML file that holds their Hive Metastore schema.
Getting Started with lakeFS Unity Catalog Integration
Getting started with lakeFS and Unity Catalog is easy!
- Register to lakeFS Cloud at http://lakefs.cloud/
- Setup a cluster in one of the supported regions
- You can quickly import your existing data into lakeFS using zero-copy import
- Follow the instructions on creating a config.share.json object and a user for the lakeFS Delta Sharing service to use.
- Request Treeverse Customer Success to enable the lakeFS Delta Sharing service on your account, and communicate the lakefs URL with your config.share.json object and access key credentials for the created user.
- Using the Databricks CLI, register the delta sharing server using the file we just created
- This creates a provider in your DataBricks account. You can now use the DataBricks UI to create a catalog that uses this provider.
- Define tables in your lakeFS repository. For a Delta table this is as simple as uploading a 3-line YAML file to lakefs://repo/main/_lakefs_tables/table.yaml.
- Congratulations! You have successfully connected Unity with lakeFS!
You’ll now see your lakeFS repository appear as a share – go ahead and create a catalog from it - You now have full access to your lakeFS branches as configured in Step #4
- As you create branches and tags, they will automatically appear in your catalog as schemas 🎉
For more information, follow these detailed instructions.
Table of Contents