Best Practices, Data Engineering, Tutorials

Databricks Unity Catalog: A Comprehensive Guide to Streamlining Your Data Assets

Oz Katz

Last updated on April 26, 2024

Home > Blog > Databricks Unity Catalog: A Comprehensive Guide to Streamlining Your Data Assets

As data quantities increase and data sources diversify, teams are under pressure to implement comprehensive data catalog solutions. Databricks Unity Catalog is a uniform governance solution for all data and AI assets in your lakehouse on any cloud, including files, tables, machine learning models, and dashboards.

The solution provides a consolidated solution for categorizing, organizing, and managing heterogeneous data sources, making it easier for data engineers, data science folks, and analysts to access and use the data they require.

With features such as data discovery, data lineage, and governance capabilities, Unity Catalog enables teams to realize the full potential of their data.

What other features does Unity Catalog offer, and how does it integrate with the lakehouse architecture? Keep reading to find out.

What Is Databricks Unity Catalog?

Databricks Unity Catalog is the first unified governance solution for data and AI in lakehouses. Teams can use Unity Catalog to manage their structured and unstructured data, machine learning models, notebooks, dashboards, and files across any cloud or platform.

Unity Catalog also enables data practitioners to securely search, access, and collaborate on trustworthy data and AI assets, harnessing AI to increase productivity and unlock the full potential of the lakehouse environment.

The single governance framework accelerates data and AI endeavors while simplifying regulatory compliance.

Why Did Databricks Create Unity Catalog?

Databricks created Unity Catalog as a follow-up to the data storage and processing aspects of its data platform. Next, the company started developing components for two underserved areas: discovery and governance.

The goal was to develop a data cataloging, discovery, and governance solution that would integrate seamlessly with the Databricks ecosystem, particularly when dealing with the various asset types that are part of the lakehouse architecture.

Unity Catalog was introduced in mid-2021 to address the data governance challenge. In its original release blog post, Databricks mentioned a lack of granular security controls for data lakes in existing technologies. In April 2022, the solution saw a limited release for Azure and AWS, and in August 2022, a GA release.

Benefits of Using Databricks Unity Catalog

Databricks aims to combine the best of both worlds in its lakehouse architecture: a data warehouse and a data lake. It works with both structured and unstructured data, supports a variety of workloads, and may benefit any member of the data team, from the data engineer to the data analyst to the machine learning engineer.

Improved data governance

Unity Catalog works as a data governance layer with a sophisticated user interface for data search and discovery. It breaks down data silos and democratizes data across the organization. It helps data specialists find relevant assets for a variety of use cases, including BI, analytics, and machine learning.

Easier metadata management

For all Databricks data assets, including tables, files, dashboards, and machine learning models, the Unity Catalog provides a single metadata management and data governance layer. The catalog includes fine-grained access control, an in-built data search, and automated data lineage (monitoring data flows to identify their sources).

Improving security across Databricks

Unity Catalog, which offers centralized, fine-grained access restrictions and enables you to restrict specific rows and columns to specified groups, strengthens the security built into the Databricks two-plane infrastructure even further.

In-depth monitoring for compliance

The platform has auditing functions to monitor user behavior as well as controls to ensure compliance with standards such as HIPAA for medical data and PCI for payment card data.

Databricks Lakehouse Framework: Quick Recap

A well-architected lakehouse is composed of seven pillars that describe several areas of concern for the cloud implementation of a data lakehouse:

Data management – ensuring that data adds value and supports your business plan.
Usability and interoperability – ability to interact with users and other systems.
Outstanding operational performance – all operations processes that maintain the lakehouse operational.
Compliance, security, and privacy – protecting the Databricks application, customer workloads, and customer data.
Reliability – the system’s ability to recover from failures and continue to function.
Performance effectiveness – ability to adjust to changes in load.
Cost reduction – cost management to maximize the value produced.

Key Features of Databricks Unity Catalog

Data collection

Databricks Unity Catalog key features — Source: https://docs.gcp.databricks.com/data-governance/unity-catalog/best-practices.html

Unity Catalog combines the power of a well-structured metadata organization with a sophisticated search interface. It exposes the search metadata but restricts access to that metadata based on the privileges and permissions of the logged-in user. At the metadata level, this ensures security.

We’ll go over data lineage in greater depth later, but lineage metadata also aids in search and discovery by displaying links between various entities and layers of data. Unity Catalog’s data discovery features are successful in generating a unified and secure search experience.

Data management

A centralized repository of all data assets, such as files, tables, views, dashboards, and so forth, makes it possible for the Databricks Unity Catalog to offer a search and discovery experience. This, combined with a data governance structure and a detailed audit log of all actions performed on Databricks data, makes Unity Catalog very appealing to enterprises.

Databricks users can be service principals, users, or groups in terms of identification and access control. You can have these users establish trust with Databricks workspaces. This trust relationship will result in an identity federation.

You can use pure SQL to control access based on tables, rows and columns in the Unity Catalog. Additionally, there’s planned support for Attribute-based access control, allowing tagging multiple objects and applying access controls to these tags (for example, “PII data”).

Data lineage

Data lineage is becoming increasingly relevant for a variety of data engineering use cases, including task tracking and monitoring, diagnosing problems, understanding complex workflows, tracing transformation rules, and so on.

Unity Catalog has used the SQL parser to extract lineage metadata from queries, as well as external tools such as dbt and Airflow. Lineage is available in the Unity Catalog for whatever code you write in your workspace, not just SQL.

Because lineage data contains vital information about your company’s data flow, Unity Catalog has taken the same approach to protecting your data from bad actors, employing a governance model that restricts access to data lineage depending on the privileges of the logged-in users.

Databricks data sharing — Source: https://docs.databricks.com/en/data-sharing/index.html

Finally, one of the most welcome advances in the data engineering arena has been data access and sharing from the platform. It provides organizations with greater control over how, why, and what data is shared with whom. When such a system is not in place, persons with access to the data manually download and distribute it with the team via Email, Slack, Teams, and so on.

The built-in, closely integrated approach to sharing data in Unity Catalog alleviates the pain and difficulties of maintaining data rights across an organization. It’s based on the Delta Sharing platform-agnostic open data sharing protocol.

This transparent method of exchanging data not only saves the workload of your data team but also allows them to clearly monitor and regulate data access.

Standards-compliant security model

Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to issue permissions at the catalog, database, table, and view levels in their current data lake using familiar syntax.

Databricks Unity Catalog Components

The Unity Catalog’s exact implementation details are not published because it is a closed-source, proprietary data platform. Still, the Unity Catalog object model is well-described in its documentation.

The hierarchy of primary data items as managed by Unity Catalog:

Metastore – the top-level metadata container. Each metastore offers a three-level namespace that organizes your data (catalog.schema.table).
Catalog – the initial object hierarchy layer, used to arrange the data stored.
Schemas (databases) – the second tier of the object hierarchy and comprise tables and views.
Volumes, tables and views – the lowest levels of the object hierarchy.

The Unity Catalog object model is notable for its use of a three-level namespace to address various types of data assets in the catalog. Most databases and data warehouses allow you to address a data asset using the schema_name.table_name format.

Let’s dive into the details of these and other components:

Metastore

The Unity Catalog has a metastore that is similar to or compatible with the Hive or Hive-like metastores used by cloud-platform-specific data catalogs such as AWS Glue Catalog. It offers one more abstraction layer to allow users to better categorize data assets.

The metastore serves as a container for all of your data assets, which are organized into numerous catalogs, schemas, and entities like tables, views, functions, and so on. Unity Catalog uses its own metastore, but it’s largely compatible with Hive.

Data archiving storage

Based on your queries, workflows, CTAS statements, and so on, the Unity Catalog internally retains both table-level and column-level lineage data.

All of this data is kept in the metastore, which is why a bespoke metastore was required. However, if you use an external Hive metastore, you will be able to customize the lineage metadata.

Audit trail

Audit logs, on the other hand, are sent to a separate storage place (a different S3 bucket if you’re using AWS). This means that even if a metastore is deleted, audit logs will remain available for compliance purposes.

The audit logs record all events associated with Unity Catalog. This includes creating, deleting, and modifying all metastore components, as well as the metastore itself. These events also cover operations such as credential storage and retrieval, access control lists, data-sharing requests, and so on.

Access management

Unity Catalog’s identity and access management strategy is built with bespoke privileges that work on different levels of the metastore’s three-level namespace. In Unity Catalog, privileges are passed down the namespace hierarchy.

Databricks features a workspace-level permission mechanism that allows you to limit access to various Data assets such as DLT pipelines, SQL warehouses, notebooks, and so on using data Access Control Lists (ACLs). Both admin users and users with ACL management privileges are responsible for maintaining these ACLs.

Challenges With Databricks Unity Catalog

The following include some limitations of Unity Catalog:

Scala, R, and Databricks Runtime for machine learning workloads are only supported on clusters with Single User access mode. These languages’ workloads don’t enable the usage of dynamic views for row- or column-level security.

Shallow clones are available in Databricks Runtime 13.1 and later to construct Unity Catalog managed tables from existing Unity Catalog managed tables. There is no support for shallow clones in Unity Catalog in Databricks Runtime 13.0 and before.

Unity Catalog tables don’t allow bucketing. If you perform commands in Unity Catalog that attempt to construct a bucketed table, you’ll get an error.

If some clusters access Unity Catalog while others do not, writing to the same route or Delta Lake table from workspaces in different regions might result in unpredictable performance.

Tables in Unity Catalog do not allow custom partition schemes built with operations like ALTER TABLE ADD PARTITION. Unity Catalog may access tables partitioned in a directory-style manner.

Overwrite mode is only available for DataFrame write operations into Unity Catalog for Delta tables, not for other file types. The user must have the CREATE privilege on the parent schema and be the owner or have the MODIFY privilege on the existing object.

Spark-submit jobs are available on single-user access clusters but not on shared clusters.

Python UDFs, including UDAFs, UDTFs, and Pandas, are not supported on Spark (applyInPandas and mapInPandas) on Databricks Runtime 13.1 and lower. Python UDFs are supported in Databricks Runtime 13.2 and later.

You can’t use workspace-level groups (groups defined before in a workspace) in Unity Catalog GRANT declarations. This is done to guarantee that groups may be seen consistently across workplaces.

Thread pools in Scala aren’t supported.

Note: If your cluster is operating on a Databricks Runtime version lower than 11.3 LTS, you may encounter additional restrictions that are not stated here. Unity Catalog requires Databricks Runtime 11.3 LTS or above.

Getting Started With Databricks Unity Catalog

Here’s how you get Unity Catalog up and running on AWS:

Configure an S3 bucket and IAM role for Unity Catalog to utilize in your AWS account to store and access data.
Create a metastore for each region where your company operates.
Connect workspaces to a metastore. Each workspace will have the same view of the data in Unity Catalog that you administer.
Add users, groups, and service principles to your Databricks account if you have a new account.

For details and other examples, check out the documentation.

How lakeFS Integrates with Unity Catalog

How lakeFS integrates with Databricks Unity Catalog

lakeFS Cloud leverages Unity’s support for the Delta Sharing protocol, by exposing versioned tables via a Delta Sharing Server. This allows data consumers to query versioned data by specifying the version identifier (branch, tag or commit ID) as their schema name.

The integration of lakeFS with Unity Catalog provides numerous benefits to businesses working with large amounts of complicated data:

Versioning of all data for Unity tables

Teams get powerful data versioning capabilities by integrating lakeFS with Unity Catalog. Users can version data assets with lakeFS, documenting changes over time.

This feature allows teams to monitor changes, compare versions, and simply restore to earlier states if necessary. Users can now query data in tables in lakeFS as they appear in different branches or tags.

Teamwork and collaboration

It’s now possible to expose changes to stakeholders using regular SQL tables thanks to this integration. Users can correct tables and automatically expose their changes as Unity tables by using lakeFS’s isolated, zero-copy branching. All you need to do is choose the lakeFS branch name and use it as the Unity schema name to access data.

Improved data governance

lakeFS includes a powerful hook system that lets users to regulate which modifications are permitted, validating both data and metadata, while Unity allows users to define fine-grained access restrictions at the table and even column level.

This combination makes it simple for security teams to define controls and safeguards to protect their most valuable asset: data.
Learn more about how lakeFS integrates with Unity Catalog – or check out the lakeFS Unity Catalog integration guide on the lakeFS documentation.