Building A Management Layer For Your Data Lake: 3 Practical Examples with Databricks, AWS, and Snowflake

Einat Orr, PhD

Last updated on September 26, 2024

Home > Blog > Building A Management Layer For Your Data Lake: 3 Practical Examples with Databricks, AWS, and Snowflake

On demand video: Building a management layer for your data lake

This article is the continuation of Building A Management Layer For Your Data Lake: 3 Architecture Components.

In this part, we explore open table formats, metastores, and data version control across three practical examples showing how to build a management layer for data lakes using tools in the Databricks, AWS, and Snowflake ecosystems.

Databricks ecosystem and data management

Databricks is an extremely powerful product suite offered by all major cloud providers, Azure, AWS, and GCP (the latter has the smallest footprint of Databricks users).

It provides all three components required for a management layer. Since Databricks relies on a proprietary Spark implementation as the computation engine, it’s suitable for structured, semi-structured, and unstructured data.

Databricks is an open platform, and there are many options you can use, including the good old Hive Metastore over ORC, Parquet, or Avro files.

That said, the serverless Databricks offering is already focused on a very specific and performant management layer that is built of Delta Lake as the open table format, Unity Catalog as the metastore, and Databricks as the computation layer, of course.

Open Table Format

Delta Lake was introduced by Databricks in 2019 as an open-source open table format together with a closed-source version under delta.io. Over time, it suffered attacks from the Iceberg community as being “not really open source,” but in 2022, Databricks released Delta Lake 2.0 as fully open-source and retired delta.io, its the closed-source twin.

The use of Delta Lake tables within Databricks improves the performance of Databricks’ computation engine by up to 8 times.

Since Databricks’ Tabular acquisition in June 2024, with a team of Apache Iceberg core contributors, including its creator, Ryan Blue, it’s expected that Apache Iceberg will become a first-class citizen within the Databricks platform in the next couple of years. However, if you choose now, Delta Lake will be your safest bet with Databricks.

Metastore

Unity Catalog is a main component of the Databricks platform and with the move towards serverless, it’s a pivotal part of the platform and essentially provides both a data catalog (such as Alation, for example) and a metastore.

In the context of our conversation, it’s the metastore that we care about the most, and Unity Catalog was built to serve as a metastore for Delta tables. So while other formats are supported within Unity Catalog, it performs best when using Delta Lake.

When considering unstructured data, it has a capability called Volumes that allows you to manage unstructured data or at least the metadata of unstructured data.

This is extremely useful for machine learning projects. For example, if you’re working on a model in computer vision, you can select all the images with cats and have all the metadata associated with the images you need.

Computation Engine

Well, here’s Databricks’s core capability. Once “The Spark Company” and today the “proprietary highly performant Spark version” company, the platform provides distributed computation capabilities with SQL, Java, Scala, Python, and more interfaces.

Data Version Control

Since Databricks is an organization-wide data platform, its users naturally add significant value to data management by using data version control.

lakeFS provides value for each of the personas using Databricks:

Data engineers

Isolated development/test environments

By leveraging lakeFS branches to create separate dev/test environments, you can cut testing time by 80%. Clean and organize your data, address outliers, add missing values, and perform other tasks to ensure your pre-processing data pipelines are reliable and deliver high-quality results.

Promoting only high-quality data to production

By using lakeFS hooks to implement Write-Audit-Publish for data, quality validation checks may be automated.

Correct error data using production rollback

Commits let you save complete, consistent snapshots of your data and enable you to roll back to earlier commits if data quality issues arise.

Data scientists

Execute local data checkouts

Maintain synchronization between remote and local locations by cloning select parts of lakeFS’s data to your local environment.

Experiment duplication

To efficiently evaluate and choose the best experiment, use lakeFS branches to run experiments in parallel with zero-copy clones in a completely deduplicated data lake.

Reproducible model training & feature engineering

Submit your experiment findings, and then use the lakeFS Git integration to replicate any experiment using the appropriate model weights, code, and data versions.

Data Ops

Data exchange

Give your staff the resources they need to work together and communicate easily about the data they use. Use Git-like semantics to share a commit ID or a branch of a data repository to indicate the data version being used or shared.

Data auditing

Observe who made the modifications to the data and when. With a thorough audit of all data-related operations across all settings, you will be able to track back any supplied results or experiments carried out.

Reduced storage costs

Keep your data lake from turning into a swamp of data. Making physical copies typically raises expenses, not to mention contaminates the data lake. Using lakeFS, data practitioners can get an isolated data lake by using a zero-copy branch thus reducing the costs and contamination.

lakeFS is the only data version control system to support all flavors of the Databricks platform, including serverless and Unity Catalog.

AWS infrastructure for data lake management

The AWS ecosystem is huge, so we selected a specific case but will also mention a bunch of other technologies as we go along.

Open Table Format

When it comes to open table formats, in 2020, AWS bet on Apache Hudi and then realized that this choice was somewhat limited. Over time, AWS supported Delta Lake and Apache Iceberg in all its data services.

As one would expect from a cloud provider, AWS supports everything – as well as the idea of unstructured data. You can choose the format without expecting it to limit your ability to select the rest of your stack.

Metastore

The metastore provided by AWS is AWS Glue. It provides metastore capabilities and supports all open table formats we mentioned above, as well as a Hive table.

The service is limited in both scale and performance. It’s relatively hard to scale it, even if you’re willing to pay a lot.

For some computation engines on AWS, such as Athena, AWS Glue is a must, while for others, one can use other metastores or no metastore at all; for example, while running Spark on AWS EMR.

If you use EMR, you can also use any Iceberg catalog or Hive Metastore. However, Amazon Athena, which imitates a database or a closed compute engine, only works with Glue. So you can see your Iceberg tables or query them through Athena by registering them within the Glue catalog.

Computation Engine

The computation engine we picked is Amazon Athena. Other AWS compute engines will also work with the Glue catalog.

Other AWS tools are more open, while Athena relies on the Glue catalog as its schema catalog. Glue ETL is another option for a computation engine on aws that relies on Glue as its metastore. It is friendlier to beginner data engineers than EMR.

Data Version Control

In any data architecture you choose to build using AWS services, you’ll need to manage your data lake data the way you manage the code of the data/ML/AI pipelines that you run on top of it. Using lakeFS you can work with a versioned Glue catalog and expose data repositories to your data consumers using zero copy clones.

Building a management layer with Snowflake

Snowflake as we know it, is a database. Snowflake architecture splits storage and computation, gaining a lot in the process. It’s a powerful and costly compute engine, making it the first analytics database for a data lake scale. As we explained, this means it manages a schema in its own catalog. So why is it here?

Just like Databricks is pushing its serverless offering that relies on Unity Catalog to sell a database experience, Snowflake would like to be able to use its compute engine over the data lake using other metastore, to compete with Databricks in large-scale data lake use cases that don’t require all the analytics database guarantees.

Snowflake endorsed Apache Iceberg a couple of years ago, essentially naming Apache Iceberg tables in its catalog. Some time ago, it released a new Apache Iceberg catalog, Polaris, to open source.

In an architecture that relies on Apache Iceberg, Snowflake becomes a compute engine working over the Polaris catalog, allowing you to query your Iceberg tables not managed within Snowflake, for read and write operations.

Open Table Format

Apache Iceberg is probably the most popular open table format outside the Databricks ecosystem. It was created by Ryan Blue at Netflix in 2014 and released as an Apache OSS project soon after.

Many large organizations contributed to its code base. It was adopted first by enterprises with large data operations and slowly and surely by smaller organizations, and by 2022, it had become the de facto standard in the eyes of many.

Metastore

Polaris is an exciting new development. It’s not open-source, but the company promised to open source it. It’s exciting to see how useful their catalog and system will become, perhaps to the point that people who have only been considering Databricks will now be open to using Snowflake.

Computation Engine

Snowflake!

Data Version Control

While Snowflake allows time travel on a single table level or a full schema level, it does NOT support git-like operations, such as branch, commit, or merge, of a user-defined repository.

Since Snowflake users are data professionals of all types, they benefit from the additional capabilities provided by a data version control system, as elaborated in the above use cases previously discussed.

lakeFS supports Apache Iceberg catalogs and is best suited to serve as the data version control system for this architecture.

Wrap up

You can add lakeFS to the architecture of your choice and make your data lake manageable. If you’d like to learn more about data version control using a practical example, take a look at this article, which shows how to achieve data lineage for your data lake with lakeFS step by step: How Data Version Control Provides Data Lineage for Data Lakes.