A data lake is often implemented using object storage, and our data resides in objects (files) that we can access through the storage API. Data professionals in analytics and ML are used to consuming data from a database, where there is an abstraction of a table one can access using SQL. They will find it uncomfortable, to say the least, to use a storage API. Here is where a metastore comes to the rescue. It allows us to expose a dataset in the object store as a table. The access to the data is done through the metastore, as if it is saved in tables, while the metastore translates it to the required actions on the storage.
Metastores are sometimes referred to as catalogs, e.g. Iceberg catalog, not to be confused with cross organizational data catalogs used for discoverability and governance of all organizational data.
For the rest of this blog post, we will call them all metastores.
When we last shared our thoughts on the Hive Metastore two years ago, we observed it was the last component standing from the glorious Hadoop ecosystem. Even Yarn didn’t make it past 2020 🙂. We also observed it didn’t age well. We speculated on what might replace it, and we certainly didn’t anticipate vendor lock coming.
In this blog we will consider in what sense Hive’s Metastore is “open” and why we believe the leading candidates to replace it are closed, in a way that is meant to limit us to using a specific vendor’s data ecosystem.
Hive Metastore – In what sense is it open?
Granted, asking in what sense an OSS Apache project is open is strange. But from the standpoint of an architect that needs to design a data / ML / AI architecture, it is not necessarily the code that has to be open but rather the interface it exposes.
There are three main things that allow flexibility in using a metastore, that make it compatible anywhere:
- Standard table formats are supported
- The interface to the metastore is a standard
- The storage layer can be configured
Let’s dive into each one and explain how it allows us a healthy separation of concerns in a data architecture.
Standard table formats are supported
Once upon a time, Hive metastore (HMS) was the only metastore for the Hadoop universe, and it exposed Hive-style tables, as Hive was the analytics engine of that universe. The universe has greatly changed since, but contributors adopted HMS to work with Delta tables, Apache Hudi tables and Apache Iceberg tables.
Open table formats (OTF) are now a standard, and any replacement to Hive should support these three to remain un-opinionated about which one you use.
Since all three are backed by commercial companies who provide metastores, any opinion here is essentially a vendor bias.
The interface to the metastore is a standard
Hive Metastore set the standard. Since it (was) widely used in data architectures, any metastore that aims to replace it supports its interface. This allows the technologies claiming for the throne to be adopted with little change on the part of the users who consume tables by accessing the metastore.
The storage layer can be configured
A metastore provides an abstraction of a table over a storage.
It represents files from the storage as tables, and allows its users to mostly forget they are running on top of files and objects, and feel like they are managing tables.
Should it be opinionated about the protocol it supports to access the storage? Maybe.
Supporting an S3 interface to object storage is probably aligned with supporting what is by now a standard. Should it be opinionated about the vendor providing the object storage? Not really. An S3 interface should leave us to choose, for example, between Amazon S3, minIO, Ceph, GCS or other storage wrappers such as Alluxio or lakeFS.
Now, let’s put the main contestants to the test
There are several metstores available. Sometimes, they provide more functionality than the original HMS – for example – around authentication and authorization.
Glue Data Catalog By AWS
What is Glue Data Catalog?
According to the AWS documentation:
“The AWS Glue Data Catalog is a centralized metadata repository for all your data assets across various data sources. It provides a unified interface to store and query information about data formats, schemas, and sources.”
How “open” is it?
The good news is that the first two criteria are met with the Glue data catalog.
It took a while, but all standard table formats are supported, including Hive tables, Apache Hudi, Apache Iceberg and Delta lake.
Glue does not support the standard HMS interface, so if you wish to migrate in both directions, changes are required on the application side while accessing the metastore.
The place Glue data catalog fails in allowing freedom of choice is on the storage layer. Glue works only over Amazon S3 as a storage layer. If one chooses to use it, it cannot use other storage compatible with it such as MinIO, Alluxio or lakeFS.
Unity Catalog by DataBricks
What is Unity Catalog?
Quoting the Databricks documentation:
“Databricks Unity Catalog is the industry’s first unified governance solution for data and AI on the lakehouse. With Unity Catalog, organizations can seamlessly govern their structured and unstructured data, machine learning models, notebooks, dashboards and files on any cloud or platform. Data scientists, analysts and engineers can use Unity Catalog to securely discover, access and collaborate on trusted data and AI assets, leveraging AI to boost productivity and unlock the full potential of the lakehouse environment. This unified approach to governance accelerates data and AI initiatives while ensuring regulatory compliance in a simplified manner.”
So it’s much more than just a metastore. But is it a good metastore?
How “open” is it?
Unity catalog supports only Delta tables; other open table formats cannot be used when choosing Unity.
There is some interoperability with Apache Iceberg using Databricks’ UniForm layer, that allows translating between Delta Lake and Iceberg metadata, but directly managing Iceberg tables is not supported by Unity.
Unity catalog supports the HMS interface, so no changes required to clients accessing the data, if moving in or out of unity.
As in Glue data catalog, when using Unity catalog, you will not be able to choose your storage. According to the cloud provider your Databricks sits on, you’ll get the supported storage provided by that cloud provider. Read only federation is supported for specific databricks external sources promises of federation published in Data+AI 2023.
Snowflake is a data warehouse. It is the first vendor to offer a distributed compute layer separated from the storage layer.
One cannot argue with the success of this approach, nor would they expect the catalog of a closed application such as a data warehouse, to be open. But Snowflake took a turn toward the open ecosystem by supporting Apache Iceberg tables.This is an important step in managing pipelines that run Spark over the storage and then register the results as Snowflake tables, but these are still external tables. The Snowflake Catalog is therefore closed on all three criteria we defined.
Where is the vendor lock?
As we can see from the review of Metastores above, the vendor lock is in Storage, or in Table format. However, there is another form of vendor locking that results in choosing a Metastore: the compute engine.
If you wish to use Snowflake as your compute engine, you cannot use an external metastore. For example, if you choose to manage your Apache Iceberg tables in Tabular as the metastore, you will not be able to write to those tables from Snowflake.
You can read from them by exporting them from Tabular to become external tables in Snowflake. Same goes for the Nessie Project. If you want to manage reads and writes to Snowflake in a Metastore/Catalog, you’ll have to use Snowflake’s catalog.
Databricks’ ecosystem is more open, unless you go serverless. Databricks provides an Apache Spark-based ecosystem that is compatible with HMS on all its products and can therefore work with any compatible Metastore.
That said, its serverless products, such as Databricks SQL serverless are strongly coupled to Unity Catalog and are therefore limited to its limitations mentioned above, and cannot be used with an external Metastore.
A bit surprising to find out that AWS turns out to be most open when it comes to the selection of metastore and compute engine. EMR, Glue ETL, SageMaker and Redshift spectrum can use any metstore that is HMS compatible. Serverless services such as Athena are biased towards Glue data catalog as a native Metastore, but other Metastores can develop a connector and support Athena.
When designing a data architecture, the selection of a compute engine will limit the choice of metastore, and vice versa. Selecting a metastore will limit the choice of compute engines. The limitation may be within the same vendor, but is more dramatic across vendors.
Since its inception, Hive Metastore held the promise of an unopinionated catalog for the Hadoop ecosystem. The separation of concerns between the catalog and tools used for data analysis allowed the data ecosystem to evolve and provide new technologies to emerge, while preserving aspects of the existing architecture in place.
Contestants to replace HMS no longer hold this promise, nor do the data analysis tools cloud providers and large data technology players offer. These players aren’t playing to collaborate and standardize, but rather to provide as large as possible a part of the data architecture in a “winner takes all” approach, based on their own metastores. Ultimately, HMS is becoming a thing of the past, and with it, our freedom to avoid vendor locking.
Table of Contents