In this article we discuss the two prominent metadata catalogs: the popular Hive Metastore, and its Amazon counterpart, AWS Glue. We will start with introducing the concept of metadata catalogs, and then explain the benefits and pains of using each tool.
Introduction to Metadata Catalogs
Distributed storage systems like HDFS gained massive popularity in the 2010s. They were designed to run on low-cost hardware, causing companies to use them for storing large datasets. In order to query the data, Hive was subsequently developed to allow SQL queries on HDFS-stored files. As suggested by its name, HDFS is just a file system, so Hive needed an additional layer of information to be able to query the data. That is, it required a metadata layer, containing the information that identifies and describes the data. Let’s look at an example.
Consider the following CSV file with a list of paintings (spaces were added for readability):
"Frida Kahlo", "The Two Fridas", 1939 "Diego Velázquez", "Las Meninas", 1656 "Élisabeth Vigée Le Brun", "Self-portrait in a Straw Hat", 1782 "Vincent van Gogh", "The Starry Night", 1889 "Pablo Picasso", "Guernica", 1937
Suppose we need to respond to an SQL query, asking for the list of painters with paintings from after 1900:
SELECT painter FROM paintings WHERE year > 1900
For a machine to be able to understand this simple query, we need to introduce it to the concepts of tables, columns and data types. It also needs some more information:
- The file represents a group of records from a table named paintings.
- The table has three columns: painter, name, and year
- The columns are ordered.
Now, to answer the query, the code can go over the file. For each record, look at the third column and check whether the value is greater than 1900. If it is, return the value of the first column.
Although this example is very simplistic, it makes it clear that programs that consume data stored in files, also need access to metadata. In Hive, the component that provides this is Hive Metastore. It is the prism through which Hive looks at a group of files and sees it more like a database. It brings the notions of tables, columns and relations to the world of distributed storage. Many tools apart from Hive now use Hive Metastore to discover data before processing it. Examples include the widely-used Spark and Presto.
Today, with cloud-based object stores (like AWS S3, Azure Blob and Google Cloud Storage) becoming affordable, companies are shifting to using them instead of HDFS. But the problem remains: in the end, all that object stores give you is a file system. So, while many organizations stopped using Hadoop for storage, they still need Hive Metastore to be able to query the data.
In 2017, Amazon launched AWS Glue, which offers a metadata catalog among other data management services. It has all the basic functionality of Hive Metastore like tables, columns and partitions, plus – it’s fully managed. Sounds perfect, right? Well, like all things AWS, Glue makes your life easier in some ways, but adds uncertainties in others.
While developing lakeFS, we worked closely with many design partners to adapt our product to their metadata management use cases. The following sections provide a clear view on what to expect when using Hive Metastore or AWS Glue.
Hive Metastore vs AWS Glue comparison: Which is right for you?
Let’s start with the obvious. Hive Metastore is a service that needs to be deployed. It also needs to be backed by a relational database. AWS Glue takes this infrastructure off your plate, and provides a serverless solution with an API compatible with Hive Metastore. It also offers a simple user-interface where you can see, add and edit tables. If you only need the basics, this is a major advantage for Glue.
This section comes down to the question of how deep you are buried in each of the ecosystems – AWS and Hadoop.
Tools like Hive, Spark and Presto have all been used extensively with the Hive Metastore. They are open-source products with strong communities, and many of them are under the wing of Apache. If your organization uses a traditional Hadoop stack, consider sticking with Hive Metastore.
Conversely, if you are planning to use AWS Athena, Amazon’s serverless query service, you should know that it is much easier to use with Glue. In fact, support for Hive Metastore in Athena has only recently been added so using them together is new territory. One can only assume that in the future, additional AWS products will rely on Glue as their catalog.
Hive Metastore has a longer history and an active community, so it has gathered lots of features on the way. Some of these features are not implemented in Glue, making it unusable for some organizations that rely on them. Here are some examples for missing features:
- Column statistics allow you to get insight regarding your data without actually having to read it, for example: min/max/average values, number of distinct values and other useful information. Many organizations rely on column statistics in order to optimize their queries.
- Hive temporary tables are a nice way to store intermediate results of complex calculations. These tables are deleted automatically by Hive at the end of the session.
- You can use Hive constraints when creating a table to improve query performance. For example, you can declare that a field is a primary key. This way, the Hive engine saves time by not looking for duplicates on this field. It’s important to note that constraints are not enforced, they are assumed to be true and this assumption is used for optimization. Hive constraints are not supported in AWS Glue.
Additional features that are currently not supported by Glue are transactions and authorization. If you think you can benefit from one of these features, then Glue is not mature enough a product for you yet.
One feature that stands out in AWS Glue allows you to launch crawlers that will scan your data and create tables and metadata for you. While a few companies mentioned performance issues when crawling on large datasets, it’s a very strong feature: creating the metadata manually can be a tedious work, and this may save you precious time getting started. After the crawler has finished, your tables will be ready to use. It seems that not all formats are supported by the crawlers: when we tested them, they worked great with regular Parquet files, but couldn’t make sense of Hudi’s format.
A core concept in metadata catalogs is partitions. They provide a way to divide a table according to the value of a specific column. A typical column to partition by would be the date, so that records from the same day are stored under the same path. Partitions allow you to answer questions like “where is the data for Saturday two weeks ago?” without having to do a full-scan on your storage (which is probably cost-prohibitive).
There is a limit to how many partitions catalogs can handle. A ballpark often mentioned as the maximum number of partitions for Hive Metastore is 10,000. This number makes sense if you partition by date as mentioned above – you would be able to hold over 27 years of data. However, engineers can’t help their scientific nature and always try to push the boundaries. Companies we work with often have much more than 10,000 partitions. Some have about half a million of them, some even more. It may not be a problem when backing Hive Metastore with a PostgreSQL database, but a few of our design partners struggled with performance issues when trying the same with Glue. So, before using Glue in production, make sure to test it with a number of partitions close to the real thing.
On the other hand, a high load on the Hive Metastore itself (as opposed to the underlying database), can also cause issues. This was described by Netflix as one of their motivations to introduce Metacat, a metadata service that talks directly to the Metastore’s database, rather than its API.
Being much simpler to spin up and use, AWS Glue is probably the way to go for small startups who need to bootstrap quickly. For larger companies with substantial data operations and many partitions, Hive Metastore may be the more suitable option, considering its performance has been better tested. Other factors, like which other products you are planning to integrate and which catalog features are required, also come into play.
A Word About Data Discovery
A problem closely related to metadata management is that of data discovery. As data grows, it becomes harder for everyone to know which tables are available, what kind of information they contain and how they interact with each other. While metadata catalogs face the producer side, i.e. the side that prepares the data, data discovery tools face the consumer side, i.e. data scientists, business users, and other people who can benefit from the data. These tools allow the organization to leverage big data by collecting it from different sources and exposing it to everyone who needs access to it. Datasets are exposed through tools that allow search, navigation and visualization.
Data discovery is becoming an increasingly common concern, and many companies have developed their own tools to tackle this challenge. Those include Netflix’s Metacat, LinkedIn’s WhereHows and later DataHub, WeWork’s Marquez, and countless others. It’s entirely possible that in the future AWS Glue and other cloud-based metadata catalogs will also include data discovery services.