This post is the second in a series about Hive Metastore. In the last post, Einat covered its history, the problems that it solves – and questioned whether or not it will survive the next revolution.
For this post, I’d like to take a closer look at Hive Metastore at present time. Specifically how it solves a problem (that it sort of created itself, many years ago), and where it falls short with the shifting tides in computing: An explosion in the amount of data stored and analyzed, and the ubiquity of cloud computing.
Designing a SQL Table in a Filesystem Directory
Eleven years ago, Apache Hive was born with an ambitious mission: Allow querying massive amounts of data using plain SQL instead of writing complex MapReduce jobs.
For this to work efficiently, Hive needed a way to structure SQL tables on top of a Filesystem(-like) interface. It was built on top of Hadoop, with HDFS being the most widely adopted distributed file system at the time.
Traditional RDBMS and Data Warehouses are able to achieve good performance using elaborate indexing schemes: clustered or non-clustered, primary and secondary, partial or complete. These are extremely tunable, allowing developers and DBAs to tune data systems to high throughput and low latency needs.
The challenge faced by Hive’s creators was that they had to provide similar capabilities, but on top of a very simplistic set of primitives: flat files inside a directory tree.
To address this, Hive’s developers came up with 2 solutions: Partitioning and Bucketing.
For this post, I will focus on the former, as it had the biggest implications on Hive’s scalability using modern cloud native architectures.
Hive's Partitioning Scheme
Partitioning is a simple concept: let’s build an index by encoding column values as directory names. Imagine the following table structure:
CREATE TABLE sales ( user_id INTEGER NOT NULL, date DATETIME NOT NULL, purchased_item_id INTEGER NOT NULL, purchased_item_price DECIMAL NOT NULL, );
sales/ date=2021-08-13/ data-file-0001.csv data-file-0002.csv data-file-0003.csv ... date=2021-08-14/ data-file-0001.csv data-file-0002.csv data-file-0003.csv ... date=2021-08-15/ …
By creating this hierarchy, Hive itself (and compatible systems) are able to prune partitions they know won’t contain relevant data. For example:
SELECT sum(purchased_item_price) FROM sales WHERE date IN ('2021-08-13', '2021-08-14');
Hive would then know to visit only the date directories contained in the query. This was in essence, a clustered index: The data in the table is laid-out inside the index.
On the surface, this design sounds like a reasonable solution given the constraints, but looking closely reveals a few important limitations.
Limitation #1: Data Duplication
Hadoop favors large files for efficiency (typically, 128MB or more). A file belongs to a directory and all rows in it must match its stated predicate.
This means we can only optimize for a set of specific access patterns. There are no secondary indices. This made it common for Hive-structured tables to be written multiple times with different partitioning schemes for different consumers or use cases.
Limitation #2: Flexibility
Since index values are encoded directly in the directory name, we can only index primitive types. Values must be encoded as a string, with a few weird quirks to keep in mind when doing so.
Limitation #3: Performance
Making this partitioning scheme behave predictably in terms of performance is tricky:
If we choose to partition by a high cardinality field (for example, user_id), we end up with many small directories.
Even a simple operation like enumerating the list of values as part of query planning puts a lot of pressure on the Filesystem and takes a relatively long time. This is one of the reasons why it was delegated to the Hive Metastore, which as we’ll discuss in a bit, doesn’t truly solve the performance problem.
On the other hand, choosing columns that aren’t selective enough leaves us with either very big files, or too many small files in each partition. For example, if we ingest 4TB of sales information per day and split them to the recommended 128MB file size, we’d end up with over 30k(!) files in each partition.
While HDFS did an OK job of handling this, listing on modern cloud based object stores could be slow, further aggravating the problem.
Limitation #4: Not optimized for Object Storage
On AWS S3 (by far the most used object store for Data Lakes), this hierarchical structure is actually the exact opposite of how S3 scales.
Using S3, gaining more throughput is done by partitioning the data based on a prefix: each prefix would then get 5,500 reads/second. When dates are used to partition (a common pattern), the result is all data from the last day/month/year (naturally, the most interesting and most often read data, inherently a performance skew) share a single long prefix.
People employ a variety of hacks — like prepending a hash value to the partition which Amazon suggests — but this isn’t currently supported by the Hive storage layout.
Now that we understand how Hive (and the metastore) define a table, let’s talk about how all this metadata is managed. The Hive metastore was introduced a couple of years after hive. The initial release occurred in January 2013.
To put this in perspective, this is 2 months before Docker was announced to the world, and more than a year before Spark emerged from Berkeley’s AMPLab in May 2014.
The Cloud landscape looked very different back then, with most companies not even considering moving their big, expensive Hadoop clusters away from their own data centers. The architecture was a simple java-based client-server application:
While the diagram looks straight-forward, it turns out the architecture was not the well-suited for large scale cloud native deployments. Let’s break down why:
Problem #1: Scaling Thrift
Scaling a Thrift based service is hard on most public clouds. If we compare it to HTTP — we get load balancing, metrics collection, health checking, logging and other operational requirements basically for free. Public clouds tend to lean heavily towards HTTP.
In addition, as an industry, the tooling and knowledge available around HTTP far surpasses that of Thrift. Operations teams typically know how to troubleshoot HTTP problems and can introduce middleware for things like service discovery and high availability. The Thrift ecosystem in comparison, is order of magnitudes smaller.
This made maintaining the metastore — a production critical component — a pain for operations teams.
Problem #2: A thin layer over RDBMS
MySQL and PostgreSQL are great and are certainly easier to scale on the cloud, where they are offered as managed services — but not necessarily a good fit to the data model.
Very big Hive tables with many partitions can put a lot of strain on the DB when doing things such as enumerating partitions, copying tables, or other heavy operations that translate to many RDBMS rows.
The metastore tries to be somewhat ANSI SQL compatible so aims for a lower common denominator, forgoing a more tailored schema per database type. HSM performance issues are a common problem with larger deployments.
Problem #3: Leaky abstractions
For an already performance sensitive system, the Thrift API added more complexity and higher latency. This was especially noticeable for “heavier” queries such as partition enumeration. Hive supports another model of operation, where the server logic is embedded by the client application, which then interacts directly with the underlying RDBMS, reducing the amount of network hops.
In reality, this made Metastore clients much more aware of its internal working, coupling consumers with the internal table layout. This made upgrading, schema management and implementation changes more complicated, increasing maintenance overhead.
So, What's Next?
There are a few ongoing efforts aiming to replace some of the functionality provided by the metastore.
Google Cloud and AWS both provide a managed service to make operations less of a burden. The AWS Managed service, named Glue Data Catalog, goes a step further and tries to replace the communications layer from Thrift to HTTP. Unfortunately, while this is a desirable change for many, it currently ties users to using Hive clients provided by AWS or specifically patched (done, for example by Databricks), since the issue to add this upstream has been open for 6 years.
Even if merged, this is a drop-in replacement that doesn’t address the inherent problems with the storage layout.
Another approach is to replace the whole concept entirely. This is where next generation data formats like Delta Lake, Hudi and Iceberg come into play (see our comparison of the 3). While they do alleviate some of the problems (most notably regarding the storage layout), they don’t fill all the gaps, leaving users to find a different solution to virtualization and data discovery.
It seems that for now, companies have a hard time replacing this aging de-facto standard. It’s deeply ingrained in the data engineering software stack and due to some of its design choices, made itself really hard to tear out.
Modern replacements that solve just some (or even most!) of the problems will leave users with their Metastore and a newer solution. This leads to a situation XKCD summarizes best:
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
Improving Our Research Velocity With lakeFS
This post was originally published in the Enigma blog. In every software engineering problem I’ve worked on, I’ve noticed a recurring tension between two highly
Big Data Testing: How To Test Data Pipelines In The ETL World
When testing ETLs for big data applications, data engineers usually face a challenge that originates in the very nature of data lakes. Since we’re writing
How Paige.ai uses lakeFS and dbt, to revolutionize AI-powered cancer diagnosis
lakeFS enabled us to streamline and run 200+ dbt models in production, increase data deployment velocity, efficiently reproduce ML experiments, increase productivity of the data teams and adhere to FDA compliance requirements