Overview of Hive's Metastore
Let’s get right into it. This is not an objective recap of every topic covered at the Future of Metadata After Hive Roundtable last week. But it is a summary of what I found most interesting from the discussion between panelists Lior Ebel, Ryan Blue, Seshu Adunuthula and host Oz Katz.
Watch the full talk below!
Takeaway #1: The Future of Metadata is… Heterogenous
The word heterogenous was used by Ryan to predict the future of catalogues. “There will be a space of catalogues. Where you have one for Databricks if you want to use Delta, something built up around Iceberg, … maybe Hudi. But I think that layer will be specific to tables and get a lot more complex.”
Of course, just because multiple catalogues exist, doesn’t mean that an individual data lake will have to use more than one of them. But it does portend a complicated future to anyone building functionality on top of catalogues that other panellists spoke about. Things like data access policies, compute engine integrations, “maintaining data structure” between tools will have to support multiple technologies or risk becoming isolated.
Whether this sort of “data lake dream” of BYO tools wins out over more vertically integrated solutions (from an increasingly testy Snowflake and Databricks) is up in the air, at least beneath the top layer of the market filled by Netflix’s, Apple’s, etc.
Takeaway #2: Table Formats Are Where It's At
At the moment the phrase table formats refers to the triumvirate of Hudi, Iceberg, and Delta. I didn’t take an official count but around a third of the conversation revolved around functionality these formats offer, which isn’t too surprising since they are the main drivers of turning data lakes from “a collection of files in S3” to table objects in the post-Hive world.
Roughly all the panelists agreed these formats will continue to evolve to replace Hive functionality and the Hive concept of a table which is “tracking partitions and listing what files are in that partition.” Additionally they will add better support for indices and other query optimizations, and push the boundaries of structure for data in object stores.
Any discussion of metastores or catalogues can’t ignore the table formats because of the tight coupling between the two.
Takeaway #3: Keep the Team In Mind When Making Architectural Decisions
It can be tempting to pick the shiny new toy off the shelf. It can be tempting to design a data lake to be super general and support a myriad of use cases. It can be tempting to go with fully open-source, or fully proprietary technology for ideological reasons.
I loved Lior’s answer to the question of “what tip would you give someone designing a data lake today?”, which was to keep the skills, motivations, and interests of the team in mind. An engaged team will deliver better insights and data products than an unengaged one regardless of the tech used.
- Do you see this [a multi-metastore future] more as query engines just supporting a bunch of metastores and being able to coordinate between them or do you expect some sort of metastore aggregation technology (ie single point of metadata)?
- Do you think cloud providers will adapt one of the open source solutions (hudi, iceberg, delta) in a native manner similar to Glue in AWS?
- Do we have an open sourced and mature enough solution to run hive metastore on kubernetes? Ideally it should be helm chart. What do you think where this could be created as project if it doesn’t exists?
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
As the Covid-19 pandemic loosens its grip on the world, we’re all eager to start travelling and meeting in person again. The great news is