Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Paul Singman
Paul Singman Author

November 16, 2021

Overview of Hive's Metastore

hive metastore overview

Let’s get right into it. This is not an objective recap of every topic covered at the Future of Metadata After Hive Roundtable last week. But it is a summary of what I found most interesting from the discussion between panelists Lior EbelRyan BlueSeshu Adunuthula and host Oz Katz.

Watch the full talk below!

Takeaway #1: The Future of Metadata is… Heterogenous

The word heterogenous was used by Ryan to predict the future of catalogues. “There will be a space of catalogues. Where you have one for Databricks if you want to use Delta, something built up around Iceberg, … maybe Hudi. But I think that layer will be specific to tables and get a lot more complex.” 

Of course, just because multiple catalogues exist, doesn’t mean that an individual data lake will have to use more than one of them. But it does portend a complicated future to anyone building functionality on top of catalogues that other panellists spoke about. Things like data access policies, compute engine integrations, “maintaining data structure” between tools will have to support multiple technologies or risk becoming isolated.

Whether this sort of “data lake dream” of BYO tools wins out over more vertically integrated solutions (from an increasingly testy Snowflake and Databricks) is up in the air, at least beneath the top layer of the market filled by Netflix’s, Apple’s, etc.

Takeaway #2: Table Formats Are Where It's At

At the moment the phrase table formats refers to the triumvirate of Hudi, Iceberg, and Delta. I didn’t take an official count but around a third of the conversation revolved around functionality these formats offer, which isn’t too surprising since they are the main drivers of turning data lakes from “a collection of files in S3” to table objects in the post-Hive world.

Roughly all the panelists agreed these formats will continue to evolve to replace Hive functionality and the Hive concept of a table which is “tracking partitions and listing what files are in that partition.” Additionally they will add better support for indices and other query optimizations, and push the boundaries of structure for data in object stores.

Any discussion of metastores or catalogues can’t ignore the table formats because of the tight coupling between the two.

Takeaway #3: Keep the Team In Mind When Making Architectural Decisions

It can be tempting to pick the shiny new toy off the shelf. It can be tempting to design a data lake to be super general and support a myriad of use cases. It can be tempting to go with fully open-source, or fully proprietary technology for ideological reasons.

I loved Lior’s answer to the question of “what tip would you give someone designing a data lake today?”, which was to keep the skills, motivations, and interests of the team in mind. An engaged team will deliver better insights and data products than an unengaged one regardless of the tech used.

Amen!

Wrapping Up

These were just a few of the topics discussed in the hour-long discussion. You can view the full recording on Youtube and continue the discussion in the lakeFS Forum where threads are open for viewer questions we didn’t get to during the live event.

Questions like:

  • Do you see this [a multi-metastore future] more as query engines just supporting a bunch of metastores and being able to coordinate between them or do you expect some sort of metastore aggregation technology (ie single point of metadata)?
  • Do you think cloud providers will adapt one of the open source solutions (hudi, iceberg, delta) in a native manner similar to Glue in AWS?
  • Do we have an open sourced and mature enough solution to run hive metastore on kubernetes? Ideally it should be helm chart. What do you think where this could be created as project if it doesn’t exists?

About lakeFS

The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.

Our mission is to maximize the manageability of open source data analytics solutions that scale.

To learn more...

Read Related Articles.

Need help getting started?

Git for Data – lakeFS

  • Get Started
    Get Started