Apache Iceberg is the most popular open table format. It originated at Netflix due to the need to provide a table representation for data saved in files and to enable teams to work with those tables as if they were managed in a relational database.
In broad terms, Apache Iceberg is constructed of three main layers:
- The data files saved in Parquet format
- The manifest files holding metadata to allow the table representation
- The catalog, which lists those tables and provides access to them using SQL
While Apache Iceberg provides only one option for the data and manifest files, it gives you much freedom in selecting the catalog you can use. Continue reading to learn about the different options and how to choose the right one for you.
What is Iceberg Catalog?
Apache Iceberg is a data lakehouse table format redefining the data business with its innovative features like sophisticated partitioning, ACID guarantees, schema evolution, time travel, and more.
The catalog mechanism of Apache Iceberg tables is essential to their operation and significantly impacts how their features are created and used over time.
The “use the tools you want” philosophy of the data lakehouse was hampered by the requirement in the past for catalogs to have support built for each of the languages that support Iceberg (Java, Python, Rust, and Go). This led to inconsistent catalog support.
The “REST Catalog specification” was created by the Apache Iceberg project to solve this. By defining required server endpoints, this open API specification creates a standard for service catalogs.
Modern catalogs seek to provide uniform table experiences across the range of Iceberg-supporting technologies. They do so not only by making tables more discoverable but also by guaranteeing the portability of governance rules and other information.
Different Uses of Catalogs
Using a top-level metadata file called metadata.json, Apache Iceberg gives a query engine access to crucial data, including a table’s structure, snapshot history, and partitioning history.
A new metadata.json file is created each time an Apache Iceberg table is modified, leading to the accumulation of many versions (such as v1.metadata.json and v2.metadata.json) in your data lake.
Even if cleanup procedures regularly remove these extra files, the problem still stands: how can query engines like Dremio, Snowflake, and Apache Spark identify which metadata.json file is the “current” one?
This is when the catalog comes into play.
The most common use cases of an Iceberg catalog are:
- Keeping track of the current Iceberg table list.
- Maintaining a copy of the “current” metadata.json.
Catalogs serve as the single source of truth, guaranteeing that several query engines can access the same database and obtain a consistent data version.
Types of Iceberg Catalogs
Iceberg catalogs are divided into file-based and service-based categories. Service-based catalogs depend on an active service to track these references, while file-based catalogs preserve them through files.
Let’s explore each type of Apache Iceberg catalog together with examples currently found on the market.
File-Based Catalogs
File-based catalogs keep a file called version-hint.text referencing the most recent metadata.json. A query engine looks for this file to find the right metadata.json whenever it enters a directory that contains an Iceberg table.
Since Iceberg catalogs are adaptable, you can use them with nearly any backend system. Any Iceberg runtime may be connected to them, and any Iceberg-compatible processing engine can use them to load the tracked Iceberg tables. Several catalog implementations that are ready to use right out of the box are also included within the Apache Iceberg.
| Catalog Type | Description |
|---|---|
| REST | A server-side catalog accessible via a RESTful API |
| Hive Metastore | Employs a Hive metastore to track namespaces and tables |
| JDBC | In a basic JDBC database, namespaces and tables are tracked |
| Nessie | A transactional catalog that uses version control similar to Git to track namespaces and tables in a database (see more info below) |
| lakeFS | A version-controlled catalog that tracks datasets and metadata at the file level within a data lake (see more info below) |
REST Catalog
REST catalog is a RESTful implementation of the Iceberg catalog specification. The client sends REST queries to a server-side catalog, applying the commits and updating the snapshot pointers. The server solely requires catalog-specific requirements. The REST catalog server may encapsulate any current Iceberg catalog solution and provide extra server-side processing.
Hive Metastore
Hive Metastore is an essential part of many data lake systems because it offers a central repository of metadata that can be readily evaluated to make data-driven choices. Hive is based on Apache Hadoop and uses HDFS to provide storage on S3, adls, gs, and other platforms. Hive users may use SQL to read, write, and manage petabytes of data.
JDBC Catalog
A JDBC catalog allows you to query data from JDBC-enabled data sources without requiring ingestion. You may also directly manipulate and load data from JDBC data sources using the INSERT INTO command with JDBC catalogs.
Nessie Catalog
Nessie supports every functionality that is accessible to every Iceberg client because it is implemented as a customized Iceberg catalog. Presto, Trino, Flink, Hive, and Spark structured streaming are examples.
Key features of Nessie include:
- Commits that allow users to accurately manage data lake changes,
- Branches that segregate changes from the primary data collection, preventing botched experiments or jobs
- Tags
- Automated data management
lakeFS Catalog
You can use the lakeFS implementation of the Iceberg catalog to add lakeFS data versioning features to your Iceberg tables. The integration enables you to query your Iceberg tables using lakeFS references, including branches, tags, and commit hashes, to query your Iceberg tables.
Service-Based Catalogs
Service-based catalogs use an active server or service. This service receives queries from query engines and responds with the metadata address in JSON.
To make these catalogs more suitable for production usage, they often create locking mechanisms or use those of a supporting database.
Here are a few examples of service-based catalogs you can use with Iceberg:
AWS Glue Data Catalog
AWS Glue Data Catalog centrally stores the metadata about your data sets. It serves as an index for your data sources’ locations, schemas, and runtime metrics. Each metadata table, representing a single data store, contains the metadata. A crawler that automatically searches your data sources and gathers information can be used to fill out the Data Catalog. You can connect to AWS-based internal and external data sources via a crawler.
Google Cloud Data Catalog
Iceberg’s unique catalog is called BigLake Metastore. Because BigLake Metastore allows tables to be synchronized between Spark and BigQuery workloads, using it is the preferred solution for Google Cloud. You can achieve this by creating the Iceberg BigLake table and initializing BigLake Metastore using an Apache Spark BigQuery stored procedure. However, you must still execute an update query in BigQuery to update a schema.
Databricks Unity Catalog
Databricks Unity catalog lets users easily manage files, notebooks, dashboards, machine learning models, and unstructured and structured data in any format across any cloud or platform.
Azure Purview
Software-as-a-service (SaaS), multi-cloud, and on-premises data can all be managed and governed with Azure Purview, a unified data governance solution. With automated data discovery, sensitive data categorization, and end-to-end data lineage, you can quickly and simply generate an accurate, comprehensive map of your data environment.
Snowflake Catalog
Snowflake unveiled Polaris, an open-source data catalog that lets you easily integrate several processing engines for data management, such as Apache Flink, Apache Spark, and Trino, as it is compatible with Apache Iceberg’s REST protocol.
REST Catalog Specification
With Iceberg 0.14.0, the REST catalog was added to offer more flexibility to Iceberg catalogs. The implementation details of a REST catalog reside on the catalog server rather than in the technology-specific logic present in the catalog clients.
This is comparable to the Hive thrift service, which enables single-port access to a Hive server. As long as the server-side functionality adheres to the Iceberg REST Open API protocol, it can be built in any language and utilize any unique technology.
One of the main advantages of the REST catalog is its ability to communicate with any catalog backend using a single client. Because of this greater flexibility, it is now simpler to create custom catalogs that work with engines like Athena or Starburst without adding a Jar to the classpath.
As long as a service complies with this specification—that is, as long as all necessary endpoints are present, receive the right inputs, and deliver the right outputs—the precise implementation logic can be changed. As such, the standard “REST Catalog” class is sufficient for any catalog that complies with this standard, eliminating the need for a distinct catalog class for each language.
This strategy has several benefits:
- When developers build a compliant catalog service, most Apache Iceberg-compatible tools can use it immediately
- Implementations of conforming catalogs can be created in any language and can communicate with any client in any language
- Updates may be managed centrally in the catalog since operations are managed according to the server rather than the client
- Tools that work with Apache Iceberg can handle a wider variety of catalogs with more ease, including unique internal catalogs created by specific organizations. Because many tools can be reluctant to establish new connections for each specific catalog, this flexibility is very helpful
Challenges with Iceberg Catalogs
1. Integration and maintainability
Your choice of Apache Iceberg catalog should be based on several important considerations, including compatibility with your existing tools, additional features or integrations offered by the catalog, and maintainability.
2. Managing all tables using a single catalog
Because only the active catalog is updated when a transaction is completed, you must manage your tables using just one catalog. When numerous catalogs are used, they could relate to out-of-date table states, which could cause problems with consistency.
3. Need for bespoke synchronization mechanisms
If more than one catalog is necessary, one can be used for writing operations and the others only for reading. This structure calls for the development of bespoke mechanisms to synchronize the read-only catalogs with the primary to guarantee that they reflect the most recent table state and preserve maximum consistency.
How to Choose the Right Catalog
When deciding which option to select, the following important factors will help you make the best decision:
- Deployment requirements – Evaluate the requirements for deploying and allowing the catalog to manage itself. Does the project provide Helm charts or Docker images to make deployments simple to manage?
- Record-keeping—Verify whether the documentation offers thorough directions for using the catalog. Do blogs, guides, and case studies exist that the organization may use to get started?
- Governance and security – Analyze the table security characteristics. To what extent are the rules designed to safeguard the catalog applicable to various tools and users?
- Scalability – Analyze how the catalog handles distributed functionality in multi-region, large-scale data lakehouse situations.
- Special features – Do you need any special products or services? lakeFS, for instance, allows for Git-like semantics in the tables, allowing for branching, merging, tagging, rolling back changes, and other features.
- Support for REST Catalog – Although all of these catalogs adhere to the REST Catalog definition, some operations—like creating or registering a table—may not be supported.
Considering these aspects, it’s smart to experiment with several catalogs to see which works best for your company. Enterprise cloud-managed versions are also available for several of these catalogs for those who would rather not administer them alone. Additionally, moving between these catalogs is not too tricky, owing to the Nessie project’s catalog migration tool.
Conclusion
The ecosystem of Apache Iceberg catalogs is vast and filled with various offerings. When reviewing a given Iceberg catalog for your team, consider aspects such as simplicity of setup, thorough documentation, strong security and governance features, scalability, catalog properties, and support for the REST Catalog specification.
Take your time evaluating these factors in depth and experimenting with several catalogs to get the one that best suits you. The right catalog will improve Iceberg tables’ portability, discoverability, and governance, whether administered internally or through a cloud-managed service.
A thoughtful decision that fits the company’s long-term data strategy needs much thought and practical testing. But the work is worthwhile. After all, one of the most critical steps in ensuring a data lakehouse project succeeds is taking the time to confidently traverse the catalog paradigm.


