This article focuses on how to work with Nessie Catalog. Please note that since its first publication, fundamental support for Iceberg REST Catalog has been added to lakeFS. Visit the lakeFS Iceberg REST Catalog article to learn more about this integration.
Data is easily one of the most important assets in every organization, serving as the foundation for driving innovation and strategic decision-making. This makes efficient data management critical.
Traditional data systems have failed to keep up with the data volume explosion, changing data formats, and the rapid migration to data lakes and cloud-based storage solutions. Teams often struggle to maintain massive datasets efficiently.
Luckily, the market quickly filled up with tooling to address these challenges. One of them is the Nessie Catalog, which offers a novel approach to lakehouse catalogs by applying the battle-tested concepts of version control to data management.
What exactly is Nessie, how does it work, and how can you get started? Read this article to get all the essentials and best practices to find out if Project Nessie is the right tool for your unique use case and requirements.
What is Project Nessie?
Project Nessie is similar to Git for data, allowing you to apply version control techniques to data catalogs such as the Apache Iceberg catalog. Data engineers, scientists, and analysts can use this open-source project to handle and maintain data like developers do with code.
Nessie enables users to branch, tag, and commit changes to data catalogs, treating them as transactions. This makes data evolution manageable, auditable, and reversible, laying a solid foundation for data governance and security.
Since it decouples data and metadata management from the underlying storage system, it supports a wide range of storage backends, from HDFS to cloud storage options, making it a versatile tool in the data engineer’s toolset.
Nessie and Git-for-Data
Git-for-Data refers to bringing Git’s version control techniques to data management. Just as Git transformed software development by allowing developers to monitor changes, branch off, merge updates, and investigate the history of their codebases, data versioning intends to transform data management technologies by applying the same ideas to data catalogs.
Nessie allows data teams to create branches for testing with data without affecting the main branch, commit changes to track data evolution, and integrate adjustments across all tables in your catalog when they’re ready to be shared or published.
This Git-like feature improves collaboration among data teams while also providing flexibility and safety not previously available in data management. Branching and merging allow teams to test new data models, algorithms, or transformations in isolation, ensuring that only validated changes make it into production.
What is Nessie Catalog?
The Nessie Catalog is a robust data catalog system that keeps the current metadata position of your Iceberg tables and maintains a commit history of the whole catalog. Like in Git, this history can be branched, tagged, and merged.
This, in turn, allows executing queries on many tables and publishing them concurrently.
Users can also roll back the entire catalog in case of an error or tag the specific state of the entire catalog.
Nessie Catalog vs. Traditional Data Catalogs
Traditionally, data engineers relied on self-managed catalog choices such as Hive Metastore and JDBC catalogs (mySQL, PostgreSQL, and so on). While these systems have played an important role in the advancement of data management, they also came with a number of issues, particularly when combined with Apache Iceberg tables.
Complex implementation
Preparing and implementing traditional catalogs can be a time-consuming and complex procedure, particularly in dynamic and scalable cloud systems. This complexity can result in additional operational overhead and a higher risk of misconfiguration, both of which are harmful to fast-paced data operations.
Nessie is easier to configure and implement, helping teams save time and resources.
Not leveraging Iceberg’s latest features
Furthermore, Hive and JDBC catalogs may not fully utilize Apache Iceberg’s latest features. Many additional capabilities are now exclusively available via the “REST catalog” OpenAPI specification, which doesn’t have an open-source or self-managed implementation.
Nessie allows adding new functionality to Apache Iceberg tables.
Demand for self-managed catalogs
The demand for self-managed catalog infrastructure is more pressing than ever, driven mostly by regulatory and security concerns. Many companies face strict governance and compliance standards. These requirements often call for precise data handling, storage, and processing protocols, which might be difficult to follow with third-party managed services.
Self-managed catalogs give teams more control over their data, allowing them to apply tailored security measures, comply with relevant requirements, and maintain data sovereignty. This management is critical for companies that handle sensitive data or operate in highly regulated industries such as financial services, healthcare, and government.
Nessie was designed to open up new possibilities over older systems such as Hive Metastore and JDBC catalogs. It’s open-source and self-managed, and it opens the door to catalog management that is more in line with the modern requirements of big data analytics.
Key Features of Nessie Catalog
Open-Source and Self-Managed
Nessie is an open-source project, which means you can adjust it to meet the needs of your team. This feature is especially appealing for companies looking for a self-managed infrastructure since it allows them to customize the catalog to meet their regulatory and security requirements.
Compatibility with Leading Data Tools
One of Nessie’s key advantages is its compatibility with diverse data processing tools. It works flawlessly with popular engines such as Apache Spark, Apache Flink, Presto, and Trino. This compatibility ensures that enterprises can continue to use their preferred tools without concern for catalog compatibility issues.
Branching and Merging
Nessie, like Git, allows teams to work on multiple versions of data at the same time by supporting the creation of data catalog branches. This is handy for experimenting with data models or running analyses without jeopardizing the integrity of the primary data set. When the work on a branch is finished and verified, it can be merged back into the main branch, ensuring that only accurate and validated modifications are included.
Time Travel and Rollback
One of the most appealing aspects of Nessie is the ability to go back in time and retrieve prior versions of your library, ensuring reliable data snapshots. This functionality is extremely useful for auditing, reproducing analysis, debugging data errors, and more. Time travel ensures that no data update is ever truly lost, providing a safeguard for data engineers and scientists.
Enabling Advanced Features
Nessie comes with functionalities that standard catalogs don’t offer, such as catalog versioning, zero-copy clones, catalog-level rollbacks, and multi-table transactions. These features improve data management efficiency and capabilities, opening the door for more complicated and sophisticated data processes.
Simplifying Data Operations
Nessie makes managing massive amounts of data much easier across several environments. Its catalog-level versioning and rollback capabilities simplify data governance and compliance, which is critical for many businesses.
Future-Proofing Data Management
Nessie’s design is naturally scalable and adaptive, allowing it to handle the increasing needs of big data analytics and the ever-changing landscape of data management technology.
Use Cases for Catalog Versioning with Nessie
Data Experimentation and Rollbacks
Data scientists and engineers often experiment with data transformations, schema modifications, and model training while maintaining continuous operations. Nessie streamlines this experimentation by letting users create freely modified branches.
If an experiment doesn’t produce the intended results, it’s simple to revert to a prior state or delete the branch without affecting the production data.
Collaborative Data Management
In large businesses, multiple teams may be required to work on the same dataset simultaneously. Nessie’s branching and merging capabilities make collaborative workflows easier, allowing teams to work independently and then integrate their changes when ready. This method minimizes conflicts while ensuring the primary dataset remains consistent and dependable.
Audit and Compliance
Businesses must keep accurate data changes, accesses, and lineage records to comply with regulations and pass audits. Nessie’s immutable history and time-travel features create an auditable trail of all data changes, making compliance efforts easier and improving data governance.
Continuous Integration/Continuous Deployment (Write-Audit-Publish) for Data Pipelines
Adopting Write-Audit-Publish principles in data pipelines can dramatically increase the reliability and efficiency of data operations. Nessie facilitates Write-Audit-Publish (WAP) by managing versions of data pipelines and datasets, enabling teams to automate testing and deployment processes. This means changes can be made fast and safely, leading to a more agile and responsive data architecture.
How to Use Nessie Catalog
The CLI Commands
The Nessie CLI provides a simple way to get started, allowing you to handle many branches and tags. Nessie CLI is intended as an interactive REPL with auto-completion, highlighting where applicable, and built-in assistance. Long outputs, such as a commit log, are automatically paged using the Unix less command.
Nessie CLI is available as a standalone uber jar or a Docker image.
To connect to a Nessie instance running locally using Iceberg REST, use CONNECT TO http://127.0.0.1:19120/iceberg. CONNECT TO http://127.0.0.1:19120/api/v2 to access Nessie’s native REST API.
What does the Nessie CLI let you do? Here’s a list of commands with actions performed:
| Command | Action |
|---|---|
| nessie branch | List all branches |
| nessie branch <new_branch> | Create a new branch from the main branch |
| nessie branch <new_branch> <old_branch> | Create a branch from the specified branch |
| nessie branch <new_branch> <hash> | Create a new branch from the specified hash |
| nessie content list -r <branch> | List your content in a branch |
| nessie content view -r <branch> <key> | View a specific content item in a branch |
| nessie content commit -m <message> -r <branch> <key> | Commit the particular content to the specified branch with the specified commit message |
Adding Tables to the Catalog
Nessie allows you to collaborate with the Nessie server to produce catalog-level commits and branches; however, this won’t have the desired results if the catalog contains no tables. You may create Iceberg tables with Dremio Sonar, Spark, Flink, or Hive.
To add tables to Nessie with Spark, shut down the Docker container running with Nessie, create an empty folder on your computer, and create a docker-compose.yml file with the following contents:
#### Nessie + Iceberg Playground Environment
services:
spark-iceberg:
image: alexmerced/nessie-sandbox-072722
ports:
- "8080:8080"
- "7077:7077"
- "8081:8081"
nessie:
image: projectnessie/nessie
ports:
- "19120:19120"After creating the filel, run the following command in the terminal from that directory to spin it up:
docker-compose upThen, open the terminal within the Spark container using the following command:
docker-compose run spark-iceberg /bin/bashFrom within your Docker container, you can execute the following command:
source nessie-init2.bashThis command launches SparkSQL with Iceberg tables and a Nessie catalog. The underlying command and the configurations it will run are shown below.
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,org.projectnessie:nessie-spark-3.2-extensions:0.40.1 --conf spark.sql.extensions="org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSpark32SessionExtensions" --conf spark.sql.catalog.nessie.uri="http://nessie:19120/api/v1" --conf spark.sql.catalog.nessie.ref=main --conf spark.sql.catalog.nessie.authentication.type=NONE --conf spark.sql.catalog.nessie.catalog-impl=org.apache.iceberg.nessie.NessieCatalog --conf spark.sql.catalog.nessie=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.nessie.warehouse=$PWD/warehouseThe configuration flags that are passed do the following:
- Ensure Spark uses the Nessie and Iceberg packages
- Enable Nessie and Iceberg extensions
- Set the URL for the Nessie server
- Se the default branch for SQL commands
- Set the authentication type to NONE
- Creating a Spark catalog and configuring it for a Nessie implementation
- Identifying where the files will be stored
- Enabling sophisticated data operations
To create a new table, use the following command:
CREATE TABLE nessie.db.table (name STRING) USING iceberg;And then this one to insert data into the new table:
INSERT INTO nessie.db.table (name) VALUES (‘cat’), (‘dog’);To create a new branch, use this command:
CREATE BRANCH IF NOT EXISTS my_branch IN nessie;You can add records to the branch with this:
INSERT INTO nessie.db.`table@my_branch` VALUES ('bird'),('fish');You can now query the table’s “main” and “my_branch” branches.
SELECT * FROM nessie.db.table;
SELECT * FROM nessie.db.`table@my_branch`;Both queries show cat and dog, but only “my_branch” will show bird and fish. This clearly illustrates how your changes are being isolated on the branch. You can now update, delete, and insert new data without disrupting queries on the primary branch.
Benefits of Using Nessie Catalog
The Git-like commits and branches the Nessie Catalog comes with offer various advantages:
- Work in isolation – If you wish to work on a new feature or a bug fix, start a separate branch to isolate your changes and then merge them once they’ve been tested.
- Rollback and time travel – If bad code is deployed in production, restoring the application is as simple as reverting to the most recent working commit.
- Prior data state – If you need to refer to code that you produced but is no longer in the current state of the data, you can look through prior commits.
Troubleshooting – When examining data to be added to the data lake, you can generate a list of differences, which greatly simplifies code review and QA.
Going Beyond Nessie Catalog Management with lakeFS
Nessie is one of the top data catalog tools on the market. But when it comes to data version control, it comes with one significant limitation:
Nessie only supports the Iceberg format.
To bring data versioning to data in other formats, you need a more comprehensive solution.
lakeFS is an open-source data version control system that provides environment-wide version control for structured, unstructured or semi-structured data. It enables isolated feature branches for data changes that span multiple tables or formats, allowing you to test modifications safely and merge atomically. This approach enables multiple teams to develop in parallel without interference, collaborating effectively on data changes using pull requests that streamline the development process.
Beyond collaboration, lakeFS automates data contracts enforcement with hooks that catch issues early before they affect production, validating schemas, referential integrity, and data correctness. The system provides data reproducibility at scale by anchoring ML experiments to concrete data versions of your entire environment, creating a reliable foundation for reproducible experiments, pipelines and models. Additionally, lakeFS helps manage and govern access to data through its commit log that tracks who changed what and when, while enforcing fine-grained access control with RBAC policies.
With the introduction of lakeFS’s Iceberg REST catalog, teams can now seamlessly integrate Apache Iceberg table format management directly into their version-controlled data development workflow, providing native support for schema evolution, time travel queries, and metadata management while maintaining all the collaborative and governance benefits of lakeFS branching and merging capabilities.
| Feature | lakeFS | Nessie |
|---|---|---|
| License | Open Source Apache 2.0 + Proprietary lakeFS Enterprise |
Open Source Apache 2.0 + Proprietary Dremio |
| GitHub Stars | 4.7k | 1.2k |
| Project Status | 16 committers in last 6m with >5 commits, actively maintained |
1 committer in last 6m with >5 commits |
| Data Formats | Structured Semi-structured Unstructured |
Iceberg only |
| Versioning Capabilities | Branch, commit, tag, diff & merge + Hooks for validation Pull requests Branch protection |
Branch, commit, tag + Limited merge capabilities* |
| Security & Access Control | OIDC, SAML, SCIM Built-in user and group management |
OIDC, policies directly attached to role/user (no groups) |
| Data Management | Full featured UI, built-in mirroring, GC, auditing, meaningful commit log* *Use Cases: Traceability – see who changed what, when and why Disaster recovery – allowing access to consistent data snapshots in other data centers or geographical regions Data discovery and exploration using friendly user interface Manageability – single pane of glass to manage access and maintenance of all types of data (structured and unstructured) Quality assurance of data and metadata (using actions, pull requests, branch protection) |
Basic UI (browse only), GC |
Conclusion
Nessie offers a good solution for the challenges of modern data management. Nessie promotes cooperation and experimentation by offering data version control at the catalog level and data governance, compliance, and overall data team agility.
However, it’s currently available only to teams working with Iceberg. If you use a different solution but would still like to benefit from data version control, check out lakeFS – this list of integrations is bound to include your tooling of choice for managing complex data environments.



