Hudi vs Iceberg vs Delta Lake: Data Lake Table Formats Compared

Oz Katz

Last updated on March 10, 2025

Home > Blog > Hudi vs Iceberg vs Delta Lake: Data Lake Table Formats Compared

Introduction to Data Lakehouse

When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility.

It is inspiring that by simply changing the format data is stored in, we can unlock new functionality and improve the performance of the overall system.

Apache Hudi, Apache Iceberg, and Delta Lake are the current best-in-breed formats designed for data lakes. All three formats solve some of the most pressing issues with data lakes:

Atomic Transactions — Guaranteeing that update or append operations to the lake don’t fail midway and leave data in a corrupted state.
Consistent Updates — Preventing reads from failing or returning incomplete results during writes. Also handling potentially concurrent writes that conflict.
Data & Metadata Scalability — Avoiding bottlenecks with object store APIs and related metadata when tables grow to the size of thousands of partitions and billions of files.

Let’s take a closer look at the approaches of each format with respect to update performance, concurrency, and compatibility with other tools. Finally, we’ll provide a recommendation on which format makes the most sense for your data lake.

What is Apache Hudi?

Apache Hudi is an open data lakehouse platform that uses a high-performance open table format to provide database capabilities to data lakes. Hudi replaces slow batch data processing with a strong new incremental processing framework that enables low-latency, minute-level analytics.

Here are some major benefits of adopting Hudi in your data lake:

Hudi lets you efficiently update and remove records in your data lake. This eliminates the need for intricate data pipelines that rewrite entire datasets for changes
Hudi keeps track of all data versions, allowing you to go back in time and view previous versions of your data. This is critical for auditing, rolling back, or studying historical trends
The solution optimizes data organization for faster querying using engines such as Apache Spark, Presto, and Hive. This minimizes query latency and increases overall data lake usability
Hudi works perfectly with stream processing frameworks such as Apache Flink and Kafka Streams, allowing you to ingest and handle real-time data effectively

What is Apache Iceberg?

Apache Iceberg is a table format that eliminates the problems associated with old catalogs and is quickly becoming the industry standard for managing data lakes. Iceberg adds new capabilities that allow many applications to collaborate on the same data in a transactionally consistent manner while also defining extra information about the status of datasets as they evolve and change over time.

The Iceberg table format provides features and capabilities similar to SQL tables in traditional databases. Still, it’s open and accessible, allowing other engines (Dremio, Spark, etc.) to operate on the same information. Iceberg offers numerous features, such as:

Transactional consistency between several applications: files can be added, removed, or updated atomically, with full read isolation and numerous concurrent writes
Full schema evolution for tracking changes to a table over time
Time travel to query old data and confirm changes between updates
Partition layout and evolution allow for modifications to partition schemes as queries and data volumes change without relying on hidden partitions or physical directories
Rollback to previous versions to quickly fix errors and restore tables to a known good condition.
Advanced planning and filtering skills for high performance on large datasets

Iceberg achieves these capabilities for a table by tracking metadata files (aka manifests) through point-in-time snapshots and retaining all deltas when the table is modified over time. Each snapshot describes the table’s schema, partitions, and files in detail and provides perfect isolation and consistency.

Furthermore, Iceberg intelligently arranges snapshot metadata in a hierarchical fashion. This allows for quick and efficient modifications to tables without redefining all dataset files, resulting in optimal efficiency while working at a data lake scale.

Hudi vs Iceberg vs Delta Lake Feature Comparison

Feature	Apache Hudi	Apache Iceberg	Delta Lake
Write Modes	Copy-on-write available	Copy-on-write available	Copy-on-write available
ACID Transactions	Available	Available	Available
Time Travel	Available	Available	Available
Managed Ingestion	Availalble (via Hudi DeltaStreamer)	Not available	Not available
Concurrency Control	Optimistic Concurrency Control (OCC) with Non-blocking table services	Optimistic Concurrency Control (OCC) only	Optimistic Concurrency Control (OCC) only
Partition Management	Not available (coarse-grained partitions and fine-grained clustering)	Not available (considered an antipattern)	Available
Tool Integration	Greatest level of integration	Moderate integration with ecosystem tooling	Moderate integration with ecosystem tooling

Platform Compatibility

Hudi

Originally open-sourced by Uber, Hudi was designed to support incremental updates over columnar data formats. It supports ingesting data from multiple sources, primarily Apache Spark and Apache Flink. It also provides a Spark based utility to read from external sources such as Apache Kafka.

Reading data from Apache Hive, Apache Impala, and PrestoDB is supported. There’s also a dedicated tool to sync Hudi table schema into Hive Metastore.

Iceberg

Initially released by Netflix, Iceberg was designed to tackle the performance, scalability and manageability challenges that arise when storing large Hive-Partitioned datasets on S3.

Iceberg supports Apache Spark for both reads and writes, including Spark’s structured streaming. Trino (PrestoSQL) is also supported for reads with limited support for deletions. Apache Flink is supported for both reading and writing. Finally, Iceberg offers read support for Apache Hive.

Delta Lake

Delta Lake is maintained as an open-source project by Databricks (creators of Apache Spark) and not surprisingly provides deep integration with Spark for both reading and writing.

Read support is available for Presto, AWS Athena, AWS Redshift Spectrum, and Snowflake using Hive’s SymlinkTextInputFormat. Though this requires exporting a symlink.txt file for every Delta table partition, and as you might suspect, becomes expensive to maintain for larger tables.

Update Performance & Throughput

Support for row-level updates over large, immutable objects can be done in several ways, each with its unique trade-off regarding performance and throughput.

Let’s look at the strategy each data format employs for UPSERT operations. We’ll also touch on additional optimizations related to read performance.

Hudi

Hudi tables are flexible (and explicit) in the performance trade-offs they offer when handling UPSERTS. The trade-offs differ between the two different types of Hudi tables:

Copy on Write Table — Updates are written exclusively in columnar parquet files, creating new objects. This increases the cost of writes, but reduces the read amplification down to zero, making it ideal for read-heavy workloads.
Merge on Read Table — Updates are written immediately in row-based log files and periodically merged into columnar parquet. Interestingly, queries can either include the latest log file data or not, providing a useful knob for users to choose between data latency and query efficiency.

Hudi further optimizes compactions by utilizing Key Indexing to efficiently keep track of which files contain stale records.

Diagram illustrating a hierarchical structure with states (s0, s1), manifest lists, manifests, and corresponding data files.

Iceberg

With the release of Spark 3.0 last summer, Iceberg supports upserts via MERGE INTO queries. They work using the straightforward copy-on-write approach in which files with records that require an update get immediately rewritten.

Where Iceberg excels is on read performance with tables containing a large number of partitions. By maintaining manifest files that map object to partitions and keep column-level statistics, Iceberg avoids expensive object-store directory listings or the need to fetch partition data from Hive.

Additionally, Iceberg’s manifests allow a single file to be assigned to multiple partitions simultaneously. This makes Iceberg tables efficient at partition pruning and improves the latency of highly selective queries.

Delta Lake

During a MERGE operation, Delta uses metadata-informed data skipping to categorize files as either needing data inserted, updated, or deleted. It then performs these operations and records them as “commits” in a JSON log file called the Delta Log. These log files are rewritten every 10 commits as a Parquet “checkpoint” file that save the entire state of the table to prevent costly log file traversals.

To stay performant, Delta tables need to undergo periodic compaction processes that take many small parquet files and combine them into fewer, larger files (optimally ~1GB, but at least 128MB in size). Delta Engine, Databricks’ proprietary version, supports Auto-Compaction where this process is triggered automatically, as well as other behind-the-scenes write optimizations.

The Delta Engine further enhances performance over its open-source counterpart by offering key indexing using Bloom Filters, Z-Ordering for better file pruning at read time, local caching, and more.

Concurrency Guarantees

Allowing in-place updates for data tables means dealing with concurrency.

What happens if someone reads a table while it is being updated? And what happens when multiple writers make conflicting changes at the same time?

MVCC: Typically databases solve this problem via Multiversion Concurrency Control (MVCC), a method that utilizes a logical transaction log where all changes are appended.

OCC: Another approach called Optimistic Concurrency Control (OCC) allows multiple writes to occur simultaneously, only checking for conflicts before a final commit. If conflict is detected, one of the transactions is retried until it succeeds.

Hudi

True to form, Hudi offers both MVCC and OCC concurrency controls.

MVCC with Hudi means all writes must be totally ordered in its central log. To offer this guarantee, Hudi limits write concurrency to 1, meaning there can only be a single writer to a table at a given point in time.

To prevent that limitation, Hudi now also offers OCC experimentally. This feature requires either Apache Zookeeper or Hive Metastore to lock individual files and provide isolation.

Iceberg

Iceberg tables support Optimistic Concurrency (OCC) by performing an atomic swapping operation on metadata files during updates.

The way this works is every write creates a new table “snapshot”. Writers then attempt a Compare-And-swap (CAS) operation on a special value holding the current snapshot ID. If no other writer replaces the snapshot during a commit, the operation succeeds. If another writer makes a commit in the meantime, the other writer will have to retry until successful.

On distributed file systems such as HDFS, this can be done natively. For S3, an additional component is required to store pointers (currently only Hive Metastore is supported).

Delta Lake

The Delta documentation explains that it uses Optimistic Control to handle concurrency, since the majority of data lake operations append data to a time-ordered partition and won’t conflict.

In the situation where two processes add commits to a Delta Log file, Delta will “silently and seamlessly” check if the file changes overlap and allow both to succeed if possible.

The implication though, is that the underlying object store needs a way of providing either a CAS operation or a way of failing a write when multiple writers start to overwrite each other’s log entries.

Similar as with Iceberg, this functionality is possible out-of-the-box on HDFS, but is not supported by S3. As a result, writing from multiple Spark clusters is not supported in Delta on AWS with true transactional guarantees.

Note: The proprietary Delta Engine version supports multi-cluster writes on S3 using an external synchronization server managed by Databricks itself.

Challenges and Operational Considerations

Complexity of setup

Apache Hudi requires more manual configuration when interacting with platforms such as Apache Spark, Hive, or Presto. Iceberg is quite straightforward to set up but advanced capabilities, such as partition evolution and hidden partitioning, require further configuration. Delta Lake is strongly linked with Apache Spark, making it easy to set up if you currently use the Spark environment. However, running Delta Lake outside of Spark (for example, with Presto or Trino) adds complexity.

Maintenance overhead

Hudi requires active maintenance, particularly for tuning for use scenarios like batch or streaming intake. Iceberg provides a simpler operating model than Hudi because it doesn’t require maintenance-intensive features like compaction. However, managing table formats and guaranteeing compatibility across different versions and engines (Spark, Flink, and Trino) can add operational complexity. Delta Lake’s maintenance overhead is generally lower because of its strong connection with Spark and simplified design focused on transaction logs and metadata management.

Community support

Hudi has a growing community focused on real-time data input use cases. Iceberg’s community has grown rapidly, thanks to strong support from organizations such as Netflix, AWS, and others. Delta Lake benefits from its partnership with Databricks, which offers a big and active community, clear documentation, and comprehensive corporate support.

Hudi vs Iceberg vs Delta Lake: Which One to Choose?

If you’ve made it this far, we’ve learned about some of the important similarities and differences between Apache Hudi, Delta Lake, and Apache Iceberg.

The time has come to decide which format makes the most sense for your use case! My recommendation is guided by which scenario is most applicable:

Choose Iceberg

Your primary pain point is not changes to existing records, but the metadata burden of managing huge tables on an object store (more than 10k partitions). Adopting Iceberg will relieve performance issues related to S3 object listing or Hive Metastore partition enumeration.

On the contrary, support for deletions and mutations is still preliminary, and there is operational overhead involved with data retention.

Choose Hudi

You use a variety of query engines and require flexibility for how to manage mutating datasets. Be mindful that supporting tooling and the overall developer experience can be rough around the edges. Though possible, there is also operational overhead required to install and tune Hudi for real, large-scale production workloads.

If you’re using AWS managed services such as Athena, Glue or EMR — Hudi already comes pre-installed and configured, and is supported by AWS.

Choose Delta Lake

You’re primarily a Spark shop and expect relatively low write throughput. If you are also already a Databricks customer, Delta Engine brings significant improvements to both read and write performance and concurrency, and it can make sense to double down on their ecosystem.

For other Apache Spark distributions, it is important to understand that Delta Lake, while open source, will likely always lag behind Delta Engine to act as a product differentiator.

Integration with lakeFS

If you were wondering “Can I use lakeFS in conjunction with these data formats?”… The answer is Yes!

lakeFS can works symbiotically with any of Delta, Iceberg, or Hudi, providing the ability to branch, merge, rollback, etc. across any number of tables.

Since these formats operate at the table level, they provide no guarantees for operations that span multiple. With lakeFS, it is possible to modify multiple tables on a single isolated branch, then merge those changes atomically into a main branch, achieving cross-table consistency.

This article was contributed to by Paul Singman.

To learn more...

Read Related Articles.

Idan Novogroder
April 28, 2026

Best Practices | Data Engineering | Machine Learning

Center of Excellence for Enterprise AI: Models & Best Practices

Einat Orr, PhD
April 20, 2026

Best Practices | Machine Learning

How to Build Infrastructure for AI-Ready Data That Supports Scalable AI Workloads

Tal Sofer
April 13, 2026

Hudi vs Iceberg vs Delta Lake: Data Lake Table Formats Compared

Table of Contents

Introduction to Data Lakehouse

What is Apache Hudi?

What is Apache Iceberg?

Hudi vs Iceberg vs Delta Lake Feature Comparison

Platform Compatibility

Hudi

Iceberg

Delta Lake

Update Performance & Throughput

Hudi

Iceberg

Delta Lake

Concurrency Guarantees

Hudi

Iceberg

Delta Lake

Challenges and Operational Considerations

Complexity of setup

Maintenance overhead

Community support

Hudi vs Iceberg vs Delta Lake: Which One to Choose?

Choose Iceberg

Choose Hudi

Choose Delta Lake

Integration with lakeFS

To learn more...

Read Related Articles.

Iceberg Branching Best Practices for Reliable Data Operations

Computer Vision in Healthcare: Incorporating Data Version Control Into Your ML Workflow

How to Build Infrastructure for AI-Ready Data That Supports Scalable AI Workloads

Data Management for AI Projects: Strategies, Tools & Best Practices

Center of Excellence for Enterprise AI: Models & Best Practices

How to Build Infrastructure for AI-Ready Data That Supports Scalable AI Workloads

Hudi vs Iceberg vs Delta Lake: Data Lake Table Formats Compared

Table of Contents

Introduction to Data Lakehouse

What is Apache Hudi?

What is Apache Iceberg?

Hudi vs Iceberg vs Delta Lake Feature Comparison

Platform Compatibility

Hudi

Iceberg

Delta Lake

Update Performance & Throughput

Hudi

Iceberg

Delta Lake

Concurrency Guarantees

Hudi

Iceberg

Delta Lake

Challenges and Operational Considerations

Complexity of setup

Maintenance overhead

Community support

Hudi vs Iceberg vs Delta Lake: Which One to Choose?

Choose Iceberg

Choose Hudi

Choose Delta Lake

Integration with lakeFS

To learn more...

Read Related Articles.

Iceberg Branching Best Practices for Reliable Data Operations

Computer Vision in Healthcare: Incorporating Data Version Control Into Your ML Workflow

How to Build Infrastructure for AI-Ready Data That Supports Scalable AI Workloads

Related Articles

Data Management for AI Projects: Strategies, Tools & Best Practices

Center of Excellence for Enterprise AI: Models & Best Practices

How to Build Infrastructure for AI-Ready Data That Supports Scalable AI Workloads

Pick up the Slack with lakeFS