I often get asked what is the difference between lakeFS and open table formats (OTF), namely Apache Iceberg, DeltaLake and Apache Hudi. The short answer is, “Those are different technologies solving for different use cases.” But if this answer is so clear cut, why does the question keep coming up?
It’s time for a detailed answer; let’s start from the very beginning.
What is an Open Table Format?
When moving structured data from a relational database to an object storage, many traditional database guarantees are lost. For example, DBs provide CRUD operations with guaranteed transactionality. This is different from object storage which is immutable by design.
Basically, if you wish to change or expand the data file in object storage, you would have to rewrite the file, and no transactionality is guaranteed. This becomes harder when your table consists of multiple files on disk (think sharding and partitions).
Open table formats (OTF) close this gap by providing a table abstraction that enables data practitioners to create, append, update, and delete records. It also helps with managing table schema evolution and allows concurrency and some level of transactionality.
How does it work?
Disclaimer: The description below is highly simplified for the sake of the argument.
While each format works slightly differently, the main concept is similar. I’ll use the Delta Lake terminology for this discussion.
While the data of the table is held in Parquet files that are immutable, changes are saved in additional data files called delta files, and the information telling us how to use those delta files is kept in log files called delta logs.
When we access the data, we can get its latest status by reading data files and delta files, and calculating (using Spark) a version of the table. We can do that for any time window where the delta files are still available.
This introduces a constraint: the log must be ordered, and that order has to be agreed upon by all
writers to avoid corruption or inconsistencies.
A by-product: time travel
By iterating over the ordered log described above, we can look at different versions, representing changes over time.
We can time travel between different versions of a table by only reading up to a given log location. This might incur a performance penalty, and the time frame available for this action may be limited, but this functionality is indeed time travel, for a single table.
Open table formats offer soft/shallow copy
hard copy and
soft copy of a table. A hard copy is an actual, physical copy, while a soft copy of a table is a metadata operation that allows read-only access to the table at a point in time.
branch is used in Iceberg for soft copy, and since it’s a term borrowed from the world of version control, one may deduce that it acts as a branch in Git. In practice, it is a read-only soft copy.
OTFs provide two functionalities that resemble version control at a first glance:
- Per table time travel
- Per table branch/soft copy
But would these qualify as version control? I don’t think so.
What is Data Version Control and what is it for?
lakeFS is a data version control system that allows lifecycle management of data from development to production, just like one would manage the application code lifecycle.
The main use cases for data version control are:
- Developing data pipelines in isolation
- Testing changes to pipelines in isolation
- Automating data quality tests
- Gatekeepers to data promotion (also referred to as CI/CD for data)
- Reproducibility of data sets
- Transactionality for multiple tables
- Rollback of data quality issues in production
Here’s how lakeFS solves all of the use cases above:
- lakeFS performs actions over a repository of data sets. The repository is logically defined by the user to hold a collection of datasets that take part in a pipeline or that represent a certain aspect of reality. For example, all data sets that include information about your customers.
- The data sets may be structured, unstructured, or managed by an open table format. lakeFS supports any format.
- You can perform Git-like operations over the repository, such as
merging, and so on.
- You can time travel to any living branch/commit/merge recorded by the system, and those points in time are a state of the repository and NOT of a single table/data set. Branches are read and written, and can also be defined as restricted.
- The implementation of the Git-like concepts is through metadata, so all those capabilities are done using metadata, and maintain a deduplicated data lake.
- The versions managed by the system are determined by a commit or branch operation the users decide on for applicative reasons, and not by parameters set for the format, as in OTF.
- lakeFS operates on a data repository of any format, rather than one table in a given open table format.
- It provides full Git-like operations while reducing storage costs.
- It allows the implementation of engineering best practices in data pipeline development: during development, testing, staging, and production.
While lakeFS can provide data version control for data in any format, and supports managing repositories of data sets saved in OTF, it has even deeper support for open table formats.
Since OTFs hold the table level changes, lakeFS can utilize this knowledge to provide highly elaborated diff operations, and smart merge capabilities.
The support for Delta Lake diff is already out there, and the support for Iceberg is coming soon.