When Databricks users first hear about lakeFS, a common response is, “I already have time travel in Delta Tables.” This raises an important question: how is lakeFS better, or how can it complement Delta Tables? Let’s explore the key differences and use cases where lakeFS shines, explaining why thousands of organizations, including many large enterprises, chose to manage their production data using lakeFS.
Single table time travel vs data version control: what’s the difference?
While Delta Tables allow time travel for a given table, a data version control system allows you to manage your data as code. It manages a repository of delta tables (e.g. hundreds of thousands of tables) and allows time travel of all those tables concurrently. When you move back in time to a specific point, you can see a snapshot of all the tables in the repository at that point in time.
Moreover, lakeFS is a data version control system, allowing you to manage your data as code. You can commit changes to your data; in other words, snapshots you can always go back to. You can open a branch in a repository to get an isolated data environment to work with, and you can merge changes of the data to your main production branch.
It’s also important to note that delta table compaction is dependent on losing the history. In the case of data version control systems, you never lose history unless you need to. As a result, compaction makes the available history in delta time travel more limited. Even if you use compaction for delta tables with lakeFS, you only lose history on the given commit you ran, but all previous commits will still preserve the history of those tables.
Let’s take a look at some use cases that differentiate Delta time travel and data version control.
Use Case Differences
lakeFS works with any format
One of the most significant distinctions is that lakeFS is format agnostic. It can run on top of Delta Tables, Iceberg, or even unstructured data such as videos and images. This flexibility allows you to manage a wide variety of data formats within the same system, providing a more versatile solution for data management. This versatility ensures faster time to market for data/AI products, as teams are not constrained by data formats and can leverage existing infrastructure more effectively.
Creating multiple isolated dev/test environments
Zero-Copy Clones
With lakeFS branches, you can generate literally millions of zero-copy clones of your environment. This means any data engineer working on an ETL can create their own isolated copy of the environment and work without stepping on each other’s toes. Similarly, any data scientist training a model can run preprocessing in isolation. This isolation improves data quality by preventing unintended changes and conflicts, making collaboration over data safe for all data practitioners.
Write-Audit-Publish for your data
Secure Data Promotion
Using a combination of lakeFS merges and hooks, you can promote data to production securely. For example, implementing write-audit-publish workflows allows you to ensure data integrity and compliance before making it available for production use. This structured promotion process addresses the pain point of slow and error-prone development and testing of data and AI pipelines & models.
Troubleshooting and reproducibility
Logical Set of Data
Since lakeFS manages repositories, time travel (i.e., accessing historical commits) is done on a logical set of datasets instead of per table. You can open a branch from a specific merge/commit that introduced changes to production. You can reproduce all aspects of the environment, troubleshoot the issue on the branch, and debug it. Meanwhile, you can revert the main branch to a previous point in time or keep it as is, depending on the use case. This capability enhances data reproducibility, a crucial requirement for auditing and AI/ML modeling.
ML reproducibility
Beyond Time Travel
Time travel has one dimension: time. However, machine learning is a non-linear, iterative process. Each data scientist typically runs separate preprocessing steps to prepare data for training their models. To achieve ML reproducibility, you need to understand the lineage of all these concurrent changes. With lakeFS, you can trace the data for each experiment and transformation up to the raw dataset, ensuring you can reproduce and verify any model’s results. This ability to trace and reproduce data transformations ensures high data quality and reliability of AI/ML products.
Conclusion
While Delta Tables offer robust per-table time travel capabilities, lakeFS provides a more comprehensive and flexible solution for data management. Its format-agnostic nature and powerful features for isolated environments, Write-Audit-Publish, troubleshooting, and ML reproducibility make it an invaluable system for modern data engineering and data science workflows. By enabling isolated environments, secure data promotion, and reproducible workflows, lakeFS accelerates the development and deployment of data/AI products. This ensures high standards of data integrity and quality, and fosters safe and efficient collaboration.


