Best of 2023: Advancing Data Version Control Best Practices with lakeFS

Keren Shohet

Last updated on January 5, 2026

Home > Blog > Best of 2023: Advancing Data Version Control Best Practices with lakeFS

As 2023 nears its final stretch, let’s take a moment to reflect on how the year unfolded. One thing that especially stood out is the fact that data version control is no longer an emerging tool — it is now a category of its own. This is a major milestone that reflects where data versioning is heading and as a leading player in this category, we worked hard together with our community to provide more features and integrations that set lakeFS users up for success..

With that in mind, let’s revisit some of our most memorable moments of 2023, including open source updates, noteworthy integrations and partnerships, and Cloud/Enterprise features. Finally, I’ll leave you with some extra reading material, rounding up the most popular content among our community in 2023.

Open Source Updates

We have an active community that continuously contributes to the evolution of lakeFS and 2023 was no different: your feedback is our North Star. We rolled out features, integrations, and partnerships to ensure an experience as seamless and productive as possible…

Since we’re summarizing 2023, I thought it was only fair to ask our community:

What features do you think are the most notable?
What projects did you have the most fun working on?

As expected, a looooong list of favorites was shared. The answers to why were even longer. ???? However, a common theme emerged. Let’s dive into five monumental changes that best reflect the evolution of lakeFS in 2023:

lakectl local
High Level Python SDK 2.0
S3 Express One Zone Integration
Pre-signed URLs Support
Iceberg Support

????????Bonus: lakeFS 1.0????????

lakectl local

I’ll state the obvious: as products evolve, you’ll run across features that can (and should) be optimized further. The team identified one such feature: although lakeFS handles data at an extremely high scale, sometimes a local checkout is needed.

That’s why this year, to better support ML use cases, we decided to invest in lakectl and introduce local checkouts, giving you the ability to locally work and sync objects of a lakeFS repository.

Using local checkouts, you can “clone” data stored in lakeFS to any machine, track what version you were using in Git, and create a reproducible local workflow that not only scales but is much easier to use.

Note that this is available from v0.106.1. Oz Katz even compiled a complete lakectl local tutorial on how to implement this including in-depth code examples.

High Level Python SDK

This is another example of how product evolution calls for improvements. Working with a full-featured Python SDK was a given for us at lakeFS. But until recently, our automatically generated code didn’t always adhere to the most Pythonic way.

In 2023, we decided to change that and released a High-Level Python SDK. We also built a much nicer abstraction layer on top of the generated code.

The new and improved Python SDK makes it easier for Python programmers to use. It is more Pythonic, object-oriented, and documented by humans. This provides an optimal client behavior for the most common user requirements and gives you the ability to use fs-spec and skip all that: “Pandas just works” (more on that in our partnership with appliedAI). Refer to all the quality of life improvements now available with these hands-on examples compiled by Oz Katz and and Nir Ozeri.

lakeFS and Amazon S3 Express One Zone Integration

When versioning data, the tools you use must deliver on speed, performance, and cost. This year’s opening keynote at AWS re:Invent announced S3 Express One Zone. As an AWS technology partner, we were thrilled to work with the S3 team as design partners to get this exciting feature out of the gate!

S3 Express One Zone can improve data access speed by 10x, reduce request costs by 50%, and give you the option to co-locate your storage and compute resources for even lower latency.

Let’s drill down into the two primary benefits lakeFS users will benefit from by using S3 Express One Zero:

Data versioned in a lakeFS repository built on S3 Express One Zone can leverage very low latency and 10x better performance with no overhead. This is possible thanks to lakeFS’ support for pre signed URLs, allowing users to access the storage layer directly with no additional network hops between the compute clusters and the object store.
Teams that use lakeFS and S3 Express One Zone together enjoy up to 5x faster merge, diff, and commit operations. Since lakeFS stores its own metadata in the underlying object stores, any lakeFS repository running on top of S3 Express One Zone will automatically enjoy these performance enhancements. Data version control at scale just got even faster!

Pre-signed URLs Support

Wrapping up the list of features our team most enjoyed working on in 2023 is the configuration of lakeFS to use pre-signed URLs.

As we generate and store more data, it becomes all the more critical that it is both accessible and protected. We took this into account when we developed this feature, and our configuration of pre-signed URLs ensures you’ll be able to manage data and we won’t be able to access it. Note that this is available on all common object stores (S3, Google Cloud, Azure, MinIO).

Iceberg Support

Apache Iceberg is one of the most popular open table formats and when you add a data version control tool into the mix, you get a winning combination: Iceberg integration with lakeFS. The Iceberg catalog maps lakeFS branches, tags, and version refs to schemas in Iceberg.

We will continue to grow our Iceberg support and have already rolled out an experimental feature allowing you to use a Spark SQL engine to compute “data diffs” between versions of an Iceberg table that are stored in two different lakeFS versions.

Ariel Shaqed (Scolnicov), Lynn Rozen, and Jonathan Rosenberg had a lot of fun developing and working on this and you can learn more about it here, as well as check out our Iceberg documentation here.

???????? Bonus: lakeFS 1.0 generally available ????????

By now, you’ll probably notice that most of these features are “upgrades” to existing features available in the core lakeFS system. And if you plan to try them out, we suggest you upgrade to minimum lakeFS 1.0 or any version after that.

This will guarantee that forthcoming lakeFS releases will maintain backwards compatibility for the life of the 1.x version, meaning existing code and functionality will continue to be fully operational even as we release more features.

lakeFS Partnerships

In addition to all the work we did on improving lakeFS built-in features, some of our biggest highlights can be traced back to the partnerships and integrations available with lakeFS.

In my opinion, one of the biggest advantages lakeFS brings to the table is the fact that you don’t need to give up on specific tools in order to manage your data versions at scale. Since we use metadata to manage data versions, you can essentially work with most platforms when using lakeFS.

This is why we are continually working on more integrations to support your data versioning and in 2023, we added a number of integrations and partnerships worth mentioning:

Unity Catalog
DuckDB
Airflow
Prefect
Dagster
appliedAI lakeFS-spec

Unity Catalog

Our integration with Unity Data Catalog, a comprehensive catalog solution by Databricks, helps ensure that data management remains efficient and robust.

Our integration includes full data versioning for Unity tables, collaboration and teamwork, along with enhanced data governance. You can explore the full list of benefits available with Unity Catalog, as well as review our documentation for more information on this integration.

DuckDB

In the engineering world, one might argue that 2023 wasn’t the year of the rabbit; it was the year of the duck. Specifically, DuckDB. Their SQL OLAP database management system exploded in 2023, and DuckDB was (and still is) in high demand.

lakeFS added support for DuckDB towards the end of 2022 but in 2023, we made some critical updates allowing you to access data in lakeFS from DuckDB as well as use DuckDB from within the web interface of lakeFS.

Airflow

lakeFS has inherently supported working with Apache Airflow but in 2023, we announced the official release of the lakeFS Airflow provider. This essentially gives you the power to integrate lakeFS functionality into your Airflow DAGs.

Since the library is published on PyPI, you can also easily install it in your project and gain the freedom to import and use the lakeFS-specific operators to your Airflow scripts in the same way you import any other operator.

Prefect

Troubleshooting pipelines can be cumbersome, especially if you don’t know where the workflows failed and what data went through them. When managing these pipelines in dev, test, and production environments, it’s key to identify what failed.

Our integration with the Prefect workflow orchestration tool is what helps ensure that you will no longer operate your pipelines with a blindfold on. In this very detailed guide, Amit Kesarwani provides step-by-step instructions on how to use this integration with your pipeline management.

Dagster

Another orchestration tool we were excited to integrate with this year is Dagster. It offers several key components to help users build and run their data pipelines.

When using lakeFS with Dagster, you’re able to create isolated environments of the data to run production pipelines, in a way that ensures only quality data will be exposed to production. You can get a full overview of the integration in this guide.

appliedAI lakeFS-spec

Rounding out our partnerships and integrations throughout 2023, let’s close the year with an open source library when using lakeFS with Python. This initiative by Jan Willem Kleinrouweler and Max Mynter of applied AI consists of a library implementing an fsspec backend for the lakeFS data lake, giving you the flexibility to add data version control to your ML projects.

Find out how to use lakeFS-spec in this step-by-step guide, making it as simple as writing files, be it from your favorite data science libraries or low-level file operations.

Cloud/Enterprise Features

Just like in open source, I wanted to also understand from our team what they most enjoyed working on when it comes to our Enterprise and Cloud solutions.

Without further ado, let’s take a look at some of the top picks:

Azure lakeFS Cloud
RBAC
SSO
Managed Garbage Collection

Azure lakeFS Cloud

In addition to everything you get with lakeFS open source, our lakeFS solution provides a fully-managed lakeFS platform, and a new addition is our optimal support of Azure’s stack of technologies.

This includes Synapse and Databricks and integrations that enable you to use Spark on Azure with lakeFS. This gives you the ability to adopt best practices with data, including reproducibility, CI/CD of your data pipelines, isolated data environments for development, and reverting changes made to data.

Role-Based Access Control (RBAC)

Role-Based Access Control with lakeFS — Source: Imgflip.com

Ensuring that you have control over what authentication best fits your security needs led us to move from decoupled security authentication and access control features.

As of 2023, lakeFS enables you to plug in your own authentication and security mechanism with RBAC. The lakeFS RBAC model is based on five components of the system: users, actions, resources, policies, and groups.

SSO

As part of our lakeFS Cloud and Enterprise solutions, we strive to maintain the highest level of security, making it a necessity to add a single-sign-on (SSO) mechanism. In the case of Cloud, we use Auth0 for authentication and thus support the same identity providers as Auth0. In lakeFS Enterprise, authentication is handled by a secondary service which runs side by side with lakeFS.

Managed Garbage Collection

Going back to my earlier reflection, here’s an example of why data version control merits a category of its own. Managing vast amounts of data efficiently requires a robust data versioning tool. It also calls for the ability to dispose of data in a manner that is consistent with privacy rights and data retention policies and guidelines.

Simply put, garbage collection can be difficult for big data lakes. To do it at scale, you need the ability to distinguish between committed and uncommitted data.

lakeFS Managed Garbage Collection (available as part of our Cloud offering), automates the process of identifying and removing redundant data versions and optimizes storage utilization, improving data access performance and enabling data governance.

What content did data practitioners love?

Besides all the product updates and rollouts we focused on in 2023, we also learned a lot from our community and tried to share these lessons via content, tutorials and guides.

When analyzing our traffic, we noticed some articles and searches that spurred interest and discussions.

Here’s our compilation of the most popular content published in 2023:

Conclusion

During 2023 we were busy working on exciting updates and releases, and if there’s one thing that I can say for sure about the team at lakeFS: this is only the cusp of what’s to come! We will continue to strive for perfection (or at least the preservation of engineering best practices in everything we do). Stay tuned for what’s to come in 2024!