As 2023 nears its final stretch, let’s take a moment to reflect on how the year unfolded. One thing that especially stood out is the fact that data version control is no longer an emerging tool — it is now a category of its own. This is a major milestone that reflects where data versioning is heading and as a leading player in this category, we worked hard together with our community to provide more features and integrations that set lakeFS users up for success..
With that in mind, let’s revisit some of our most memorable moments of 2023, including open source updates, noteworthy integrations and partnerships, and Cloud/Enterprise features. Finally, I’ll leave you with some extra reading material, rounding up the most popular content among our community in 2023.
Open Source Updates
We have an active community that continuously contributes to the evolution of lakeFS and 2023 was no different: your feedback is our North Star. We rolled out features, integrations, and partnerships to ensure an experience as seamless and productive as possible…
Since we’re summarizing 2023, I thought it was only fair to ask our community:
- What features do you think are the most notable?
- What projects did you have the most fun working on?
As expected, a looooong list of favorites was shared. The answers to why were even longer. 😉 However, a common theme emerged. Let’s dive into five monumental changes that best reflect the evolution of lakeFS in 2023:
- lakectl local
- High Level Python SDK 2.0
- S3 Express One Zone Integration
- Pre-signed URLs Support
- Iceberg Support
🌟🎉Bonus: lakeFS 1.0🎉🌟
I’ll state the obvious: as products evolve, you’ll run across features that can (and should) be optimized further. The team identified one such feature: although lakeFS handles data at an extremely high scale, sometimes a local checkout is needed.
That’s why this year, to better support ML use cases, we decided to invest in lakectl and introduce local checkouts, giving you the ability to locally work and sync objects of a lakeFS repository.
Using local checkouts, you can “clone” data stored in lakeFS to any machine, track what version you were using in Git, and create a reproducible local workflow that not only scales but is much easier to use.
High Level Python SDK
This is another example of how product evolution calls for improvements. Working with a full-featured Python SDK was a given for us at lakeFS. But until recently, our automatically generated code didn’t always adhere to the most Pythonic way.
In 2023, we decided to change that and released a High-Level Python SDK. We also built a much nicer abstraction layer on top of the generated code.
The new and improved Python SDK makes it easier for Python programmers to use. It is more Pythonic, object-oriented, and documented by humans. This provides an optimal client behavior for the most common user requirements and gives you the ability to use fs-spec and skip all that: “Pandas just works” (more on that in our partnership with appliedAI). Refer to all the quality of life improvements now available with these hands-on examples compiled by Oz Katz and and Nir Ozeri.
lakeFS and Amazon S3 Express One Zone Integration
When versioning data, the tools you use must deliver on speed, performance, and cost. This year’s opening keynote at AWS re:Invent announced S3 Express One Zone. As an AWS technology partner, we were thrilled to work with the S3 team as design partners to get this exciting feature out of the gate!
S3 Express One Zone can improve data access speed by 10x, reduce request costs by 50%, and give you the option to co-locate your storage and compute resources for even lower latency.
Let’s drill down into the two primary benefits lakeFS users will benefit from by using S3 Express One Zero:
- Data versioned in a lakeFS repository built on S3 Express One Zone can leverage very low latency and 10x better performance with no overhead. This is possible thanks to lakeFS’ support for pre signed URLs, allowing users to access the storage layer directly with no additional network hops between the compute clusters and the object store.
- Teams that use lakeFS and S3 Express One Zone together enjoy up to 5x faster merge, diff, and commit operations. Since lakeFS stores its own metadata in the underlying object stores, any lakeFS repository running on top of S3 Express One Zone will automatically enjoy these performance enhancements. Data version control at scale just got even faster!
Pre-signed URLs Support
Wrapping up the list of features our team most enjoyed working on in 2023 is the configuration of lakeFS to use pre-signed URLs.
As we generate and store more data, it becomes all the more critical that it is both accessible and protected. We took this into account when we developed this feature, and our configuration of pre-signed URLs ensures you’ll be able to manage data and we won’t be able to access it. Note that this is available on all common object stores (S3, Google Cloud, Azure, MinIO).
Apache Iceberg is one of the most popular open table formats and when you add a data version control tool into the mix, you get a winning combination: Iceberg integration with lakeFS. The Iceberg catalog maps lakeFS branches, tags, and version refs to schemas in Iceberg.
We will continue to grow our Iceberg support and have already rolled out an experimental feature allowing you to use a Spark SQL engine to compute “data diffs” between versions of an Iceberg table that are stored in two different lakeFS versions.
🌟🎉 Bonus: lakeFS 1.0 generally available 🎉🌟
By now, you’ll probably notice that most of these features are “upgrades” to existing features available in the core lakeFS system. And if you plan to try them out, we suggest you upgrade to minimum lakeFS 1.0 or any version after that.
This will guarantee that forthcoming lakeFS releases will maintain backwards compatibility for the life of the 1.x version, meaning existing code and functionality will continue to be fully operational even as we release more features.
In addition to all the work we did on improving lakeFS built-in features, some of our biggest highlights can be traced back to the partnerships and integrations available with lakeFS.
In my opinion, one of the biggest advantages lakeFS brings to the table is the fact that you don’t need to give up on specific tools in order to manage your data versions at scale. Since we use metadata to manage data versions, you can essentially work with most platforms when using lakeFS.
This is why we are continually working on more integrations to support your data versioning and in 2023, we added a number of integrations and partnerships worth mentioning:
- Unity Catalog
- appliedAI lakeFS-spec
Our integration includes full data versioning for Unity tables, collaboration and teamwork, along with enhanced data governance. You can explore the full list of benefits available with Unity Catalog, as well as review our documentation for more information on this integration.
In the engineering world, one might argue that 2023 wasn’t the year of the rabbit; it was the year of the duck. Specifically, DuckDB. Their SQL OLAP database management system exploded in 2023, and DuckDB was (and still is) in high demand.
lakeFS added support for DuckDB towards the end of 2022 but in 2023, we made some critical updates allowing you to access data in lakeFS from DuckDB as well as use DuckDB from within the web interface of lakeFS.
lakeFS has inherently supported working with Apache Airflow but in 2023, we announced the official release of the lakeFS Airflow provider. This essentially gives you the power to integrate lakeFS functionality into your Airflow DAGs.
Since the library is published on PyPI, you can also easily install it in your project and gain the freedom to import and use the lakeFS-specific operators to your Airflow scripts in the same way you import any other operator.
Troubleshooting pipelines can be cumbersome, especially if you don’t know where the workflows failed and what data went through them. When managing these pipelines in dev, test, and production environments, it’s key to identify what failed.
Our integration with the Prefect workflow orchestration tool is what helps ensure that you will no longer operate your pipelines with a blindfold on. In this very detailed guide, Amit Kesarwani provides step-by-step instructions on how to use this integration with your pipeline management.
Another orchestration tool we were excited to integrate with this year is Dagster. It offers several key components to help users build and run their data pipelines.
When using lakeFS with Dagster, you’re able to create isolated environments of the data to run production pipelines, in a way that ensures only quality data will be exposed to production. You can get a full overview of the integration in this guide.
Rounding out our partnerships and integrations throughout 2023, let’s close the year with an open source library when using lakeFS with Python. This initiative by Jan Willem Kleinrouweler and Max Mynter of applied AI consists of a library implementing an fsspec backend for the lakeFS data lake, giving you the flexibility to add data version control to your ML projects.
Find out how to use lakeFS-spec in this step-by-step guide, making it as simple as writing files, be it from your favorite data science libraries or low-level file operations.
Just like in open source, I wanted to also understand from our team what they most enjoyed working on when it comes to our Enterprise and Cloud solutions.
Without further ado, let’s take a look at some of the top picks:
- Azure lakeFS Cloud
- Managed Garbage Collection
Azure lakeFS Cloud
In addition to everything you get with lakeFS open source, our lakeFS solution provides a fully-managed lakeFS platform, and a new addition is our optimal support of Azure’s stack of technologies.
This includes Synapse and Databricks and integrations that enable you to use Spark on Azure with lakeFS. This gives you the ability to adopt best practices with data, including reproducibility, CI/CD of your data pipelines, isolated data environments for development, and reverting changes made to data.
Role-Based Access Control (RBAC)
Ensuring that you have control over what authentication best fits your security needs led us to move from decoupled security authentication and access control features.
As of 2023, lakeFS enables you to plug in your own authentication and security mechanism with RBAC. The lakeFS RBAC model is based on five components of the system: users, actions, resources, policies, and groups.
As part of our lakeFS Cloud and Enterprise solutions, we strive to maintain the highest level of security, making it a necessity to add a single-sign-on (SSO) mechanism. In the case of Cloud, we use Auth0 for authentication and thus support the same identity providers as Auth0. In lakeFS Enterprise, authentication is handled by a secondary service which runs side by side with lakeFS.
Managed Garbage Collection
Going back to my earlier reflection, here’s an example of why data version control merits a category of its own. Managing vast amounts of data efficiently requires a robust data versioning tool. It also calls for the ability to dispose of data in a manner that is consistent with privacy rights and data retention policies and guidelines.
Simply put, garbage collection can be difficult for big data lakes. To do it at scale, you need the ability to distinguish between committed and uncommitted data.
lakeFS Managed Garbage Collection (available as part of our Cloud offering), automates the process of identifying and removing redundant data versions and optimizes storage utilization, improving data access performance and enabling data governance.
What content did data practitioners love?
Besides all the product updates and rollouts we focused on in 2023, we also learned a lot from our community and tried to share these lessons via content, tutorials and guides.
When analyzing our traffic, we noticed some articles and searches that spurred interest and discussions.
Here’s our compilation of the most popular content published in 2023:
- Best Vector Databases (by now for 2024!)
- Data Orchestration Tools: Ultimate Review
- Guide: Data Warehouse vs. Data Lake
- Jupyter Notebook & 10 Alternatives: Data Notebook Review
- Data Engineering Patterns: Write-Audit-Publish (WAP)
- lakeFS ❤️ DuckDB: Embedding an OLAP database in the lakeFS UI
- Guide: Acceptance Testing For Data Pipelines
- Backfilling Data: A Foolproof Guide to Managing Historical Data
- 9 Data Version Control Best Practices
During 2023 we were busy working on exciting updates and releases, and if there’s one thing that I can say for sure about the team at lakeFS: this is only the cusp of what’s to come! We will continue to strive for perfection (or at least the preservation of engineering best practices in everything we do). Stay tuned for what’s to come in 2024!
Table of Contents