Adi Polak
January 4, 2023

2022 has been an incredible year, in the same way that roller coasters are thrilling. Our industry has seen many shifts and rapid changes – which we experienced together. In 2022, we witnessed how data engineering teams became increasingly central to any data-driven organization. And the growth of more significant roles of Analytics Engineers, ML Engineers and DataOps, which together build the next generation of modern data platforms. With these evolving roles and opportunities, many of us invest more time in learning and connecting with our peers. We attend conferences, meetups, and events to get together, network, and learn all the things. While creating world-wide industry standards and pushing the boundaries of what data can do for us 🎉.

So, let’s take a moment to reflect on the incredible successes and contributions we’ve made together with you, the lakeFS community, over the past year as we close out 2022. Here are a few highlights and what we are excited about in 2023. 

The lakeFS open source project is experiencing constant growth. By using a data version control system, any data practitioner can apply engineering best practices in their team.

lakeFS project had a great 2022

This year lakeFS took a giant leap forward. It’s more secure, scalable and usable than ever before. 

We’ve had over 1,000 PRs merged into the product – bug fixes, new features, scalability and security enhancements and documentation improvements. These PRs culminated in 37 releases of lakeFS in 2022. That’s an average of one release every 10 days!

But as always, this data is more interesting when put into context – let’s explore some of the notable things that changed for lakeFS:

lakeFS is much easier to get started with.

We worked hard on improving the initial installation. You can spin up a new lakeFS instance by simply running docker run – no dependencies required, no database to manage and maintain and no tricky network setups required.

Additionally, running a production grade lakeFS installation on AWS  is a `helm install` away – lakeFS will take care of provisioning its own DynamoDB table and other associated resources.

Perhaps more importantly – it’s much easier to get your existing data into that lakeFS installation. lakeFS now allows you to import existing data into lakeFS without copying it, all with a click of a button (or cli command, whichever you prefer 😀 ).

lakeFS is more flexible than ever

While lakeFS was always agnostic to which types of data are managed, this year we went a step further and provided users with more ways to control how their data is versioned. We introduced merge strategies to allow users to decide what happens when conflicting changes occur – and introduced Lua-based hooks that allow users to customize what happens on commit, merge and branch actions – validation, downstream notifications and triggering external systems are all possible without having to manage a webhook server.

Security, scalability, usability

We’ve implemented over a dozen optimizations to reduce latency and increase efficiency of all core components: the Hadoop Filesystem used by Spark applications, lakeFS API server and the UI have all been vastly improved. 

Additionally, we’ve put quite a bit of focus on making the lakeFS UI intuitive and friendly, allowing users to explore and manage the largest of datasets with ease.

We’ve even embedded DuckDB – a full featured, high performance OLAP database, right in the lakeFS UI  to allow users to explore tabular data objects, right from the comfort of their web browser.

We felt your love and support through various metrics, including installations acn community.

Monthly active repositories growth

High (exponential!) increase in our monthly active repositories, which indicates the high adoption of our OSS product.

Growth in new installations:

Both open source installations, and our lakeFS Playground installations are growing exponentially.

What we are grateful for this year:

3,100 stars on Github! Help us reach 4K 😊

Growing Slack Community – Crossing 1k slack users!

More than 40 Events: we participated and spoke at more than 40 events this year! One of the most exciting presentations was by our community member and open-source contributor – Holden Karau. In her presentation she demonstrated using lakeFS in Netflix to safely upgrade her Apache Spark version by creating an isolated dev / test environment for migration. 

Our O’Reilly course exploded: we collaborated with O’Reilly on a technical course and had more than 1600 people register and learn about CI/CD for data! What a joy developing best practices and designing industry standards together.

Community content: our community has created 25! Articles on using lakeFS. With one of the most interesting being – From Data Warehouse to Data Lake to Data Lakehouse by Alex Punnen. And an extremely useful tutorial – Fun with lakeFS and Terraform by Bjorn Olsen, which also contributed a Github repo with terraform to install lakeFS on aws. Also a very cool success story from Enigma. Enigma engineers shared how moving to a data branching solution helped speed up their velocity by reducing testing time by 80%. full blog here

Courtesy of Alex Punnen – his system architecture on aws

Our community members won innovator awards!  We have many thought leaders and innovators in our community. Together, we push the boundaries of data systems to radically improve the lives of data practitioners. Read about Leonard Aukea – who drives ML platform at Volvo – on his nomination for Machine Learning professional of the year! 

Our first in-person lakeFS meetup: As many of us want to meet in-person, we launched our first in-person lakeFS meetup of the year!  We had fun and benefited from great presentations and hallway conversations. We started organizing a few in-person meetups in different areas! (ping @Adi Polak in slack if you want to organize one in your city!)

in-person lakeFS meetup in San Francisco

We had Axolotl fun! After meetups and events, many of us enjoy continuing the fun at a nearby brewery or an old-fashioned pub. It is common for us to continue our discussion of data best practices in a casual manner. If you want to get involved, join our slack and meetups for updates.

lakeFS casual get-together in SF after a successful Data & AI conference

lakeFS Cloud – We launched our first paid offer with lakeFS Cloud! We all got excited and the announcement on the launch of lakeFS cloud received lots of community love.

Our team at the Data & AI Summit booth in SF

Blog exploded! Thanks to the insightful insights and thought-provoking conversations in our community, we have seen an explosion of excitement around our blog with the 2022 state of data engineering leading it (more than 50K readers). Other successful blog posts include: ​​5 Painful mistakes data engineers make, and how to avoid them, and 7 Winning Habits of Effective Data Engineers. We also published our definitive guide to data products, which gained a lot of readers, quotes and shares.

Forbes featured our articles 🤯

Investing in the developers and data practitioners community is something we’re passionate about. Einat Orr, CEO and co-founder of Treeverse, shared in her article, why developer experience is a core role in the management team of any company that is building products for developers.

More articles on how to build a data strategy that can bring value fast – in an ever growing complexity of tools and frameworks. And how better engineering processes can unlock the full business value of data.

Looking into the future:

Our plans for lakeFS project:

Some highlights of the very packed roadmap for 2023:

  1. Increase Support for open table formats: Apache iceberg, Apache hudi and delta lake – many of you asked and believe this is the future of data systems, table formats to become 1st class citizens in lakeFS.
  2. Azure as a first class citizen: On top of Azure blob, Databricks and Synapse which are already supported, we’re adding ADLS, CosmosDB,  and a full integration with Azure Active Directory.
  3. Continue making sure lakeFS brings value to your data stack.

Slack:

We’ll experiment with a variety of channels tailored to your needs:

  1. Job seekers  – We understand that the market is what it is, so, we decided to take an active approach and create dedicated channels to provide advice, soundboard and open doors for Job seekers community members. We also collaborate with recruiters and hiring managers. Our first networking event for 2023 is on Jan 24, in SF. Join us.
  2. Data-discussion – bring industry news and highlights on a daily basis. It can be difficult to keep up with industry trends and updates, so we bring you the tl;dr of everything new and foster authentic, safe conversation about platforms and tools.

Meetups & other events:

In 2023, we will continue to get together and meet you where you are, seek all opportunities to engage physically and virtually! 

Content:

Continue creating more content on: Data best practices, practical tools, tutorials and guides for data practitioners and lakeFS users. And with our team growing and Robin Moffatt joining us in 2023, it will become even greater!  

Kris Jenkins tweet, wishing Robin good luck on his new journey

With your support, we are cultivating and growing the world’s largest data source control community. Thanks for being a part of this incredible journey, and here’s to 2023! And if you are not there yet – join our slack community today.

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +