The lakeFS open source project saw tremendous growth in 2021, our first full year of existence. We started the year with high expectations and many questions: Will we see engagement in our community? What about the adoption rate? Will our team of core developers grow? What will our community look like?
Now that the calendar has turned to 2022, it’s a good time to review last year’s journey. We’re proud of the great work by our team and enthusiastic community who are a big help in the effort of maximizing the potential of lakeFS.
We measure our progress’ success using many metrics. In this article we’re happy to share how 2021 looked for lakeFS in terms of the following: Github stars, Audience & Slack members, Number of Monthly Installations, and Active Repositories.
#1: GitHub Stars
Every repository on GitHub can be starred. You can think of it as a like button for Facebook pages. Some people star a project because they love the idea, some want to bookmark it for future updates, others are inspired and want to support. One of the things they all have in common is their appreciation for the project. Github Stars is an important reach metric for us and you can understand why.
We began the year with a little over 500 stars and by the end of year reached above the 2k mark. We thank all who contributed to our project with their significant star!
We’re fostering a community around lakeFS. A growing community is a strong indicator of a project’s popularity and its demand among users.
Let’s look at two metrics that measure community:
This metric follows the growth of our audience. It summarizes the number of members in our community within our social media channels. (Twitter, LinkedIn, YouTube). These channels provide a platform where we share thoughts and discussions about technology as well as highlights from our journey. We appreciate everyone who has been a part of it.
Since the start of 2021, we saw the Audience metric grow all the way to quadruple.
We believe in accessible, fast, helpful communication within the community. Our Slack space allows it to happen. It is a safe place to learn and engage, get assistance from one another, catch up on the latest updates, and learn about the fast-changing world of scalable data.
Our slack community is mainly a community of users discussing use cases and implementation of lakeFS, features, bugs, and integrations such as: Airflow, dbt, Spark, Hive metastore, Databricks Delta Lake and more. It is less a community of contributors, which is to be expected as lakeFS is a distributed system developed in Go, While most of our users are data engineers who are highly skilled in Scala, Java and Python. However, we do see contribution from the community around integrations with other data tools and data sources.
Our slack community grew at a linear, high pace throughout the year. Measuring the growth of this metric is more than just measuring popularity. Thanks to the Slack community, we are able to connect with users, get a deeper sense of their needs, their challenges, and what value lakeFS provides them.
#3: New Monthly lakeFS Installations
Number of installations is an adoption metric that follows the trend of new installations per month. More installations of lakeFS means more users utilizing the value that lakeFS has to offer. The significant growth of this metric is a validation for the need of a solution like lakeFS – it solves a real pain.
The wide usage of lakeFS allows us to quickly improve the quality, scalability, and usability as it matures with every new installation. The more users explore lakeFS, ask and raise their thoughts in the community, the more accurate the use cases for lakeFS become. It provides us the information needed to evolve lakeFS so that it will best serve those use cases. Since lakeFS provides a set of operations to run data lifecycle management, it is quickly adopted not only for the initial pain it solves, but also for best practices the users now understand they can adopt.
While it is clear that the number of new installations per month is exponentially growing, there is some variance in the growth rate throughout the year. Showing a reduction in the growth pace during Q3, while other 3 quarters show a more evident exponential growth reaching the 400 mark by the end of the year.
#4: Repositories Managed by lakeFS
Repositories is a metric that captures a more advanced phase of adopting lakeFS. After installing lakeFS, we look for users to remain active and engage with lakeFS. One of the ways users do so is by creating repositories within lakeFS to better manage their data.
The graph below shows the number of weekly active repositories. We expect it to grow exponentially, with correlation to the number of installations.
Clearly the graph shows exponential growth with an unexpected bump for 3 weeks in August. The bump is a result of an experiment done by one of our design partners to manage their data in an extremely high number of independent repositories. As you can see the experiment did not yield good results and they returned to their original usage of a handful of repositories.
2021 was full of great challenges that helped shape lakeFS the way it is today. Getting there could not have happened without our incredible community.
Our community has grown throughout the past year and brought in many partners with an innovative perspective to walk our journey with us. Sharing with us brilliant thoughts and requests makes a great difference. Other than being our main motivator, our community brings us closer to the fulfillment of lakeFS as a basic part of every scalable data architecture, providing data lifecycle management.
Grateful for 2021, and our community, we are expecting 2022 to be even greater. Looking forward to launching our SaaS this year, learn the changes it will bring and overcome all great challenges ahead.
We wish you all – a great 2022!
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
Find out what are vector databases and why you need them as a data practitioner
Dagster is a cloud-native data pipeline orchestration tool for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.
Table of Contents