About two years ago, we left SimilarWeb, where we had the privilege of managing some of the most interesting and complex data architecture projects in the world.
The architecture was centered around a Data Lake – 7 Petabytes of data on Amazon S3. On top, hundreds of hourly, daily and monthly jobs running, forming complex DAGs of dependencies, as well as data streaming, ad-hoc exploration and experimentation happening.
While a great architecture, the pain of managing a data lake is one we felt deep.
From that frustration, the sentiment of “I wish I had Git over the object storage” was born.
We then went on and asked tens of data teams in organizations spanning many business verticals from SMB to very large enterprises: they all sympathised with this sentiment. The feedback ranged from “If it was available, we would use it”, to “we kind of built it ourselves, but we really rather use something someone else maintains”.
With a prototype and some external validation of the concept, we raised capital and got to work.
Building and Releasing lakeFS
We made an initial release on August 1st, 2020. Exactly one year ago, in the midst of a new global pandemic, with a small, experienced team we put together.
As veterans in practising the principles of agile development methodology, we understood the tradeoff between releasing a project that brings value and shows promise of high quality and professionalism, to the need of getting feedback from the community as soon as possible. The power of the community is something you want to harness as early as possible.
Early on we decided that coding conventions and test coverage would be things we insist on from day one. One casualty of this decision was the ability to build lakeFS to a level of scalability we were satisfied with. We figured that high quality is important to inspire trust, whereas scaling issues can be solved once we further prove our value.
Another aspect we insisted on for our initial release was usability. We chose a friendly, familiar interaction, familiar to any GitHub user. The aim is to make the value proposition of lakeFS as intuitive as possible.
Did we nail it as far as the tradeoff goes?
Hard to say.
The first users who appreciated the value commented on the potential scale issue and our use of postgreSQL as a potential scalability bottleneck and ops burden.
Our initial data model did not scale well either, and actual performance issues were reported. Initially, it looked a lot like MVCC, with branching added on top. This shifted a lot of the logic to the database, lowering our developement velocity – and the tooling to debug/troubleshoot complex PostgreSQL queries was not great.
The way we managed this gap was to expose our community to the scalable architecture we planned to implement and get their feedback. It seemed that this had built trust with our first users and also allowed us to use their experience to perfect the architecture. We created a timeline for delivering the scalable architecture, and we progressed transparently towards it until all main issues were resolved. Some items of this plan are still part of our roadmap.
Rethinking the data model to move most of the logic out of the database was a bigger challenge though, as it wasn’t tied to a specific user request – it was an impediment for our own community’s development velocity.
With the initial model, we spent a lot of time crafting and honing a set of very complex SQL queries, sometimes having to work around the database’s query planner to get predictable performance. Small changes would then take a very long time to implement and validate – so we set out to change the data model completely.
This is not an easy decision to make – basically throwing away working software that took a lot of effort to get to high quality. We did it anyway. It took weeks, but resulted in a much more coherent system, that is easier to onboard new developers onto, easier to debug issues when they arise and most importantly, a solid base to build features on top of, to extend and to modify.
With the benefit of hindsight, we feel we chose the right approach. We are now able to provide the community with high-quality, stable software, while addressing bugs and releasing new features faster than before.
Building a community
We knew that the real feedback will come from early adopters who currently experience the pain that lakeFS solves for. While we had a group of data professionals that were interested in lakeFS from ideation, we wanted to reach as many potential users as possible and build a community around the project.
The first step was choosing a cute logo. Don’t underestimate the importance of this part.
Next, we made sure our project’s documentation is excellent. As users of open source we know that the first friction you have with adopting open source is the documentation describing it’s value, installation and use. We took the best examples we appreciated and made sure to follow the standard. We are proud of the positive feedback our documentation got, starting at the first release.
In addition we released a weekly blog post. The blog posts that feature quality content, on engineering best practices and our POV on trends within the data ecosystem. The entire team took part in this, contributing from their vast engineering experiance.
All steps, together with lakeFS being a great project with real value, allowed us to build the community we now have that keeps on growing with the wider adoption of lakeFS.
What’s ahead for lakeFS
Our vision is to allow git-like operations over data, no matter where it is stored. The data pipeline most organizations manage span over several data stores, including an object storage, an analytics DB and maybe a key value store for OLAP operations. We see lakeFS providing consistency and transactionality by treating data spanning several stores as one synchronized repository that can be branched, committed to, and reverted when needed.
We have our work cut out for us. What you’ll see before lakeFS turns two is…..
A Full Metastore integration to version data and metadata together. Rollback schema changes, look at tables as existed in the past easily, and support user flows that are completely SQL based – Including diffing and merging of open data formats.
lakeFS becomes a first class citizen in the data landscape by deeply integrating into Spark, Trino and other common data analysis and machine learning frameworks. We recently released a first iteration of our Hadoop Filesystem that is supported for Spark users, as well as native integrations into Airflow, and SDKs for Java, Scala and Python. This is just the beginning as we plan on improving them massively, as well as support additional frameworks and tools in the coming months.
Improve the architecture further to make managing lakeFS installations at scale easier.
Eventually, moving towards our vision, we want to support data not only in the lake, but also in the warehouse and even in the DB.
We believe versioning is a core functionality that allows introducing engineering best practices to data, wherever it may be stored.
We would like to take the opportunity to thank our amazing team for this journey together. lakeFS is not only about a great way to collaborate on data, it’s also about a great way to collaborate with people in an amazing culture of teamwork and accountability. The same spirit we feel within our community of users and contributors who continuously assist us in making lakeFS the best way to manage data workflows. Thank you!