Who needs a role model?
When we first launched lakeFS in August of 2020, we asked ourselves a simple question: What does success look like? And how will we know we’re doing the right things to get there?
Of course with thousands of installations, a thriving user community, active developer contributions, and exponential growth on all metrics, we could claim success quite easily. But how could we evaluate our progress while in the middle of the journey?
One thing we did to gain this perspective was take a look at other companies, and learn from what we know about their journey. For the learnings to be valuable, we had to consider the similarities and differences between us and those companies. We hence characterized both lakeFS the project and Treeverse the company. We came up with the following four:
- We operate over big data – Though we work with data stacks of all shapes and sizes, lakeFS is designed to handle problems that become acute when analyzing large-scale datasets, non-trivial data pipelines and /or organizational complexity of the data operation.
- We are a VC backed startup that released an open source project – Other companies may be internal projects inside FAANG-like companies, academic labs or OSS foundations, and later a commercial company, or companies that were first consultancies.
- We fall into the sparsely-populated version control for data category – Our chief concern is growing the category and educating the market, not fighting for space in a crowded field.
- We sit at the infrastructure level – Adoption follows a different trajectory for infrastructure that needs to integrate with a rich ecosystem of applications.
With these in mind, let’s see if we can find ourselves a role model. Note: I will ignore database companies. I may write a post one day why 🙂
Open Source First, Company Second
Many companies base an open core model on an already successful open source project. In most cases, the founding team of the company played a significant role in the creation of the open source tech.
First Generation: Managed Hadoop
In the early 2010s consulting businesses made a fortune helping organizations implement and maintain complicated Hadoop installations. The promise of monetizing on Hadoop brought Cloudera (founded in 2008), MapR (2009) and Hortonworks (2011) hundreds of millions in funding.
These companies each maintained their own distribution of Hadoop, improving usability and reducing devops friction, all open source. The business model focused on services and enterprise features related primarily to manageability and security. This is back in the day where most large enterprises were not yet using public clouds and public clouds were not yet offering managed Hadoop services.
Despite the lofty initial funding, this generation of companies didn’t see great success. The only one left standing is Cloudera.
A lot had been said about why. My take is that between 2015 and 2019 the big data market underwent two monumental changes: 1) A shift to Spark and 2) a shift to cloud based solutions. Fatally, the Hadoop-based players didn’t react to those two monumental changes on time.
Second Generation: Replacing Hadoop
In the last few years, open source projects that replace parts of the Hadoop ecosystem have come into play. Projects such as Spark, Airflow, Presto, and Kafka were created in academic labs or large enterprises in response to the shortcomings of Hadoop, but in its spirit still contributed their internal infrastructure to open source.
After 2-4 years spent growing their community and user bases, these projects launched commercial companies. Compared to the first generation, their business models differ.
Confluent, the company behind Kafka, and Astronomer (Airflow) started with enterprise adoption through premium features and support. While Databricks (Spark) and later Starburst Data (Trino) started with a SaaS offering over the open source, using the major cloud providers. Confluent has since followed suit with its own SaaS platform in early 2020, after establishing itself on large enterprises’ on-prem installations.
The second generation companies are doing very (or very very 🙂 ) well, and are competing successfully with similar offerings from public cloud providers.
Consulting Company First, Open Source Second
A software company that collects BI data to steer its business may find itself managing big data. Big data architectures are challenging, and since there are many potential architectures and tools, consulting firms thrive.
Like any operation, consulting firms create their own software tools in an attempt to reduce the cost per customer. In some cases those tools become the center of the company, and the company turns from a consulting company to a software company. Some examples from the data domain are Vision BI (Rivery), Superconductive (Great Expectations), Fishtown Analytics (dbt) .
In some cases, the tool developed by the consulting company is open sourced and the business model of the software company is open core. Two main examples in the data domain are Fishtown analytics and their project dbt and Superconductive and their project Great Expectations.
In a sense the journey to product market fit is done while the business model is consulting on a much wider engineering challenge. While the company already exists, it doesn’t shift to an open core model, and to VC funding, before it had proven product market fit of the open source project that is gained by securing a large community and large install base.
If in the former category, there was no company to finance before the project succeeded, here, the funding comes from a different business model. There is some advantage to having a rounded familiarity with many of your users in the open source community, from the consulting services provided to them.
Company First, Open Source Second
The success of open core companies in the last decade – combined with an increase in available funds – has created a new breed of companies that first raise capital, and then start developing an open source project with an open core model in mind.
I believe the reason we see more and more of those is twofold. First, there is a proven case for open core provided by companies founded 7-10 years ago. Second, there is a lot of money in the market looking to back good initiatives in the long term, and willing to finance the adoption phase of the open source project as well as its commercial journey.
An open core business model is now a penetration strategy. A form of Freemium that is meant to provide product lead growth.
Here we should probably differentiate between the red ocean swimmers, going into a domain that already exists, and the blue ocean swimmers, innovating by defining new categories.
Red Ocean Swimmers
In this category we find open source projects offering a remedy for a pain already addressed by either commercial or other open source projects.
Let’s look at orchestration tools. We have already mentioned Astronomer, who heralded the large community of Airflow users with the launch of Airflow 2.0. This formed the open source project that is the Gorilla of this space, which other players such as Prefect and Dagster need to unsit.
Another, slightly different group would be the discovery and catalog domain. Here commercial enterprise software solutions such as Allation and Collibra, have seen over 10 open source projects released in the last 18 months, including Amundsen, Nemo, Metacat, Apache, Atlas, and DataHub.
Most were developed by large corporations to manage the access to their various data sources, and later contributed those projects to the community as open source. Three projects also have a commercial company that are founded on the basis of monetizing them: Stemma, built over Lyft’s Amundsen, Acryl Data over LinkedIn’s DataHub, and Datakin based on WeWork’s Marquez and OpenLineage.
In these cases the creators of the projects are behind the companies. However the projects – while mature as software projects since they served within the companies they were built in – are not yet mature open source projects in terms of adoption and community.
Even though maybe the company didn’t come first, the open source adoption wasn’t there when the commercial entity was established. So the challenge of creating a successful open source community while being VC backed is there.
Other examples worth mentioning are Min.io in the object storage domain, Airbyte in the SaaS ingest domain and Preset in data visualization. In Preset’s case, it may also be an ideology, as its founder Maxime Beauchemin, is the creator of Airflow.
Blue Ocean Swimmers
We (lakeFS) are a blue ocean swimmer, going into a domain that is not yet established. We need to educate the market on the pain we solve, and on the value of solving it.
Among blue ocean swimmers are dbt labs (formerly Fishtown Analytics) and Superconductive (mentioned above). Both were the first to identify an unanswered pain and address it. What followed is somewhat different at this point in time. The observability domain Superconductive is in has quickly gained traction with over 10 companies offering closed source solutions to the same pain, and in a sense validating the need and the market.
Dbt labs remain the only, very successful offering in its domain. Its own success is the validation of the relevance of its offering. Needless to say, Hadoop was a blue ocean solution when it was launched. A novel approach to data that aimed to solve big data analytics for good.
Blue ocean swimmers carry the burden of market education, but also have the opportunity to lead the market they create, by being first, and by shaping the messaging of the domain and its identity with their brand.
An Infrastructure Open Source Project
It’s one thing if a person downloads an open source project to be used locally or chooses a library to link to her project. And a separate thing to adopt an infrastructure that influences many users and interacts with other infrastructure components of the architecture. The adoption curve of infrastructure is different than the one of other tools due to the different risk/reward pattern.
In the data domain there is a large share of infrastructure projects, starting with Hadoop and Spark, and continuing to AIrflow and Kafka. However, dbt and great expectations are not infrastructure, but tools that provide value to a single user locally, at least as an initial penetration.
When considering infrastructure, Hashicorp is the ultimate role model. It is responsible for several extremely successful open source projects, with different levels of success in monetizing them. Vault, Terraform and Consul to name a few. Hashicorp also shares lessons learned regarding its adoption and monetization efforts. The biggest lesson in my opinion is that even after you’ve done it a few times, it’s a challenge to build the open core correctly.
We Can't Avoid the Git Analogy
Git is probably the most successful open source project after Linux. Both GitHub and Gitlab managed to successfully monetize Git, with differing approaches:
GitHub initially gained traction by allowing developers to easily collaborate using its Pull Requests feature. This proved valuable for both open source projects as well as enterprise applications.
By positioning itself in the center of the development lifecycle, it was later able to grow by allowing developers to define a fully managed lifecycle, from coding to production, with very little to no devops efforts.
Gitlab’s approach was a bit different – provide an open source platform, capable of basically everything available to GitHub users, where users are free to deploy and modify it to their liking. They monetize by offering tiered “enterprise” levels, each one exposing better management, security and compliance levels.
Both are interesting examples to learn from, as we believe lakeFS will have the same goal: Reduce dependency on DataOps/MLops when creating lifecycle management for data.
We have something to learn from each and every one of the companies we mentioned above. There is something we can benchmark against with all of them.
We all make our own path in the world, and when starting an innovative open source project, we are all unique in some way. The need we have to constantly look around and place ourselves within the ecosystem in many aspects comes from the desire to conceptually get as much information as we can to assist us in navigating our own journey. In many cases understanding why a data point is not relevant to our status or nature, can help us as much as an extremely relevant data point.
Most importantly, while constantly looking for data, we never give up on our convictions and gut feeling and a compass to guide us, as each journey is different.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
This post is a recap of a talk I gave at this year’s Data + AI Summit about why I believe the Rust Programming Language
There are several tools in the data version control space, all looking to solve similar problems. Two of the leaders are lakeFS and DVC. In
Mixing Metadata, Air and Water: Use the lakeFS Airflow Provider to Link Airflow Execution to lakeFS Data
Introduction “How do I integrate X with lakeFS” is an ever-green question on lakeFS Slack. lakeFS takes a “tooling-first” strategy to data management: it slots
Table of Contents