Let’s start with the obvious: the lakeFS project doesn’t exist in isolation. It belongs to a larger ecosystem of data engineering tools and technologies adjacent and complementary to the problems we are solving. What better way to visualize our place in this ecosystem, I thought, than by creating a cross-sectional LUMAscape to depict it.
What’s more, I believe it is critical to understand where lakeFS resides to identify opportunities where we can bring additional value to users by addressing pain points in today’s practices.
With that said, I’m excited to share what we created (and continuously maintain) internally with the larger community! Note at the end I conclude with a few thoughts and predictions about the space generally.
Without further ado…
The section of the data ecosystem lakeFS inhabits can be described as open, flexible analytic platforms capable of supporting the core functions of modern data teams:
- Data-Intensive applications & APIs
- AI & Machine Learning projects
- Warehouse-based BI & reporting
What we see in the LUMAscape above are the major components typical of these platforms. We’ll cover each in more detail, starting from the bottom and working upwards.
The first step is getting data into the system and there are three main strategies for doing so.
1. Batch Ingest
The first strategy is basic uploading or dumping of data files in batch style. This is easy enough to implement yourself using core functions available in most programming languages or common data-transformation libraries like Pandas and Spark.
For example, it can be as simple as:
with open('file.csv', 'rb') as f:
2. Streaming Ingest
The second approach of streaming data requires more advanced technologies that combine high throughput messaging systems with compute capabilities within their consumers. Most prevalent is the open-source project Kafka and its many managed offerings. Competing with Kafka are public cloud services such as AWS Kinesis and Google Pub/Sub. Finally there are other open source options such as Flink and Spark Streaming one can choose.
3. Managed SaaS Ingest
It is common to want to ingest data from operational systems like a SalesForce CRM, Hubspot account, or Zuora subscription containing financial information, as well as other internal databases. Rather than implement the fetching data from these tools yourself, it is increasingly common to employ a managed approach by using any of the five tools in this section—Segment, Stitch, Fivetran, Snowplow, and Matillion—for their pre-built data connectors.
No reason to reinvent the wheel, unless your data volumes grow large enough that it becomes prohibitively expensive.
The Data Lake
The exact definition of a data lake can change depending on who you ask. It is easier to know one when you see it. And there are two main architectures you’ll encounter:
- The first is characterized by a separate object store and analytics engine.
- The second employs technology that combines both functions in one system.
Let’s look at examples of each.
In the first type of lake architecture, object stores hold any type of data in a cost-effective way with a rich ecosystem of applications that can consume data directly from it.
Analytics engines provide an SQL interface to tabular, relational datasets. Some engines like Snowflake, Druid, Firebolt, Redshift (and other tools commonly referred to as “data warehouses”) integrate proprietary storage services with the analytics engine, creating self-contained data lake functionality.
Where there’s data, there is metadata. Metadata is used to define schema, data types, data versions, relations to other datasets and so on. It is useful for improving discoverability, manageability, and enforcement of good practices.
The following sections of the LUMAscape all leverage metadata to achieve these aims.
Open Table Formats
One of the empowering characteristics of data lakes in an object store is that we choose the format data is stored in. This is also one of the most influential decisions, as it impacts the performance and functionality of the lake directly.
Open table formats like Hudi, Iceberg, and Delta are designed to meet mutability requirements (think GDPR) and maximize performance of even the largest tables. They achieve this through managing metadata files over the dataset, allowing for fast access and mutations during read or write operations.
Metastores play the important role of abstracting files in object storage into the familiar construct of a query-able table.
One relic of the Hadoop ecosystem that is most likely to survive is Hive Metastore, a virtualization layer that provides tabular access to the content of the object storage. It also plays a role in managing schema, aiding in the discovery of data lake content, and improving read performance through partition management.
Hive is the sole metastore on the market, with managed or compatible versions available on all public clouds.
It will be interesting to see if a new player emerges to overtake Hive, as has occurred with most other Hadoop-era technologies. Or perhaps an existing tool or combination of multiple (potentially format + discovery) will make hive redundent.
Data Lifecycle Management
The data in data-intensive applications should have a lifecycle similar to the ones used to manage code. Lifecycle management tools allow for this through CI/CD operations and isolated data development environments (instead of shared buckets).
Both lakeFS and the Nessie project approach the problem by enabling git-like operations over collections of data. Notably, Nessie leverages (and depends on) the metadata created by the Iceberg data format, whereas lakeFS employs a general data model that is format agnostic.
Data pipelines that run over the data lake require orchestration of tasks. A data pipeline may include the execution of hundreds or even thousands of jobs represented by a DAG, where the input of one job may depend on the output of several upstream jobs.
At this point, data is not only present in the system, but flowing smoothly thanks to metadata tools. Now it’s time to crunch it!
When working with data volumes common to most lakes, distributed compute engines are a must to handle the load. When it burst onto the scene in 2006, Hadoop was a significant improvement (and open source, no less). Since then the category has only continued to improve to the point of allowing near real-time computation via both SQL and code interfaces.
Distributed compute today is dominated by Spark—offered as an open source technology, as a service on the major cloud providers, and other vendors.
This category of data virtualization aims high. Regardless of data’s location, it aspires to provide access to it via a single endpoint.
Trino (formerly PrestoSQL) is the first open source project to offer such federated capabilities. Today, all public clouds offer their managed version of Trino, and other virtualization technologies like Denodo are entering the market.
Data Science + Analytics Useability
Not all users of the platform will be engineers, hence tools are required to close the gap between the technology and user capabilities. Here we cover a few of the areas where tools exist to improve the experience for BI and DS functions on analytic platforms.
The processes involved in the development and maintenance of machine learning models have received much attention in the last few years. This category includes dozens of different tools, including homegrown in large enterprises and released to open source, like MetaFlow (Netflix), DisDat (Intuit), and KubeFlow (Google). Also relevant are commercial companies with an open source strategy such as Pachyderm, DVC, and Clear.
What’ll be interesting to see play out is whether an end-to-end solution can win out in managing the ML model lifecycle. Or whether architectures will consist of multiple tools with a more specialized focus.
An in-depth analysis of this category can be found here.
The organization and execution of transformational queries poses challenges for analysts. As a result, tools like dbt and dataform have exploded in popularity, providing the equivalent of an IDE for running data intensive code/SQL.
Notebook environments like Jupyter came onto the scene and mostly made code tutorials in blogs a bit nicer looking. Since then notebook environments have become preferred interfaces for everything from exploratory analysis, to ML model training, and even production ETL jobs.
The second metadata layer doesn’t describe the data itself, but rather contains organizational metadata. The tools in this final, topmost section aim to enhance the usability of data platforms in organizational settings.
In the last 18 months, 10 new open source projects were released from large companies (see the Love Letters section below) that offer an organizational data catalog. These discovery tools allow a user to easily find datasets, visualize connections between them, contact the creators, and see how they are used.
As an org scales, it is important to make this information easily found to sustain a data-driven culture that is efficient and consistent.
Lineage, manageability and Governance
Enterprises in many verticals are committed to data auditing, reproducibility and regulation. The tools in this category simplify data management for these purposes, sometimes involving custom solutions per customer.
Quality & Observability
Finally, the quality and observability category offers rule or machine learning based data quality monitoring and testing. In an ideal world, tests will cover all data source, be implemented in all stages of the data lifecycle.
Errors and anomalies are accepted as a given in complex data systems. The idea is to identify them before your consumers do. Though perhaps as this category matures, our expectations around data quality will rise with it.
Observations and Predictions
Phew! We made it through the analytics gauntlet! Now that we’ve touched on the major components of modern analytics platforms, I’d like to share a few trends.
1. Manageability As a First-Order Problem
The first problem we faced with big data was the feasibility of processing data at such a high scale. In solving the scale problem, people developed technologies we know today like Kafka, Spark, Presto, Snowflake, etc.
Now the problem people face is one of manageability. They no longer ask if they can handle a dataset but rather: How can I move faster when developing data-intensive applications? How do I utilize all of my data (Discoverability) and ensure it is high-quality (Quality, Observability)? Or, how do I ensure reproducibility, auditability, and governance of my data?
This context explains the explosion of tools in the categories along the topmost row of the Lumascape: Data Discovery, Quality & Observability, and Lineage, Management & Governance. As people want to do more with their data—run more analyses, put more models into production, etc.—effective use of these tools will play an important role in enabling this. It is this domain of metadata management that I expect to see growing in the next few years.
2. Love Letters From the Future
Sophisticated data organizations like Netflix and Uber were the first to encounter the problems related to large-scale analytics. In response, they developed their own internal solutions like the Iceberg and Hudi data formats respectively to address these issues. Years later, the rest of the world is catching up and one example is the adoption of these data formats, now open-sourced.
We see the same pattern in the Orchestration space, where Airflow, originally developed at AirBnb, is now an open source product with huge adoption, and competitors like Prefect and Dagster emerging.
A final category worth highlighting is Discovery, where it seems every notable company developed an internal Data Catalogue tool that now is available as an open-source or paid service. Some examples are Amundsen (Lyft), Datahub (LinkedIn), Metacat (Netflix), Databook (Uber), and Dataportal (Airbnb).
3. Means of Consolidation
One thing the LUMAscape highlights is the fractured nature of the data engineering ecosystem. And it stands to reason we’ll see a degree of consolidation in the future. The question is what type of consolidation?
One option is for an end-to-end solution to emerge more in-line with the closed Snowflake platform. The other option is consolidation around an ecosystem based on open standards, aligned with the Databricks approach.
My belief is that if a consolidated solution emerges, it will form through allowing organizations to pick and choose the pieces of the puzzle that make sense for them. The final system results in a platform with a total added value greater than the individual parts.
From a vendor perspective, it means my chips are placed on the DataBricks approach (though there is certainly room for both companies to succeed).
As one final note, I see a parallel situation in the MLOps space. Vying for market share are products offering closed, end-to-end solutions to model management. My bet however, is that in the mid-term, the ecosystem will remain fractured with tools satisfying their niche remaining successful.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
Data is the core of any ML project. The vast volumes of machine learning data are the most important factor in training algorithms to deliver
It’s clear that the adoption of dbt is picking up, as it now supports major big data compute tools like Spark and Trino, as well
A data lake is often implemented using object storage, and our data resides in objects (files) that we can access through the storage API. Data
Table of Contents