What is the Basic Data Lake?
A data lake is primarily two things: an object store and the objects being stored. It might look something like this:
Even with this basic setup, your data is in a good position to support all three of the main use cases for data: 1. BI Analytics 2. Data-Intensive APIs and 3. Machine Learning Algorithms.
The fact that this architecture is flexible enough to support all three speaks to the strength of object stores, particularly their flexibility in integrating with a diverse set of data processing engines.
Level #1: Modern Table Formats
As data lakes exploded in adoption, a number of improvements were made upon this basic architecture. The first and most obvious improvement to make is to replace those pesky CSV files.
A popular improvement to CSVs was and still is the columnar parquet file format. Parquet is great for analytic use cases due to it being:
- Able to support complex, nested data types.
While these are key improvements, an object store’s objects — however optimized they may be — can never be anything more than a loose collection of objects (sans adopting a separate metastore service).
What’s missing from these collections of objects, people realized, is the abstraction of a table. In databases, tables are everywhere, and all the benefits they provide are equally valid in object stores.
This is where the table formats come in: Apache Iceberg, Apache Hudi, and Delta Lake. When saving data in these formats it becomes infinitely easier to create tables within the object store itself — with a defined schema, with versioning history, and with the ability to be updated atomically.
This greatly enhances the performance and usability of a data lake. And soon enough our basic data lake will look something more like this:
How do these table formats work? Well, the idea behind them is to maintain a transaction log of objects added (and removed) from certain prefixes in the lake. This provides the important guarantee of atomicity to write operations, and lets us avoid query errors when reading and writing data simultaneously.
For more details, here are two articles that dive into more of the specifics:
Level 2: Source Control for Data
While table formats made our data lake much more impressive, we’re not done improving it. All of the benefits modern file formats provided on the table level can be extended even further to encompass our entire data lake!
How you ask? With a data source control tool like lakeFS that turns an object store’s bucket into data repository in which we can track multiple datasets.
While the previous architecture is still fresh in your mind, here’s what our data lake looks like at this level:
A new layer is added to the folder hierarchy, which corresponds to the name of a branch. lakeFS lets us create, alter, and merge as many branches as we want, making it possible to do things like:
- Create numerous multiple copies of all tables (without duplicating objects!)
- Save cross-collection snapshots of tables as commits and time-travel between them
For example, it is possible to synchronize updates to two Iceberg tables (or even a Hudi and Iceberg table) in the same lakeFS repository via a merge operation from one branch to another.
When it comes to reproducing the state of training data in an ML experiment or updating data assets power critical APIs, having your data nimble in this way makes it possible to work efficiently over even the largest data lakes.
For more information on these git-inspired workflows, see the following articles:
If you are just getting started in your data lake journey, I hope this article has provided inspiration to take advantage of the latest enhancements to data lakes.
Progress in the data lake space is far from over; in fact, most would argue it’s just getting started! Getting to the cutting edge puts you in the best position to take advantage of all the innovation underway and yet to come.
Move over turtles, turns out it is objects all the way down.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
Overview Our routine work with data includes developing code, choosing and upgrading compute infrastructure, and testing new and changed data pipelines. Usually, this requires running