What is the Basic Data Lake?
A data lake is primarily two things: an object store and the objects being stored. It might look something like this:
Even with this basic setup, your data is in a good position to support all three of the main use cases for data: 1. BI Analytics 2. Data-Intensive APIs and 3. Machine Learning Algorithms.
The fact that this architecture is flexible enough to support all three speaks to the strength of object stores, particularly their flexibility in integrating with a diverse set of data processing engines.
In-memory distributed processing with Spark? No problem. A columnar data warehouse like Snowflake? Piece of cake. A distributed query engine like Trino? Go for it.
Level #1: Modern Table Formats
As data lakes exploded in adoption, a number of improvements were made upon this basic architecture. The first and most obvious improvement to make is to replace those pesky CSV files.
A popular improvement to CSVs was and still is the columnar parquet file format. Parquet is great for analytic use cases due to it being:
- Able to support complex, nested data types.
While these are key improvements, an object store’s objects — however optimized they may be — can never be anything more than a loose collection of objects (sans adopting a separate metastore service).
What’s missing from these collections of objects, people realized, is the abstraction of a table. In databases, tables are everywhere, and all the benefits they provide are equally valid in object stores.
This is where the table formats come in: Apache Iceberg, Apache Hudi, and Delta Lake. When saving data in these formats it becomes infinitely easier to create tables within the object store itself — with a defined schema, with versioning history, and with the ability to be updated atomically.
This greatly enhances the performance and usability of a data lake. And soon enough our basic data lake will look something more like this:
How do these table formats work? Well, the idea behind them is to maintain a transaction log of objects added (and removed) from certain prefixes in the lake. This provides the important guarantee of atomicity to write operations, and lets us avoid query errors when reading and writing data simultaneously.
For more details, here are two articles that dive into more of the specifics:
Level 2: Source Control for Data
While table formats made our data lake much more impressive, we’re not done improving it. All of the benefits modern file formats provided on the table level can be extended even further to encompass our entire data lake!
How you ask? With a data source control tool like lakeFS that turns an object store’s bucket into data repository in which we can track multiple datasets.
While the previous architecture is still fresh in your mind, here’s what our data lake looks like at this level:
A new layer is added to the folder hierarchy, which corresponds to the name of a branch. lakeFS lets us create, alter, and merge as many branches as we want, making it possible to do things like:
- Create numerous multiple copies of all tables (without duplicating objects!)
- Save cross-collection snapshots of tables as commits and time-travel between them
For example, it is possible to synchronize updates to two Iceberg tables (or even a Hudi and Iceberg table) in the same lakeFS repository via a merge operation from one branch to another.
When it comes to reproducing the state of training data in an ML experiment or updating data assets power critical APIs, having your data nimble in this way makes it possible to work efficiently over even the largest data lakes.
For more information on these git-inspired workflows, see the following articles:
If you are just getting started in your data lake journey, I hope this article has provided inspiration to take advantage of the latest enhancements to data lakes.
Progress in the data lake space is far from over; in fact, most would argue it’s just getting started! Getting to the cutting edge puts you in the best position to take advantage of all the innovation underway and yet to come.
Move over turtles, turns out it is objects all the way down.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
How To Maintain Data Quality In Your Data Lake
Enterprises use more and more data as the foundation for their decisions and operations. The sheer number of digital goods that collect, analyze, and use
lakeFS Product Offerings Overview: Open Source vs. Enterprise vs. Cloud
What is lakeFS? lakeFS is a platform that helps data engineers build scalable and resilient data lakes running on object storage. It provides version control,
Best Practices to Easily Adopt lakeFS
lakeFS is gaining momentum as a solution for data versioning on top of an object store, and more and more data driven organizations adopt lakeFS