Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Oz Katz
Oz Katz Author

Oz Katz is the CTO and Co-founder of lakeFS, an...

Published on February 6, 2024

This article was originally published on Datanami and is republished here with permission.

As we officially kick off 2024, I realized I have a few thoughts on the direction of the data landscape that might be of interest to others. 

This is a recap of my “predictions.” 

I will admit that it’s a mix of what I believe will happen with what I’d like to see happen, but being human, it’s sometimes hard to separate between the two.

To make the list easier to consume, I’m dividing it into 2 distinct parts: The Data Lake and the Serving Layer

The Data Lake

Here’s what I believe we will see this year in the area of analytics, OLAP and data engineering:

Moving on from Hadoop

In 2023, tools such as DuckDB (C++), Polars (Rust) and Apache Arrow (Go, Rust, Javascript, ) became very popular, starting to show cracks in the complete dominance of the JVM and C/Python in the analytics space. I believe the pace of innovation outside the JVM will accelerate, sending the existing hadoop-based architectures into the legacy drawer.

While most companies already aren’t using Hadoop directly, much of the current state of the art is still built on Hadoop’s scaffolding: Apache Spark completely relies on Hadoop’s I/O implementation to access its underlying data. Many lakehouse architectures are based either on Apache Hive-style tables, or even more directly, on the Hive Metastore and its interface to create a tabular abstraction on top of their storage layer.

The modern digital infrastructure
Source: XKCD, under Creative Commons Attribution-NonCommercial 2.5 License.
Slightly modified by Oz Katz. Cloudera’s logo is trademarked to Cloudera Inc.

While Hadoop and Hive aren’t inherently bad, they no longer represent the state of the art. For once, they are completely based on the JVM, which is incredibly performant nowadays, but still not the best choice if you’re looking to get the absolute best out of CPUs that are simply not getting any faster.

Additionally, Apache Hive which marked a huge step forward in big data processing by abstracting away the underlying distributed nature of Hadoop and exposing a familiar SQL(-ish) table abstraction on top of a distributed file system, is really starting to show its age and limitations: lack of transactionality and concurrency control, lack of separation between metadata and data, and other lessons we’ve learned over the > 15 years of its existence.

I believe this year we’ll see Apache Spark moving on from these roots: Databricks already have a JVM-free implementation of Apache Spark (see: Photon), while new table formats such as Apache Iceberg are also stepping away from our collective Hive roots by implementing an open specification for table catalogs, as well as providing a more modern approach to the I/O layer. 

Battle of the meta-stores

With Hive slowly but steadily becoming a thing of the past and Open Table formats such as Delta Lake and Iceberg becoming ubiquitous, a central component in any data architecture is also being displaced – the “meta-store”. That layer of indirection between files on an object store or filesystem – and the tables and entities that they represent. 

While the table formats are open, it seems that their meta-stores are growing increasingly proprietary and locked down.

Databricks are aggressively pushing users to their own Unity Catalog, AWS has Glue, Snowflake has its own catalog implementation too. These are not interoperable, and in many ways become a means of vendor lock-in for users looking to leverage the openness afforded by the new table formats. I believe that at some point, the pendulum will swing back – as users will push towards more standardization and flexibility.

 My co-founder, Dr. Einat Orr and I wrote a thorough analysis on the subject of metastore vendor lock, which I highly recommend reading.

Big Data Engineering as a practice will mature

As analytics and data engineering become more prevalent, the mass body of collective knowledge is growing and best practices are beginning to emerge.

In 2023 we saw tools that promote a structured dev-test-release approach to data engineering becoming more mainstream. dbt is vastly popular and established by now. Observability and monitoring are now also seen as more than just nice-to-haves, judging by the success of tools such as Great Expectations, Monte Carlo and other quality and observability platforms. lakeFS (whose blog you’re currently reading) advocates for versioning of the data itself to allow git-like branching and merging, allowing to build robust, repeatable dev-test-release pipelines.

Additionally, we are also now seeing patterns such as the Data Mesh and Data Products being promoted by everyone, from Snowflake and Databricks to startups popping up to fill the gap in tooling that still exist around these patterns.

I believe in 2024 we’ll see a surge of tools that aim to help us achieve these goals. From data-centric monitoring and logging to testing harnesses and better CI/CD options – there’s a lot of catching up to do with software engineering practices, and this is the right time to close these gaps.

The Serving Layer

Cloud native applications will move a larger share of their state to object storage

At the end of 2023, AWS announced one of the biggest features coming to S3, its core storage service – pretty much since its inception in 2006.

That feature, named “S3 Express One-Zone” allows users to use the same* standard object store API as provided by S3, but with a consistent single-digit millisecond latency to access data. At roughly half the cost for API calls.

To me, this marks a dramatic change. Until now, the use cases for object storage were somewhat narrow: while they allow storing pretty much infinite amounts of data, you’d have to settle with longer access times, even if you’re only looking to read small amounts of data.

This trade-off obviously made them very popular for analytics and big data processing, where latency is often less important than overall throughput, but it meant that low latency systems such as databases, HPC and user-facing applications couldn’t really rely on them as part of their critical path. If they made any use of the object store, it would typically be in the form of an archival or backup storage tier. If you want fast access, you have to opt for a block device, attached to your instance in some form, and forgo the benefits of scalability and durability that object stores provide.

I believe S3 Express One-Zone is the first step towards changing that.

With consistent, low latency reads, it is now theoretically possible to build fully object-store-backed databases that don’t rely on block storage at all. S3 is the new disk drive.

With that in mind, I predict that in 2024 we will see more operational databases starting to adopt that concept in practice: allowing databases to run on completely ephemeral compute environments, relying solely on the object store for persistence.

Object Stores (S3) Block Storage (EBS) S3 Express One-Zone
High throughput needs Image Image Image
High throughput writes Image Image Image
Fast random reads Image Image Image
Fast random writes Image Image Image
Durability Image Image Image
Price Image Image Image
Scalability Image Image Image

Operational databases will begin to disaggregate 

With the previous prediction in mind, we can take this approach a step further: what if we standardized the storage layer for OLTP the same way we standardized it for OLAP?

One of the biggest promises of the Data Lake is the ability to separate storage and compute, so that data being written by one technology could be read by another.

This gives developers the freedom to choose the best-of-breed stack that best fits their use case. It took us a while to get there, but with technologies such as Apache Parquet, Delta Lake and Apache Iceberg, this is now doable.

What if we manage to standardize the formats used for operational data access as well? Let’s imagine a key/value abstraction (perhaps similar to LSM sstables?) that allows storing sorted key value pairs, optimally laid out for object storage. 

We can deploy a stateless RDBMS to provide query parsing/planning/execution capabilities on top, even as an on-demand lambda function. Another system might use that same storage abstraction to store an inverted index for search, or a vector similarity index for a cool generative AI application.

While I don’t believe a year from now we’ll be running all our databases as lambda functions, I do think we will see a shift from “object stores as an archive tier,” to more “object store as the system of record” happening in operational databases as well.

Object store as the system of record

Summary 

Overall, I’m optimistic that 2024 will continue to evolve the data landscape in mostly the right directions: better abstractions, improved interfaces between different parts of the stack and new capabilities unlocked by technological evolution. 

Yes, it won’t always be perfect, and ease of use will be traded off for less flexibility, but having seen this ecosystem grow over the last 2 decades, I think we’re in better shape than we’ve ever been. We have more choice, better protocols and tools  – and a lower barrier of entry than ever before. And I don’t think that is likely to change.

Disclosure of Changes

Git for Data – lakeFS

  • Get Started
    Get Started