Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Einat Orr, PhD
Einat Orr, PhD Author

Einat Orr is the CEO and Co-founder of lakeFS, a...

Last updated on June 10, 2024

Since 2021 we’ve been releasing the annual State of Data Engineering Report, a compilation of all the relevant categories that have a direct impact on data engineering infrastructure.

In 2024, we see 3 primary trends that influence the categories which will be covered in this report.

To access a PDF version of this report with links to companies and a list view of all the categories, you can also click here (note that we will require some contact info). To simply expand the image, click on it now

Trend #1: GenAI influence on software infrastructure

As predicted in the 2023 State of Data Engineering Report,  the 2024 edition is heavily influenced by the rise of Generative AI. In the 2024 report, we will discuss this influence on storage, computation engines, MLOps and Observability tools, but you can also find its footprints in features across almost every category. Since we are not discussing hardware, you’ll only see Nvidia mentioned once, to say GPUs are out of scope for this report. 

Trend #2: Expansion of product offerings

Another aspect of the 2023 report was the economic downturn’s impact on companies’ growth. While market indexes are moving up, it is due to a small number of companies leading the GenAI revolution in hardware and foundation models. For the rest of the market, the struggle is still ongoing, making it harder for tech companies in the data domain to grow,  and hence we continue to see companies vying to evolve and expand their products to adjacent domains in an effort to increase revenue.

Trend #3: Open table formats and their catalogs creating closed gardens

Open table formats and their catalogs are making waves in the data lake, and may become the center of the giant war between Databricks and Snowflake. The catalogs are becoming the technology that makes an open environment into a closed garden, and Databricks and AWS seem to be pushing in that direction.  The February Apache Iceberg community call revolved around excluding all catalogs except one from the codebase, resulting in some interesting releases of catalogs to open source to fight what seemed to be a move by Tabular to take over the community. Oh, the drama!

Now that we got you curious, let’s dive into the details.

Ingestion

This layer includes streaming technologies and SaaS services that provide pipelines from operational systems to data storage.

Since Kafka is the standard, it has become a protocol. Other players trying to penetrate here are using the same interface with their own innovative technologies, such as Warpstream, Redpanda and Apache Pulsar. Read more about this observation here.

Data Lakes

This layer includes object storage technologies used as data lakes. 

The rise of deep learning posed requirements on storage and architecture. Data retrieval must be fast, to help reduce the time as much as possible, and hence cost, of model training. Storage performance becomes critical. The data lake category, the Object storage providers, rose to the challenge. Min.Io has always focused on performance and has benchmarks to show for it. Amazon released S3 Express One Zone, a new storage tier showing an order of magnitude in data retrieval performance. Vast data released Vast Platform, a product for AI use cases, allowing the performance needed for deep learning.

The need to get the data closer to the computation had led to the release of several open source tools that allow local mounting of storage data, such as AWS S3 Mountpoint and Azure ML Data Assets. lakeFS will launch its own mount capabilities in early June.

Metadata Management

Open Table Formats (OTF)

This category and its three players, Apache Hudi, Apache Iceberg, and Delta Lake, are all fully open-source under a foundation and have commercial companies maintaining them as part of their core business strategy.

In the past year Apache Iceberg declared victory in this category, and judging by the movement in the market towards Apache Iceberg-based environments, this may very well not change. While Snowflake declared their Iceberg support last year, and AWS focused on Iceberg support with its data products, Starburst recently announced its product support for Apache Iceberg, putting its offering on par with Dremio, which was built on Iceberg from day one. Keep reading for more aspects of the Apache Iceberg takeover of open table formats (OTF).

It seems that Delta Lake is now a choice of Databricks users only. Considering its growing market share, this is probably an understatement.

Metastores

This category includes meta stores for data lakes, allowing an SQL interface to the lake, among other things.

A direct impact of the Iceberg takeover is the new and advanced Iceberg REST catalogs that compete with Tabular, OpenHouse that was released to open source by LinkedIn, and Gravitino as part of its OSS data platform. While the Iceberg creators are becoming more and more commercial, enterprises that contributed a lot to Apache Iceberg OSS are continuing to nurture the OSS ecosystem in and around Iceberg. IMHO, this struggle on the independent and open nature of Apache Iceberg is just beginning.

We also provided our take on how metastores are becoming the door to closed gardens and vendor lock-in.

Data Version Control, or Git for Data Lakes

This category includes data version control systems that allow engineering best practices in data products.

This year, the category expanded to XetHub and Oxen, both coming out of stealth, along with Underhive releasing its open source. This trend shows the growing understanding in the market that data version control is not a feature in an analytics or ML system, but rather an organizational infrastructure over the data. This is the approach we took at lakeFS since day one, providing a highly scalable data version control system with the required features to accommodate any data practitioner in data engineering, ML, AI, and analytics.

The 2023 Technology Radar by Thoughtworks showed the OSS project DVC as a technology to adopt, giving the AI/ML use case for data version control a strong validation. 

Another source of validation for the AI/ML use case is Hugging Face’s long use of GitLFS to version their data. If anyone is fit to teach us how to work with data in AI/ML, Hugging Face would be it. 🤗

Here’s how data version control interacts with Open Table Formats for those of you wondering. 

And if you’re ready to test your data versioning skills in the lake, the best place to start is by spinning up a local environment

Compute

Distributed Compute

This category includes technologies for distributed computation.

No news is good news? We think not….

Analytics Engines

This category includes databases that provide analytics capabilities for data analysis.

In the rise to be relevant for GenAI, many members of this category offer capabilities in the Vector search domain (see vector databases category). This is a natural expansion from a user perspective, since one database can serve several needs within the data architecture, making it more manageable. An example of this would be Clickhouse and Elastic, which have a strong natural use case with data management around LLMs in general.

Not really new, but Snowflake is going down the stack with Snowpark and Databricks is going up the stack with serverless SQL warehouse. Databricks showed the fastest adoption to the GenAI needs and hence grew its business much faster than Snowflake in the last year. 

As we mentioned above, Starburst released their Apache Iceberg-based warehouse, directly competing with Dremio.

Orchestration and Observability

Orchestration

This category includes tools that design and manage data pipelines.

The trend of investing in AI/ML is evident in the orchestration category, with most tools working to provide functionality specific to AI/ML pipelines. 

We also see the trend of consolidation between categories happening here. For example, Dagster offering Dagster+, Dagster with a built-in catalog automatically populated, lineage, data observability capabilities, and integration with OpenAI to observe and manage openAI API calls. 

Observability

This category includes tools that provide data quality testing and monitoring, along with data pipeline health monitoring.

While data observability has long been a standby of SMB and mid-market stacks, enterprises are increasingly adopting the technology. More and more, data observability solutions are being used not just to monitor the data sources (both internal and external) but also the infrastructure, pipelines, and post-ingestion systems. Enterprises are also viewing data observability as a foundational priority for their generative AI initiatives. Gartner has identified data observability as a must-have for AI-ready data. 

This may be the reason observability became a popular addition to tools from other categories. It is now provided by end to end MLOps tools, Databricks and Snowflake platforms, orchestration tools, and catalogs. The competition in this domain is no longer with tools with the same offering, but rather with consolidated platforms providing other basic infrastructure. 

Similar to 2023, Monte Carlo are paving the way forward in this category, with G2 recognizing them as the #1 Data Observability Platform and data-driven organizations like Cisco, American Airlines, and NASDAQ leveraging Monte Carlo to drive more reliable AI systems. 

Data Science + Analytics Usability

End-to-End MLOps tools

This category includes platforms that provide the full MLOps suite from experimentation management to production observability, or a large portion of that functionality.

As one would expect, this category was heavily influenced by GenAI: interfaces to foundation models, prompted engineering tools, RAG and fine-tuning support, observability and reproducibility are part of the functionality added to those tools to ensure their relevance to GenAI users. If in the past we suspected this category was trying to bite off more than it can chew, there is even more to chew now.

Data-Centric AI/ML

This category includes platforms that manage the data throughout the MLOps journey, from experimentation to production and back.

Coined by Andrew Ng, former head of Google Brain, data-centric ML refers to the need to prioritize the quality and integrity of the data powering your machine learning and AI systems, often over the reliability of the models themselves. In 2024, data-centric ML is now data-centric AI. Retrieval augmented generation (RAG) and fine-tuning will enable teams to build custom, enterprise-grade AI for use cases beyond experimentation, and the tools in this category are positioned best to make RAG easier and more reliable, together with data version control, data observability, and vector databases.

ML observability and monitoring

This category includes tools that provide model quality metrics and monitor model drift

While we see ML observability tools offering GenAI related observability features, we also have new joiners with the focus on GenAI, such as HunnyhiveNannyML and AIMon, and the data observability category leaning towards this domain as well, so this year, the competition here is fierce. 

For this category, as for data observability, the question remains: are there really two categories of observability, one for data and one for AI/ML, or should organizations use one solution for both? 

Vector DataBases

This category includes databases that specialize in managing vectors with high read and write performance for finding neighboring vectors.

Vector databases made the biggest growth in terms of the number databases in the category, not to mention a huge splash of discussions. It is considered a must-have for any AI/ML environment to ensure performance of operations over embeddings. This need obviously grows with GenAI models and fine tuning. We lately reviewed the vector database category extensively and are continuing to update top picks regularly.

That said, there are some voices who challenge the common assumption that vector databases are inevitable. Here’s one of them.

Notebooks and workflow management

Come to our booth at the Data+AI Summit to get a sticky “notebook”!

Catalogs, permissions, and governance

This category includes platforms for data discoverability, access control and governance.

For GenAI and other AI systems to be performant and effective, however, they need to be trustworthy, compliant, and secure. We predict that ensuring the ethics and accountability of LLMs will fall to data leaders themselves instead of GenAI tiger teams and other ML engineering groups. The tools that are positioned best to assist with that are the data governance tools, for example by integrating into RAG flows.

In this category we also see expansion to other domains. An example would be Acryl data, the company behind DataHub, adding data observability capabilities to their catalog.

Conclusion

It is evident that 2024 is shaping out to be more than just Generative AI. While organizations are adding infra to support their interaction with foundation models such as ChatGPT, and fine tuning of such models using GPUs, other fundamental changes are happening that require adoption of new infrastructure to data environments.  Examples of that are vector databases, open table formats and their catalogs, and data version control systems. 

Take a look back at previous reports and compare the evolution of the State of Data Engineering:

Git for Data – lakeFS

  • Get Started
    Get Started
  • Who’s coming to Data+AI Summit? Meet the lakeFS team at Booth #69! Learn more about -

    lakeFS for Databricks
    +