A lot has happened since 2022, from the rise of Generative AI to the economic slowdown and job losses impacting data practitioners in 2023. In fact, it’s safe to say that the GenAI hype doesn’t seem to be high enough to counterbalance the influence of the struggling economy, as the demand for data practitioners is now declining, after having risen for the last 15 years.
Nevertheless, diving into this year’s report, there’s one “analyst” we can’t ignore: ChatGPT. I asked ChatGPT what the state of data engineering in 2023 looked like and this is the answer I got:
I have to say, it was still surprising that it has no idea about the hype it has caused around generative AI, even when I didn’t limit the output to 50 words. How very modest!
The way the trend and investment focus on GenAI will influence the data engineering space is likely to become a major topic in our 2024 State of Data Engineering review.
At this point, we’re seeing some preliminary LLM-specific features in MLOps platforms, developer experience tools focused on GenAI-based applications such as Continual, and of course LLM libraries (details in the MLOps section below).
Data engineering in the face of a global economic slowdown
Emerging technologies struggle when budgets shrink since they don’t have a budget line to begin with. But those who make it are probably onto something that can grow quickly once the economy picks up.
Technologies based on open source have the opportunity to grow their community, educate the market, and improve their products so that when the economy improves, their paid offerings will have a solid foundation to grow on. Our favorite example is GitHub.
Established domains see greater competition over pricing and need to add more value to maintain their customers’ ACV. What we see on our map are more acquisitions, more consolidation, and fewer new players. Don’t worry; ultimately, this crisis will bring us great products and might clear the clutter around what is really needed in the data domain versus fad trends.
So without further ado, let’s take a look at the 2023 State of Data Engineering.
Table of contents
- Data engineering in the face of a global economic slowdown
- Data Lakes
- Metadata Management
- Data Version Control, or Git for Data lakes
- Analytics Engines
- Orchestration and Observability
- Data Science + Analytics Usability
- Notebooks and workflow management
This layer includes streaming technologies and SaaS services that provide pipelines from operational systems to data storage.
Confluent, by the original creators of Apache Kafka, had recently acquired Immerok, a dominant contributor to Apache Flink. What is the product vision behind this acquisition? It’s all about incorporating real-time processing into Confluent’s data streaming platform.
Another Flink-related innovation comes from Decodable. It gives users the ability to run their own custom and managed Flink jobs on the platform.
To check out a cool new kid on the block, look into the open-source project Memphis. It’s a messaging platform that is Kubernetes-native and can come in handy if you work with real-time data pipelines. The solution offers Schema enforcement, one-click integrations, and out-of-the-box notifications and monitoring. No doubt, a really nice addition to the cloud-native ecosystem!
This layer includes object storage technologies used as data lakes. This separation is even clearer in 2023, as now we have a full split between the storage layer and the querying/compute engine with all major players.
Snowflake is now open to work with any S3-compatible object storage. It has also launched collaborations with on-prem storage providers such as Dell/EMC.
This is part of a trend among storage providers in general this year. They’re looking for a way to bring the value up the stack of the data practitioners, either via integrations and OSS compatibility or by the actual implementation of advanced capabilities to the storage, as in the case of Vast Data.
This category and its three players, Apache Hudi, Apache Iceberg, and Delta Lake, are all fully open-source under a foundation and have commercial companies maintaining them as part of their core business strategy.
Open Table Formats (OTF)
First thing’s first: we have open table formats! Unless you’re living under a rock, you’ve probably heard of them and read the benchmarks and functional comparisons published by objective parties such as Dremio, AWS, and academics.
Open table formats may have now become a standard, but we still see many organizations using several data formats. That happens even just for tabular data, and you’re bound to see multiple formats in use whenever semi-structured and unstructured data are involved.
At last! Hive Metastore has alternatives
Two years ago, we asked why HMS was still here, and what might replace it. We might finally have some answers.
Tabular launched its Data Catalog, which is a metastore for Iceberg tables that can be used anywhere. It self-optimizes to speed up your queries while reducing your cloud storage costs. The solution also provides granular Role-Based Access Control, which enables safer collaboration – particularly important in enterprise adoption.
Unity, the Catalog from DataBricks, holds a very similar promise but without the format limitation. Clearly, the performance gain is optimized here with Delta tables, but other aspects of the catalog are available in other tabular formats. One can drop and replace an HMS implementation with the Unity Catalog.
Since the world of Metastore is still dominated by HMS, the above are two different approaches to migration. The former starts with replacing the format to Iceberg, while the latter starts with replacing HMS and then optimizing to Delta tables when needed.
Data Version Control, or Git for Data lakes
While you’ll probably notice that we listed some of these tools in the Data centric ML/AI category in our report, Data Version Control is now officially a category!
lakeFS proved its value as the leading player in this category. It’s both scalable and performant while supporting all data formats. In the past year, we made Azure a first-class citizen, added deeper support for open table formats, and made it easier to use lakeFS for both analytics and ML/AI use cases.
Another interesting development in this category is compute based on Apache Arrow, a language-agnostic software framework for developing data analytics applications that process columnar data. There is a growing interest in analytics engines that are based on it. A good example here is Arrow Datafusion and InfluxDB. Also, Dremio’s founders are behind the Apache Arrow project and are the first technology to rely on it.
Ahana was acquired by IBM, probably to serve as its managed PrestoDB service. Ahana delivers a distributed SQL data lake query engine as a service in its Ahana Cloud offering.
This category seems to be growing even now as new players are rapidly gaining traction and old players continue to grow.
Snowflake deepens its investment into Apache Iceberg and seems to be heading in the direction of its own catalog for it, using the new REST API that the community built for catalogs: the very same one Tabular is based on.
The popularity of DuckDB exploded this year, proving that ease of use and great developer experience are still differentiating features, even in a deep red ocean. DuckDB’s support for the Arrow protocol makes it very simple to pick up and use, almost regardless of your stack.
Orchestration and Observability
This is yet another category experiencing growth in 2023. Although dominated by a major player (Airflow), the market keeps on growing and different pipeline management approaches are popping up all over the place.
A great example of this is Shipyard, which shows how a no-code approach can work in this space. There’s also Mage, which offers both batch and real-time pipeline orchestration, with built-in monitoring.
We also see more traditional orchestration tools that are well-established in the DevOps world being used by data practitioners. One example of that is Argo workFlows.
The observability category was established and is still led by Monte Carlo. Last year brought more competition from other companies who are seeking to differentiate themselves by focusing on specific features.
While Monte Carlo monitors both metadata and data, Anomalo, with its enterprise platform approach and focus on the data itself rather than its operational aspects is one example of advances in the observability category. Another interesting project is lightup which offers a no-code approach, similar to BigEye and Monte Carlo.
On the acquisitions side, we can mention DataBand, a pioneer in operational observability for data pipelines, which was acquired by IBM to become part of its DataOps platform.
Data Science + Analytics Usability
This is the category where we see the biggest impact of the GenAI trend on infrastructure technologies. This is just the crest of the wave, but the transformation is already starting.
End-to-End MLOps tools
On the MLOps horizon, we saw an interesting new LLM-specific feature launched by W&B. It allows engineers to check every step of their LLM program for easier debugging and more insights about model behavior.
It should also be noted that we removed all the open source projects from our State of Data Engineering 2022 chart that did not gain any real traction in the past year.
This subcategory can’t escape the end-to-end trap either, but the tools listed here take on a different approach to the functionality they provide. They put data and its management at the center of their missions.
Riding on the ChatGPT wave, Activeloop released a ‘ChatGPT Interface’ for exploring ML datasets. More of a marketing stunt than a product feature, but still fun.
This feature touches on a very general point with data scientists and APIs: the usability requirements for this target group are different than those of software engineers. Any tool that provides value to data scientists knows this challenge first-hand.
What’s happening on the acquisitions front? Veteran data version control and pipeline tool, Pachyderm, was acquired by HPE to become part of its MLOps platform and allow reproducibility at scale.
ML observability and monitoring
This subcategory includes tools that deliver monitoring and observability of model quality. Just like the data observability category, this space is growing and gaining momentum.
An interesting example here is Phoenix, an open-source library to monitor LLMs for hallucinations from Arize AI. Products like this one might soon make AI less of a black box than it is today, opening new doors to accelerated adoption.
Notebooks and workflow management
When it comes to Hex, Snowflake acts as both its investor and partner. As part of its investment in Snowpark, the company’s foray into AI/ML, Snowflake is open to integrations. And Hex is here to provide this integration to speed up data workflows over Snowflake.
On a different front, dbt announced doubling its price for both new and existing users. We believe the new pricing model better reflects their value, given that the adoption of engineering best practices in data is key to a cost-effective data engineering operation. Recently the dbt team was downsized and even with a pricing model restructuring, just like other companies we’ve mentioned, no one is immune to the economic slowdown.
Catalogs, permissions, and governance
DataBricks is acquiring security companies, creating an additional moat around its platform. One good example of this is the recent acquisition of Okera, a data governance platform focused on AI.
During the post-pandemic high, the tech industry flourished, but every area is now experiencing a slowdown, and data engineering is no exception. Well, maybe generative AI is the exception.
Our chart tracing the State of Data Engineering in 2023 shows more mergers and acquisitions, fewer new businesses entering the market, and many businesses failing to get enough traction to survive this year.
Ultimately, we believe that the current challenges will translate into a slower, but more deliberate growth of the data engineering space. Tight budgets and less VC funding are bound to help us figure out what we really need in the data world.