TL;DR:

DataOps complexity arises from unclear R&Rs, a lack of standardization in interfaces, distributed technology complexities, and difficulties in implementing engineering best practices. The solution is to define clear responsibilities, address missing requirements, and manage data pipelines efficiently using emerging solutions that enhance the manageability and resilience of DataOps.

What makes DataOps so hard is, well, the data itself.

DataOps is DevOps for data, and because of that data, it’s just a bit more complicated. Data not only adds another layer of complexity but also makes all the other parts of Ops harder.

Keep reading to learn where the complexity of DataOps comes from and what tools are out there to help teams manage it.

What is DataOps?

DataOps is a collection of technical practices, workflows, cultural norms, and architectural patterns that enable:

Rapid innovation and experimentation
Resilience and high quality
Monitoring and fast recovery in production
Collaboration across a complex array of people, technologies, and environments

You might notice, like I did, that we already have names for some of these challenges in the Ops world:

Challenge	Description
Rapid innovation and experimentation	We create a development environment to enable this kind of experimentation for engineering
Resilience and high quality	These are related to DevOps activities like Write-Audit-Publish and testing, as well as setting up the staging environments
Monitoring and fast recovery	Once we’re in production, these two are self-explanatory and we have them in the DevOps world as well
Collaboration across a complex array of people, technologies, and environments	Since DevOps is a team that involves a lot of different interfaces and people, collaboration is a common challenge

What makes DataOps different from DevOps?

The difference is, well, the data.

In DataOps, we’re dealing with another layer on top of everything else: the data architecture. In other words, managing data becomes another challenge on top of all other DevOps.

The fact that we have this additional aspect not only adds to our workload but also makes the other aspects that we’re used to and know from DevOps more complex.

Let’s go over them and see why DataOps is so hard.

1. Data architecture

Why does data architecture present a problem? Basically, because there are no best practices for it yet.

Here’s an example of data lake architecture:

This architecture is very general. If you already have a data architecture in place or are about to build one, you might not use all these components. But you’re definitely going to have some of them.

Let’s dive into the details to build an understanding of data applications that will come in handy later on:

Data Application	Description
Events data	These are data sources such as events data streams or data ingest infrastructure such as Flink, Kafka, or Spark Streaming.
Operations data	This is usually database replication into object storage, a data lake, or a single source of truth, whatever we choose to call it.
Data processors	These are processing techniques such as Spark and Presto that also call for orchestration tools such as Airflow and capabilities for accessing the data through SQL, such as Hive. This is, generally speaking, the Hadoop ecosystem.
Analytics engines	Once upon a time, this was the only component of a data architecture. Today, it’s not necessarily a relational database, but the logic of a relational database. This means that it presents tabular data in a way that enables analysts to query it using SQL.
Object storage	The storage layer that holds the data throughout the stages of its evolution, from raw data to features and metrics that analysts and ML engineers can use.
Data visualization	Tools that used to be called BI tools such as Tableau or Looker that can either communicate directly with the storage or work over an analytics engine.
Data exploration	Those are notebooks researchers use to develop models over the data.

When picking the parts for your architecture, you’ll quickly notice that the tools presented here overlap in their capabilities.

Unfortunately, the industry hasn’t yet come up with the best formula for choosing the tools that match your unique challenges and goals.

The tooling choice is tricky

The ecosystem is based on open-source tools, but there are also paid solutions that provide managed services, which take some of the workload off the Ops team.

The choice poses another challenge. That’s because we have four very strong players here – AWS, Confluent, Databricks, and Snowflake – with entirely different agendas. The only thing they have in common is the philosophy of conquering all:

AWS wants you to run your entire data pipeline on AWS using their hosting.
Confluent would tell you that for managing Kafka in real time, you don’t need storage and that there’s no such thing as a batch because you can do everything in real time.
Databricks would tell you the entire world is about Spark.
And Snowflake would tell you to run your Spark over Snowflake, which is actually a data warehouse.

Every one of them is building on top of open source, or, in Snowflake’s case, closed source, that they manage in line with their philosophy. As a result, they hurt the compatibility that naturally exists between these open-source tools.

No standardization

When managing data, we’re dealing with several different layers:

Here’s a quick overview of what you’re looking at:

Storage – where the data is saved
Format – the file format used to save the data
Indexing – how data is indexed to allow read/write (or only one of them) performance
Data frame within the compute application – how data is saved in the memory of the compute system
Compute systems – Computation engines that run an analysis logic over the data that is coded using SQL, Python, Java/Scala, etc.

This is, of course, an approximation that serves to help us understand that any mismatch in these layers could become a mismatch when working with different tools together in the data environment.

You can make them work together, but when you need high throughput and performance, you might find yourself with a lack of compatibility standing in your way.

For example, the Apache Iceberg community invested much time in ensuring that Spark performs well when used with Apache Iceberg. Another example is DataBricks that optimized its proprietary version of Spark for Delta Lake or Apache Arrow, which has become a standard for in-memory computation.

2. Whose role is it? Ownership and R&Rs

The second DataOps challenge is related to who is responsible for what. In the world of data, this is a little more complicated than usual.

Here’s why.

Missing requirements

The first challenge is: who is going to give us the requirements?

When developing a regular software application, we have a product manager who works with an R&D team using agile methodologies. It’s their job to meet the requirements and make sure that the fit between the business, the customer, and the technology is achieved.

Data applications, on the other hand, are often internal applications that serve the organization and the stakeholders within it who need insights about the business. If there’s no product manager, or it’s not intuitive for some organizations to put a product manager in that position, we may miss all kinds of requirements.

This has serious consequences. For example, accuracy requirements are critical for understanding how to manage the data.

In this scenario, we don’t always understand the business logic behind the queries running in the data environments or what they mean to the business.

We also don’t have a budget or a demand for cost-effectiveness to answer questions like: How much does it cost us to get to a certain accuracy, and is it worth it?

We’re kind of in the dark about that. As a result, we might make decisions without having clarity about their impact on the results or the goal that we’re meant to achieve for the organization.

DBA – is it Dev or Ops?

This second challenge relates to a very old and important role called Database Administrator (DBA).

Once upon a time, when our data environment was just a great PostgreSQL database running on a single machine and providing the organization with all the data that it needed, we had:

Ops running the machine
Users accessing the data using the SQL interface of the DB (data analysts, data scientists, or engineers querying the data)
DBA – the person whose job is to make sure the internals of the database (think indexing, schema, or configuration) optimize the data querying.

Today, the data ecosystem looks entirely different.

We’re dealing with a much more complex environment where we manage a lot of distributed systems that run over data, such as Spark, Kafka, or Presto. All of those are, of course, NoSQL databases; they’re distributed systems that query the data.

We still have the engineers or users who query the data. And we still have the Ops running VMs and clusters.

But we no longer have that person in the middle who understands how indexing formats and environment configuration impact the type of queries or code that we run within those environments.

This is often something we figure out as we go along. So, server developers who started creating data environments suddenly found themselves to be DBAs of a complex

set of complex systems. In other cases, the Ops people find themselves becoming the DBAs of those environments.

In the past few years, organizations have realized that this role is missing and have tried to build the knowledge and create the capabilities. In some organizations, data engineers take on this role. In others, it’s the domain of DataOps or data infrastructure teams.

The important takeaway here is that someone needs to take on this role because it’s critical for the success of a data environment.

3. Data pipelines: are they pets or cattle?

Data pipelines are definitely pets.

There’s one machine running this database; it’s a mission-critical database, so we nurture it and make sure it’s very stable.

On the other hand, a NoSQL database that runs on a cluster represents cattle because if one of the machines fails, we can simply replace it with another one. Until we do that, the cluster knows how to handle itself without the missing machine.

Let’s look at this data pipeline example:

When we talk about Kafka, Spark, or even the analytical database or storage, we talk about cattle. These environments are distributed systems with the capabilities of cattle.

But when we want to run another version of our production data environment, we’re not looking at a machine that is running a part of the Kafka cluster. We’re looking at duplicating the entire pipeline and running it in parallel.

What we want to have is the ability to spin up the entire pipeline and run an experiment on some of the code. So, we must set up a variety of distributed systems that we or third parties will manage. And we want to do that efficiently, so this is an Ops challenge.

This becomes more complex because the systems creating our production pipeline are complex. But this is not the only complexity that we’re adding here. We also need to manage data, and we’re talking about production data here.

So, where do we hold this production data?

We either hold it in object storage; this is what modern architectures do due to the cost-effectiveness and high throughput of object storage. But we can find the need and save the data in a database. A lot of organizations save the data in several places according to the requirements.

Production data is the single source of truth that we want users to consume and trust. So, we guard it the way we’d guard any other production component.

This means that if we want to duplicate our data pipelines, we also need a way to duplicate (not necessarily physically) the data that is running over them.

In the world of databases, this is solved by creating a replication of the database – we physically replicate the data. In the world of object storage, we can also do replication. However, because object storage doesn’t provide the guarantees given by databases, this could be a bit dangerous.

Application Lifecycle Management: Complexity of creating development environment and Write-Audit-Publish in data

Another best practice that’s very hard to implement in data environments is the development environment and Write-Audit-Publish for the data. These are critical for creating a resilient environment and high-quality output.

The challenges of object storage

Although I highly recommend using object storage, we need consider a few things. If there’s low visibility and manual management in this environment, it doesn’t help you efficiently manage or create resilient work.

That’s because it’s a shared folder in its logical management. It’s immutable, which means we can replace data with other data or a file with another file but not necessarily change it.

There is no guarantee of transactionality, which goes very well with no cross-collection consistency.

You might want isolation and the ability to work on our copy of the data. Or a duplication of our data pipeline to be working on its own copy of the data. In both cases, you need to copy the data.

But if we create a lot of copies in a manually-managed environment, we’re creating an error-prone situation.

Technologies that help here

Manageability has attracted a lot of attention in the past few years for a reason. Problems like performance and throughput of data infrastructure were solved or are sufficient in most organizations. So, all that’s left is manageability.

On the left-hand side, you can see a set of tools from the MLOps world that provide solutions for people who manage and create ML models. We’re looking at one application that allows replication of pipelines and versioning of data to enable data scientists to work efficiently.

But we do so much more with data than just building ML models, so this solves only part of the problem.

On the right-hand side, two sets of solutions aim to give the data itself characteristics that would help manage it no matter what the application on top is:

Data formats that run within the object storage and allow the object storage to provide mutability via those formats. They also provide time travel within a table, which is helpful for manageability and recovery from errors.
Git-like operations over object storage that let you have several data versions or spin up another data pipeline. From an Ops perspective, you can simply open a branch of the data and run this pipeline over that branch of the data, providing a fully isolated data pipeline for experimentation. This solution helps to solve both parts of the puzzle.

Wrap up

DataOps is DevOps for data, and it’s more complex because of that very aspect: data. Data doesn’t only add a layer of complexity but also makes all the other Ops aspects more complicated.

On the upside, a lot of good work has been done over the last 10 years to improve the situation, including very strong players that bring viable solutions. And over on the data front, we also see new tools emerge that are helping teams build resilient data pipelines and duplicate data pipelines for experimentation.

Data is a valuable asset for the entire organization. As you evolve, a lot of datasets that were once used only by one application for analysis start being used for many applications in different ways.

So, teams need to start adopting data versioning at an infrastructure level to something that is horizontal and covers the data rather than having it as something vertical that is used only by one application.

This is what lakeFS brings to the table. To learn more, check out the documentation or go straight to GitHub and give lakeFS a try!

Why Is DataOps So Hard and What Tools Make It Easier?

What is DataOps?