VC surveys show that the MLOps category has significantly expanded in the past few years, with hundreds of companies defining themselves as part of this dynamically evolving niche.
MLOps systems provide the infrastructure allowing ML practitioners to manage the lifecycle of their work from development to production in a robust and reproducible manner. An MLOps tool may cover the E2E needs or focus on a specific phase or artifact in the process, such as R&D or a feature.
Take a look at this State of Data Engineering landscape to see the sheer number of MLOps tools out there:
Click here for a full-sized, zoom-able version of the State of Data Engineering 2022 Map.
The world of data involves a continuum of practitioners, from analysts using mainly ad hoc SQL statements to researchers running proprietary algorithms.
Can one DevOps approach rule them all? Or is ML such a unique practice that it requires its own process for best practices and matching infrastructures?
Inspired by DevOps practices, new approaches such as MLOps and DataOps have evolved precisely to help teams face challenges around data and ML operations.
To answer this question, we will look into the basis of DevOps and how DataOps is a natural area of expertise within DevOps that suits data practitioners’ needs. Later, we will examine the requirements of ML projects and try to understand if and how they may differ from DataOps.
Last but not least, we will examine how much infra an ML practitioner should handle. Is it different from any other data practitioner? Where does it stand in comparison to software engineers? This question is relevant to the question at hand, as it drives the need for Ops provided to the practitioner.
What is DevOps?
According to Wikipedia:
“DevOps is a methodology in the software development and IT industry. Used as a set of practices and tools, DevOps integrates and automates the work of software development (Dev) and IT operations (Ops) as a means for improving and shortening the systems development life cycle. DevOps is complementary to agile software development; several DevOps aspects came from the agile way of working.”
Let’s break it all down.
The agile methodology is part of DevOps methodologies. It relies on the team’s ability to maintain a short feedback loop between product designers and users.
To maintain a short feedback loop, you need an efficient software delivery lifecycle from development to production. The infrastructure and tools required to maintain this process are under the responsibility of the DevOps team.
So, EFFICIENCY IS THE NAME OF THE GAME.
In a nutshell, the main components under DevOps responsibility are:
- Development environment: An environment that allows collaboration and testability of new or changed code.
- Continuous integration: The ability to continuously add new/changed code to the code base while maintaining its quality.
- Staging: to ensure the system’s quality including new/changed functionality before deploying it into production, by setting up and running quality tests on an environment similar to production.
- Continuous deployment: The ability to deploy new/changed functionality into production environments.
- Monitoring: Observing the health of production and the ability to quickly recover from issues by rolling back.
- Modularity: The ability to easily add components, such as new services into production while maintaining production stability and health monitoring.
You will see many job titles around that depending on organization structure (DevOps/SRE/Production Engineering), but the responsibilities stay the same.
This role is basically responsible for providing an infrastructure to move code from development to production. The software engineers who are building the product functionality may participate in choosing some of the tools that are more specific to their expertise, such as aspects of their development environment.
To support this goal and enable agile processes, software engineers are trained in a variety of tools, including source control such as Git, automation tools such as Jenkins, as well as unit and integration testing platforms.
Any software engineer knows that the most critical “on-the-job” training software engineers get is around understanding the life cycle management of the application and working with the tools that support it. Your productivity is much higher once you master that. It quickly becomes a natural part of your day-to-day work.
Now that we’re clear on the role of DevOps in current software development practices, let’s take a look at how it impacts the world of machine learning. Enter DataOps.
What is DataOps?
DataOps is DevOps for data-intensive applications. Such applications rely on data pipelines to produce the derivatives of the data that are the heart of the applications.
Examples of data-intensive applications include internal BI systems, digital health applications that rely on large data sets of patients to improve the diagnostics and treatment of disease, autonomous driving capabilities in cars, the optimization of manufacturing lines, generative AI engines, and many more.
The goal of a DataOps team is similar to that of a DevOps team, but its stack includes expertise in the technologies that allow data practitioners to achieve a short feedback loop.
Such technologies include distributed storage, distributed compute engines, distributed ingest systems, orchestration tools to manage data pipelines, and data observability tools to allow quality testing and production monitoring of the data aspects of a data-intensive application.
In a nutshell, the DataOps expertise will enable:
- Development environment: An environment that allows collaboration and testing of new or changed data pipelines. The infra will include not only the management of the functional code, but also of the pipeline code and the data.
- Continuous integration of code: The ability to continuously add new/changed code to the code base
- Continuous integration of data: The ability to continuously add new/changed data to the data set.
- Staging: To ensure the system’s quality with new/changed functionality before deploying it into production. This will include testing of both code and data.
- Continuous deployment: The ability to automatically deploy new/changed functionality and or data into production environments.
- The monitoring of the health of production and the ability to quickly recover from issues. This will include both the application and the data incorporated in it.
- The ability to easily add components, such as new services/data artifacts into production while maintaining production stability and health monitoring.
Drawing from DevOps best practices, DataOps extends them to cover both code and data, allowing smooth operations in any ML project that works with both of them.
So, what can teams gain with MLOps?
DataOps vs. MLOps
In the context of DevOps and DataOps, MLOps is the variant of DataOps that focuses on the ML lifecycle.
It’s time to answer the main question of this article: Is MLOps truly different from DataOps? And if so, how?
Since ML-based applications require code, data, and environment version control, as well as orchestration and the provisioning of data technologies, their needs in those domains are similar to the needs of other data practitioners and fall well within the DataOps scope as defined here.
The same is true for data quality and monitoring tools. While those tools may be ML-specific in some parts of the testing, this is no different than what we see between the tools used by a C++ developer vs. a JavaScript developer. We don’t define those as different categories from DevOps, do we?
Note that the need for such tools is not in question, and the data and data monitoring categories will be successful, but they don’t change the DevOps paradigm or make MLOps an actual product category.
This brings us to where the difference really lies: in the development environment.
This difference is known in DevOps, and it’s real. Every practitioner in software engineering has specific requirements for their dev environment. Those needs come on top of the basic code and data version control, and the well-configured notebook that is required by all. For example, ML practitioners require a good experimentation management system, a good way of optimizing hyperparameters, and a good way of creating their training sets.
The training set. Oh, that is indeed a difference!
Data scientists require infrastructure to store and manage the tagging of the data they rely on to train their models. While some of this infrastructure is data general (such as the storage or the database used to save the metadata of the tagging), some of it is extremely specific to the tagging process itself and to the management of the training sets.
Does this justify a new Ops category? I don’t think so. It would be the same as saying that an application that requires a graph database would need a new Ops category.
Why did we overfit MLOps?
In many discussions about the infrastructure needs of data scientists, the question of their basic software skills is raised. I’m talking about statements like “Don’t expect a data scientist to understand Git concepts” or “Data scientists can not create code that has proper logging, they’re not software engineers”, etc.
I resent that line of thinking, and I think it has led us to overfit MLOps.
Data scientists are highly skilled individuals that can quickly grasp the concepts of version control and master the complexities of working with automation servers for CI/CD. As I mentioned above, junior software engineers learn this on the job, and I believe that so should data scientists who intend to bring value to a commercial company.
I feel we are creating a category of tools to solve a training gap with data scientists. This survey supports this opinion.
That said, the question of how much Ops software engineers should be exposed to is still open, and different organizations take different views on how DevOps is implemented within the organization.
What is common to all is the understanding that the DevOps team is responsible for providing infrastructure, and the software engineers should understand it, use it, and bring requirements to make sure it continuously improves. When moving to data engineers, the expectation remains the same.
Why should it change for ML engineers and data scientists?
Conclusion
Organizations with vast data operations (soon, all organizations) should make sure their DevOps teams have the data expertise to provide high-quality and general data infrastructure and best practices for all data practitioners while making sure all of them are well-educated on how to use this infrastructure to optimally manage their pipelines. In enterprises, that could definitely become a dedicated department.
Good practices and internal education may remove the need for overfitted tools that eventually limit the data practitioners from the much-needed flexibility offered by general Ops tools. It also facilitates the transition between domains and opens the door to broader career opportunities, improving the communication between teams thanks to the alignment of the core concepts and mindset.
The advantage of an overfitted tool is that in the short term, for a very simple system, it provides a one-stop shop for all needs. Why work with our DevOps team if we can just buy this E2E tool?
But in the long run, needs evolve from such simplified use cases, and the depth of expert systems is required. For example, for orchestration, instead of using E2E tools that offer simplified orchestration among other things, companies will move to powerful orchestration systems such as Airflow, Prefect, or Dagster.