0% found this document useful (0 votes)
417 views

Machine Learning Operations MLOps Overview Definition and Architecture

This document provides an overview, definition, and architecture of Machine Learning Operations (MLOps). It conducted mixed research methods including a literature review, tool review, and expert interviews. As a result, it contributes a definition of MLOps and highlights necessary principles, components, roles, and workflows to automate and operate machine learning products. The goal of MLOps is to address challenges of operationalizing machine learning models and bringing more proofs of concept into production.

Uploaded by

Serhiy Yehress
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
417 views

Machine Learning Operations MLOps Overview Definition and Architecture

This document provides an overview, definition, and architecture of Machine Learning Operations (MLOps). It conducted mixed research methods including a literature review, tool review, and expert interviews. As a result, it contributes a definition of MLOps and highlights necessary principles, components, roles, and workflows to automate and operate machine learning products. The goal of MLOps is to address challenges of operationalizing machine learning models and bringing more proofs of concept into production.

Uploaded by

Serhiy Yehress
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Received 20 February 2023, accepted 16 March 2023, date of publication 27 March 2023, date of current version 3 April 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3262138

Machine Learning Operations (MLOps):


Overview, Definition, and Architecture
DOMINIK KREUZBERGER1 , NIKLAS KÜHL 1,2 , AND SEBASTIAN HIRSCHL1
1 IBM, 71139 Ehningen, Germany
2 Information Systems and Human-Centric Artificial Intelligence, University of Bayreuth, 95447 Bayreuth, Germany

Corresponding author: Niklas Kühl (kuehl@uni-bayreuth.de)


This work was supported in part by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Grant 491183248, and
in part by the Open Access Publishing Fund of the University of Bayreuth.

ABSTRACT The final goal of all industrial machine learning (ML) projects is to develop ML products
and rapidly bring them into production. However, it is highly challenging to automate and operationalize
ML products and thus many ML endeavors fail to deliver on their expectations. The paradigm of Machine
Learning Operations (MLOps) addresses this issue. MLOps includes several aspects, such as best practices,
sets of concepts, and development culture. However, MLOps is still a vague term and its consequences
for researchers and professionals are ambiguous. To address this gap, we conduct mixed-method research,
including a literature review, a tool review, and expert interviews. As a result of these investigations,
we contribute to the body of knowledge by providing an aggregated overview of the necessary principles,
components, and roles, as well as the associated architecture and workflows. Furthermore, we provide a
comprehensive definition of MLOps and highlight open challenges in the field. Finally, this work provides
guidance for ML researchers and practitioners who want to automate and operate their ML products with a
designated set of technologies.

INDEX TERMS CI/CD, DevOps, machine learning, MLOps, operations, workflow orchestration.

I. INTRODUCTION ML workflows manually to a great extent, resulting in many


Machine Learning (ML) has become an important technique issues during the operations of the respective ML solution [7].
to leverage the potential of data and allows businesses to To address these issues, the goal of this work is to examine
be more innovative [1], efficient [2], and sustainable [3]. how manual ML processes can be automated and operational-
However, the success of many productive ML applications ized so that more ML proofs of concept can be brought into
in real-world settings falls short of expectations [4]. A large production. In this work, we explore the emerging ML engi-
number of ML projects fail—with many ML proofs of con- neering practice ‘‘Machine Learning Operations’’—MLOps
cept never progressing as far as production [5]. From a for short—precisely addressing the issue of designing and
research perspective, this does not come as a surprise as the maintaining productive ML. We take a holistic perspective to
ML community has focused extensively on the building of gain a common understanding of the involved components,
ML models, but not on (a) building production-ready ML principles, roles, and architectures. While existing research
products and (b) providing the necessary coordination of sheds some light on various specific aspects of MLOps,
the resulting, often complex ML system components and a holistic conceptualization, generalization, and clarification
infrastructure, including the roles required to automate and of ML systems design are still missing. Different perspectives
operate an ML system in a real-world setting [6]. For instance, and conceptions of the term ‘‘MLOps’’ might lead to misun-
in many industrial applications, data scientists still manage derstandings and miscommunication, which, in turn, can lead
to errors in the overall setup of the entire ML system. Thus,
The associate editor coordinating the review of this manuscript and we ask the research question:
approving it for publication was Alberto Cano . RQ: What is MLOps?

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


31866 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

To answer that question, we conduct a mixed-method


research endeavor to (a) identify important principles of
MLOps, (b) carve out functional core components, (c) high-
light the roles necessary to successfully implement MLOps,
and (d) derive a general architecture for ML systems design.
In combination, these insights result in a definition of
MLOps, which contributes to a common understanding of the
term and related concepts.
Therefore, we hope to positively impact academic and
practical discussions by providing clear guidelines for pro-
fessionals and researchers alike with precise responsibilities.
These insights can assist in allowing more proofs of concept
to make it into production by having fewer errors in the
FIGURE 1. Overview of the methodology.
system’s design and, finally, enabling more robust predictions
in real-world environments.
The remainder of this article is structured as follows.
We will first elaborate on the necessary foundations and results demonstrate, DevOps ensures better software qual-
related work in the field. Next, we will give an overview of ity [15]. People in the industry, as well as academics, have
the utilized methodology, consisting of a literature review, gained a wealth of experience in software engineering using
a tool review, and an interview study. We then present the DevOps. This experience is now being used to automate and
insights derived from the application of the methodology operationalize ML.
and conceptualize these by providing a unifying definition.
We conclude the paper with a short summary, limitations, and III. METHODOLOGY
outlook. To derive insights from the academic knowledge base while
also drawing upon the expertise of practitioners from the
II. FOUNDATIONS OF DEVOPS field, we apply a mixed-method approach, as depicted in
In the past, different software process models and develop- Figure 1. As a first step, we conduct a structured literature
ment methodologies surfaced in the field of software engi- review [16], [17] to obtain an overview of relevant research.
neering. Prominent examples include waterfall [8] and the Furthermore, we review relevant tooling support in the field
agile manifesto [9]. Those methodologies have similar aims, of MLOps to gain a better understanding of the technical
namely to deliver production-ready software products. A con- components involved. Finally, we conduct semi-structured
cept called ‘‘DevOps’’ emerged in the years 2008/2009 and interviews [18], [19] with experts from different domains.
aims to reduce issues in software development [10], [11]. On that basis, we conceptualize the term ‘‘MLOps’’ and elab-
DevOps is more than a pure methodology and rather repre- orate on our findings by synthesizing literature and interviews
sents a paradigm addressing social and technical issues in in the next chapter (‘‘Results’’).
organizations engaged in software development. It has the
goal of eliminating the gap between development and oper- A. LITERATURE REVIEW
ations and emphasizes collaboration, communication, and To ensure that our results are based on scientific knowledge,
knowledge sharing. DevOps promotes automation through we conduct a systematic literature review according to the
the tactic of continuous integration, continuous delivery, and method of Webster and Watson [16] and Kitchenham et al.
continuous deployment (CI/CD), enabling fast, frequent, and [17]. After an initial exploratory search, we define our search
reliable releases. Moreover, it is designed to ensure con- query as follows: (((‘‘DevOps’’ OR ‘‘CICD’’ OR ‘‘Continuous
tinuous testing, quality assurance, continuous monitoring, Integration’’ OR ‘‘Continuous Delivery’’ OR ‘‘Continuous
logging, and feedback loops. Due to the commercialization Deployment’’) AND ‘‘Machine Learning’’) OR ‘‘MLOps’’ OR
of DevOps, many DevOps tools are emerging, which can be ‘‘CD4ML’’).
differentiated into six groups [12], [13]: collaboration and We query the scientific databases of Google Scholar, Web
knowledge sharing (e.g., Slack, Trello, GitLab wiki), source of Science, Science Direct, Scopus, and the Association for
code management (e.g., GitHub, GitLab), build process (e.g., Information Systems eLibrary. It should be mentioned that
Maven), continuous integration (e.g., Jenkins, GitLab CI), the use of DevOps for ML, MLOps, and continuous prac-
deployment automation (e.g., Kubernetes, Docker), mon- tices in combination with ML is a relatively new field in
itoring and logging (e.g., Prometheus, Logstash). Cloud academic literature. Thus, only a few peer-reviewed stud-
environments are increasingly equipped with ready-to-use ies are available at the time of this research. Neverthe-
DevOps tooling that is designed for cloud use, facilitating less, to gain experience in this area, the search included
the efficient generation of value [14]. With this novel shift non-peer-reviewed literature as well. The search was per-
towards DevOps, developers need to care about what they formed in May 2021 and resulted in 1,864 retrieved articles.
develop, as they need to operate it as well. As empirical Of those, we screened 194 papers in detail. From that group,

VOLUME 11, 2023 31867


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

TABLE 1. List of evaluated technologies.

27 articles were selected based on our inclusion and exclusion B. TOOL REVIEW
criteria (e.g., the term MLOps or DevOps and CI/CD in After going through 27 articles and eight interviews, vari-
combination with ML was described in detail, the article ous open-source tools, frameworks, and commercial cloud
was written in English, etc.). All 27 of these articles were ML services were identified. These tools, frameworks, and
peer-reviewed. ML services were reviewed to gain an understanding of the

31868 VOLUME 11, 2023


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

TABLE 2. List of interview partners.

technical components of which they consist. An overview of With regard to the interview design, we prepare a semi-
the identified tools is depicted in Table 1. structured guide with several questions, documented as an
interview script [18]. During the interviews, ‘‘soft ladder-
C. INTERVIEW STUDY ing’’ is used with ‘‘how’’ and ‘‘why’’ questions to probe
To answer the research questions with insights from practice, the interviewees’ means-end chain [19]. This methodical
we conduct semi-structured expert interviews according to approach allowed us to gain additional insight into the
Myers and Newman [18]. One major aspect in the research experiences of the interviewees when required. All inter-
design of expert interviews is choosing an appropriate sample views are recorded and then transcribed. To evaluate the
size [20]. We apply a theoretical sampling approach [21], interview transcripts, we use an open coding scheme [20].
which allows us to choose experienced interview partners to The open coding process allows the data to be broken
obtain high-quality data. Such data can provide meaningful down in an analytical manner so that conceptually sim-
insights with a limited number of interviews. To get an ade- ilar topics can be grouped into categories and subcate-
quate sample group and reliable insights, we use LinkedIn—a gories. These categories are called ‘‘codes’’. Concepts were
social network for professionals—to identify experienced identified when they appeared multiple times in different
ML professionals with profound MLOps knowledge on a interviews [21].
global level. To gain insights from various perspectives,
we choose interview partners from different organizations
and industries, different countries, and nationalities, as well IV. RESULTS
as different genders. Interviews are conducted until no new We apply the described methodology and structure our result-
categories and concepts emerge in the analysis of the data. ing insights into a presentation of important principles, their
According to Glaser and Strauss [21], this stage is called resulting instantiation as components, the description of nec-
‘‘theoretical saturation.’’ In total, we conduct eight interviews essary roles, as well as a suggestion for the architecture and
with experts (pseudonymized with α - θ), whose details are workflow resulting from the combination of these aspects.
depicted in Table 2. All interviews are conducted between Finally, we derive the conceptualization of the term and pro-
June and August 2021. vide a definition of MLOps.

VOLUME 11, 2023 31869


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

use case (e.g., daily vs. weekly). A powerful tool to decrease


cost of retraining is the use of online learning in large scale
web applications, benefiting from iterative training steps
compared to a full training data set. This way the model can
also reflect recent impactful events like catastrophes. There
is a magnitude of online learning optimization algorithms
available [30].
P7 ML metadata tracking/logging. Metadata is tracked
and logged for each orchestrated ML workflow task. Meta-
data tracking and logging is required for each training job
iteration (e.g., training date and time, duration, etc.), includ-
FIGURE 2. Implementation of principles within technical components. ing the model specific metadata—e.g., used parameters and
the resulting performance metrics, model lineage: data and
A. PRINCIPLES code used—to ensure the full traceability of experiment
A principle is viewed as a general or basic truth, a value, or a runs [6], [7], [31], [32], [α, β, δ, ε, ζ, η, θ].
guide for behavior. In the context of MLOps, a principle is P8 Continuous monitoring. Continuous monitoring
a guide to how things should be realized in MLOps and is implies the periodic assessment of data, model, code, infras-
closely related to the term ‘‘best practices’’ from the profes- tructure resources, and model serving performance (e.g., pre-
sional sector. Based on the outlined methodology, we iden- diction accuracy) to detect potential errors or changes that
tified nine principles required to realize MLOps. Figure 2 influence the product quality [7], [23], [28], [32], [33], [α,
provides an illustration of these principles and links them to β, γ, δ, ε, ζ, η].
the components with which they are associated. P9 Feedback loops. Multiple feedback loops are required
P1 CI/CD automation. CI/CD automation provides con- to integrate insights from the quality assessment step into
tinuous integration, continuous delivery, and continuous the development or engineering process (e.g., a feedback
deployment. It carries out the build, test, delivery, and deploy loop from the experimental model engineering stage to the
steps. It provides fast feedback to developers regarding the previous feature engineering stage). Another feedback loop
success or failure of certain steps, thus increasing the over- is required from the monitoring component (e.g., observing
all productivity. CI/CD puts ideas of DevOps into prac- the model serving performance) to the scheduler to enable the
tice. Therefore, CI/CD can be seen as a DevOps tactic [6], retraining [7], [23], [24], [34], [35], [α, β, δ, ζ, η, θ].
[7], [22], [23], [α, β, θ].
P2 Workflow orchestration. Workflow orchestration B. TECHNICAL COMPONENTS
coordinates the tasks of an ML workflow pipeline according After identifying the principles that need to be incorporated
to directed acyclic graphs (DAGs). DAGs define the task into MLOps, we now elaborate on the precise components
execution order by considering relationships and dependen- and implement them in the ML systems design. In the fol-
cies [7], [24], [25], [26] [α, β, γ, δ, ζ, η]. lowing, the components are listed and described in a generic
P3 Reproducibility. Reproducibility is the ability to way with their essential functionalities. The references in
reproduce an ML experiment and obtain the exact same brackets refer to the respective principles that the technical
results [23], [27] [α, β, δ, ε, η]. components are implementing.
P4 Versioning. Versioning ensures the versioning of data, C1 CI/CD Component (P1, P6, P9). The CI/CD com-
model, and code to enable not only reproducibility, but also ponent ensures continuous integration, continuous deliv-
traceability (for compliance and auditing reasons) [23], [27] ery, and continuous deployment. It takes care of the
[α, β, δ, ε, η]. build, test, delivery, and deploy steps. It provides rapid
P5 Collaboration. Collaboration ensures the possibility to feedback to developers regarding the success or failure
work collaboratively on data, model, and code. Besides the of certain steps, thus increasing the overall productivity
technical aspect, this principle emphasizes a collaborative and [6], [7], [22], [23], [24], [28], [α, β, γ, ε, ζ, η]. Examples
communicative work culture aiming to reduce domain silos are Jenkins [7], [24], and GitHub actions (η). For implement-
between different roles [7], [25], [27], [α, δ, θ]. ing an MLOps workflow, this means the automated linting,
P6 Continuous ML training & evaluation. Continuous assembly and registry of training inference and application
training (CT) means periodic retraining of the ML model code into a shippable format (e.g., Python Wheel) as well as
based on new feature data. Continuous training is enabled execution of unit and integration test cases. This automation
through the support of a monitoring component, a feedback should be an idempotent process using dynamically assigned
loop, and an automated ML workflow pipeline. Continu- resources from the CI/CD tool, that results in a binary or
ous training always includes an evaluation run to assess the archive-based package.
change in model quality [23], [24], [28], [29], [β, δ, η, θ]. C2 Source Code Repository (P4, P5). The training, infer-
In general, to manage costs of retraining, it should be care- ence and application source code is versioned in a repository.
fully considered, which update frequency is necessary for the It allows multiple developers to commit and merge their

31870 VOLUME 11, 2023


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

code [23], [24], [36], [37], [38] [α, β, γ, ζ, θ]. Exam- therefore computation bound. GPUs are optimized towards
ples include Bitbucket [39], [ζ], GitLab [24], [39], [ζ], this type of workload and should be the primary focus for
GitHub [37], [ζ, η], and Gitea [23]. compute node specification. In edge devices, where stor-
C3 Workflow Orchestration Component (P2, P3, P6). age and computation power are limited, Quantized Neural
The workflow orchestration component offers task orchestra- Nets with low-precision floating-point operations [47] in
tion of an ML workflow via directed acyclic graphs (DAGs). combination with pruning and Hofmann Coding should be
These graphs represent execution order and artifact usage of investigated [48].
single steps of the workflow. A workflow uses for example C6 Model Registry (P3, P4). The model registry stores
packaged code artifacts in the respective process step, like centrally the trained ML models together with their metadata.
extracting data, training, inference or embedding of a model It has two main functionalities: storing the ML artifact and
binary into an application [6], [7], [23], [26], [31], [α, β, γ, δ, storing the ML metadata (see C7) [7], [24], [25], [34], [35],
ε, ζ, η]. Examples include Apache Airflow [α, ζ], Kubeflow [α, β, γ, ε, ζ, η, θ]. Advanced storage examples include
Pipelines [ζ], Watson Studio Pipelines [γ], Luigi [ζ], AWS MLflow [α, η, ζ], AWS SageMaker Model Registry [ζ],
SageMaker Pipelines [β], and Azure Pipelines [ε]. In theory, Microsoft Azure ML Model Registry [ζ], and Neptune.ai [α].
CI/CD tools could also be used to schedule the triggering of Simple storage examples include Microsoft Azure Storage,
specific tasks sequentially, however the complexity of data Google Cloud Storage, and Amazon AWS S3 [23].
engineering- or ML-pipeline tasks has increased the need for C7 ML Metadata Stores (P4, P7). ML metadata stores
a tool specifically designed for the purpose of workflow or allow for the tracking of various kinds of metadata, e.g.,
task orchestration. These workflow orchestration tools make for each orchestrated ML workflow pipeline task. Another
it easier to efficiently manage interrelated and interdependent metadata store can be configured within the model registry for
tasks, because they are specifically designed to manage com- tracking and logging the metadata of each training job (e.g.,
plex task chains [40]. training date and time, duration, etc.), including the model
C4 Feature Store System (P3, P4). A feature store system specific metadata—e.g., used parameters and the resulting
ensures central storage of commonly used features. It has two performance metrics, model lineage: data and code used [7],
databases configured: One database as an offline feature store [25], [31], [37], [49], [α, β, δ, ζ, θ]. Examples include orches-
to serve features with normal latency for experimentation, trators with built-in metadata stores tracking each step of
and one database as an online store to serve features with experiment pipelines [α] such as Kubeflow Pipelines [α, ζ],
low latency for predictions in production [25], [28], [α, β, AWS SageMaker Pipelines [α, ζ], Azure ML, and IBM Wat-
ζ, ε, θ]. Examples include Google Feast [ζ], Amazon AWS son Studio [γ]. MLflow provides an advanced metadata store
Feature Store [β, ζ], Tecton.ai and Hopswork.ai [ζ]. This is in combination with the model registry [6], [31].
where most of the data for training ML models will come C8 Model Serving Component (P1). The model serving
from. Moreover, data can also come directly from any kind of component can be configured for different purposes. Exam-
data store. A feature store poses complex requirements, which ples are online inference for real-time predictions or batch
are highly dependent on the use case. Databases of it can be inference for predictions using large volumes of input data.
hosted on on-premises infrastructure or in the cloud. How- The serving can be provided, e.g., via a REST API. As a foun-
ever, scalability is typically realized with cloud infrastructure. dational infrastructure layer, a scalable and distributed model
Most use cases have a read-heavy workload, combined with serving infrastructure is recommended [23], [27], [33], [37],
a batch or streaming-based ingestion pattern on (very) large [39], [46], [α, β, δ, ζ, η, θ]. One example of a model serving
data sets. Such high scalability can be achieved on distributed component configuration is the use of Kubernetes and Docker
file systems [41], [42], or distributed databases [43], [44] technology to containerize the ML model, and leveraging a
combined with parallel and distributed data processing algo- Python web application framework like Flask [24] with an
rithms (e.g., MapReduce or a more high-level API like API for serving [α]. Other Kubernetes supported frameworks
Spark). are KServing of Kubeflow [α], TensorFlow Serving, and
C5 Model Training Infrastructure (P6). The model Seldion.io serving [27]. Inferencing could also be realized
training infrastructure provides the foundational computation with Apache Spark for batch predictions [θ]. Examples of
resources, e.g., CPUs, RAM, and GPUs. The provided infras- cloud services include Microsoft Azure ML REST API [ε],
tructure can be either distributed or non-distributed. In gen- AWS SageMaker Endpoints [α, β], IBM Watson Studio [γ],
eral, a scalable and distributed infrastructure is recommended and Google Vertex AI prediction service [δ]. The actual
[7], [23], [27], [28], [32], [33], [37], [45], [46], [δ, ζ, η, θ]. deployment of the model depends on the use case and
Examples include local machines (not scalable) or cloud falls typically into one of these categories: real-time, batch,
computation [33] [η, θ], as well as non-distributed or dis- or serverless inference. Real-time inference can be achieved
tributed computation (several worker nodes) [7], [37]. Frame- by hosting the model in a RESTful web-service, batch infer-
works supporting computation are Kubernetes [η, θ] and ence can be an idempotent MapReduce workflow, and server-
Red Hat OpenShift [γ]. Typically, deep learning workloads less inference is used when cost-efficient and scalable serving
(training and inference) are matrix-multiplication-heavy and is required.

VOLUME 11, 2023 31871


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

C9 Monitoring Component (P8, P9). The monitoring


component takes care of the continuous monitoring of the
model serving performance (e.g., prediction accuracy). Addi-
tionally, monitoring of the ML infrastructure, CI/CD, and
orchestration are required [7], [23], [24], [28], [32], [33],
[50], [α, ζ, η, θ]. Examples include Prometheus with Grafana
[η, ζ], ELK stack (Elasticsearch, Logstash, and Kibana) [α,
η, ζ], and simply TensorBoard [θ]. Examples with built-in
monitoring capabilities are Kubeflow [θ], MLflow [η], and
AWS SageMaker model monitor or cloud watch [ζ].

C. ROLES
After describing the principles and their resulting instanti-
ation of components, we identify necessary roles in order
to realize MLOps in the following. MLOps is an interdis-
ciplinary group process, and the interplay of different roles
is crucial to design, manage, automate, and operate an ML
system in production. In the following, every role, its purpose,
and related tasks are briefly described: FIGURE 3. Roles and their intersections contributing to the MLOps
paradigm.
R1 Business Stakeholder (similar roles: Product Owner,
Project Manager). The business stakeholder defines the
business goal to be achieved with ML and takes care of and model deployment to production, and monitors both the
the communication side of the business, e.g., presenting model and the ML infrastructure [7], [24], [25], [32], [α, β,
the return on investment (ROI) generated with an ML γ, δ, ε, ζ, η, θ].
product [7], [24], [45] [α, β, δ, θ].
R2 Solution Architect (similar role: IT Architect). V. ARCHITECTURE AND WORKFLOW
The solution architect designs the architecture and defines On the basis of the identified principles, components, and
the technologies to be used, following a thorough roles, we derive a generalized MLOps end-to-end architecture
evaluation [7], [24], [α, ζ]. to give ML researchers and practitioners proper guidance.
R3 Data Scientist (similar roles: ML Specialist, It is depicted in Figure 4. Additionally, we depict the work-
ML Developer). The data scientist translates the business flows, i.e., the sequence in which the different tasks are
problem into an ML problem and takes care of the model executed in the different stages. The artifact was designed to
engineering, including the selection of the best-performing be technology-agnostic. Therefore, ML researchers and prac-
algorithm and hyperparameters [7], [25], [32], [33], [α, β, γ, titioners can choose the best-fitting technologies and frame-
δ, ε, ζ, η, θ]. works for their needs. This means the MLOps process and
R4 Data Engineer (similar role: DataOps Engineer). components can either be built out of ‘‘best-of-breed’’ open-
The data engineer builds up and manages data and fea- source tools, but also with enterprise solutions. Also, a mix
ture engineering pipelines. Moreover, this role ensures and match combination of enterprise and open-source tools
proper data ingestion to the databases of the feature store to realize MLOps is possible. Enterprise softwares / cloud
system [25], [26], [32], [α, β, γ, δ, ε, ζ, η, θ]. services often allow the connection to open-source tools via
R5 Software Engineer. The software engineer applies their APIs and vice versa. Thus, it should be considered to
software design patterns, widely accepted coding guidelines, have a look at newest developments as the open source tool
and best practices to turn the raw ML problem into a well- market is growing rapidly. There are frequently new options
engineered product [32], [α, γ]. for combinations possible. However, there are certainly also
R6 DevOps Engineer. The DevOps engineer bridges some constraints when it comes to API interfaces connections
the gap between development and operations and ensures and combinations. In general, it is hard to say which technolo-
proper CI/CD automation, ML workflow orchestra- gies are good to combine and which aren’t. However, with the
tion, model deployment to production, and monitoring newly introduced examples of applications and the precise
[7], [22], [25], [51], [α, β, γ, ε, ζ, η, θ]. mentioning of tools, we demonstrate possible combinations.
R7 ML Engineer/MLOps Engineer. The ML engineer As depicted in Figure 4, we illustrate an end-to-end pro-
or MLOps engineer combines aspects of several roles and cess, from MLOps product initiation to the model serv-
thus has cross-domain knowledge. This role incorporates ing. It includes (A) the MLOps product initiation steps;
skills from data scientists, data engineers, software engineers, (B) the feature engineering pipeline, including the data
DevOps engineers, and backend engineers (see Figure 3). ingestion to the feature store; (C) the experimentation; and
This cross-domain role builds up and operates the ML infras- (D) the automated ML workflow pipeline up to the model
tructure, manages the automated ML workflow pipelines serving.

31872 VOLUME 11, 2023


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

FIGURE 4. End-to-end MLOps architecture and workflow with functional components and roles.

(A) MLOps product initiation. (1) The business stake- data engineer (R4) and the data scientist (R3) work together
holder (R1) analyzes the business and identifies a potential to understand which data is required to solve the problem.
business problem that can be solved using ML. (2) The (5) Once the answers are clarified, the data engineer (R4) and
solution architect (R2) defines the architecture design for the data scientist (R3) collaborate to locate the raw data sources
overall ML system, and decides on the technologies to be for the initial data analysis. They check the distribution, and
used after a thorough evaluation. (3) The data scientist (R3) quality of the data, as well as performing validation checks.
derives an ML problem—such as whether regression or clas- Furthermore, they ensure that the incoming data from the
sification should be used—from the business goal. (4) The data sources is labeled, meaning that a target attribute is

VOLUME 11, 2023 31873


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

known, as this is a mandatory requirement for supervised ML. credit history, income, and demographics. The data is then
In this example, the data sources already had labeled data transformed and cleaned using Spark’s DataFrame and SQL
available as the labeling step was covered during an upstream functionality, and various feature engineering techniques are
process. applied to create a set of relevant features for the credit risk
(B1) Requirements for feature engineering pipeline. model. These features can be then passed through an ML
The features are the relevant attributes required for model pipeline, also implemented in Spark, to train and evaluate a
training. After the initial understanding of the raw data and predictive model for assessing credit risk. In addition, Apache
the initial data analysis, the fundamental requirements for the Kafka can be used for near real-time streaming data ingestion
feature engineering pipeline are defined, as follows: (6) The into the Spark-based feature engineering pipeline [54]. How-
data engineer (R4) defines the data transformation rules (nor- ever, to some extent, a traditional ETL tool can be used to
malization, aggregations) and cleaning rules to bring the data build a feature engineering pipeline [55].
into a usable format. (7) The data scientist (R3) and data (C) Experimentation. Most tasks in the experimentation
engineer (R4) together define the feature engineering rules, stage are led by the data scientist (R3) including the initial
such as the calculation of new and more advanced features configuration of the hardware and runtime environment. The
based on other features. These initially defined rules must be data scientist is supported by the software engineer (R5).
iteratively adjusted by the data scientist (R3) either based on (13) The data scientist (R3) connects to the feature store
the feedback coming from the experimental model engineer- system (C4) for the data analysis. (Alternatively, the data
ing stage or from the monitoring component observing the scientist (R3) can also connect to the raw data for an initial
model performance. analysis.) In case of any required data adjustments, the data
(B2) Feature engineering pipeline. The initially defined scientist (R3) reports the required changes back to the data
requirements for the feature engineering pipeline are taken engineering zone (feedback loop). (14) Then the preparation
by the data engineer (R4) and software engineer (R5) as a and validation of the data coming from the feature store
starting point to build up the prototype of the feature engi- system is required. This task also includes the train and test
neering pipeline. The initially defined requirements and rules split dataset creation. (15) The data scientist (R3) estimates
are updated according to the iterative feedback coming either the best-performing algorithm and hyperparameters, and the
from the experimental model engineering stage or from the model training is then triggered with the training data (C5).
monitoring component observing the model’s performance in The software engineer (R5) supports the data scientist (R3)
production. in the creation of well-engineered model training code. (16)
As a foundational requirement, the data engineer (R4) Different model parameters are tested and validated interac-
defines the code required for the CI/CD (C1) and orches- tively during several rounds of model training. Once the per-
tration component (C3) to ensure the task orchestration of formance metrics indicate good results, the iterative training
the feature engineering pipeline. This role also defines the stops. The best-performing model parameters are identified
underlying infrastructure resource configuration. (8) First, via parameter tuning. The model training task and model
the feature engineering pipeline connects to the raw data, validation task are then iteratively repeated; together, these
which can be (for instance) streaming data, static batch data, tasks can be called ‘‘model engineering.’’ The model engi-
or data from any cloud storage. (9) The data will be extracted neering aims to identify the best-performing algorithm and
from the data sources. (10) The data preprocessing begins hyperparameters for the model. (17) The data scientist (R3)
with data transformation and cleaning tasks. The transforma- exports the model and commits the code to the repository.
tion rule artifact defined in the requirement gathering stage As a foundational requirement, either the DevOps engineer
serves as input for this task, and the main aim of this task (R6) or the ML engineer (R7) defines the code for the (C2)
is to bring the data into a usable format. These transforma- automated ML workflow pipeline and commits it to the repos-
tion rules are continuously improved based on the feedback. itory. Once either the data scientist (R3) commits a new ML
(11) The feature engineering task calculates new and more model or the DevOps engineer (R6) and the ML engineer (R7)
advanced features based on other features. The predefined commits new ML workflow pipeline code to the repository,
feature engineering rules serve as input for this task. These the CI/CD component (C1) detects the updated code and
feature engineering rules are continuously improved based triggers automatically the CI/CD pipeline carrying out the
on the feedback. (12) Lastly, a data ingestion job loads build, test, and delivery steps. The build step creates artifacts
batch or streaming data into the feature store system (C4). containing the ML model and tasks of the ML workflow
The target can either be the offline or online database (or pipeline. The test step validates the ML model and ML work-
any kind of data store). An example of the implementation flow pipeline code. The delivery step pushes the versioned
of an entire feature engineering pipeline can be found in artifact(s)—such as images—to the artifact store (e.g., image
Esmaeilzadeh et al. [52], who implemented an NLP pipeline registry).
with Apache Spark. As another example, Xu [53] demon- Typical technologies used for the experimentation step
strates how a financial institution may use Spark to process are notebook-based solutions like the ones from Jupyter.
and analyze large amounts of customer credit data, such as One example of an industry case where ML experiments

31874 VOLUME 11, 2023


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

are performed with a notebook-based environment is in the model specific metadata called ‘‘model lineage’’ combin-
field of natural language processing (NLP) [56]. A com- ing the lineage of data and code is tracked for each newly
pany that provides NLP-based services such as sentiment registered model. This includes the source and version of
analysis, text summarization, and named entity recognition, the feature data and model training code used to train the
may use Jupyter notebooks to perform ML experiments on model. Also, the model version and status (e.g., staging or
large amounts of text data. The company’s data scientists use production-ready) is recorded.
Jupyter notebooks to prepare the data. Then, they can train, Once the status of a well-performing model is switched
evaluate, and optimize different ML models, such as deep from staging to production, it is automatically handed over
learning models, and test the results. To track the experiments to the DevOps engineer or ML engineer for model deploy-
with the textual data, i.e., the tracking of meta data and storing ment. From there, the (24) CI/CD component (C1) triggers
the resulting models, common-used solutions in combination the continuous deployment pipeline. The production-ready
with Jupyter are, among others, MLflow (i.e., Obeid [57] ML model and the model serving code are pulled (initially
for assessing the risk of COVID-19 based on health records) prepared by the software engineer (R5)). The continuous
or Neptune.AI (i.e., Aljabri [58] for NLP-based fake news deployment pipeline carries out the build and test step of
detection). the ML model and serving code and deploys the model for
(D) Automated ML workflow pipeline. The DevOps production serving. The (25) model serving component (C8)
engineer (R6) and the ML engineer (R7) take care of the makes predictions on new, unseen data coming from the
management of the automated ML workflow pipeline. They feature store system (C4). This component can be designed
also manage the runtime environments, the underlying model by the software engineer (R5) as online inference for real-
training infrastructure in the form of hardware resources time predictions or as batch inference for predictions con-
and frameworks supporting computation such as Kubernetes cerning large volumes of input data. For real-time predictions,
(C5). The workflow orchestration component (C3) orches- features must come from the online database (low latency),
trates the tasks of the automated ML workflow pipeline. whereas for batch predictions, features can be served from
For each task, the required artifacts (e.g., images) are pulled the offline database (normal latency). Model-serving applica-
from the artifact store (e.g., image registry). Each task can tions are often configured within a container and prediction
be executed via an isolated environment (e.g., containers). requests are handled via a REST API. When deploying an
Finally, the workflow orchestration component (C3) gathers ML/AI application, it’s a good practice to use A/B testing
metadata for each task in the form of logs, completion time, to determine in a real-world scenario which model performs
and so on. better compared to another model, for example, deploying
Once the automated ML workflow pipeline is triggered, a ‘‘challenger model’’ in addition to an existing ‘‘champion
each of the following tasks is managed automatically: (18) model’’ to find out which one performs better by collecting
automated pulling of the versioned features from the fea- feedback, for example, when predicting hotel booking can-
ture store systems (data extraction). Depending on the use cellations [61].
case, features are extracted from either the offline or online As a foundational requirement, the ML engineer (R7)
database (or any kind of data store). (19) Automated data manages the model-serving computation infrastructure. The
preparation and validation; in addition, the train and test split (26) monitoring component (C9) observes continuously the
is defined automatically. (20) Automated final model training model-serving performance and infrastructure in real-time.
on new unseen data (versioned features). The algorithm and Once a certain threshold is reached, such as detection of
hyperparameters are already predefined based on the settings low prediction accuracy, the information is forwarded via
of the previous experimentation stage. The model is retrained the feedback loop. The (27) feedback loop is connected to
and refined. (21) Automated model evaluation and iterative the monitoring component (C9) and ensures fast and direct
adjustments of hyperparameters are executed, if required. feedback allowing for more robust and improved predic-
Once the performance metrics indicate good results, the auto- tions. It enables continuous training, retraining, and improve-
mated iterative training stops. The automated model training ment. With the support of the feedback loop, information is
task and the automated model validation task can be itera- transferred from the model monitoring component to several
tively repeated until a good result has been achieved. (22) The upstream receiver points, such as the experimental stage, data
trained model is then exported and (23) pushed to the model engineering zone, and the scheduler (trigger). The feedback
registry (C6), where it is stored e.g., as code or containerized to the experimental stage is taken forward by the data scientist
together with its associated configuration and environment for further model improvements. The feedback to the data
files. engineering zone allows for the adjustment of the features
For all training job iterations, the ML metadata store (C7) prepared for the feature store system. Additionally, the detec-
records metadata such as parameters to train the model and tion of concept drifts as a feedback mechanism can enable
the resulting performance metrics. This also includes the (28) continuous training. Concept drifts occur in real-world
tracking and logging of the training job ID, training date applications when the input data changes over time e.g.,
and time, duration, and sources of artifacts. Additionally, the when a sensor breaks. Decreased prediction accuracy due

VOLUME 11, 2023 31875


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

to concept drift can be detected by a certain concept drift


detection algorithm [59]. Once the model-monitoring com-
ponent (C9) detects a drift in the data [60], the information is
forwarded to the scheduler, which then triggers the automated
ML workflow pipeline for retraining (continuous training).
As explained, a change in adequacy of the deployed model
can be detected using distribution comparisons to identify
drift. Retraining is not only triggered automatically when
a statistical threshold is reached; it can also be triggered
when new feature data is available, or it can be scheduled
periodically.
Typical technologies supporting the automated ML work-
flow pipeline are, among others, Apache Airflow, Kubeflow
Pipelines, IBM Watson Studio Pipelines, or SageMaker
Pipelines. One example of an industry use case for an auto-
mated machine learning workflow pipeline using Airflow is
in the field of online advertising [62]. A company may use
Airflow to automate the process of training and deploying
FIGURE 5. Intersection of disciplines of the MLOps paradigm.
machine learning models for ad targeting and optimization.
The pipeline starts by extracting, transforming, and loading
large amounts of data from various sources, such as website VII. OPEN CHALLENGES
clickstream data, user demographics, and campaign perfor- Several challenges for adopting MLOps have been identified
mance data. This data is then passed through a series of after conducting the literature review, tool review, and inter-
preprocessing and feature engineering steps, implemented view study. These open challenges have been organized into
as Airflow operators. Next, different machine learning mod- the categories of organizational, ML system, and operational
els are trained and evaluated on the processed data, also challenges.
using Airflow operators. The best-performing model is then
deployed to a production environment, where it is used to A. ORGANIZATIONAL CHALLENGES
make real-time ad targeting decisions. In this case, Apache The mindset and culture of data science practice is a typical
Airflow is used to automate the entire process, including challenge in organizational settings [63]. As our insights
scheduling, monitoring, and re-running failed tasks. from literature and interviews show, to successfully develop
and run ML products, there needs to be a culture shift
away from model-driven machine learning toward a product-
VI. CONCEPTUALIZATON oriented discipline [γ]. The recent trend of data-centric AI
With the findings at hand, we conceptualize the literature also addresses this aspect by putting more focus on the data-
and interviews. It becomes obvious that the term MLOps is related aspects taking place prior to the ML model building.
positioned at the intersection of machine learning, software Especially the roles associated with these activities should
engineering, DevOps, and data engineering (see Figure 5). have a product-focused perspective when designing ML
We define MLOps as follows: products [γ]. A great number of skills and individual roles
MLOps (Machine Learning Operations) is a paradigm, are required for MLOps (β). As our identified sources point
including aspects like best practices, sets of concepts, as out, there is a lack of highly skilled experts for these roles—
well as a development culture when it comes to the end-to- especially with regard to architects, data engineers, ML engi-
end conceptualization, implementation, monitoring, deploy- neers, and DevOps engineers [26], [32], [38], [α, ε]. This is
ment, and scalability of machine learning products. Most related to the necessary education of future professionals—
of all, it is an engineering practice that leverages three as MLOps is typically not part of data science education [33]
contributing disciplines: machine learning, software engi- [γ]. Posoldova [6] further stresses this aspect by remarking
neering (especially DevOps), and data engineering. MLOps that students should not only learn about model creation, but
is aimed at productionizing machine learning systems by must also learn about technologies and components necessary
bridging the gap between development (Dev) and operations to build functional ML products.
(Ops). Essentially, MLOps aims to facilitate the creation of Data scientists alone cannot achieve the goals of MLOps. A
machine learning products by leveraging these principles: multi-disciplinary team is required [25], thus MLOps needs to
CI/CD automation, workflow orchestration, reproducibility; be a group process [α]. This is often hindered because teams
versioning of data, model, and code; collaboration; con- work in silos rather than in cooperative setups [α]. Addition-
tinuous ML training and evaluation; ML metadata track- ally, different knowledge levels and specialized terminolo-
ing and logging; continuous monitoring; and feedback gies make communication difficult. To lay the foundations
loops. for more fruitful setups, the respective decision-makers

31876 VOLUME 11, 2023


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

need to be convinced that an increased MLOps maturity [2] M. K. Gourisaria, R. Agrawal, G. M. Harshvardhan, M. Pandey, and
and a product-focused mindset will yield clear business S. S. Rautaray, ‘‘Application of machine learning in industry 4.0,’’ in
Machine Learning: Theoretical Foundations and Practical Applications.
improvements [γ]. Cham, Switzerland: Springer, 2021, pp. 57–87.
[3] A. D. L. Heras, A. Luque-Sendra, and F. Zamora-Polo, ‘‘Machine learning
B. ML SYSTEM CHALLENGES technologies for sustainability in smart cities in the post-COVID era,’’
Sustainability, vol. 12, no. 22, p. 9320, Nov. 2020.
A major challenge with regard to MLOps systems is design- [4] R. Kocielnik, S. Amershi, and P. N. Bennett, ‘‘Will you accept an imperfect
ing for fluctuating demand, especially in relation to the AI?: Exploring designs for adjusting end-user expectations of AI systems,’’
process of ML training [33]. This stems from potentially in Proc. CHI Conf. Hum. Factors Comput. Syst., May 2019, pp. 1–14.
voluminous and varying data [28], which makes it difficult [5] R. van der Meulen and T. McCall. (2018). Gartner Says Nearly Half
of CIOs Are Planning to Deploy Artificial Intelligence. Accessed:
to precisely estimate the necessary infrastructure resources Dec. 4, 2021. [Online]. Available: https://www.gartner.com/en/newsroom/
(CPU, RAM, and GPU) and requires a high level of flexibility press-releases/2018-02-13-gartner-says-nearly-half-of-cios-are-planning-
in terms of scalability of the infrastructure [7], [33], [δ]. to-deploy-artificial-intelligence
[6] A. Posoldova, ‘‘Machine learning pipelines: From research to production,’’
IEEE Potentials, vol. 39, no. 6, pp. 38–42, Nov. 2020.
C. OPERATIONAL CHALLENGES [7] L. E. Lwakatare, I. Crnkovic, E. Rånge, and J. Bosch, ‘‘From a data science
In productive settings, it is challenging to operate ML driven process to a continuous delivery process for machine learning
systems,’’ in Product-Focused Software Process Improvement (Lecture
manually due to different stacks of software and hardware Notes in Computer Science), vol. 12562. Springer, 2020, pp. 185–201, doi:
components and their interplay as well as the selection of 10.1007/978-3-030-64148-1_12.
both ([64], [65]. Therefore, robust automation is required [8] W. W. Royce, ‘‘Managing the development of large software systems,’’ in
Proc. IEEE WESCON, Aug. 1970, pp. 1–9.
[24], [33]. Also, a constant incoming stream of new data
[9] K. Beck et al., ‘‘The agile manifesto,’’ 2001. [Online]. Available:
forces retraining capabilities. This is a repetitive task which, http://agilemanifesto.org/
again, requires a high level of automation [66], [θ]. These [10] P. Debois. (2009). Patrick Debois Devopsdays Ghent. Accessed:
repetitive tasks yield a large number of artifacts that require Mar. 25, 2021. [Online]. Available: https://devopsdays.org/events/2019-
ghent/speakers/patrick-debois/
a strong governance [27], [32], [45] as well as versioning
[11] S. Mezak. (Jan. 25, 2018). The Origins of DevOps: What’s in a
of data, model, and code to ensure robustness and repro- Name? DevOps.com. Accessed: Mar. 25, 2021. [Online]. Available:
ducibility [7], [32], [39]. Lastly, it is challenging to resolve https://devops.com/the-origins-of-devops-whats-in-a-name/
a potential support request (e.g., by finding the root cause), [12] L. Leite, C. Rocha, F. Kon, D. Milojicic, and P. Meirelles, ‘‘A survey of
DevOps concepts and challenges,’’ ACM Comput. Surv., vol. 52, no. 6,
as many parties and components are involved. Failures can pp. 1–35, Nov. 2020, doi: 10.1145/3359981.
be a combination of ML infrastructure and software within [13] R. W. Macarthy and J. M. Bass, ‘‘An empirical taxonomy of DevOps in
the MLOps stack [7]. practice,’’ in Proc. 46th Euromicro Conf. Softw. Eng. Adv. Appl. (SEAA),
Aug. 2020, pp. 221–228, doi: 10.1109/SEAA51224.2020.00046.
[14] M. Rütz, ‘‘DEVOPS: A systematic literature review,’’ Inf. Softw. Technol.,
VIII. CONCLUSION FH-Wedel, Aug. 2019. [Online]. Available: https://www.researchgate.
With the increase of data availability and analytical capabil- net/publication/335243102_DEVOPS_A_SYSTEMATIC_LITERATURE
ities, coupled with the constant pressure to innovate, more _REVIEW
[15] P. Perera, R. Silva, and I. Perera, ‘‘Improve software quality through
machine learning products than ever are being developed.
practicing DevOps,’’ in Proc. 17th Int. Conf. Adv. ICT Emerg. Regions
However, only a small number of these proofs of concept (ICTer), Sep. 2017, pp. 1–6.
progress into deployment and production. Furthermore, the [16] J. Webster and R. Watson, ‘‘Analyzing the past to prepare for the
academic space has focused intensively on machine learning future: Writing a literature review,’’ MIS Quart., vol. 26, no. 2,
pp. 8–23, 2002. [Online]. Available: https://www.jstor.org/stable/4132319,
model building and benchmarking, but too little on operating doi: 10.1.1.104.6570.
complex machine learning systems in real-world scenarios. [17] B. Kitchenham, O. P. Brereton, D. Budgen, M. Turner, J. Bailey, and
In the real world, we observe data scientists still managing S. Linkman, ‘‘Systematic literature reviews in software engineering—A
systematic literature review,’’ Inf. Softw. Technol., vol. 51, no. 1, pp. 7–15,
ML workflows manually to a great extent. The paradigm Jan. 2009, doi: 10.1016/j.infsof.2008.09.009.
of Machine Learning Operations (MLOps) addresses these [18] M. D. Myers and M. Newman, ‘‘The qualitative interview in IS research:
challenges. In this work, we shed more light on MLOps. Examining the craft,’’ Inf. Org., vol. 17, no. 1, pp. 2–26, Jan. 2007, doi:
By conducting a mixed-method study analyzing existing 10.1016/j.infoandorg.2006.11.001.
[19] U. Schultze and M. Avital, ‘‘Designing interviews to generate rich data
literature and tools, as well as interviewing eight experts for information systems research,’’ Inf. Org., vol. 21, no. 1, pp. 1–16,
from the field, we uncover four main aspects of MLOps: its Jan. 2011, doi: 10.1016/j.infoandorg.2010.11.001.
principles, components, roles, and architecture. From these [20] J. M. Corbin and A. Strauss, ‘‘Grounded theory research: Procedures,
aspects, we infer a holistic definition. The results support a canons, and evaluative criteria,’’ Qualitative Sociol., vol. 13, no. 1,
pp. 3–21, 1990, doi: 10.1007/BF00988593.
common understanding of the term MLOps and its associated [21] B. Glaser and A. Strauss, The Discovery of Grounded Theory: Strate-
concepts, and will hopefully assist researchers and profes- gies for Qualitative Research. London, U.K.: Aldine, 1967, doi:
sionals in setting up successful ML products in the future. 10.4324/9780203793206.
[22] T. Granlund, A. Kopponen, V. Stirbu, L. Myllyaho, and T. Mikkonen,
‘‘MLOps challenges in multi-organization setup: Experiences from two
REFERENCES real-world cases,’’ 2021, arXiv:2103.08937.
[1] M. Aykol, P. Herring, and A. Anapolsky, ‘‘Machine learning for continuous [23] Y. Zhou, Y. Yu, and B. Ding, ‘‘Towards MLOps: A case study of ML
innovation in battery technologies,’’ Nature Rev. Mater., vol. 5, no. 10, pipeline platform,’’ in Proc. Int. Conf. Artif. Intell. Comput. Eng. (ICAICE),
pp. 725–727, Jun. 2020. Oct. 2020, pp. 494–500, doi: 10.1109/ICAICE51518.2020.00102.

VOLUME 11, 2023 31877


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

[24] I. Karamitsos, S. Albarhami, and C. Apostolopoulos, ‘‘Applying devops [45] Y. Liu, Z. Ling, B. Huo, B. Wang, T. Chen, and E. Mouine, ‘‘Building
practices of continuous automation for machine learning,’’ Information, a platform for machine learning operations from open source frame-
vol. 11, no. 7, pp. 1–15, 2020, doi: 10.3390/info11070363. works,’’ IFAC-PapersOnLine, vol. 53, no. 5, pp. 704–709, 2020, doi:
[25] A. Goyal, ‘‘MLOps machine learning operations,’’ Int. J. Inf. 10.1016/j.ifacol.2021.04.161.
Technol. Insights Transformations, vol. 4, no. 2, 2020. Accessed: [46] G. S. Yoon, J. Han, S. Lee, and J. W. Kim, DevOps Portal Design for
Apr. 15, 2021. [Online]. Available: http://technology.eurekajournals.com/ SmartX AI Cluster Employing Cloud-Native Machine Learning Workflows,
index.php/IJITIT/article/view/655 vol. 47. Cham, Switzerland: Springer, 2020, doi: 10.1007/978-3-030-
[26] D. A. Tamburri, ‘‘Sustainable MLOps: Trends and challenges,’’ in Proc. 39746-3_54.
22nd Int. Symp. Symbolic Numeric Algorithms Sci. Comput. (SYNASC), [47] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, ‘‘Quan-
Sep. 2020, pp. 17–23, doi: 10.1109/SYNASC51798.2020.00015. tized neural networks: Training neural networks with low precision weights
[27] O. Spjuth, J. Frid, and A. Hellander, ‘‘The machine learning life and activations,’’ J. Mach. Learn. Res., vol. 18, no. 1, pp. 6869–6898, 2017.
cycle and the cloud: Implications for drug discovery,’’ Expert Opin- [48] S. Han, H. Mao, and W. J. Dally, ‘‘Deep compression: Compressing deep
ion Drug Discovery, vol. 16, no. 9, pp. 1071–1079, 2021, doi: neural networks with pruning, trained quantization and Huffman coding,’’
10.1080/17460441.2021.1932812. 2015, arXiv:1510.00149.
[49] L. E. Lwakatare, I. Crnkovic, and J. Bosch, ‘‘DevOps for AI—
[28] B. Derakhshan, A. R. Mahdiraji, T. Rabl, and V. Markl, ‘‘Continuous
Challenges in development of AI-enabled applications,’’ in Proc. Int. Conf.
deployment of machine learning pipelines,’’ in Proc. EDBT, Mar. 2019,
Softw., Telecommun. Comput. Netw. (SoftCOM), Sep. 2020, pp. 1–6, doi:
pp. 397–408, doi: 10.5441/002/edbt.2019.35.
10.23919/SoftCOM50211.2020.9238323.
[29] R. R. Karn, P. Kudva, and I. A. M. Elfadel, ‘‘Dynamic autoselection and
[50] C. Renggli, L. Rimanic, N. M. Gürel, B. Karlaš, W. Wu, and C. Zhang,
autotuning of machine learning models for cloud network analytics,’’ IEEE
‘‘A data quality-driven view of MLOps,’’ 2021, arXiv:2102.07750.
Trans. Parallel Distrib. Syst., vol. 30, no. 5, pp. 1052–1064, May 2019, doi:
[51] W. J. van den Heuvel and D. A. Tamburri, Model-Driven ML-Ops for
10.1109/TPDS.2018.2876844.
Intelligent Enterprise Applications: Vision, Approaches and Challenges,
[30] S. Shalev-Shwartz, ‘‘Online learning and online convex optimization,’’ vol. 391. Cham, Switzerland: Springer, 2020, doi: 10.1007/978-3-030-
Found. Trends Mach. Learn., vol. 4, no. 2, pp. 107–194, 2012. 52306-0_11.
[31] A. Molner Domenech and A. Guillén, ‘‘Ml-experiment: A Python frame- [52] A. Esmaeilzadeh, M. Heidari, R. Abdolazimi, P. Hajibabaee, and
work for reproducible data science,’’ J. Phys., Conf. Ser., vol. 1603, no. 1, M. Malekzadeh, ‘‘Efficient large scale NLP feature engineering with
Sep. 2020, Art. no. 012025, doi: 10.1088/1742-6596/1603/1/012025. Apache spark,’’ in Proc. IEEE 12th Annu. Comput. Commun. Workshop
[32] S. Makinen, H. Skogstrom, E. Laaksonen, and T. Mikkonen, ‘‘Who needs Conf. (CCWC), Jan. 2022, pp. 274–280.
MLOps: What data scientists seek to accomplish and how can MLOps [53] J. Xu, ‘‘MLOps in the financial industry: Philosophy, practices, and tools,’’
help?’’ in Proc. IEEE/ACM 1st Workshop AI Eng. Softw. Eng. AI (WAIN), in Future and Fintech, the, Abcdi and Beyond. Singapore: World Scientific,
May 2021, pp. 109–112. 2022, p. 451, doi: 10.1142/9789811250903_0014.
[33] L. C. Silva, F. R. Zagatti, B. S. Sette, L. N. dos Santos Silva, [54] F. Carcillo, A. D. Pozzolo, Y.-A. L. Borgne, O. Caelen, Y. Mazzer, and
D. Lucredio, D. F. Silva, and H. de Medeiros Caseli, ‘‘Benchmark- G. Bontempi. SCARFF?: A Scalable Framework for Streaming Credit
ing machine learning solutions in production,’’ in Proc. 19th IEEE Card Fraud Detection With Spark 1. Accessed: Feb. 17, 2023. [Online].
Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2020, pp. 626–633, doi: Available: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
10.1109/ICMLA51294.2020.00104. [55] J. Dhanalakshmi and N. Ayyanathan, ‘‘A dynamic web data extraction from
[34] A. Banerjee, C. C. Chen, C. C. Hung, X. Huang, Y. Wang, and SRLDC (southern regional load dispatch centre) and feature engineering
R. Chevesaran, ‘‘Challenges and experiences with MLOps for perfor- using ETL tool,’’ in Proc. 2nd Int. Conf. Artif. Intell., Adv. Appl. Springer,
mance diagnostics in hybrid-cloud enterprise software deployments,’’ in 2022, pp. 443–449, doi: 10.1007/978-981-16-6332-1_38.
Proc. OpML USENIX Conf. Oper. Mach. Learn., 2020, pp. 7–9. [56] J. Foster and J. Wagner, ‘‘Naive Bayes versus BERT: Jupyter notebook
[35] B. Benni, M. Blay-Fornarino, S. Mosser, F. Precioso, and G. Jungbluth, assignments for an introductory NLP course,’’ in Proc. 5th Workshop
‘‘When DevOps meets meta-learning: A portfolio to rule them all,’’ in Teaching (NLP), 2021, pp. 112–114.
Proc. ACM/IEEE 22nd Int. Conf. Model Driven Eng. Lang. Syst. Com- [57] J. S. Obeid, M. Davis, M. Turner, S. M. Meystre, P. M. Heider,
panion (MODELS-C), Sep. 2019, pp. 605–612, doi: 10.1109/MODELS- E. C. O’Bryan, and L. A. Lenert, ‘‘An artificial intelligence approach to
C.2019.00092. COVID-19 infection risk assessment in virtual visits: A case report,’’
[36] C. Vuppalapati, A. Ilapakurti, K. Chillara, S. Kedari, and V. Mamidi, J. Amer. Med. Inform. Assoc., vol. 27, no. 8, pp. 1321–1325, Aug. 2020.
‘‘Automating tiny ML intelligent sensors DevOPS using Microsoft azure,’’ [58] M. Aljabri, D. M. Alomari, and M. Aboulnour, ‘‘Fake news detection
in Proc. IEEE Int. Conf. Big Data (Big Data), Dec. 2020, pp. 2375–2384, using machine learning models,’’ in Proc. 14th Int. Conf. Comput. Intell.
doi: 10.1109/BigData50022.2020.9377755. Commun. Netw. (CICN), Dec. 2022, pp. 473–477.
[37] Á. L. García, J. M. D. Lucas, M. Antonacci, W. Z. Castell, and [59] L. Baier, N. Kühl, and G. Satzger, ‘‘How to cope with change?—Preserving
M. David, ‘‘A cloud-based framework for machine learning workloads validity of predictive services over time,’’ in Proc. Annu. Hawaii Int. Conf.
and applications,’’ IEEE Access, vol. 8, pp. 18681–18692, 2020, doi: Syst. Sci., 2019, pp. 1–10.
10.1109/ACCESS.2020.2964386. [60] L. Baier, T. Schlör, J. Schöffer, and N. Kühl, ‘‘Detecting concept drift with
[38] C. Wu, E. Haihong, and M. Song, ‘‘An automatic artificial intelligence neural network model uncertainty,’’ 2021, arXiv:2107.01873.
training platform based on kubernetes,’’ in Proc. 2nd Int. Conf. Big Data [61] N. Antonio, A. de Almeida, and L. Nunes, ‘‘Predicting hotel bookings
Eng. Technol., Jan. 2020, pp. 58–62, doi: 10.1145/3378904.3378921. cancellation with a machine learning classification model,’’ in Proc. 16th
IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2017, pp. 1049–1054,
[39] G. Fursin, ‘‘Collective knowledge: Organizing research projects as a
doi: 10.1109/ICMLA.2017.00-11.
database of reusable components and portable workflows with common
[62] T. Cui, Y. Wang, and B. Namih, ‘‘Build an intelligent online marketing
interfaces,’’ Phil. Trans. Roy. Soc. A, Math., Phys. Eng. Sci., vol. 379,
system: An overview,’’ IEEE Internet Comput., vol. 23, no. 4, pp. 53–60,
no. 2197, May 2021, Art. no. 20200211, doi: 10.1098/rsta.2020.0211.
Jul. 2019.
[40] M. Schmitt, ‘‘Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow,’’ [63] L. Baier and S. Seebacher, ‘‘Challenges in the Deployment and,’’ in Proc.
Tech. Rep., 2022. [Online]. Available: https://www.datarevenue.com/en- 27th Eur. Conf. Inf. Syst. (ECIS), Stockholm, Sweden, Jun. 2019, pp. 1–15.
blog/airflow-vs-luigi-vs-argo-vs-mlflow-vs-kubeflow [Online]. Available: https://aisel.aisnet.org/ecis2019_rp/163
[41] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, ‘‘The Hadoop dis- [64] P. Ruf, M. Madan, C. Reich, and D. Ould-Abdeslam, ‘‘Demystifying
tributed file system,’’ in Proc. IEEE 26th Symp. Mass Storage Syst. Tech- MLOps and presenting a recipe for the selection of open-source tools,’’
nol. (MSST), May 2010, pp. 1–10. Appl. Sci., vol. 11, no. 19, p. 8861, Sep. 2021.
[42] S. Ghemawat, H. Gobioff, and S.-T. Leung, ‘‘The Google file system,’’ in [65] N. Hewage and D. Meedeniya, ‘‘Machine learning operations: A survey on
Proc. 19th ACM Symp. Operating Syst. Princ., Oct. 2003, pp. 29–43. MLOps tool support,’’ 2022, arXiv:2202.10169.
[43] A. Lakshman and P. Malik, ‘‘Cassandra: A decentralized structured stor- [66] B. Karlaš, M. Interlandi, C. Renggli, W. Wu, C. Zhang, D. M. I. Babu,
age system,’’ ACM SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35–40, J. Edwards, C. Lauren, A. Xu, and M. Weimer, ‘‘Building continuous
Apr. 2010. integration services for machine learning,’’ in Proc. 26th ACM SIGKDD
[44] J. C. Corbett, ‘‘Spanner: Google’s globally distributed database,’’ ACM Int. Conf. Knowl. Discovery Data Mining, Aug. 2020, pp. 2407–2415, doi:
Trans. Comput. Syst. (TOCS), vol. 31, no. 3, pp. 1–22, 2013. 10.1145/3394486.3403290.

31878 VOLUME 11, 2023


D. Kreuzberger et al.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture

DOMINIK KREUZBERGER received the B.Sc. SEBASTIAN HIRSCHL is currently a Senior


degree in business information systems as part of Engineer/Architect with IBM and leads the
a dual study program from Baden-Wuerttemberg ML engineering and platform activities for the
Cooperative State University (DHBW), Stuttgart, machine learning practice in Germany. He com-
and the M.Sc. degree in information systems bines his computer science background in machine
engineering and management with a focus on learning and artificial intelligence. He developed
digital services and machine learning operations the discipline of machine learning (ML) engi-
(MLOps) from the Karlsruhe Institute of Technol- neering within IBM over the last five years,
ogy (KIT). He is currently an IT architect, spe- including best practices, methods, roles, and tools.
cializing in hybrid cloud computing and artificial He designs and leads the implementation of
intelligence solutions. He is also with IBM with a focus on client success, enterprise-grade data and ML products for clients in Germany and Europe.
where he designs and builds enterprise-grade data and machine-learning Together with a team, he publishes the IBM Data Science Best Practices as
products based on IBM technology. Before joining IBM, he worked for well as shapes the IBM Data and AI reference architecture. In his role as an
nearly a decade with the multinational sports company adidas. During this ML engineering expert, he drives the evolution of the paradigm of MLOps
time, he held various positions in the area of e-commerce and data and internally and externally.
analytics.

NIKLAS KÜHL received the Ph.D. degree (summa


cum laude) in information systems with a focus
on applied machine learning. In his studies, he is
working on conceptualizing, designing, and imple-
menting artificial intelligence (AI) products with
a focus on inter-organizational learning as well as
fair and effective collaboration within human–AI
teams. He is currently a Full Professor of infor-
mation systems and human-centric AI with the
University of Bayreuth. He is also a Group Lead
with Fraunhofer FIT for business analytics as well as a Senior Expert in
artificial intelligence with IBM. In the past, he was a Managing Consultant
for Data Science with IBM, which complemented his theoretical knowledge
with practical insights from the field. He has been working on machine
learning (ML) and AI in different domains, since 2014. He is internationally
collaborating with multiple institutions such as the University of Texas and
the MIT–IBM Watson AI Laboratory.

VOLUME 11, 2023 31879

You might also like