Machine Learning Operations MLOps Overview Definition and Architecture
Machine Learning Operations MLOps Overview Definition and Architecture
ABSTRACT The final goal of all industrial machine learning (ML) projects is to develop ML products
and rapidly bring them into production. However, it is highly challenging to automate and operationalize
ML products and thus many ML endeavors fail to deliver on their expectations. The paradigm of Machine
Learning Operations (MLOps) addresses this issue. MLOps includes several aspects, such as best practices,
sets of concepts, and development culture. However, MLOps is still a vague term and its consequences
for researchers and professionals are ambiguous. To address this gap, we conduct mixed-method research,
including a literature review, a tool review, and expert interviews. As a result of these investigations,
we contribute to the body of knowledge by providing an aggregated overview of the necessary principles,
components, and roles, as well as the associated architecture and workflows. Furthermore, we provide a
comprehensive definition of MLOps and highlight open challenges in the field. Finally, this work provides
guidance for ML researchers and practitioners who want to automate and operate their ML products with a
designated set of technologies.
INDEX TERMS CI/CD, DevOps, machine learning, MLOps, operations, workflow orchestration.
27 articles were selected based on our inclusion and exclusion B. TOOL REVIEW
criteria (e.g., the term MLOps or DevOps and CI/CD in After going through 27 articles and eight interviews, vari-
combination with ML was described in detail, the article ous open-source tools, frameworks, and commercial cloud
was written in English, etc.). All 27 of these articles were ML services were identified. These tools, frameworks, and
peer-reviewed. ML services were reviewed to gain an understanding of the
technical components of which they consist. An overview of With regard to the interview design, we prepare a semi-
the identified tools is depicted in Table 1. structured guide with several questions, documented as an
interview script [18]. During the interviews, ‘‘soft ladder-
C. INTERVIEW STUDY ing’’ is used with ‘‘how’’ and ‘‘why’’ questions to probe
To answer the research questions with insights from practice, the interviewees’ means-end chain [19]. This methodical
we conduct semi-structured expert interviews according to approach allowed us to gain additional insight into the
Myers and Newman [18]. One major aspect in the research experiences of the interviewees when required. All inter-
design of expert interviews is choosing an appropriate sample views are recorded and then transcribed. To evaluate the
size [20]. We apply a theoretical sampling approach [21], interview transcripts, we use an open coding scheme [20].
which allows us to choose experienced interview partners to The open coding process allows the data to be broken
obtain high-quality data. Such data can provide meaningful down in an analytical manner so that conceptually sim-
insights with a limited number of interviews. To get an ade- ilar topics can be grouped into categories and subcate-
quate sample group and reliable insights, we use LinkedIn—a gories. These categories are called ‘‘codes’’. Concepts were
social network for professionals—to identify experienced identified when they appeared multiple times in different
ML professionals with profound MLOps knowledge on a interviews [21].
global level. To gain insights from various perspectives,
we choose interview partners from different organizations
and industries, different countries, and nationalities, as well IV. RESULTS
as different genders. Interviews are conducted until no new We apply the described methodology and structure our result-
categories and concepts emerge in the analysis of the data. ing insights into a presentation of important principles, their
According to Glaser and Strauss [21], this stage is called resulting instantiation as components, the description of nec-
‘‘theoretical saturation.’’ In total, we conduct eight interviews essary roles, as well as a suggestion for the architecture and
with experts (pseudonymized with α - θ), whose details are workflow resulting from the combination of these aspects.
depicted in Table 2. All interviews are conducted between Finally, we derive the conceptualization of the term and pro-
June and August 2021. vide a definition of MLOps.
code [23], [24], [36], [37], [38] [α, β, γ, ζ, θ]. Exam- therefore computation bound. GPUs are optimized towards
ples include Bitbucket [39], [ζ], GitLab [24], [39], [ζ], this type of workload and should be the primary focus for
GitHub [37], [ζ, η], and Gitea [23]. compute node specification. In edge devices, where stor-
C3 Workflow Orchestration Component (P2, P3, P6). age and computation power are limited, Quantized Neural
The workflow orchestration component offers task orchestra- Nets with low-precision floating-point operations [47] in
tion of an ML workflow via directed acyclic graphs (DAGs). combination with pruning and Hofmann Coding should be
These graphs represent execution order and artifact usage of investigated [48].
single steps of the workflow. A workflow uses for example C6 Model Registry (P3, P4). The model registry stores
packaged code artifacts in the respective process step, like centrally the trained ML models together with their metadata.
extracting data, training, inference or embedding of a model It has two main functionalities: storing the ML artifact and
binary into an application [6], [7], [23], [26], [31], [α, β, γ, δ, storing the ML metadata (see C7) [7], [24], [25], [34], [35],
ε, ζ, η]. Examples include Apache Airflow [α, ζ], Kubeflow [α, β, γ, ε, ζ, η, θ]. Advanced storage examples include
Pipelines [ζ], Watson Studio Pipelines [γ], Luigi [ζ], AWS MLflow [α, η, ζ], AWS SageMaker Model Registry [ζ],
SageMaker Pipelines [β], and Azure Pipelines [ε]. In theory, Microsoft Azure ML Model Registry [ζ], and Neptune.ai [α].
CI/CD tools could also be used to schedule the triggering of Simple storage examples include Microsoft Azure Storage,
specific tasks sequentially, however the complexity of data Google Cloud Storage, and Amazon AWS S3 [23].
engineering- or ML-pipeline tasks has increased the need for C7 ML Metadata Stores (P4, P7). ML metadata stores
a tool specifically designed for the purpose of workflow or allow for the tracking of various kinds of metadata, e.g.,
task orchestration. These workflow orchestration tools make for each orchestrated ML workflow pipeline task. Another
it easier to efficiently manage interrelated and interdependent metadata store can be configured within the model registry for
tasks, because they are specifically designed to manage com- tracking and logging the metadata of each training job (e.g.,
plex task chains [40]. training date and time, duration, etc.), including the model
C4 Feature Store System (P3, P4). A feature store system specific metadata—e.g., used parameters and the resulting
ensures central storage of commonly used features. It has two performance metrics, model lineage: data and code used [7],
databases configured: One database as an offline feature store [25], [31], [37], [49], [α, β, δ, ζ, θ]. Examples include orches-
to serve features with normal latency for experimentation, trators with built-in metadata stores tracking each step of
and one database as an online store to serve features with experiment pipelines [α] such as Kubeflow Pipelines [α, ζ],
low latency for predictions in production [25], [28], [α, β, AWS SageMaker Pipelines [α, ζ], Azure ML, and IBM Wat-
ζ, ε, θ]. Examples include Google Feast [ζ], Amazon AWS son Studio [γ]. MLflow provides an advanced metadata store
Feature Store [β, ζ], Tecton.ai and Hopswork.ai [ζ]. This is in combination with the model registry [6], [31].
where most of the data for training ML models will come C8 Model Serving Component (P1). The model serving
from. Moreover, data can also come directly from any kind of component can be configured for different purposes. Exam-
data store. A feature store poses complex requirements, which ples are online inference for real-time predictions or batch
are highly dependent on the use case. Databases of it can be inference for predictions using large volumes of input data.
hosted on on-premises infrastructure or in the cloud. How- The serving can be provided, e.g., via a REST API. As a foun-
ever, scalability is typically realized with cloud infrastructure. dational infrastructure layer, a scalable and distributed model
Most use cases have a read-heavy workload, combined with serving infrastructure is recommended [23], [27], [33], [37],
a batch or streaming-based ingestion pattern on (very) large [39], [46], [α, β, δ, ζ, η, θ]. One example of a model serving
data sets. Such high scalability can be achieved on distributed component configuration is the use of Kubernetes and Docker
file systems [41], [42], or distributed databases [43], [44] technology to containerize the ML model, and leveraging a
combined with parallel and distributed data processing algo- Python web application framework like Flask [24] with an
rithms (e.g., MapReduce or a more high-level API like API for serving [α]. Other Kubernetes supported frameworks
Spark). are KServing of Kubeflow [α], TensorFlow Serving, and
C5 Model Training Infrastructure (P6). The model Seldion.io serving [27]. Inferencing could also be realized
training infrastructure provides the foundational computation with Apache Spark for batch predictions [θ]. Examples of
resources, e.g., CPUs, RAM, and GPUs. The provided infras- cloud services include Microsoft Azure ML REST API [ε],
tructure can be either distributed or non-distributed. In gen- AWS SageMaker Endpoints [α, β], IBM Watson Studio [γ],
eral, a scalable and distributed infrastructure is recommended and Google Vertex AI prediction service [δ]. The actual
[7], [23], [27], [28], [32], [33], [37], [45], [46], [δ, ζ, η, θ]. deployment of the model depends on the use case and
Examples include local machines (not scalable) or cloud falls typically into one of these categories: real-time, batch,
computation [33] [η, θ], as well as non-distributed or dis- or serverless inference. Real-time inference can be achieved
tributed computation (several worker nodes) [7], [37]. Frame- by hosting the model in a RESTful web-service, batch infer-
works supporting computation are Kubernetes [η, θ] and ence can be an idempotent MapReduce workflow, and server-
Red Hat OpenShift [γ]. Typically, deep learning workloads less inference is used when cost-efficient and scalable serving
(training and inference) are matrix-multiplication-heavy and is required.
C. ROLES
After describing the principles and their resulting instanti-
ation of components, we identify necessary roles in order
to realize MLOps in the following. MLOps is an interdis-
ciplinary group process, and the interplay of different roles
is crucial to design, manage, automate, and operate an ML
system in production. In the following, every role, its purpose,
and related tasks are briefly described: FIGURE 3. Roles and their intersections contributing to the MLOps
paradigm.
R1 Business Stakeholder (similar roles: Product Owner,
Project Manager). The business stakeholder defines the
business goal to be achieved with ML and takes care of and model deployment to production, and monitors both the
the communication side of the business, e.g., presenting model and the ML infrastructure [7], [24], [25], [32], [α, β,
the return on investment (ROI) generated with an ML γ, δ, ε, ζ, η, θ].
product [7], [24], [45] [α, β, δ, θ].
R2 Solution Architect (similar role: IT Architect). V. ARCHITECTURE AND WORKFLOW
The solution architect designs the architecture and defines On the basis of the identified principles, components, and
the technologies to be used, following a thorough roles, we derive a generalized MLOps end-to-end architecture
evaluation [7], [24], [α, ζ]. to give ML researchers and practitioners proper guidance.
R3 Data Scientist (similar roles: ML Specialist, It is depicted in Figure 4. Additionally, we depict the work-
ML Developer). The data scientist translates the business flows, i.e., the sequence in which the different tasks are
problem into an ML problem and takes care of the model executed in the different stages. The artifact was designed to
engineering, including the selection of the best-performing be technology-agnostic. Therefore, ML researchers and prac-
algorithm and hyperparameters [7], [25], [32], [33], [α, β, γ, titioners can choose the best-fitting technologies and frame-
δ, ε, ζ, η, θ]. works for their needs. This means the MLOps process and
R4 Data Engineer (similar role: DataOps Engineer). components can either be built out of ‘‘best-of-breed’’ open-
The data engineer builds up and manages data and fea- source tools, but also with enterprise solutions. Also, a mix
ture engineering pipelines. Moreover, this role ensures and match combination of enterprise and open-source tools
proper data ingestion to the databases of the feature store to realize MLOps is possible. Enterprise softwares / cloud
system [25], [26], [32], [α, β, γ, δ, ε, ζ, η, θ]. services often allow the connection to open-source tools via
R5 Software Engineer. The software engineer applies their APIs and vice versa. Thus, it should be considered to
software design patterns, widely accepted coding guidelines, have a look at newest developments as the open source tool
and best practices to turn the raw ML problem into a well- market is growing rapidly. There are frequently new options
engineered product [32], [α, γ]. for combinations possible. However, there are certainly also
R6 DevOps Engineer. The DevOps engineer bridges some constraints when it comes to API interfaces connections
the gap between development and operations and ensures and combinations. In general, it is hard to say which technolo-
proper CI/CD automation, ML workflow orchestra- gies are good to combine and which aren’t. However, with the
tion, model deployment to production, and monitoring newly introduced examples of applications and the precise
[7], [22], [25], [51], [α, β, γ, ε, ζ, η, θ]. mentioning of tools, we demonstrate possible combinations.
R7 ML Engineer/MLOps Engineer. The ML engineer As depicted in Figure 4, we illustrate an end-to-end pro-
or MLOps engineer combines aspects of several roles and cess, from MLOps product initiation to the model serv-
thus has cross-domain knowledge. This role incorporates ing. It includes (A) the MLOps product initiation steps;
skills from data scientists, data engineers, software engineers, (B) the feature engineering pipeline, including the data
DevOps engineers, and backend engineers (see Figure 3). ingestion to the feature store; (C) the experimentation; and
This cross-domain role builds up and operates the ML infras- (D) the automated ML workflow pipeline up to the model
tructure, manages the automated ML workflow pipelines serving.
FIGURE 4. End-to-end MLOps architecture and workflow with functional components and roles.
(A) MLOps product initiation. (1) The business stake- data engineer (R4) and the data scientist (R3) work together
holder (R1) analyzes the business and identifies a potential to understand which data is required to solve the problem.
business problem that can be solved using ML. (2) The (5) Once the answers are clarified, the data engineer (R4) and
solution architect (R2) defines the architecture design for the data scientist (R3) collaborate to locate the raw data sources
overall ML system, and decides on the technologies to be for the initial data analysis. They check the distribution, and
used after a thorough evaluation. (3) The data scientist (R3) quality of the data, as well as performing validation checks.
derives an ML problem—such as whether regression or clas- Furthermore, they ensure that the incoming data from the
sification should be used—from the business goal. (4) The data sources is labeled, meaning that a target attribute is
known, as this is a mandatory requirement for supervised ML. credit history, income, and demographics. The data is then
In this example, the data sources already had labeled data transformed and cleaned using Spark’s DataFrame and SQL
available as the labeling step was covered during an upstream functionality, and various feature engineering techniques are
process. applied to create a set of relevant features for the credit risk
(B1) Requirements for feature engineering pipeline. model. These features can be then passed through an ML
The features are the relevant attributes required for model pipeline, also implemented in Spark, to train and evaluate a
training. After the initial understanding of the raw data and predictive model for assessing credit risk. In addition, Apache
the initial data analysis, the fundamental requirements for the Kafka can be used for near real-time streaming data ingestion
feature engineering pipeline are defined, as follows: (6) The into the Spark-based feature engineering pipeline [54]. How-
data engineer (R4) defines the data transformation rules (nor- ever, to some extent, a traditional ETL tool can be used to
malization, aggregations) and cleaning rules to bring the data build a feature engineering pipeline [55].
into a usable format. (7) The data scientist (R3) and data (C) Experimentation. Most tasks in the experimentation
engineer (R4) together define the feature engineering rules, stage are led by the data scientist (R3) including the initial
such as the calculation of new and more advanced features configuration of the hardware and runtime environment. The
based on other features. These initially defined rules must be data scientist is supported by the software engineer (R5).
iteratively adjusted by the data scientist (R3) either based on (13) The data scientist (R3) connects to the feature store
the feedback coming from the experimental model engineer- system (C4) for the data analysis. (Alternatively, the data
ing stage or from the monitoring component observing the scientist (R3) can also connect to the raw data for an initial
model performance. analysis.) In case of any required data adjustments, the data
(B2) Feature engineering pipeline. The initially defined scientist (R3) reports the required changes back to the data
requirements for the feature engineering pipeline are taken engineering zone (feedback loop). (14) Then the preparation
by the data engineer (R4) and software engineer (R5) as a and validation of the data coming from the feature store
starting point to build up the prototype of the feature engi- system is required. This task also includes the train and test
neering pipeline. The initially defined requirements and rules split dataset creation. (15) The data scientist (R3) estimates
are updated according to the iterative feedback coming either the best-performing algorithm and hyperparameters, and the
from the experimental model engineering stage or from the model training is then triggered with the training data (C5).
monitoring component observing the model’s performance in The software engineer (R5) supports the data scientist (R3)
production. in the creation of well-engineered model training code. (16)
As a foundational requirement, the data engineer (R4) Different model parameters are tested and validated interac-
defines the code required for the CI/CD (C1) and orches- tively during several rounds of model training. Once the per-
tration component (C3) to ensure the task orchestration of formance metrics indicate good results, the iterative training
the feature engineering pipeline. This role also defines the stops. The best-performing model parameters are identified
underlying infrastructure resource configuration. (8) First, via parameter tuning. The model training task and model
the feature engineering pipeline connects to the raw data, validation task are then iteratively repeated; together, these
which can be (for instance) streaming data, static batch data, tasks can be called ‘‘model engineering.’’ The model engi-
or data from any cloud storage. (9) The data will be extracted neering aims to identify the best-performing algorithm and
from the data sources. (10) The data preprocessing begins hyperparameters for the model. (17) The data scientist (R3)
with data transformation and cleaning tasks. The transforma- exports the model and commits the code to the repository.
tion rule artifact defined in the requirement gathering stage As a foundational requirement, either the DevOps engineer
serves as input for this task, and the main aim of this task (R6) or the ML engineer (R7) defines the code for the (C2)
is to bring the data into a usable format. These transforma- automated ML workflow pipeline and commits it to the repos-
tion rules are continuously improved based on the feedback. itory. Once either the data scientist (R3) commits a new ML
(11) The feature engineering task calculates new and more model or the DevOps engineer (R6) and the ML engineer (R7)
advanced features based on other features. The predefined commits new ML workflow pipeline code to the repository,
feature engineering rules serve as input for this task. These the CI/CD component (C1) detects the updated code and
feature engineering rules are continuously improved based triggers automatically the CI/CD pipeline carrying out the
on the feedback. (12) Lastly, a data ingestion job loads build, test, and delivery steps. The build step creates artifacts
batch or streaming data into the feature store system (C4). containing the ML model and tasks of the ML workflow
The target can either be the offline or online database (or pipeline. The test step validates the ML model and ML work-
any kind of data store). An example of the implementation flow pipeline code. The delivery step pushes the versioned
of an entire feature engineering pipeline can be found in artifact(s)—such as images—to the artifact store (e.g., image
Esmaeilzadeh et al. [52], who implemented an NLP pipeline registry).
with Apache Spark. As another example, Xu [53] demon- Typical technologies used for the experimentation step
strates how a financial institution may use Spark to process are notebook-based solutions like the ones from Jupyter.
and analyze large amounts of customer credit data, such as One example of an industry case where ML experiments
are performed with a notebook-based environment is in the model specific metadata called ‘‘model lineage’’ combin-
field of natural language processing (NLP) [56]. A com- ing the lineage of data and code is tracked for each newly
pany that provides NLP-based services such as sentiment registered model. This includes the source and version of
analysis, text summarization, and named entity recognition, the feature data and model training code used to train the
may use Jupyter notebooks to perform ML experiments on model. Also, the model version and status (e.g., staging or
large amounts of text data. The company’s data scientists use production-ready) is recorded.
Jupyter notebooks to prepare the data. Then, they can train, Once the status of a well-performing model is switched
evaluate, and optimize different ML models, such as deep from staging to production, it is automatically handed over
learning models, and test the results. To track the experiments to the DevOps engineer or ML engineer for model deploy-
with the textual data, i.e., the tracking of meta data and storing ment. From there, the (24) CI/CD component (C1) triggers
the resulting models, common-used solutions in combination the continuous deployment pipeline. The production-ready
with Jupyter are, among others, MLflow (i.e., Obeid [57] ML model and the model serving code are pulled (initially
for assessing the risk of COVID-19 based on health records) prepared by the software engineer (R5)). The continuous
or Neptune.AI (i.e., Aljabri [58] for NLP-based fake news deployment pipeline carries out the build and test step of
detection). the ML model and serving code and deploys the model for
(D) Automated ML workflow pipeline. The DevOps production serving. The (25) model serving component (C8)
engineer (R6) and the ML engineer (R7) take care of the makes predictions on new, unseen data coming from the
management of the automated ML workflow pipeline. They feature store system (C4). This component can be designed
also manage the runtime environments, the underlying model by the software engineer (R5) as online inference for real-
training infrastructure in the form of hardware resources time predictions or as batch inference for predictions con-
and frameworks supporting computation such as Kubernetes cerning large volumes of input data. For real-time predictions,
(C5). The workflow orchestration component (C3) orches- features must come from the online database (low latency),
trates the tasks of the automated ML workflow pipeline. whereas for batch predictions, features can be served from
For each task, the required artifacts (e.g., images) are pulled the offline database (normal latency). Model-serving applica-
from the artifact store (e.g., image registry). Each task can tions are often configured within a container and prediction
be executed via an isolated environment (e.g., containers). requests are handled via a REST API. When deploying an
Finally, the workflow orchestration component (C3) gathers ML/AI application, it’s a good practice to use A/B testing
metadata for each task in the form of logs, completion time, to determine in a real-world scenario which model performs
and so on. better compared to another model, for example, deploying
Once the automated ML workflow pipeline is triggered, a ‘‘challenger model’’ in addition to an existing ‘‘champion
each of the following tasks is managed automatically: (18) model’’ to find out which one performs better by collecting
automated pulling of the versioned features from the fea- feedback, for example, when predicting hotel booking can-
ture store systems (data extraction). Depending on the use cellations [61].
case, features are extracted from either the offline or online As a foundational requirement, the ML engineer (R7)
database (or any kind of data store). (19) Automated data manages the model-serving computation infrastructure. The
preparation and validation; in addition, the train and test split (26) monitoring component (C9) observes continuously the
is defined automatically. (20) Automated final model training model-serving performance and infrastructure in real-time.
on new unseen data (versioned features). The algorithm and Once a certain threshold is reached, such as detection of
hyperparameters are already predefined based on the settings low prediction accuracy, the information is forwarded via
of the previous experimentation stage. The model is retrained the feedback loop. The (27) feedback loop is connected to
and refined. (21) Automated model evaluation and iterative the monitoring component (C9) and ensures fast and direct
adjustments of hyperparameters are executed, if required. feedback allowing for more robust and improved predic-
Once the performance metrics indicate good results, the auto- tions. It enables continuous training, retraining, and improve-
mated iterative training stops. The automated model training ment. With the support of the feedback loop, information is
task and the automated model validation task can be itera- transferred from the model monitoring component to several
tively repeated until a good result has been achieved. (22) The upstream receiver points, such as the experimental stage, data
trained model is then exported and (23) pushed to the model engineering zone, and the scheduler (trigger). The feedback
registry (C6), where it is stored e.g., as code or containerized to the experimental stage is taken forward by the data scientist
together with its associated configuration and environment for further model improvements. The feedback to the data
files. engineering zone allows for the adjustment of the features
For all training job iterations, the ML metadata store (C7) prepared for the feature store system. Additionally, the detec-
records metadata such as parameters to train the model and tion of concept drifts as a feedback mechanism can enable
the resulting performance metrics. This also includes the (28) continuous training. Concept drifts occur in real-world
tracking and logging of the training job ID, training date applications when the input data changes over time e.g.,
and time, duration, and sources of artifacts. Additionally, the when a sensor breaks. Decreased prediction accuracy due
need to be convinced that an increased MLOps maturity [2] M. K. Gourisaria, R. Agrawal, G. M. Harshvardhan, M. Pandey, and
and a product-focused mindset will yield clear business S. S. Rautaray, ‘‘Application of machine learning in industry 4.0,’’ in
Machine Learning: Theoretical Foundations and Practical Applications.
improvements [γ]. Cham, Switzerland: Springer, 2021, pp. 57–87.
[3] A. D. L. Heras, A. Luque-Sendra, and F. Zamora-Polo, ‘‘Machine learning
B. ML SYSTEM CHALLENGES technologies for sustainability in smart cities in the post-COVID era,’’
Sustainability, vol. 12, no. 22, p. 9320, Nov. 2020.
A major challenge with regard to MLOps systems is design- [4] R. Kocielnik, S. Amershi, and P. N. Bennett, ‘‘Will you accept an imperfect
ing for fluctuating demand, especially in relation to the AI?: Exploring designs for adjusting end-user expectations of AI systems,’’
process of ML training [33]. This stems from potentially in Proc. CHI Conf. Hum. Factors Comput. Syst., May 2019, pp. 1–14.
voluminous and varying data [28], which makes it difficult [5] R. van der Meulen and T. McCall. (2018). Gartner Says Nearly Half
of CIOs Are Planning to Deploy Artificial Intelligence. Accessed:
to precisely estimate the necessary infrastructure resources Dec. 4, 2021. [Online]. Available: https://www.gartner.com/en/newsroom/
(CPU, RAM, and GPU) and requires a high level of flexibility press-releases/2018-02-13-gartner-says-nearly-half-of-cios-are-planning-
in terms of scalability of the infrastructure [7], [33], [δ]. to-deploy-artificial-intelligence
[6] A. Posoldova, ‘‘Machine learning pipelines: From research to production,’’
IEEE Potentials, vol. 39, no. 6, pp. 38–42, Nov. 2020.
C. OPERATIONAL CHALLENGES [7] L. E. Lwakatare, I. Crnkovic, E. Rånge, and J. Bosch, ‘‘From a data science
In productive settings, it is challenging to operate ML driven process to a continuous delivery process for machine learning
systems,’’ in Product-Focused Software Process Improvement (Lecture
manually due to different stacks of software and hardware Notes in Computer Science), vol. 12562. Springer, 2020, pp. 185–201, doi:
components and their interplay as well as the selection of 10.1007/978-3-030-64148-1_12.
both ([64], [65]. Therefore, robust automation is required [8] W. W. Royce, ‘‘Managing the development of large software systems,’’ in
Proc. IEEE WESCON, Aug. 1970, pp. 1–9.
[24], [33]. Also, a constant incoming stream of new data
[9] K. Beck et al., ‘‘The agile manifesto,’’ 2001. [Online]. Available:
forces retraining capabilities. This is a repetitive task which, http://agilemanifesto.org/
again, requires a high level of automation [66], [θ]. These [10] P. Debois. (2009). Patrick Debois Devopsdays Ghent. Accessed:
repetitive tasks yield a large number of artifacts that require Mar. 25, 2021. [Online]. Available: https://devopsdays.org/events/2019-
ghent/speakers/patrick-debois/
a strong governance [27], [32], [45] as well as versioning
[11] S. Mezak. (Jan. 25, 2018). The Origins of DevOps: What’s in a
of data, model, and code to ensure robustness and repro- Name? DevOps.com. Accessed: Mar. 25, 2021. [Online]. Available:
ducibility [7], [32], [39]. Lastly, it is challenging to resolve https://devops.com/the-origins-of-devops-whats-in-a-name/
a potential support request (e.g., by finding the root cause), [12] L. Leite, C. Rocha, F. Kon, D. Milojicic, and P. Meirelles, ‘‘A survey of
DevOps concepts and challenges,’’ ACM Comput. Surv., vol. 52, no. 6,
as many parties and components are involved. Failures can pp. 1–35, Nov. 2020, doi: 10.1145/3359981.
be a combination of ML infrastructure and software within [13] R. W. Macarthy and J. M. Bass, ‘‘An empirical taxonomy of DevOps in
the MLOps stack [7]. practice,’’ in Proc. 46th Euromicro Conf. Softw. Eng. Adv. Appl. (SEAA),
Aug. 2020, pp. 221–228, doi: 10.1109/SEAA51224.2020.00046.
[14] M. Rütz, ‘‘DEVOPS: A systematic literature review,’’ Inf. Softw. Technol.,
VIII. CONCLUSION FH-Wedel, Aug. 2019. [Online]. Available: https://www.researchgate.
With the increase of data availability and analytical capabil- net/publication/335243102_DEVOPS_A_SYSTEMATIC_LITERATURE
ities, coupled with the constant pressure to innovate, more _REVIEW
[15] P. Perera, R. Silva, and I. Perera, ‘‘Improve software quality through
machine learning products than ever are being developed.
practicing DevOps,’’ in Proc. 17th Int. Conf. Adv. ICT Emerg. Regions
However, only a small number of these proofs of concept (ICTer), Sep. 2017, pp. 1–6.
progress into deployment and production. Furthermore, the [16] J. Webster and R. Watson, ‘‘Analyzing the past to prepare for the
academic space has focused intensively on machine learning future: Writing a literature review,’’ MIS Quart., vol. 26, no. 2,
pp. 8–23, 2002. [Online]. Available: https://www.jstor.org/stable/4132319,
model building and benchmarking, but too little on operating doi: 10.1.1.104.6570.
complex machine learning systems in real-world scenarios. [17] B. Kitchenham, O. P. Brereton, D. Budgen, M. Turner, J. Bailey, and
In the real world, we observe data scientists still managing S. Linkman, ‘‘Systematic literature reviews in software engineering—A
systematic literature review,’’ Inf. Softw. Technol., vol. 51, no. 1, pp. 7–15,
ML workflows manually to a great extent. The paradigm Jan. 2009, doi: 10.1016/j.infsof.2008.09.009.
of Machine Learning Operations (MLOps) addresses these [18] M. D. Myers and M. Newman, ‘‘The qualitative interview in IS research:
challenges. In this work, we shed more light on MLOps. Examining the craft,’’ Inf. Org., vol. 17, no. 1, pp. 2–26, Jan. 2007, doi:
By conducting a mixed-method study analyzing existing 10.1016/j.infoandorg.2006.11.001.
[19] U. Schultze and M. Avital, ‘‘Designing interviews to generate rich data
literature and tools, as well as interviewing eight experts for information systems research,’’ Inf. Org., vol. 21, no. 1, pp. 1–16,
from the field, we uncover four main aspects of MLOps: its Jan. 2011, doi: 10.1016/j.infoandorg.2010.11.001.
principles, components, roles, and architecture. From these [20] J. M. Corbin and A. Strauss, ‘‘Grounded theory research: Procedures,
aspects, we infer a holistic definition. The results support a canons, and evaluative criteria,’’ Qualitative Sociol., vol. 13, no. 1,
pp. 3–21, 1990, doi: 10.1007/BF00988593.
common understanding of the term MLOps and its associated [21] B. Glaser and A. Strauss, The Discovery of Grounded Theory: Strate-
concepts, and will hopefully assist researchers and profes- gies for Qualitative Research. London, U.K.: Aldine, 1967, doi:
sionals in setting up successful ML products in the future. 10.4324/9780203793206.
[22] T. Granlund, A. Kopponen, V. Stirbu, L. Myllyaho, and T. Mikkonen,
‘‘MLOps challenges in multi-organization setup: Experiences from two
REFERENCES real-world cases,’’ 2021, arXiv:2103.08937.
[1] M. Aykol, P. Herring, and A. Anapolsky, ‘‘Machine learning for continuous [23] Y. Zhou, Y. Yu, and B. Ding, ‘‘Towards MLOps: A case study of ML
innovation in battery technologies,’’ Nature Rev. Mater., vol. 5, no. 10, pipeline platform,’’ in Proc. Int. Conf. Artif. Intell. Comput. Eng. (ICAICE),
pp. 725–727, Jun. 2020. Oct. 2020, pp. 494–500, doi: 10.1109/ICAICE51518.2020.00102.
[24] I. Karamitsos, S. Albarhami, and C. Apostolopoulos, ‘‘Applying devops [45] Y. Liu, Z. Ling, B. Huo, B. Wang, T. Chen, and E. Mouine, ‘‘Building
practices of continuous automation for machine learning,’’ Information, a platform for machine learning operations from open source frame-
vol. 11, no. 7, pp. 1–15, 2020, doi: 10.3390/info11070363. works,’’ IFAC-PapersOnLine, vol. 53, no. 5, pp. 704–709, 2020, doi:
[25] A. Goyal, ‘‘MLOps machine learning operations,’’ Int. J. Inf. 10.1016/j.ifacol.2021.04.161.
Technol. Insights Transformations, vol. 4, no. 2, 2020. Accessed: [46] G. S. Yoon, J. Han, S. Lee, and J. W. Kim, DevOps Portal Design for
Apr. 15, 2021. [Online]. Available: http://technology.eurekajournals.com/ SmartX AI Cluster Employing Cloud-Native Machine Learning Workflows,
index.php/IJITIT/article/view/655 vol. 47. Cham, Switzerland: Springer, 2020, doi: 10.1007/978-3-030-
[26] D. A. Tamburri, ‘‘Sustainable MLOps: Trends and challenges,’’ in Proc. 39746-3_54.
22nd Int. Symp. Symbolic Numeric Algorithms Sci. Comput. (SYNASC), [47] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, ‘‘Quan-
Sep. 2020, pp. 17–23, doi: 10.1109/SYNASC51798.2020.00015. tized neural networks: Training neural networks with low precision weights
[27] O. Spjuth, J. Frid, and A. Hellander, ‘‘The machine learning life and activations,’’ J. Mach. Learn. Res., vol. 18, no. 1, pp. 6869–6898, 2017.
cycle and the cloud: Implications for drug discovery,’’ Expert Opin- [48] S. Han, H. Mao, and W. J. Dally, ‘‘Deep compression: Compressing deep
ion Drug Discovery, vol. 16, no. 9, pp. 1071–1079, 2021, doi: neural networks with pruning, trained quantization and Huffman coding,’’
10.1080/17460441.2021.1932812. 2015, arXiv:1510.00149.
[49] L. E. Lwakatare, I. Crnkovic, and J. Bosch, ‘‘DevOps for AI—
[28] B. Derakhshan, A. R. Mahdiraji, T. Rabl, and V. Markl, ‘‘Continuous
Challenges in development of AI-enabled applications,’’ in Proc. Int. Conf.
deployment of machine learning pipelines,’’ in Proc. EDBT, Mar. 2019,
Softw., Telecommun. Comput. Netw. (SoftCOM), Sep. 2020, pp. 1–6, doi:
pp. 397–408, doi: 10.5441/002/edbt.2019.35.
10.23919/SoftCOM50211.2020.9238323.
[29] R. R. Karn, P. Kudva, and I. A. M. Elfadel, ‘‘Dynamic autoselection and
[50] C. Renggli, L. Rimanic, N. M. Gürel, B. Karlaš, W. Wu, and C. Zhang,
autotuning of machine learning models for cloud network analytics,’’ IEEE
‘‘A data quality-driven view of MLOps,’’ 2021, arXiv:2102.07750.
Trans. Parallel Distrib. Syst., vol. 30, no. 5, pp. 1052–1064, May 2019, doi:
[51] W. J. van den Heuvel and D. A. Tamburri, Model-Driven ML-Ops for
10.1109/TPDS.2018.2876844.
Intelligent Enterprise Applications: Vision, Approaches and Challenges,
[30] S. Shalev-Shwartz, ‘‘Online learning and online convex optimization,’’ vol. 391. Cham, Switzerland: Springer, 2020, doi: 10.1007/978-3-030-
Found. Trends Mach. Learn., vol. 4, no. 2, pp. 107–194, 2012. 52306-0_11.
[31] A. Molner Domenech and A. Guillén, ‘‘Ml-experiment: A Python frame- [52] A. Esmaeilzadeh, M. Heidari, R. Abdolazimi, P. Hajibabaee, and
work for reproducible data science,’’ J. Phys., Conf. Ser., vol. 1603, no. 1, M. Malekzadeh, ‘‘Efficient large scale NLP feature engineering with
Sep. 2020, Art. no. 012025, doi: 10.1088/1742-6596/1603/1/012025. Apache spark,’’ in Proc. IEEE 12th Annu. Comput. Commun. Workshop
[32] S. Makinen, H. Skogstrom, E. Laaksonen, and T. Mikkonen, ‘‘Who needs Conf. (CCWC), Jan. 2022, pp. 274–280.
MLOps: What data scientists seek to accomplish and how can MLOps [53] J. Xu, ‘‘MLOps in the financial industry: Philosophy, practices, and tools,’’
help?’’ in Proc. IEEE/ACM 1st Workshop AI Eng. Softw. Eng. AI (WAIN), in Future and Fintech, the, Abcdi and Beyond. Singapore: World Scientific,
May 2021, pp. 109–112. 2022, p. 451, doi: 10.1142/9789811250903_0014.
[33] L. C. Silva, F. R. Zagatti, B. S. Sette, L. N. dos Santos Silva, [54] F. Carcillo, A. D. Pozzolo, Y.-A. L. Borgne, O. Caelen, Y. Mazzer, and
D. Lucredio, D. F. Silva, and H. de Medeiros Caseli, ‘‘Benchmark- G. Bontempi. SCARFF?: A Scalable Framework for Streaming Credit
ing machine learning solutions in production,’’ in Proc. 19th IEEE Card Fraud Detection With Spark 1. Accessed: Feb. 17, 2023. [Online].
Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2020, pp. 626–633, doi: Available: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
10.1109/ICMLA51294.2020.00104. [55] J. Dhanalakshmi and N. Ayyanathan, ‘‘A dynamic web data extraction from
[34] A. Banerjee, C. C. Chen, C. C. Hung, X. Huang, Y. Wang, and SRLDC (southern regional load dispatch centre) and feature engineering
R. Chevesaran, ‘‘Challenges and experiences with MLOps for perfor- using ETL tool,’’ in Proc. 2nd Int. Conf. Artif. Intell., Adv. Appl. Springer,
mance diagnostics in hybrid-cloud enterprise software deployments,’’ in 2022, pp. 443–449, doi: 10.1007/978-981-16-6332-1_38.
Proc. OpML USENIX Conf. Oper. Mach. Learn., 2020, pp. 7–9. [56] J. Foster and J. Wagner, ‘‘Naive Bayes versus BERT: Jupyter notebook
[35] B. Benni, M. Blay-Fornarino, S. Mosser, F. Precioso, and G. Jungbluth, assignments for an introductory NLP course,’’ in Proc. 5th Workshop
‘‘When DevOps meets meta-learning: A portfolio to rule them all,’’ in Teaching (NLP), 2021, pp. 112–114.
Proc. ACM/IEEE 22nd Int. Conf. Model Driven Eng. Lang. Syst. Com- [57] J. S. Obeid, M. Davis, M. Turner, S. M. Meystre, P. M. Heider,
panion (MODELS-C), Sep. 2019, pp. 605–612, doi: 10.1109/MODELS- E. C. O’Bryan, and L. A. Lenert, ‘‘An artificial intelligence approach to
C.2019.00092. COVID-19 infection risk assessment in virtual visits: A case report,’’
[36] C. Vuppalapati, A. Ilapakurti, K. Chillara, S. Kedari, and V. Mamidi, J. Amer. Med. Inform. Assoc., vol. 27, no. 8, pp. 1321–1325, Aug. 2020.
‘‘Automating tiny ML intelligent sensors DevOPS using Microsoft azure,’’ [58] M. Aljabri, D. M. Alomari, and M. Aboulnour, ‘‘Fake news detection
in Proc. IEEE Int. Conf. Big Data (Big Data), Dec. 2020, pp. 2375–2384, using machine learning models,’’ in Proc. 14th Int. Conf. Comput. Intell.
doi: 10.1109/BigData50022.2020.9377755. Commun. Netw. (CICN), Dec. 2022, pp. 473–477.
[37] Á. L. García, J. M. D. Lucas, M. Antonacci, W. Z. Castell, and [59] L. Baier, N. Kühl, and G. Satzger, ‘‘How to cope with change?—Preserving
M. David, ‘‘A cloud-based framework for machine learning workloads validity of predictive services over time,’’ in Proc. Annu. Hawaii Int. Conf.
and applications,’’ IEEE Access, vol. 8, pp. 18681–18692, 2020, doi: Syst. Sci., 2019, pp. 1–10.
10.1109/ACCESS.2020.2964386. [60] L. Baier, T. Schlör, J. Schöffer, and N. Kühl, ‘‘Detecting concept drift with
[38] C. Wu, E. Haihong, and M. Song, ‘‘An automatic artificial intelligence neural network model uncertainty,’’ 2021, arXiv:2107.01873.
training platform based on kubernetes,’’ in Proc. 2nd Int. Conf. Big Data [61] N. Antonio, A. de Almeida, and L. Nunes, ‘‘Predicting hotel bookings
Eng. Technol., Jan. 2020, pp. 58–62, doi: 10.1145/3378904.3378921. cancellation with a machine learning classification model,’’ in Proc. 16th
IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2017, pp. 1049–1054,
[39] G. Fursin, ‘‘Collective knowledge: Organizing research projects as a
doi: 10.1109/ICMLA.2017.00-11.
database of reusable components and portable workflows with common
[62] T. Cui, Y. Wang, and B. Namih, ‘‘Build an intelligent online marketing
interfaces,’’ Phil. Trans. Roy. Soc. A, Math., Phys. Eng. Sci., vol. 379,
system: An overview,’’ IEEE Internet Comput., vol. 23, no. 4, pp. 53–60,
no. 2197, May 2021, Art. no. 20200211, doi: 10.1098/rsta.2020.0211.
Jul. 2019.
[40] M. Schmitt, ‘‘Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow,’’ [63] L. Baier and S. Seebacher, ‘‘Challenges in the Deployment and,’’ in Proc.
Tech. Rep., 2022. [Online]. Available: https://www.datarevenue.com/en- 27th Eur. Conf. Inf. Syst. (ECIS), Stockholm, Sweden, Jun. 2019, pp. 1–15.
blog/airflow-vs-luigi-vs-argo-vs-mlflow-vs-kubeflow [Online]. Available: https://aisel.aisnet.org/ecis2019_rp/163
[41] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, ‘‘The Hadoop dis- [64] P. Ruf, M. Madan, C. Reich, and D. Ould-Abdeslam, ‘‘Demystifying
tributed file system,’’ in Proc. IEEE 26th Symp. Mass Storage Syst. Tech- MLOps and presenting a recipe for the selection of open-source tools,’’
nol. (MSST), May 2010, pp. 1–10. Appl. Sci., vol. 11, no. 19, p. 8861, Sep. 2021.
[42] S. Ghemawat, H. Gobioff, and S.-T. Leung, ‘‘The Google file system,’’ in [65] N. Hewage and D. Meedeniya, ‘‘Machine learning operations: A survey on
Proc. 19th ACM Symp. Operating Syst. Princ., Oct. 2003, pp. 29–43. MLOps tool support,’’ 2022, arXiv:2202.10169.
[43] A. Lakshman and P. Malik, ‘‘Cassandra: A decentralized structured stor- [66] B. Karlaš, M. Interlandi, C. Renggli, W. Wu, C. Zhang, D. M. I. Babu,
age system,’’ ACM SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35–40, J. Edwards, C. Lauren, A. Xu, and M. Weimer, ‘‘Building continuous
Apr. 2010. integration services for machine learning,’’ in Proc. 26th ACM SIGKDD
[44] J. C. Corbett, ‘‘Spanner: Google’s globally distributed database,’’ ACM Int. Conf. Knowl. Discovery Data Mining, Aug. 2020, pp. 2407–2415, doi:
Trans. Comput. Syst. (TOCS), vol. 31, no. 3, pp. 1–22, 2013. 10.1145/3394486.3403290.