Machine Learning Systems: A Survey From A Data-Oriented Perspective
Machine Learning Systems: A Survey From A Data-Oriented Perspective
for integrating ML models. Even though papers on deployed ML systems do not mention DOA, their authors made design
decisions that implicitly follow DOA. Implicit decisions create a knowledge gap, limiting the practitioners’ ability to implement
ML-based systems. This paper surveys why, how, and to what extent practitioners have adopted DOA to implement and
deploy ML-based systems. We overcome the knowledge gap by answering these questions and explicitly showing the design
decisions and practices behind these systems. The survey follows a well-known systematic and semi-automated methodology
for reviewing papers in software engineering. The majority of reviewed works partially adopt DOA. Such an adoption enables
systems to address requirements such as Big Data management, low latency processing, resource management, security and
privacy. Based on these findings, we formulate practical advice to facilitate the deployment of ML-based systems.
CCS Concepts: • Software and its engineering; • Computing methodologies → Artificial intelligence; Machine
learning;
Additional Key Words and Phrases: Artificial Intelligence, Machine Learning, Real World Deployment, Systems Architecture,
Data-Oriented Architecture.
ACM Reference Format:
Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence. 2023. Machine Learning Systems: A Survey from a
Data-Oriented Perspective. ACM Comput. Surv. 1, 1 (July 2023), 37 pages. [Link]
1 INTRODUCTION
Artificial Intelligence solutions based on ML have gained an increased level of attention. They have been used
to address challenging problems in several domains (e.g., healthcare, science, etc.) [77, 138]. Their success
is a product of the growth of available data, increasingly powerful hardware, and the development of novel
ML models [41, 113]. Many of these models have demonstrated practical value, which has led to their rapid
adoption in real-world software systems. The contrast between real-world environments and the more controlled
environments from which ML models originate challenges deploying systems that integrate these ML models [27].
In particular, real-world environments produce large amounts of complex, dynamic, and sometimes sensitive data,
which systems must process efficiently [25, 65]. ML-based systems deployed in the real world must be scalable,
autonomous, and resource-efficient, while enabling data availability, monitoring, security, and trust [80, 106, 109].
Authors’ address: Christian Cabrera; Andrei Paleyes; Pierre Thodoroff; Neil D. Lawrence, {chc79,ap2169,pt440,ndl21}@[Link],
Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
permissions@[Link].
© 2023 Association for Computing Machinery.
0360-0300/2023/7-ART $15.00
[Link]
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
2 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence
Service-oriented architectures (SOAs) and microservices are the dominant software architectural styles nowa-
days [11, 96]. These styles are characterised by services that separate systems’ functionalities, communicate
through well-defined programming interfaces (i.e., APIs), and are usually deployed in the cloud [134]. The
necessity to process large volumes of data, typical for ML-based real-world systems, leads to new requirements
that challenge SOA-based systems. For example, separation of concerns facilitates systems maintenance by
following the divide-and-conquer principle [70, 87]. However, data monitoring and transparency requirements
are arduous to satisfy because services hide the systems’ data behind their APIs [34]. This situation is known as
“The Data Dichotomy”: while high-quality data management requires exposing systems’ data, services hide it
behind their interfaces [131]. Another example is that services deployed in flexible cloud platforms support highly
available and scalable systems [67]. However, cloud deployments struggle to meet low-latency requirements for
critical applications. The physical distance between end users and cloud data centres impacts systems’ end-to-end
response time [28]. Such cloud-based deployments also affect data security, privacy, and trust requirements
because the ownership of the data changes from end users to cloud providers. This change of ownership also
generates availability, privacy and security concerns that can impact ML model training [4, 23, 74].
Data-Oriented Architecture (DOA) is an emerging architectural style that complements the current practices
for implementing and deploying systems. DOA considers data as the common denominator between system
components [65]. Services in DOA are distributed, autonomous, and communicate with each other at the data level
(i.e., data coupling) using asynchronous messages [121, 139]. DOA enables systems to achieve data availability,
reusability, monitoring, and systems autonomy and resource efficiency [65, 121, 139]. These properties facilitate
systems integrating ML models in production environments. For example, the data coupling between DOA-based
system components enables the storage of the states of the data that flows through it by design. A practitioner
or a meta-learning system can analyse such records, identify when the system’s data distribution changes,
and update the system’s learning models. Similarly, DOA enables the creation of open systems because of the
autonomous communication between its components. New devices can automatically join and provide computing
power to DOA-based systems. Such flexibility facilitates the deployment of ML models in resource-constrained
environments and can mitigate low-latency requirements as services are physically closer to end users [24, 133].
Even though most papers on deployed ML-based systems do not describe DOA explicitly, their authors faced
the same challenges that DOA aims to address. Such an implicit adoption creates a knowledge gap that limits the
practitioners’ ability to deploy ML-based systems in real-world conditions. We close this gap in this survey paper
by explicitly showing the design decisions behind implementing and deploying ML-based systems and analysing
why, how and to what extent their authors have implicitly adopted DOA principles. Such an analysis constitutes
the first in-depth study of DOA architectural style for ML-based systems. Based on existing literature on DOA,
we formulate high-level principles of this architectural style and contextualise these principles through the ML
deployment challenges. We then conduct a comprehensive survey to understand the implicit adoption of DOA
principles among practitioners when implementing existing ML-based systems1 . Based on the survey findings,
this paper shows the benefits and limitations of DOA, formulates practical advice for deploying ML-based systems,
and discusses open challenges and research directions to advance DOA.
2 RELATED WORK
Although DOA as an architectural style for ML-based systems is an emerging idea, the principles behind DOA are
not new. Many domains have discovered and are reaping the benefits of applying similar principles. For instance,
data-oriented design (DOD) applies many DOA principles on a lower level of abstraction, with claims of significant
1 Werefer to "ML-based systems in the real world" as software systems that deliver ML-powered capabilities deployed and evaluated in
production environments, with real users, unpredictable physical processes, unseen variable data. Such a definition contrasts against isolated
academic settings with fixed datasets, simulated users and environments, and tightly controlled experimental conditions.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 3
improvements over analogous object-oriented programming (OOP) solutions. Video game development is one
domain where DOD is widespread to improve memory and cache utilisation [1]. For example, Coherent Labs
utilised DOD while creating their game engine Hummingbird [91] to overcome the performance limitations of
solutions based on Chromium and WebKit. DOD helped the developers eliminate cache misses and compiler
branch mispredictions, leading to a 6.12x improvement in animation rendering speed. Outside of gaming, Mironov
et al. [83] utilised DOD to improve the performance of a trading strategy backtesting utility. They improved
the parallelisation opportunities and increased the performance speed to 66%. These works illustrate that DOA-
like principles, especially around data prioritisation, emerge at a different level of abstraction with different
motivations and nevertheless bring significant benefits to diverse practitioners. Our work extends these previous
efforts by reviewing DOA principles in the context of ML-based systems at deployment.
A growing body of literature focuses on surveying the application of AI. Table 1 shows how our work compares
against these surveys. One category of current surveys presents use cases of AI algorithms that solve problems in
specific domains. They report the challenges these algorithms face at deployment. For example, Cai et al. [29]
survey multimodal ML techniques applied to the healthcare domain. They review systems for disease analysis,
triage, diagnosis, and treatment. Bohg et al. [19] survey data-driven methodologies for robot grasping. They
focus on techniques based on object recognition and pose estimation for known, familiar, and unknown objects.
The review compares data-driven methodologies and analytics approaches. Qin et al. [110] analyse data-driven
methods for industrial fault detection and diagnosis. This survey focuses on fault detectability and identifiability
methods for industrial processes with different complexities and sizes. Wong et al. [144] present the challenges
of applying deep learning (DL) techniques in radio frequency applications. This work reviews DL applications
in the radio frequency domain from the perspective of data trust, security, and hardware/software issues for
DL deployment in real-world wireless communication applications. Joshi et al. [64] survey the deployment of
deep learning approaches at the edge. Their review presents architectures of deep learning models, enabling
technologies, and adaptation techniques. Their work also describes different metrics for designing and evaluating
deep learning models at the edge. Our survey is domain-agnostic and focuses on papers that report ML-based
systems at deployment. In addition, we extend specific domain surveys by analysing the design decisions and
software architectural patterns behind the systems the authors report.
Another category of papers reviews software engineering patterns for ML applications. Since DOA is a general
architectural style that builds on multiple design patterns, our survey closely relates to such existing works.
Muccini and Vaidhyanathan [86] emphasise the importance of research on architectural patterns for ML-based
systems and provide high-level research directions. Washizaki et al. [141] focus on a more in-depth discussion
of the patterns for building ML-based systems by reviewing academic literature and engineering blog posts
to identify 33 patterns. DOA directly applies or encourages many of these patterns (e.g., “Kappa Architecture”
and “Modularisation of ML Components”). Chai et al. surveyed challenges that arise at different stages of ML
pipelines and acknowledged data management as one of the crucial challenges for ensuring the high quality
of a pipeline [32]. Giray [53] performed a systematic analysis of software engineering practices for ML-based
systems. Heck [59] conducted a mapping study on data engineering for AI systems, briefly summarising relevant
technical solutions and architectural styles. At a high level, these works describe multiple architectural styles,
patterns and paradigms for designing and developing ML-based systems. They attempt to deepen the community’s
understanding of modern software engineering for ML. We provide further insights by exclusively reviewing
ML-based systems from a data-oriented perspective (i.e., data focus), which is the crucial and distinct aspect of
modern and future data-driven systems. Moreover, none of the papers in the related work perform a detailed
study of a single architectural style or pattern in the context of ML-based systems. In contrast, our work surveys
the adoption level of DOA when deploying ML-based systems. We define the DOA core principles, compare them
against ML challenges at deployment, and provide advice on applying DOA in practice. We quantify the adoption
of DOA principles among already existing ML-based systems by conducting a deep architectural analysis of
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
4 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence
Table 1. Related work compared against this survey. Our work is the only survey that measures an architectural style
adoption (i.e., DOA) in the context of ML-based systems at deployment.
Focus on
Focus on Focus on Focus on
Related work Domain Architecture
Deployment Data Adoption
Style
Cai et al. [29] Healthcare No Yes No No
Bohg et al. [19] Robotics No Yes No No
Qin et al. [110] Industrial processes No Yes No No
Wong et al. [144] Radio frequency No Yes No No
Joshi et al. [64] Edge computing No Yes No No
Muccini and ML software
Yes Yes No No
Vaidhyanathan [86] architecture
Washizaki et al. [141] ML engineering Yes No No No
Chai et al. [32] ML pipelines Yes No No No
Giray [53] ML engineering Yes Yes No No
Heck [59] Data engineering No Yes Yes Yes
This work ML-based Systems Yes Yes Yes Yes
these works. Our long-term goal is to unify multiple approaches for deploying ML-based systems under the DOA
architectural style and to increase its recognition in the software architecture community.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 5
Data collection
Data augmentation
Data
Data preprocessing Management
Data-driven
Data analysis
Model selection
Hyper-parameter
selection Model
Data coupling
Learning
Model training
Requirements
encoding Every entity stores
Formal data chunks
Model
Verification Verification
Test-based Prioritise
Local first
verification Decentralisation
Integration
Peer-to-peer first
Composition
Every entity is
Model
autonomous
Deployment
Monitoring
Ethics
Security
Law
ML Workflow
DOA's Principles
Challenges at Deployment
Fig. 1. Map between ML Workflow Challenges at Deployment [106] and DOA principles. The left side shows the ML
challenges at deployment, and the right side shows the DOA principles. The links between them represent which principles
support addressing the respective challenges.
Data-Oriented Architecture (DOA) is an emerging architectural style that can help us overcome these challenges
when implementing and deploying ML-based systems. DOA proposes a set of principles for implementing data-
oriented software: considering data as a first-class citizen, decentralisation as a priority, and openness [65, 74,
121, 139]. Figure 1 maps ML deployment challenges and DOA principles addressing them. The left side of the
figure shows the challenges of the ML workflow at deployment discussed by Paleyes et al. [106], while the right
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
6 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence
side shows the DOA principles we have extracted from the literature [65, 74, 121, 139]. This section introduces
each DOA principle and discusses how they support the deployment of ML-based systems in the real world.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 7
example, they learn from non-quadratic, non-convex, and high-dimensional data sets [54, 66, 97]. Hyperparameters
improve the efficiency of the training process and the accuracy of the learning models [55]. However, selecting
these hyperparameters is also a resource-demanding optimisation problem [17, 99]. In addition, expensive and
physically centralised deployments can fail to meet low-latency requirements from critical applications. The
physical distance between end users and data centres impacts systems’ end-to-end response time [28]. Data
ownership, security, and trust requirements are also affected in centralised systems. The ownership of the data
changes from end users to service providers, which can entail privacy and security issues as end users are unaware
of where and how their data is stored, processed, and used [4]. Recent research shows that Large Language
Models (LLMs) create risks of leaking sensitive information from the model’s training data. Adversarial attacks
or prompt engineering strategies can extract personal information from LLMs [42].
The simplest way to deliver a shared data model (Principle 3.1) would seem to be to centralise it, but, in
practice, scalability, resource constraints, low latency, and security requirements mean that in DOA we prioritise
decentralisation [74, 139]. Such decentralisation should be logical and physical. Logical decentralisation enables
organisations to scale when developing ML-based systems, as different development teams focus on smaller
systems’ components. Logical decentralisation is a way to achieve a clear separation of concerns, avoid central
bottlenecks in the computational process, and increase flexibility in system development. Physical decentralisation
enables the deployment of ML models in constrained environments without expensive computational resources
and improves privacy and ownership of data [74]. Practitioners should deploy components of ML-based systems
as decentralised entities that store data chunks of the shared information model described in the previous principle.
These entities first perform their operations with their local resources (i.e. local first, [74]). If local resources (i.e.,
data, computing time, or storage) are insufficient, entities can connect temporarily with other participants to
share resources. Entities first scan their local environment for potential resources they need. They prioritise
interactions with nodes in the close vicinity to share or ask for data and computing resources (i.e., peer-to-peer
first). Centralised servers are fallback mechanisms [139]. This principle facilitates data management, enabling the
system’s data replication by design because different entities can store the same data chunk. Such replication
provides data availability because if one entity fails, its information is not lost. Similarly, replication provides
scalability as different entities can respond to concurrent data requests [139]. Prioritised decentralisation also
alleviates the high demand for resources of ML-based systems and reduces the inference latency. ML-based
systems can perform their data-related tasks (e.g., ML model training) in smaller data sets partitioned by design,
and trained models deployed closer to end users to provide faster responses [24]. In addition, decentralisation
creates a flexible ecosystem where we can use resources from different devices on demand. This DOA principle
advocates for a more sustainable and democratic approach that prioritises the computational power available in
everyday devices over expensive cloud resources. It is important to note that logical or physical decentralisation
is not always necessary. For example, there is no need to perform local first computations when centralised
computational resources are available. However, data replication, partitioned data sets, and flexible resource
management are DOA-enabled properties that can still benefit even partial decentralisation of ML-based systems.
3.3 Openness
Developing ML components often involves the automation of various stages of data processing, both at the training
and inference stages. The amounts of data ML-based systems manage, the complex processes they perform, and the
fact that their users are usually experts in domains different from ML (e.g. healthcare, physics, etc.) [140] demand
such automation. AutoML is a recent research area that aims to automate the ML training life cycle. AutoML
approaches include data augmentation, model selection, and hyperparameter optimisation [43, 137]. Nevertheless,
implementing and deploying ML-based systems in real-world environments sets automation requirements for
adopting ML models that AutoML does not cover. Real-world environments are dynamic, and deployed systems
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
8 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence
596
papers
18 3 2
papers papers papers
Fig. 2. The survey process depicts the steps from the review motivation to the full-text reading of the selected papers. This
process relies on the methodology proposed by Kitchenham et al. [72, 73]
must respond autonomously to changing goals, variable data, unexpected failures, security threats, complex
uncertainties, etc. Autonomous responses are required because practitioners do not have complete control over
such deployments [98], their complexity usually exceeds the limit of human comprehension [101], and human
intervention is not feasible [25]. Current ML-based systems rely on the interaction of heterogeneous components.
Static systems architectures usually pre-define the components and devices that compose a system, and third-
party entities (e.g., central controllers) usually integrate, orchestrate, and maintain these components to satisfy
end-to-end system quality requirements [26]. The scale and dynamic nature of real-world deployments demand
autonomous solutions and challenge these static and centralised architectures.
Autonomous and independent software components are the building blocks of systems that can adapt and
respond to uncertain environments without human intervention [143]. DOA proposes openness as a principle
that enables designing and implementing such autonomous systems’ components in open environments where
they interact autonomously [65]. DOA represents components as autonomous and asynchronous entities that
communicate with each other using a common message exchange protocol. Systems can leverage such environments
to integrate ML models because their decentralised components can autonomously perform the integration,
composition, monitoring, and adaptation tasks. Asynchronous entities produce their outputs and can subscribe
to inputs at any time [121]. These steps are transparent, favouring data trust and traceability. Similarly, the
system’s components are autonomous in deciding which data to store, which to make public, and which to hide
for security and privacy [139]. The message exchange protocol between asynchronous components replaces
interface dependencies in service-oriented or microservices architectures with asynchronous messages between
data producers and consumers. Asynchronous communication protocols enable data coupling, considered the
loosest form of coupling, enabling scalability and low latency requirements [93, 94].
4 SURVEY METHODOLOGY
This survey assesses why, how, and to what extent practitioners have adopted the DOA principles when imple-
menting and deploying ML systems and the motivation and methods behind such adoption. We survey research
works that have used ML to solve problems in different domains and have been deployed and tested in real-world
settings. We are particularly interested in works that report the software architectures behind these systems and,
if possible, the authors’ design decisions toward such architectures.
The selection of relevant works for our survey is not straightforward because many papers that apply ML in
different domains have been published in recent years, making manual selection unfeasible. For this reason, we
developed a semi-automatic framework based on a well-known methodology for systematic literature reviews
(SLRs) in software engineering [72, 73]. The framework is available publicly on GitHub3 to allow the reproducibility
of this work, as well as the reusability of this framework in other surveys. The framework queries the search
APIs from different digital libraries to automatically retrieve papers’ metadata (e.g., title, abstract, and citations).
It then applies syntactic and semantic filters over the retrieved records to reduce the search space. Then, we
3 Semi-automatic Literature Survey: [Link]
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 9
Table 2. Search query format used to retrieve papers. The query is composed of three search terms in conjunction. The values
in the second column replace each search term.
Table 3. Synonyms to extend queries. Search terms in the Word column are expanded using their respective synonyms.
Word Synonyms
"health" "healthcare", "health care", "healthcare", "medicine", "medical", "diagnosis"
"industry" "industry 4", "manufacture", "manufacturing", "factory", "manufactory", "industrial"
"sustainable city", "smart city", "digital city", "urban", "city", "cities", "mobility",
"smart cities"
"transport", "transportation system"
"virtual reality", "augmented reality", "3D", "digital twin", "video games", "video",
"multimedia"
"image recognition", "audio", "speech recognition", "speech"
"physics", "psychology ", "chemistry", "biology", "geology", "social", "maths",
"science"
"materials", "astronomy", "climatology", "oceanology", "space"
"self-driving vehicle", "self-driving car", "autonomous car", "driverless car",
"autonomous vehicle"
"driverless vehicle", "unmanned car", "unmanned vehicle", "unmanned aerial vehicle"
"networking" "computer network", "intranet", "internet", "world wide web"
"e-commerce" "marketplace", "electronic commerce", "shopping", "buying"
"robotics" "robot"
"finance" "banking"
"ML", "deep learning", "neural network", "reinforcement learning",
"machine learning"
"supervised learning", "unsupervised learning", "artificial intelligence", "AI"
"deploy" "deployment", "deployed", "implemented", "implementation", "software"
"real world" "reality", "real", "physical world"
manually filtered the list of works to select the papers to survey. Figure 2 depicts the stages of the survey process.
It has two principal stages described in the next section.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
10 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence
Table 4. We used the following categories and keywords for the Lbl2Vec algorithm [119].
Category Keywords
"system" "architecture", "framework", "platform", "tool", "prototype".
"develop", "engineering", "methodology", "architecture", "design",
"software"
"implementation", "open", "source", "application".
"production", "real", "world", "embedded", "physical", "cloud", "edge",
"deploy"
"infrastructure".
"simulation" "synthetic", "simulate".
b. Research questions: The main research question we want to answer with this survey is: To what extent
and how have researchers adopted the principles of Data-Oriented Architecture (DOA) in the implementation
and deployment of modern ML-based systems, and what are the motivations behind such adoption? We split
this question into three complementary research questions to determine the level and way of adoption
of each DOA principle in deployed ML-based systems: data as a first-class citizen (RQ1), prioritise
decentralisation (RQ2), and openness (RQ3). The answers to these questions will enable us to identify
the research gaps and inform the development of the next generation of DOAs. Our long-term goal is to
establish DOA as a mature and competitive architectural style for designing, developing, implementing,
deploying, monitoring, and adapting ML-based systems.
c. Search terms: We want to search for papers presenting ML-based systems deployed in real-world environ-
ments to answer the research questions above. Table 2 shows the query format and the search terms we
use to retrieve such papers. The query has three search terms in conjunction (i.e., AND operator). The first
term refers to popular domains that have applied ML. The second term filters papers that apply machine
learning in these domains, and the third term filters papers that deploy their solution in the real world.
Search engines in scientific databases use different matching algorithms. Some engines search for exact
words in the papers’ attributes (e.g., title or abstract), which can be too restrictive. We expand these queries
by including synonyms for the search terms (Table 3). Synonyms extend queries using inclusive disjunction
(i.e. OR operator).
d. Source selection: We search for papers in the most popular scientific repositories using their APIs. They are
IEEEXplore4 , Springer Nature5 , Scopus6 , Semantic Scholar7 , CORE8 , and ArXiv9 . Some repositories, such
as the ACM digital library, could not be used as they do not provide an API to query. Nevertheless, we
are confident in the sufficient coverage of our search because of the significant overlap with other sources
(papers can be published in multiple libraries or indexed by meta-repositories such as Semantic Scholar).
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 11
Table 5. Number of papers selected at each stage of the system literature review process from each scientific repository.
respective API expects, submits the request, and stores the search results in separate CSV files. The search
results are the metadata of the retrieved papers (e.g., title, abstract, publication data).
b. Preprocessing of Retrieved Data: Each API provides papers’ metadata in a different format. There are
duplicated papers between repositories, and some records can be incomplete (e.g., a paper missing an
abstract). The preprocessing step prepares the data for the following steps in our semi-automated framework.
It joins the papers’ metadata in a single file, cleaning the data and removing repeated and incomplete
records. The process selected a total of 34,931 works after this step.
c. Automatic Filtering: All the data of the retrieved papers are stored in a single file after the preprocessing
step. However, the number of papers is still too large for manual processing. We reduce the search space
by applying two filters. A syntactic filter selects the papers that talk in the abstract about real-world
deployments according to our definition of "ML systems in the real world". In particular, this filter searches
for the "real world" and "deploy" words and their synonyms (Table 3) in the papers’ abstracts. We classified
the selected papers into four categories after the syntactic filtering. The first category includes the papers
that present architectures of deployed ML-based systems, the second category groups papers that present
software engineering approaches to build ML-based systems in practice, and the third category represents
papers that show physical implementations (e.g., edge architectures) of ML-based systems with a special
focus on the infrastructure. The final category includes papers that evaluate ML algorithms and systems
based on synthetic data and simulated environments. We used an unsupervised-learning algorithm, Lbl2Vec,
to semantically classify the selected papers in these four groups following the work proposed by Schopf et
al. [119]. Lbl2Vec requires as inputs a set of texts to classify (i.e., selected papers’ abstracts) and a set of
predefined categories (Table 4). The algorithm assigns papers to the most relevant category. We use this
semantic classification to select the papers that belong to the first three categories (i.e., system, software,
and deploy). These filters produced a total of 5,559 papers.
d. Manual filtering: The syntactic filters in the previous step reduce the set of papers to a more feasible
number to be manually explored. Our framework in this step shows the paper information to the user in a
centralised interface where papers are selected as included or excluded. Such manual selection has two
stages following the methodology defined by Kitchenham et al. [72, 73]. We select the papers by reading
their abstracts in the first stage. Then, we filter the works by skimming the full text. The skimming process
includes assessing the papers’ quality. According to the goal of our survey, we focused on papers whose
content introduces the software architectures behind ML-based systems and reports the results of their
deployment in the real world. We discarded the works that did not present these two attributes. One
researcher performed the manual filters that produced 101 papers.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
12 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence
e. Snowballing: We use the API from Semantic Scholar to retrieve metadata of the papers that cite the selected
works from the previous stage. Our tool preprocesses and filters (i.e., syntactically and semantically) the
citing papers and adds the ones that pass these filters to the final set of selected papers. This process
resulted in 2 more papers added to the selection, resulting in 103 papers for full-text reading.
f. Final Selection: Two researchers read the 103 papers, selected 45 papers, and extracted data from these to
answer our survey research questions. We agreed on the selection criteria based on two aspects. First, we
considered the papers’ relevance by identifying if they introduce the architectures behind the ML-based
systems and report results from real-world deployments. Second, we use the DOA principles definition to
determine to what extent these papers adopt the DOA principles (Section 3). We used a data collection
spreadsheet10 to analyse and extract the data from the 45 selected papers. The spreadsheet has each paper’s
metadata (i.e., ID, URL, and title) and columns each researcher must fill out according to the paper analysis.
We use the "Quick Summary" column to describe the research work, its domain, problem, solution, and
reasons for inclusion or exclusion. The "Real-World Deployment" column describes how authors deploy
the selected works in the real world. We use the "Data as First Class Citizen" (i.e., RQ1), "Decentralised
Architecture (i.e., RQ2), and "Open Architecture" (i.e., RQ3) columns to describe the extent to which and how
these papers follow the DOA principles. We use the "Data Flow Design" and "Data Coupling" to evaluate if
the authors’ design systems with a focus on data flowing between their components (i.e., data-first) and if
such components communicate through data mediums (i.e., decentralisation and openness). For each of
these columns, we defined three possible outcomes:
– "YES" meaning the reviewed paper fully follows the evaluated principle. For example, a system fully
follows the data as a first-class citizen principle if the system is data-driven, its components share a data
model, and they communicate through such a data model (i.e., data coupling).
– "PARTIAL" meaning the reviewed paper partially follows the evaluated principle. For example, a system
partially prioritises decentralisation if it stores and processes data chunks locally but does not have
peer-to-peer communication with components in the same layer.
– "NO" meaning the reviewed paper does not follow the evaluated principle in any way. For example, a
system does not follow the openness principle when it does not implement autonomous and asynchronous
entities exchanging messages between them.
Each researcher justifies the outcome for each column and paper in the data collection spreadsheet. In
addition, the spreadsheet also has a column "Comments" to annotate extra information researchers find
relevant for our study. We addressed selection and data extraction conflicts in group discussions. Section 5
presents the survey of the 45 papers using a descriptive and quantitative synthesis of the results we
extracted in the spreadsheet.
g. Exclusion criteria: We exclude the following works during the different stages of our semi-automatic process:
– Papers that introduce ML-based systems without deployment and evaluation in the real world or a
production environment.
– Papers that present ML-based systems but do not report their software architecture.
– Papers that report experiments on synthetic data or simulated environments where ML-based systems
do not interact with actual data, physical entities, or people.
– Papers presenting isolated ML algorithms not part of larger systems.
– Papers with missing metadata that our framework cannot analyse (e.g., a paper without an abstract).
– Duplicated papers.
– Papers not written in English.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 13
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
14 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence
Table 6. All reviewed papers in our survey. This table shows the extent to which and in what papers practitioners adopt (fully
or partially) each of the DOA sub-principles discussed in Section 3.
Message exchange
Asynchronous
Autonomous
Peer-to-peer
data chunks
data mode
coupling
protocol
entities
entities
Shared
driven
Local
Local
Research
Data
Data
first
first
work
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 15