0% found this document useful (0 votes)
18 views37 pages

Machine Learning Systems: A Survey From A Data-Oriented Perspective

Real-world Machine Learning Systems: A Survey from a Data-Oriented Perspective — Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, Neil D. Lawrence. Veri odaklı mimarilerin gerçek dünyadaki ML sistemlerinde nasıl benimsendiğini inceliyor.

Uploaded by

emrecidems
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views37 pages

Machine Learning Systems: A Survey From A Data-Oriented Perspective

Real-world Machine Learning Systems: A Survey from a Data-Oriented Perspective — Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, Neil D. Lawrence. Veri odaklı mimarilerin gerçek dünyadaki ML sistemlerinde nasıl benimsendiğini inceliyor.

Uploaded by

emrecidems
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning Systems:

A Survey from a Data-Oriented Perspective

CHRISTIAN CABRERA, ANDREI PALEYES, PIERRE THODOROFF, and NEIL D. LAWRENCE,


Department of Computer Science and Technology, University of Cambridge, United Kingdom
Engineers are deploying ML models as parts of real-world systems with the upsurge of AI technologies. Real-world envi-
ronments challenge the deployment of such systems because these environments produce large amounts of heterogeneous
data, and users require increasingly efficient responses. These requirements push prevalent software architectures to the
limit when deploying ML-based systems. Data-oriented Architecture (DOA) is an emerging style that equips systems better
arXiv:2302.04810v3 [[Link]] 16 Jul 2025

for integrating ML models. Even though papers on deployed ML systems do not mention DOA, their authors made design
decisions that implicitly follow DOA. Implicit decisions create a knowledge gap, limiting the practitioners’ ability to implement
ML-based systems. This paper surveys why, how, and to what extent practitioners have adopted DOA to implement and
deploy ML-based systems. We overcome the knowledge gap by answering these questions and explicitly showing the design
decisions and practices behind these systems. The survey follows a well-known systematic and semi-automated methodology
for reviewing papers in software engineering. The majority of reviewed works partially adopt DOA. Such an adoption enables
systems to address requirements such as Big Data management, low latency processing, resource management, security and
privacy. Based on these findings, we formulate practical advice to facilitate the deployment of ML-based systems.
CCS Concepts: • Software and its engineering; • Computing methodologies → Artificial intelligence; Machine
learning;
Additional Key Words and Phrases: Artificial Intelligence, Machine Learning, Real World Deployment, Systems Architecture,
Data-Oriented Architecture.
ACM Reference Format:
Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence. 2023. Machine Learning Systems: A Survey from a
Data-Oriented Perspective. ACM Comput. Surv. 1, 1 (July 2023), 37 pages. [Link]

1 INTRODUCTION
Artificial Intelligence solutions based on ML have gained an increased level of attention. They have been used
to address challenging problems in several domains (e.g., healthcare, science, etc.) [77, 138]. Their success
is a product of the growth of available data, increasingly powerful hardware, and the development of novel
ML models [41, 113]. Many of these models have demonstrated practical value, which has led to their rapid
adoption in real-world software systems. The contrast between real-world environments and the more controlled
environments from which ML models originate challenges deploying systems that integrate these ML models [27].
In particular, real-world environments produce large amounts of complex, dynamic, and sometimes sensitive data,
which systems must process efficiently [25, 65]. ML-based systems deployed in the real world must be scalable,
autonomous, and resource-efficient, while enabling data availability, monitoring, security, and trust [80, 106, 109].
Authors’ address: Christian Cabrera; Andrei Paleyes; Pierre Thodoroff; Neil D. Lawrence, {chc79,ap2169,pt440,ndl21}@[Link],
Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
permissions@[Link].
© 2023 Association for Computing Machinery.
0360-0300/2023/7-ART $15.00
[Link]

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
2 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

Service-oriented architectures (SOAs) and microservices are the dominant software architectural styles nowa-
days [11, 96]. These styles are characterised by services that separate systems’ functionalities, communicate
through well-defined programming interfaces (i.e., APIs), and are usually deployed in the cloud [134]. The
necessity to process large volumes of data, typical for ML-based real-world systems, leads to new requirements
that challenge SOA-based systems. For example, separation of concerns facilitates systems maintenance by
following the divide-and-conquer principle [70, 87]. However, data monitoring and transparency requirements
are arduous to satisfy because services hide the systems’ data behind their APIs [34]. This situation is known as
“The Data Dichotomy”: while high-quality data management requires exposing systems’ data, services hide it
behind their interfaces [131]. Another example is that services deployed in flexible cloud platforms support highly
available and scalable systems [67]. However, cloud deployments struggle to meet low-latency requirements for
critical applications. The physical distance between end users and cloud data centres impacts systems’ end-to-end
response time [28]. Such cloud-based deployments also affect data security, privacy, and trust requirements
because the ownership of the data changes from end users to cloud providers. This change of ownership also
generates availability, privacy and security concerns that can impact ML model training [4, 23, 74].
Data-Oriented Architecture (DOA) is an emerging architectural style that complements the current practices
for implementing and deploying systems. DOA considers data as the common denominator between system
components [65]. Services in DOA are distributed, autonomous, and communicate with each other at the data level
(i.e., data coupling) using asynchronous messages [121, 139]. DOA enables systems to achieve data availability,
reusability, monitoring, and systems autonomy and resource efficiency [65, 121, 139]. These properties facilitate
systems integrating ML models in production environments. For example, the data coupling between DOA-based
system components enables the storage of the states of the data that flows through it by design. A practitioner
or a meta-learning system can analyse such records, identify when the system’s data distribution changes,
and update the system’s learning models. Similarly, DOA enables the creation of open systems because of the
autonomous communication between its components. New devices can automatically join and provide computing
power to DOA-based systems. Such flexibility facilitates the deployment of ML models in resource-constrained
environments and can mitigate low-latency requirements as services are physically closer to end users [24, 133].
Even though most papers on deployed ML-based systems do not describe DOA explicitly, their authors faced
the same challenges that DOA aims to address. Such an implicit adoption creates a knowledge gap that limits the
practitioners’ ability to deploy ML-based systems in real-world conditions. We close this gap in this survey paper
by explicitly showing the design decisions behind implementing and deploying ML-based systems and analysing
why, how and to what extent their authors have implicitly adopted DOA principles. Such an analysis constitutes
the first in-depth study of DOA architectural style for ML-based systems. Based on existing literature on DOA,
we formulate high-level principles of this architectural style and contextualise these principles through the ML
deployment challenges. We then conduct a comprehensive survey to understand the implicit adoption of DOA
principles among practitioners when implementing existing ML-based systems1 . Based on the survey findings,
this paper shows the benefits and limitations of DOA, formulates practical advice for deploying ML-based systems,
and discusses open challenges and research directions to advance DOA.

2 RELATED WORK
Although DOA as an architectural style for ML-based systems is an emerging idea, the principles behind DOA are
not new. Many domains have discovered and are reaping the benefits of applying similar principles. For instance,
data-oriented design (DOD) applies many DOA principles on a lower level of abstraction, with claims of significant

1 Werefer to "ML-based systems in the real world" as software systems that deliver ML-powered capabilities deployed and evaluated in
production environments, with real users, unpredictable physical processes, unseen variable data. Such a definition contrasts against isolated
academic settings with fixed datasets, simulated users and environments, and tightly controlled experimental conditions.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 3

improvements over analogous object-oriented programming (OOP) solutions. Video game development is one
domain where DOD is widespread to improve memory and cache utilisation [1]. For example, Coherent Labs
utilised DOD while creating their game engine Hummingbird [91] to overcome the performance limitations of
solutions based on Chromium and WebKit. DOD helped the developers eliminate cache misses and compiler
branch mispredictions, leading to a 6.12x improvement in animation rendering speed. Outside of gaming, Mironov
et al. [83] utilised DOD to improve the performance of a trading strategy backtesting utility. They improved
the parallelisation opportunities and increased the performance speed to 66%. These works illustrate that DOA-
like principles, especially around data prioritisation, emerge at a different level of abstraction with different
motivations and nevertheless bring significant benefits to diverse practitioners. Our work extends these previous
efforts by reviewing DOA principles in the context of ML-based systems at deployment.
A growing body of literature focuses on surveying the application of AI. Table 1 shows how our work compares
against these surveys. One category of current surveys presents use cases of AI algorithms that solve problems in
specific domains. They report the challenges these algorithms face at deployment. For example, Cai et al. [29]
survey multimodal ML techniques applied to the healthcare domain. They review systems for disease analysis,
triage, diagnosis, and treatment. Bohg et al. [19] survey data-driven methodologies for robot grasping. They
focus on techniques based on object recognition and pose estimation for known, familiar, and unknown objects.
The review compares data-driven methodologies and analytics approaches. Qin et al. [110] analyse data-driven
methods for industrial fault detection and diagnosis. This survey focuses on fault detectability and identifiability
methods for industrial processes with different complexities and sizes. Wong et al. [144] present the challenges
of applying deep learning (DL) techniques in radio frequency applications. This work reviews DL applications
in the radio frequency domain from the perspective of data trust, security, and hardware/software issues for
DL deployment in real-world wireless communication applications. Joshi et al. [64] survey the deployment of
deep learning approaches at the edge. Their review presents architectures of deep learning models, enabling
technologies, and adaptation techniques. Their work also describes different metrics for designing and evaluating
deep learning models at the edge. Our survey is domain-agnostic and focuses on papers that report ML-based
systems at deployment. In addition, we extend specific domain surveys by analysing the design decisions and
software architectural patterns behind the systems the authors report.
Another category of papers reviews software engineering patterns for ML applications. Since DOA is a general
architectural style that builds on multiple design patterns, our survey closely relates to such existing works.
Muccini and Vaidhyanathan [86] emphasise the importance of research on architectural patterns for ML-based
systems and provide high-level research directions. Washizaki et al. [141] focus on a more in-depth discussion
of the patterns for building ML-based systems by reviewing academic literature and engineering blog posts
to identify 33 patterns. DOA directly applies or encourages many of these patterns (e.g., “Kappa Architecture”
and “Modularisation of ML Components”). Chai et al. surveyed challenges that arise at different stages of ML
pipelines and acknowledged data management as one of the crucial challenges for ensuring the high quality
of a pipeline [32]. Giray [53] performed a systematic analysis of software engineering practices for ML-based
systems. Heck [59] conducted a mapping study on data engineering for AI systems, briefly summarising relevant
technical solutions and architectural styles. At a high level, these works describe multiple architectural styles,
patterns and paradigms for designing and developing ML-based systems. They attempt to deepen the community’s
understanding of modern software engineering for ML. We provide further insights by exclusively reviewing
ML-based systems from a data-oriented perspective (i.e., data focus), which is the crucial and distinct aspect of
modern and future data-driven systems. Moreover, none of the papers in the related work perform a detailed
study of a single architectural style or pattern in the context of ML-based systems. In contrast, our work surveys
the adoption level of DOA when deploying ML-based systems. We define the DOA core principles, compare them
against ML challenges at deployment, and provide advice on applying DOA in practice. We quantify the adoption
of DOA principles among already existing ML-based systems by conducting a deep architectural analysis of

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
4 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

Table 1. Related work compared against this survey. Our work is the only survey that measures an architectural style
adoption (i.e., DOA) in the context of ML-based systems at deployment.

Focus on
Focus on Focus on Focus on
Related work Domain Architecture
Deployment Data Adoption
Style
Cai et al. [29] Healthcare No Yes No No
Bohg et al. [19] Robotics No Yes No No
Qin et al. [110] Industrial processes No Yes No No
Wong et al. [144] Radio frequency No Yes No No
Joshi et al. [64] Edge computing No Yes No No
Muccini and ML software
Yes Yes No No
Vaidhyanathan [86] architecture
Washizaki et al. [141] ML engineering Yes No No No
Chai et al. [32] ML pipelines Yes No No No
Giray [53] ML engineering Yes Yes No No
Heck [59] Data engineering No Yes Yes Yes
This work ML-based Systems Yes Yes Yes Yes

these works. Our long-term goal is to unify multiple approaches for deploying ML-based systems under the DOA
architectural style and to increase its recognition in the software architecture community.

3 DATA-ORIENTED ARCHITECTURES AND ML CHALLENGES


Service-oriented architecture (SOA) is the prevalent architectural style to design and implement modern software.
It defines services as individual components that expose their functionalities through interfaces (APIs) and
interact via calls to these interfaces [131]. Modularity enables the assignment of systems components to different
teams, allows flexible scaling, and facilitates maintainability. Independent departments in the organisation focus
on specific services instead of the whole system because of the well-defined interfaces between the parts of
the system [70, 87]. SOAs usually deploy services and microservices in cloud platforms for high availability
and robustness. Flexible and automatic management of resources (e.g., replicas, load-balancers, etc.) enables
systems to be fault-tolerant and guarantee the quality of service despite the amount of data to be processed [67].
Maintainability, high availability, and scalability are essential requirements that ML-based systems must satisfy
when deployed in the real world. Service abstractions and cloud deployments enable current system architectures
to meet these requirements. However, real-world environments pose additional requirements to ML-based
systems, challenging SOA. Data monitoring and transparency are challenging because services hide data behind
their interfaces (“The Data Dichotomy”, [131]). Service interactions based on interfaces prioritise control flow
and make data exchanged during the communication ephemeral [100]. Cloud platforms are expensive and take
ownership of end users’ data. The physical distance between end users and cloud servers impacts the end-to-end
latency of critical applications supported by AI (e.g., real-time health diagnosis) [28]. Cloud resources are usually
expensive, failing to satisfy resource constraint requirements when there are budget constraints in the real-world
systems [75, 123]. Cloud deployments imply the change of data ownership from end users to service providers. End
users are unaware of where and how providers process their sensitive data and the associated security risks. Data
ownership, security, accountability, and trust requirements are hard to satisfy in such deployments [4, 34, 42, 74].

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 5

Data collection

Data augmentation
Data
Data preprocessing Management
Data-driven

Data analysis

Data as First- Invariant and shared


Data availability
Class Citizen data model

Model selection
Hyper-parameter
selection Model
Data coupling
Learning
Model training

Requirements
encoding Every entity stores
Formal data chunks
Model
Verification Verification
Test-based Prioritise
Local first
verification Decentralisation

Integration
Peer-to-peer first

Composition
Every entity is
Model
autonomous
Deployment
Monitoring

Updating Every entity is


Openness asynchronous

Ethics

Trust Cross-cutting Message exchange


Aspects protocol

Security

Law

ML Workflow
DOA's Principles
Challenges at Deployment

Fig. 1. Map between ML Workflow Challenges at Deployment [106] and DOA principles. The left side shows the ML
challenges at deployment, and the right side shows the DOA principles. The links between them represent which principles
support addressing the respective challenges.
Data-Oriented Architecture (DOA) is an emerging architectural style that can help us overcome these challenges
when implementing and deploying ML-based systems. DOA proposes a set of principles for implementing data-
oriented software: considering data as a first-class citizen, decentralisation as a priority, and openness [65, 74,
121, 139]. Figure 1 maps ML deployment challenges and DOA principles addressing them. The left side of the
figure shows the challenges of the ML workflow at deployment discussed by Paleyes et al. [106], while the right

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
6 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

side shows the DOA principles we have extracted from the literature [65, 74, 121, 139]. This section introduces
each DOA principle and discusses how they support the deployment of ML-based systems in the real world.

3.1 Data as a First Class Citizen


Data collection, preprocessing, analysis, and monitoring tasks are challenging for systems deployed in the real
world. Such environments usually generate high volumes of variable data, which needs to be processed in real
time to create value. These requirements come from the properties of the data real-world systems must handle,
which are also known as "the five Vs" of big data: volume, variability, velocity, veracity, and value [89]. Such
data properties and requirements become more critical for ML-based systems implementation because they are
inherently data-driven. The outputs and overall performance of ML-based systems depend on the quality of
the data that flows through their components [46]. Practitioners must design data-driven systems focusing on
the specific data to be handled and processed. These systems must monitor their data to evaluate performance
at runtime, identify or predict failures, and trigger maintenance or adaptation operations. Architectural styles
oriented to individual services encapsulate atomic capabilities and interact through their interfaces, falling into
"The Data Dichotomy". These architectures hide their data behind service interfaces, but data management requires
exposing data [131]. The data dichotomy does not suit data management tasks because additional efforts are
needed to access the data that flows through the system. Such data unavailability complicates systems’ monitoring
and transparency and the related tasks [34, 149]. In the ML context, it means that systems’ data unavailability
challenges the selection, parametrisation, training, testing, verification, integration, monitoring, composition,
and updating of the learning model components as well as systems’ data transparency requirements [106].
DOA proposes to treat the data as a first-class citizen of a software system, understanding data as the
common denominator between disparate components [65]. The data in a DOA-based system is primary, and
the operations on the data are secondary. This principle makes systems data-driven by design, which matches
the nature of ML-based components. DOA-based systems rely on a shared data model, which is processed and
nourished by multiple system components [121]. The shared data model is a single data structure equivalent to
the one that data engineers build into the data management stage of ML workflows. The key difference is that
the system automatically creates this data model from its components’ interactions. Systems components do not
expose any APIs and interact via data mediums, where input is listened to, and output is offloaded [121]. Such
interactions enable DOA-based systems to achieve data coupling, considered the loosest form of coupling [93, 94].
The shared data model stores the history of the system’s state during its whole life cycle. Systems’ data is fully
available, describing the systems’ current and past states. Data availability facilitates data management, learning
model management, systems monitoring, failure detection, and adaptation tasks. Components’ behaviour is
programmatically observable, traceable, and auditable. Such transparency helps the responsible design and
implementation of ML-based systems to resolve questions regarding accountability, trust, and traceability.

3.2 Prioritise Decentralisation


Big tech companies drove recent ML breakthroughs thanks to the increasingly powerful hardware and the efforts
in building massive data sets [41, 77]. These enablers are not always present in systems deployed in the real world.
Smaller organisations are using ML to solve challenging problems (e.g. sustainable farming, climate change, etc.)
with limited budgets and resources 2 . Most organisations worldwide cannot afford expensive cloud computing
resources or the effort of building and maintaining massive learning models. Such a reality threatens the practical
and democratic adoption of ML models, as different stages of their workflow are computationally expensive. For
example, the training stage is an iterative process that solves an optimisation problem to find learning model
parameters. Complex models (e.g., large language models) make this process computationally expensive when, for
2 An example of such an organisation is Data Science Africa (DSA): [Link]

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 7

example, they learn from non-quadratic, non-convex, and high-dimensional data sets [54, 66, 97]. Hyperparameters
improve the efficiency of the training process and the accuracy of the learning models [55]. However, selecting
these hyperparameters is also a resource-demanding optimisation problem [17, 99]. In addition, expensive and
physically centralised deployments can fail to meet low-latency requirements from critical applications. The
physical distance between end users and data centres impacts systems’ end-to-end response time [28]. Data
ownership, security, and trust requirements are also affected in centralised systems. The ownership of the data
changes from end users to service providers, which can entail privacy and security issues as end users are unaware
of where and how their data is stored, processed, and used [4]. Recent research shows that Large Language
Models (LLMs) create risks of leaking sensitive information from the model’s training data. Adversarial attacks
or prompt engineering strategies can extract personal information from LLMs [42].
The simplest way to deliver a shared data model (Principle 3.1) would seem to be to centralise it, but, in
practice, scalability, resource constraints, low latency, and security requirements mean that in DOA we prioritise
decentralisation [74, 139]. Such decentralisation should be logical and physical. Logical decentralisation enables
organisations to scale when developing ML-based systems, as different development teams focus on smaller
systems’ components. Logical decentralisation is a way to achieve a clear separation of concerns, avoid central
bottlenecks in the computational process, and increase flexibility in system development. Physical decentralisation
enables the deployment of ML models in constrained environments without expensive computational resources
and improves privacy and ownership of data [74]. Practitioners should deploy components of ML-based systems
as decentralised entities that store data chunks of the shared information model described in the previous principle.
These entities first perform their operations with their local resources (i.e. local first, [74]). If local resources (i.e.,
data, computing time, or storage) are insufficient, entities can connect temporarily with other participants to
share resources. Entities first scan their local environment for potential resources they need. They prioritise
interactions with nodes in the close vicinity to share or ask for data and computing resources (i.e., peer-to-peer
first). Centralised servers are fallback mechanisms [139]. This principle facilitates data management, enabling the
system’s data replication by design because different entities can store the same data chunk. Such replication
provides data availability because if one entity fails, its information is not lost. Similarly, replication provides
scalability as different entities can respond to concurrent data requests [139]. Prioritised decentralisation also
alleviates the high demand for resources of ML-based systems and reduces the inference latency. ML-based
systems can perform their data-related tasks (e.g., ML model training) in smaller data sets partitioned by design,
and trained models deployed closer to end users to provide faster responses [24]. In addition, decentralisation
creates a flexible ecosystem where we can use resources from different devices on demand. This DOA principle
advocates for a more sustainable and democratic approach that prioritises the computational power available in
everyday devices over expensive cloud resources. It is important to note that logical or physical decentralisation
is not always necessary. For example, there is no need to perform local first computations when centralised
computational resources are available. However, data replication, partitioned data sets, and flexible resource
management are DOA-enabled properties that can still benefit even partial decentralisation of ML-based systems.

3.3 Openness
Developing ML components often involves the automation of various stages of data processing, both at the training
and inference stages. The amounts of data ML-based systems manage, the complex processes they perform, and the
fact that their users are usually experts in domains different from ML (e.g. healthcare, physics, etc.) [140] demand
such automation. AutoML is a recent research area that aims to automate the ML training life cycle. AutoML
approaches include data augmentation, model selection, and hyperparameter optimisation [43, 137]. Nevertheless,
implementing and deploying ML-based systems in real-world environments sets automation requirements for
adopting ML models that AutoML does not cover. Real-world environments are dynamic, and deployed systems

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
8 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

596
papers
18 3 2
papers papers papers

Need Research Search Automatic Data Automatic Manual Final


Snowballing
identification questions terms search Preprocessing filters filters Selection
92,906 34,931 5,559 101 103 45
papers papers papers papers papers papers

Fig. 2. The survey process depicts the steps from the review motivation to the full-text reading of the selected papers. This
process relies on the methodology proposed by Kitchenham et al. [72, 73]
must respond autonomously to changing goals, variable data, unexpected failures, security threats, complex
uncertainties, etc. Autonomous responses are required because practitioners do not have complete control over
such deployments [98], their complexity usually exceeds the limit of human comprehension [101], and human
intervention is not feasible [25]. Current ML-based systems rely on the interaction of heterogeneous components.
Static systems architectures usually pre-define the components and devices that compose a system, and third-
party entities (e.g., central controllers) usually integrate, orchestrate, and maintain these components to satisfy
end-to-end system quality requirements [26]. The scale and dynamic nature of real-world deployments demand
autonomous solutions and challenge these static and centralised architectures.
Autonomous and independent software components are the building blocks of systems that can adapt and
respond to uncertain environments without human intervention [143]. DOA proposes openness as a principle
that enables designing and implementing such autonomous systems’ components in open environments where
they interact autonomously [65]. DOA represents components as autonomous and asynchronous entities that
communicate with each other using a common message exchange protocol. Systems can leverage such environments
to integrate ML models because their decentralised components can autonomously perform the integration,
composition, monitoring, and adaptation tasks. Asynchronous entities produce their outputs and can subscribe
to inputs at any time [121]. These steps are transparent, favouring data trust and traceability. Similarly, the
system’s components are autonomous in deciding which data to store, which to make public, and which to hide
for security and privacy [139]. The message exchange protocol between asynchronous components replaces
interface dependencies in service-oriented or microservices architectures with asynchronous messages between
data producers and consumers. Asynchronous communication protocols enable data coupling, considered the
loosest form of coupling, enabling scalability and low latency requirements [93, 94].

4 SURVEY METHODOLOGY
This survey assesses why, how, and to what extent practitioners have adopted the DOA principles when imple-
menting and deploying ML systems and the motivation and methods behind such adoption. We survey research
works that have used ML to solve problems in different domains and have been deployed and tested in real-world
settings. We are particularly interested in works that report the software architectures behind these systems and,
if possible, the authors’ design decisions toward such architectures.
The selection of relevant works for our survey is not straightforward because many papers that apply ML in
different domains have been published in recent years, making manual selection unfeasible. For this reason, we
developed a semi-automatic framework based on a well-known methodology for systematic literature reviews
(SLRs) in software engineering [72, 73]. The framework is available publicly on GitHub3 to allow the reproducibility
of this work, as well as the reusability of this framework in other surveys. The framework queries the search
APIs from different digital libraries to automatically retrieve papers’ metadata (e.g., title, abstract, and citations).
It then applies syntactic and semantic filters over the retrieved records to reduce the search space. Then, we
3 Semi-automatic Literature Survey: [Link]

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 9

Table 2. Search query format used to retrieve papers. The query is composed of three search terms in conjunction. The values
in the second column replace each search term.

Search Query <search_term_1> AND <search_term_2> AND <search_term_3>


"autonomous vehicle" OR "health" OR "industry" OR "smart cities" OR "multimedia" OR
<search_term_1>
"science" OR "robotics" OR "oceanology" OR "finance" OR "space" OR "e-commerce"
<search_term_2> "machine learning"
<search_term_3> "real world" AND "deploy"

Table 3. Synonyms to extend queries. Search terms in the Word column are expanded using their respective synonyms.

Word Synonyms
"health" "healthcare", "health care", "healthcare", "medicine", "medical", "diagnosis"
"industry" "industry 4", "manufacture", "manufacturing", "factory", "manufactory", "industrial"
"sustainable city", "smart city", "digital city", "urban", "city", "cities", "mobility",
"smart cities"
"transport", "transportation system"
"virtual reality", "augmented reality", "3D", "digital twin", "video games", "video",
"multimedia"
"image recognition", "audio", "speech recognition", "speech"
"physics", "psychology ", "chemistry", "biology", "geology", "social", "maths",
"science"
"materials", "astronomy", "climatology", "oceanology", "space"
"self-driving vehicle", "self-driving car", "autonomous car", "driverless car",
"autonomous vehicle"
"driverless vehicle", "unmanned car", "unmanned vehicle", "unmanned aerial vehicle"
"networking" "computer network", "intranet", "internet", "world wide web"
"e-commerce" "marketplace", "electronic commerce", "shopping", "buying"
"robotics" "robot"
"finance" "banking"
"ML", "deep learning", "neural network", "reinforcement learning",
"machine learning"
"supervised learning", "unsupervised learning", "artificial intelligence", "AI"
"deploy" "deployment", "deployed", "implemented", "implementation", "software"
"real world" "reality", "real", "physical world"

manually filtered the list of works to select the papers to survey. Figure 2 depicts the stages of the survey process.
It has two principal stages described in the next section.

4.1 Planning Stage


The first stage consists of the review plan definition as follows:
a. Need identification: Section 3 introduced the DOA principles and how these can support practitioners
when deploying ML in the real world. Despite these potential benefits and several surveys on ML and its
applications (see Section 2), it remains unclear to what extent, why, and how practitioners have adopted
these principles in deploying ML-based systems. Practitioners usually follow DOA principles implicitly,
creating a knowledge gap around the DOA style and limiting the implementation of ML-based systems. We
close such a gap by surveying deployed ML-based systems from a DOA perspective to identify the design
decisions, implementation practices, and deployment strategies practitioners followed.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
10 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

Table 4. We used the following categories and keywords for the Lbl2Vec algorithm [119].

Category Keywords
"system" "architecture", "framework", "platform", "tool", "prototype".
"develop", "engineering", "methodology", "architecture", "design",
"software"
"implementation", "open", "source", "application".
"production", "real", "world", "embedded", "physical", "cloud", "edge",
"deploy"
"infrastructure".
"simulation" "synthetic", "simulate".

b. Research questions: The main research question we want to answer with this survey is: To what extent
and how have researchers adopted the principles of Data-Oriented Architecture (DOA) in the implementation
and deployment of modern ML-based systems, and what are the motivations behind such adoption? We split
this question into three complementary research questions to determine the level and way of adoption
of each DOA principle in deployed ML-based systems: data as a first-class citizen (RQ1), prioritise
decentralisation (RQ2), and openness (RQ3). The answers to these questions will enable us to identify
the research gaps and inform the development of the next generation of DOAs. Our long-term goal is to
establish DOA as a mature and competitive architectural style for designing, developing, implementing,
deploying, monitoring, and adapting ML-based systems.
c. Search terms: We want to search for papers presenting ML-based systems deployed in real-world environ-
ments to answer the research questions above. Table 2 shows the query format and the search terms we
use to retrieve such papers. The query has three search terms in conjunction (i.e., AND operator). The first
term refers to popular domains that have applied ML. The second term filters papers that apply machine
learning in these domains, and the third term filters papers that deploy their solution in the real world.
Search engines in scientific databases use different matching algorithms. Some engines search for exact
words in the papers’ attributes (e.g., title or abstract), which can be too restrictive. We expand these queries
by including synonyms for the search terms (Table 3). Synonyms extend queries using inclusive disjunction
(i.e. OR operator).
d. Source selection: We search for papers in the most popular scientific repositories using their APIs. They are
IEEEXplore4 , Springer Nature5 , Scopus6 , Semantic Scholar7 , CORE8 , and ArXiv9 . Some repositories, such
as the ACM digital library, could not be used as they do not provide an API to query. Nevertheless, we
are confident in the sufficient coverage of our search because of the significant overlap with other sources
(papers can be published in multiple libraries or indexed by meta-repositories such as Semantic Scholar).

4.2 Conducting Stage


The second stage consists of the review execution as follows:
a. Automatic search: We implemented clients that consume the APIs exposed by the selected repositories
as part of our semi-automatic framework. In this step, each client creates the query in the format the

4 IEEEXplore API: [Link]


5 SpringerNature API: [Link]
6 Scopus API: [Link]
7 Semantic Scholar: [Link]
8 CORE API: [Link]
9 ArXiv API: [Link]

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 11

Table 5. Number of papers selected at each stage of the system literature review process from each scientific repository.

Automatic Pre- Automatic Manual Snow- Final


Repository
Search processing Filters Filters balling Selection
IEEEXplore 12,467 9,084 2,557 58 58 21
Springer Nature 58,199 15,617 463 7 7 2
Scopus 4,537 2,239 727 10 10 8
Semantic Scholar 12,308 5,225 638 6 8 5
CORE 1,917 386 251 2 2 2
ArXiv 3,480 2,374 923 18 18 7
Total Papers 92,908 34,931 5,559 101 103 45

respective API expects, submits the request, and stores the search results in separate CSV files. The search
results are the metadata of the retrieved papers (e.g., title, abstract, publication data).
b. Preprocessing of Retrieved Data: Each API provides papers’ metadata in a different format. There are
duplicated papers between repositories, and some records can be incomplete (e.g., a paper missing an
abstract). The preprocessing step prepares the data for the following steps in our semi-automated framework.
It joins the papers’ metadata in a single file, cleaning the data and removing repeated and incomplete
records. The process selected a total of 34,931 works after this step.
c. Automatic Filtering: All the data of the retrieved papers are stored in a single file after the preprocessing
step. However, the number of papers is still too large for manual processing. We reduce the search space
by applying two filters. A syntactic filter selects the papers that talk in the abstract about real-world
deployments according to our definition of "ML systems in the real world". In particular, this filter searches
for the "real world" and "deploy" words and their synonyms (Table 3) in the papers’ abstracts. We classified
the selected papers into four categories after the syntactic filtering. The first category includes the papers
that present architectures of deployed ML-based systems, the second category groups papers that present
software engineering approaches to build ML-based systems in practice, and the third category represents
papers that show physical implementations (e.g., edge architectures) of ML-based systems with a special
focus on the infrastructure. The final category includes papers that evaluate ML algorithms and systems
based on synthetic data and simulated environments. We used an unsupervised-learning algorithm, Lbl2Vec,
to semantically classify the selected papers in these four groups following the work proposed by Schopf et
al. [119]. Lbl2Vec requires as inputs a set of texts to classify (i.e., selected papers’ abstracts) and a set of
predefined categories (Table 4). The algorithm assigns papers to the most relevant category. We use this
semantic classification to select the papers that belong to the first three categories (i.e., system, software,
and deploy). These filters produced a total of 5,559 papers.
d. Manual filtering: The syntactic filters in the previous step reduce the set of papers to a more feasible
number to be manually explored. Our framework in this step shows the paper information to the user in a
centralised interface where papers are selected as included or excluded. Such manual selection has two
stages following the methodology defined by Kitchenham et al. [72, 73]. We select the papers by reading
their abstracts in the first stage. Then, we filter the works by skimming the full text. The skimming process
includes assessing the papers’ quality. According to the goal of our survey, we focused on papers whose
content introduces the software architectures behind ML-based systems and reports the results of their
deployment in the real world. We discarded the works that did not present these two attributes. One
researcher performed the manual filters that produced 101 papers.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
12 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

e. Snowballing: We use the API from Semantic Scholar to retrieve metadata of the papers that cite the selected
works from the previous stage. Our tool preprocesses and filters (i.e., syntactically and semantically) the
citing papers and adds the ones that pass these filters to the final set of selected papers. This process
resulted in 2 more papers added to the selection, resulting in 103 papers for full-text reading.
f. Final Selection: Two researchers read the 103 papers, selected 45 papers, and extracted data from these to
answer our survey research questions. We agreed on the selection criteria based on two aspects. First, we
considered the papers’ relevance by identifying if they introduce the architectures behind the ML-based
systems and report results from real-world deployments. Second, we use the DOA principles definition to
determine to what extent these papers adopt the DOA principles (Section 3). We used a data collection
spreadsheet10 to analyse and extract the data from the 45 selected papers. The spreadsheet has each paper’s
metadata (i.e., ID, URL, and title) and columns each researcher must fill out according to the paper analysis.
We use the "Quick Summary" column to describe the research work, its domain, problem, solution, and
reasons for inclusion or exclusion. The "Real-World Deployment" column describes how authors deploy
the selected works in the real world. We use the "Data as First Class Citizen" (i.e., RQ1), "Decentralised
Architecture (i.e., RQ2), and "Open Architecture" (i.e., RQ3) columns to describe the extent to which and how
these papers follow the DOA principles. We use the "Data Flow Design" and "Data Coupling" to evaluate if
the authors’ design systems with a focus on data flowing between their components (i.e., data-first) and if
such components communicate through data mediums (i.e., decentralisation and openness). For each of
these columns, we defined three possible outcomes:
– "YES" meaning the reviewed paper fully follows the evaluated principle. For example, a system fully
follows the data as a first-class citizen principle if the system is data-driven, its components share a data
model, and they communicate through such a data model (i.e., data coupling).
– "PARTIAL" meaning the reviewed paper partially follows the evaluated principle. For example, a system
partially prioritises decentralisation if it stores and processes data chunks locally but does not have
peer-to-peer communication with components in the same layer.
– "NO" meaning the reviewed paper does not follow the evaluated principle in any way. For example, a
system does not follow the openness principle when it does not implement autonomous and asynchronous
entities exchanging messages between them.
Each researcher justifies the outcome for each column and paper in the data collection spreadsheet. In
addition, the spreadsheet also has a column "Comments" to annotate extra information researchers find
relevant for our study. We addressed selection and data extraction conflicts in group discussions. Section 5
presents the survey of the 45 papers using a descriptive and quantitative synthesis of the results we
extracted in the spreadsheet.
g. Exclusion criteria: We exclude the following works during the different stages of our semi-automatic process:
– Papers that introduce ML-based systems without deployment and evaluation in the real world or a
production environment.
– Papers that present ML-based systems but do not report their software architecture.
– Papers that report experiments on synthetic data or simulated environments where ML-based systems
do not interact with actual data, physical entities, or people.
– Papers presenting isolated ML algorithms not part of larger systems.
– Papers with missing metadata that our framework cannot analyse (e.g., a paper without an abstract).
– Duplicated papers.
– Papers not written in English.

10 Datacollection spreadsheet: [Link]


04/[Link]

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 13

– Survey and review papers.


– Thesis and report documents.
We use the exclusion criteria to distinguish ML-based systems deployed in the real world from ML systems
or algorithms developed and tested in controlled and isolated environments.

5 ML-BASED SYSTEMS SURVEY FROM DOA PRINCIPLES PERSPECTIVE


This section presents a survey of the selected ML-based systems. Our main goal is to understand why, how,
and to what extent practitioners have adopted the DOA principles (Section 3). The answer to these questions
allows us to identify the systems’ requirements that DOA principles satisfy and the practical approaches for
implementing DOA, as well as to define the open challenges for building DOA-based systems (Section 6). Table 6
shows the systems that adopt the DOA principles by contrasting them with the list of reviewed papers. The
ML-based systems reported in these papers can fully, partially, or not adopt the DOA principles. The level of
adoption depends on the requirements these systems need to satisfy, the nature of the data they handle, and their
deployment environments. This section details the adoption of each principle and sub-principle. Table 6 shows
that nine out of the forty-five papers follow all or most DOA principles, reflecting an overall low adoption of
DOA when implementing current ML-based systems. The lack of maturity and tooling around the DOA style, in
contrast with popular SOA cloud-based deployments, is the reason behind this finding.

5.1 Data as a First Class Citizen


Figure 3 shows the extent to which the reviewed papers adopt the data as a first-class citizen principle. We
found that all reviewed papers report data-driven systems, which is an expected result because ML-based systems
are data-driven by nature. All the surveyed systems contain one or multiple learning models because they want to
satisfy data needs such as identifying anomalies, optimising operations, automating processes, etc. These systems’
performance depends on the data flowing through them, and they accomplish their functional requirements
using diverse data-driven methods. For example, Schumann et al. [122] use a Bayesian network for health
monitoring of space vehicles (e.g., rovers), Sarabia-Jacome et al. [118] build a DL model for fall detection in
Ambient Assisted Living (AAL) environments, Junchen et al. [62] propose Pytheas as a data-driven approach that
optimises the Quality of Experience (QoE) of applications based on network metrics, Agarwal et al. [2] present an
RL-based decision-making system that Microsoft products use for different optimisation tasks, such as content
recommendation system in MSN and virtual machine management in Azure Compute.
Our findings show that 60% of the surveyed papers fully adopt the sub-principle of handling their information
using shared data models. These systems implement shared data models using data streams or databases. Practi-
tioners select data streams because these structures enable systems to manage and process information from
continuous data sources. That is the case of environments that require systems to process multimedia streams (e.g.,
video) [37, 44, 50, 57, 62], sensors data [22, 35, 61, 69, 79, 90, 122, 128, 148], social media data [88, 125, 145, 147],
and network metrics [117, 132]. We observed that streams are common in systems built with dataflow architecture
[36, 100]. A data flow architecture transforms data inputs using different components while they flow through
the system. Herrero et al. [61] presents a good example of a system based on a dataflow architecture. They
propose a platform for Industry 4.0 based on the RAI4.0 reference architecture [81] that models software and
hardware components as stream producers and consumers. This platform handles data-intensive, continuous,
and dynamic data sources from a water pump and its environment. Streams enable the system to manage and
respond to real-time changes while providing predictive maintenance. Similarly, Sultana et al. [132] present a
software framework to detect Distributed Denial of Service (DDoS) attacks in real-time network traffic. This
software relies on a Support Vector Machine (SVM) model deployed using the Acumos and ONAP platforms.
The framework decouples components that exchange data between them via Kafka streams. System designers

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
14 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

Table 6. All reviewed papers in our survey. This table shows the extent to which and in what papers practitioners adopt (fully
or partially) each of the DOA sub-principles discussed in Section 3.

Data as a First Prioritise


Openness
Class Citizen Decentralisation

Message exchange
Asynchronous
Autonomous
Peer-to-peer
data chunks
data mode

coupling

protocol
entities

entities
Shared
driven

Local

Local
Research

Data

Data

first

first
work

Lebofsky et al. [76] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓


Herrero et al. [61] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Zhang et al. [147] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Dai et al. [37] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Junchen et al. [62] ✓ ✓ ✓ – ✓ ✓ ✓ ✓ ✓
Karageorgou et al. [68] ✓ – ✓ ✓ ✓ ✓ ✓ ✓ ✓
Shan et al. [125] ✓ ✓ ✓ ✓ ✓ ✓ ✓ – –
Zhang et al. [148] ✓ ✓ ✓ ✓ ✓ – – –
Sultana et al. [132] ✓ ✓ ✓ – ✓ ✓ ✓
Santana et al. [117] ✓ ✓ – – – – –
Alonso et al. [8] ✓ – – – – – –
Sarabia-Jácome et al. [118] ✓ – – – – – –
Xu et al. [145] ✓ ✓ ✓ – ✓ ✓
Alves et al. [9] ✓ ✓ ✓ – – –
Conroy et al. [35] ✓ ✓ ✓ – – –
Bellocchio et al. [15] ✓ ✓ – – – –
Shih et al. [128] ✓ ✓ – – – –
Gallagher et al. [49] ✓ ✓ – – – –
Salhaoui et al. [115] ✓ – – – – –
Shi et al. [127] ✓ – – – – –
Nguyen et al. [90] ✓ ✓ ✓ ✓ ✓
Schumann et al. [122] ✓ ✓ ✓ ✓ ✓
Schubert et al. [120] ✓ ✓ ✓ ✓ ✓
Agarwal et al. [2] ✓ ✓ ✓ – –
Habibi et al. [57] ✓ ✓ – – –
Müller et al. [88] ✓ ✓ – – –
Lu et al. [79] ✓ ✓ – – –
Calancea et al. [30] ✓ – – – –
Quintero et al. [112] ✓ – – – –
Franklin et al. [48] ✓ ✓ ✓ ✓
Brumbaugh et al. [21] ✓ – – –
Hegemier et al. [60] ✓ – – –
Gorkin et al. [56] ✓ – – –
Barachi et al. [13] ✓ – – –
Niu et al. [92] ✓ – – –
Qiu et al. [111] ✓ – – –
Cabanes et al. [22] ✓ ✓ ✓
Gao et al. [50] ✓ ✓ –
Amrollahi et al. [10] ✓ – –
Kemsaram et al. [69] ✓ ✓ –
Bayerl et al. [14] ✓ – –
Johny et al. [63] ✓ – –
Falcao et al. [44] ✓ ✓
Hawes et al. [58] ✓ ✓
Ali et al. [7] ✓ –
✓= Adopted, – = Partially adopted, = Not adopted

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 15

)XOO   


$GRSWLRQ
3DUWLDO

  

  


1R

'DWDGULYHQ 6KDUHGGDWD 'DWDFRXSOLQJ


PRGHO
3ULQFLSOH
Fig. 3. Data as a first-class citizen principle adoption. All systems reviewed are data-driven, over 60% at least partially adopt
a shared data model, and just under half utilise data coupling.

prefer databases as shared data models when the system’s functionalities require making decisions based on
historical data records [2, 9, 15, 49, 58, 76, 120]. The reason behind this design decision is that a shared data
model implemented as a database facilitates the storage and management of historical records and provides an
efficient communication medium between systems’ components compared to synchronous API calls or Remote
Procedure Calls (RPCs). For example, Alves et al. [9] propose a system for industrial predictive maintenance based
on historical data collected from IoT devices. These devices produce and write monitoring data in a database.
The system’s components read such data for the respective maintenance decision-making. Schubert et al. [120]
introduce a deep learning-based system used by the Texas Spacecraft Laboratory. The system’s components store
and retrieve data from AWS S3 buckets to generate large data sets of synthetic images that support on-orbit
spacecraft operations. Hawes et al. [58] present a robotic platform that stores the Robotic Operative System (ROS)
messages in a document-oriented database (i.e., MongoDB). These messages describe the status of the robot
and its interactions with the environment. A monitoring component reads this history to predict future states.
This robotic platform is the only reviewed work that uses the shared data model and data coupling principle for
self-monitoring purposes.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
16 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

Over thirteen per cent of the reviewed papers partially adopt a shared data model between their systems’
components. That is the case for systems combining heterogeneous storage technologies in shared data mod-
els [10, 21, 68]. These systems must store, monitor, and trace the data status in different stages of its processing.
Practitioners design and implement systems combining different storage technologies and models because the
data structure and format change after each processing stage. For example, the AIDEx platform [10] queries
an electronic medical records (EMR) database and passes such data to a set of microservices that predict the
risk of patients’ infection sepsis based on their data models. AIDEx creates patients’ health data streams based
on the prediction results stored in a MongoDB instance. The system uses this data for visualisation purposes.
Karageorgou et al. [68] propose a system for multilingual sentiment analysis of Twitter streams. This platform
uses RabbitMQ queues to collect tweets in different languages. The system then models such data as Kafka streams
for sentiment analysis. Other systems that partially adopt a shared data model are the ones that must collect
data from spatially distributed sources [8, 14, 118]. Systems distribute such data to satisfy low latency, resource
constraints, and privacy requirements. Distributed nodes use the same models to store local data, but do not
share this data between them. Subsection 5.2 reports more details on distributed and decentralised deployments.
Just over a quarter of the reviewed papers report systems whose components do not have shared data models.
Practitioners design systems without shared data models because these systems are small and not data-intensive.
These are either implemented as monoliths in the cloud [56, 92] or as edge architectures that use edge servers as
gateways for data transmission and inference. Such edge solutions deploy trained learning models on embedded
devices, like in the monitoring and prediction systems proposed by Quintero et al. [112], Salhaoui et al. [115],
Hegemier et al. [60], and Johny et al. [63]. Monolithic deployments follow a client-server pattern to collect
sensor data and respond to requests. For example, Niu et al. [92] deploy a recognition system of indoor daily
activities for one-person household apartments. Some systems do not have shared data models but are complex
and large and process big volumes of data. These rely on cloud data centres and follow well-known software
architectures (e.g., microservices) to satisfy such requirements [7, 13, 30, 48, 111, 127]. Functionalities and data of
systems’ subcomponents are encapsulated and hidden behind interfaces (e.g., service APIs). Designers prefer
such implementations when they do not prioritise transparency and traceability requirements and when they
rely on third-party cloud components for monitoring tasks (e.g., middleware solutions [26]).
Systems’ components in 33% of papers follow the data coupling sub-principle. These systems’ components
interact by reading from and writing to data mediums. Engineers adopt this principle because systems must
handle real-time data that needs to be processed continuously. We found that components of real-time systems [21,
22, 35, 37, 61, 62, 68, 90, 122, 132] act as subscribers and publishers of data to streams that represent the state
of the data at different stages in a workflow. Stream-based systems make use of different technologies such
as Apache Kafka [61, 62, 132] or Spark Streaming [21, 62], sometimes adopting the underlying stream-based
programming model for the entire system [37]. Message queues (e.g., RabbitMQ) are also used in systems to
collect heterogeneous data from different sources or enable interaction between components in parallel [68, 90].
Systems also adopt the data coupling sub-principle to handle large amounts of data that are not possible to send
through APIs or RPCs [76, 120] or when they analyse historical data [2, 9, 125, 147]. Practitioners design and
implement data-coupled components using databases because systems and their components must process large
pieces of data in batches. The system proposed by Zhang et al. [147] illustrates these types of data coupling by
combining streams and databases. Distributed Apache Kafka streams store social media data from Weibo and
Chinese forums, which are processed using Apache Storm to enable real-time sentiment analysis. The system
also offers batch processing where social media is stored using the Hadoop Distributed File System (HDFS) and
an HBase database. The Apache Spark machine learning library (MLlib) performs distributed data analysis on
top of this data. Systems that partially adopt data coupling (i.e., 15.5%) [10, 21, 49, 50, 57, 88, 128] are the ones
where some components communicate through data mediums and others use traditional process calls. The reason
behind such a hybrid approach is satisfying interoperability requirements while managing large and real-time

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 17

data. These systems use APIs or RPCs to communicate with distributed components or external entities, while
the centralised system’s components interact through data mediums. For example, systems can use API calls
to collect data from heterogeneous data sources [10, 88] or to interact with end users [49, 50]. Over half of the
surveyed papers do not adopt data coupling. These systems mostly use REST [13, 56, 63, 69, 79, 92, 112, 117] or
RPCs [8, 14, 15, 30, 44, 48, 58, 60, 111, 115, 118, 127, 145, 148] for communication. These systems’ components
hide the data behind their interfaces, which causes the data dichotomy issue impacting monitoring tasks [131].
We found that not all systems where components share a data model necessarily follow the data coupling sub-
principle. Different nodes in a distributed system can have the same data model even when they do not interact.
Designers made this decision when systems must address low latency and data privacy requirements, but do not
need to aggregate data from geographically distributed sources for training their models. Some of these distributed
systems create independent deployments that are in charge of the functionality of the whole system in a given
geographical region or coverage area [15, 58, 69, 79]. Other distributed systems rely on edge devices that play the
role of intermediate nodes that collect, preprocess, and transmit data to centralised servers as well as perform
inference tasks when designers deploy cloud-trained learning models on the edge [8, 14, 44, 117, 118, 125, 145, 148].

5.2 Prioritise Decentralisation


Figure 4 presents the extent to which the papers prioritise decentralisation when deploying their ML-based
systems. Almost eighteen per cent of the papers report systems where decentralised entities store local data chunks
with minimal presence of centralised entities. Decentralised approaches [37, 61, 68, 76, 125, 145, 147, 148] rely on
local data chunks, which provide partitioning and replication by design. Practitioners implement systems based on
local data chunks to exploit such data partition and replication, enabling systems to manage computing resources
efficiently and be fault-tolerant. That is the case of the Breakthrough Listen program presented by Lebofsky [76].
This program collects data from radio and optical telescopes to search for extraterrestrial intelligence (SETI).
The search leverages different strategies, including ML models for modulation scheme classification and outlier
detection. The paper reports the software and hardware architecture behind the data collection, reduction,
archival, and public dissemination pipeline. This architecture consists of storage clusters and compute nodes
that handle raw data volumes averaging 1PB daily. Decentralised storage nodes use Hierarchical Data Format
version 5 (HDF5) to organise the large data sets in groups that facilitate data management along the pipeline.
Another example of decentralisation is Poligraph [125], a system that detects fake news by combining ML models
and human knowledge. Users send fake news detection requests to decentralised servers that run different ML
models and collect news reviews from experts. These servers run the Byzantine Fault Tolerant (BFT) protocol to
replicate users’ requests and news to mitigate experts’ unavailability (i.e., fault tolerance). Systems designers also
use decentralisation to implement systems for processing intensive and sparse data sources [61, 68, 145, 147].
For example, Zhang et al. [147] propose a system for sentiment analysis of tweets from three Chinese cities
(i.e., Beijing, Shanghai, Guangzhou and Chengdu). The decentralised storage and management of data leverages
HBase and Kafka. We found that decentralised approaches must rely on distributed computing protocols and
technologies to mitigate the challenges that arise from the absence of a central control entity. In addition to the
technologies mentioned before, Karageorgou et al. [68] use Spark for scalable sentiment analysis on multilingual
data from social media, Xu et al. [145] use Distributed Hash Tables (DHTs) to facilitate the retrieval of social data
stored by topic for spam detection, and Herrero et al. [61] integrate Zookeeper, Kafka, and Apache Cassandra in
a decentralised platform for a predictive maintenance industrial service.
Some systems partially decentralise the storage of data (17.8% of the papers). These approaches exploit
decentralisation properties (e.g., closeness to data owners) while having centralised control of the system [107, 133].
Partial decentralisation creates federated networks where collected data from sensor devices (e.g., IoT sensors,
smartphones, wearables, etc.) is stored in local databases [8, 15, 60, 127] or encoded in local learning models trained

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
18 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

)XOO   


$GRSWLRQ
3DUWLDO

  

  


1R

/RFDOGDWD /RFDOILUVW 3HHUWRSHHU


FKXQNV ILUVW
3ULQFLSOH
Fig. 4. Adoption of the prioritising decentralisation principle. Most systems rely on centralised storage and processing to the
detriment of the decentralisation sub-principles.

using more powerful computing infrastructure [60, 62, 115, 117, 118]. In both cases, edge servers preprocess and
filter the collected data before transmitting it to back-end servers. Such central entities have a complete view
of the system’s state and perform more complex tasks (e.g., learning models training, job scheduling, resource
allocation, etc.). System developers exploit these federated architectures to satisfy systems requirements such
as data ownership, privacy, and security, as devices and edge nodes can decide when and how to transmit
the information to back-end servers. Santana et al. [117] propose a crowd management system based on WiFi
frames originating from people’s smartphones in the context of the SmartSantander project [116]. The system
collects WiFi frames as streams preprocessed in edge nodes. This preprocessing includes data anonymisation and
dimensionality reduction to protect people’s identities and extract key features from the streams. A centralised
processing tier stores and uses the preprocessed and anonymised data to train ML models (e.g., logistic regression,
naive Bayesian, and random forest classifiers) for crowdsensing estimation. Practitioners also exploit local data
storage to facilitate the deployment of ML-based solutions in resource-constrained environments and optimise
cost. For example, Alonso et al. [8] implement a real-time monitoring edge system for smart farming, where
farmers want to reduce storage and compute costs while monitoring the state of dairy cattle and feed grain.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 19

Edge servers in the system collect, buffer, and filter data from IoT devices to eliminate possible noise and discard
duplicated frames. Such filtering also reduces transmission and storage costs.
Our findings show that most reviewed papers report systems that rely on centralised storage (i.e., 64.4%). Some of
these systems are deployed on single cloud servers [2, 7, 10, 13, 21, 30, 35, 48–50, 56, 57, 79, 88, 92, 111, 120, 128, 132],
where all data is stored. Designers rely on centralised cloud servers because these are flexible and can handle
Big Data requirements. For example, Conroy et al. [35] present a system that monitors and predicts COVID-19
infections to aid military workforce readiness. This system monitored personnel of the US Department of Defence
during the pandemic. It had around 10,000 users in ten months, collected 201 million hours of data, and delivered
599,174 total user-days of predictive service. Similarly, Calancea et al. [30] use the Google Cloud Datastore service
to store the data behind iAssistMe, a platform to assist people with eye disabilities. Both systems assure scalability
by increasing or reducing the number of servers based on the number of requests. Other systems are deployed on
single resource-constraint devices like mobile phones [14, 112], robots [58, 69, 122], or small processing units
(e.g., Raspberry Pi) [9, 22, 44, 63, 90]. System designers implement this type of deployment to fit data problems
that require online and fast inference and actuation, but where data storage is not needed or infeasible due
to hardware or bandwidth constraints. For example, Nguyen et al. [90] present a neuroprosthetic hand with
embedded deep learning-based control. The authors deploy a recurrent neural network (RNN) decoder on an
NVIDIA Jetson Nano, enabling the implementation of the neuroprosthetic hand as a portable and self-contained
unit with real-time control of individual finger movements. Data is not stored as it is processed online.
Above 17% of systems prioritise local processing. These systems deploy learning modes and other components
on distributed computing nodes that process users’ requests, make decisions, and provide systems’ functionalities.
The main reason for decentralised processing is to meet scalability and low latency requirements in data-intensive
applications [37, 61, 62, 68, 76, 125, 147, 148]. Dai et al. [37] introduce BigDL, a distributed deep learning framework
for big data platforms and workflows. Intel developed the BigDL framework to ease the application of deep learning
in real-world data pipelines for its industrial users: Mastercard, World Bank, Cray, Talroo, UCSF, JD, UnionPay,
Telefonica, GigaSpaces, and more. These companies handle dynamic and messy data that requires complex,
iterative, and recurrent processing that is challenging to implement efficiently in centralised architectures. BigDL
distributes training and inference on top of the scalable architecture of the Spark compute model. Spark partitions
data across workers that form computing clusters. Workers apply map, filter, and reduce operations in a parallel
fashion. The data partitioning and parallel processing enable the design and implementation of fault-tolerant and
efficient systems based on learning models built from large data sets. A few cases offer systems’ functionalities
based on the cooperation of distributed nodes. These systems also follow the peer-to-peer first sub-principle and
correspond to the 20% of the reviewed papers [37, 61, 62, 68, 76, 125, 145, 147, 148]. System designers adopt local
processing and peer-to-peer first principles to satisfy scalability, latency, fault-tolerance, and resource management
requirements of systems that handle data from intensive and distributed sources. That is the case with systems
that analyse social media data, for example, sentiment analysis of tweets or spam detection [68, 145, 147], or the
Breakthrough Listen program [76] in the search for extraterrestrial intelligence (SETI). Adopting the peer-to-peer
sub-principle requires distributed computing technologies and protocols to handle the complexity of decentralised
processing. Systems also use these technologies when they adopt the local first sub-principle. Examples of the use
of these technologies are Zhang et al. [147], Dai et al. [37], and Karageorgou et al. [68] using Spark, Junchen et
al. [62] based on Kafka, Xu et al. [145] using DHTs, Lebofsky et al. [76] storing data on Hierarchical Data Format
(HDF5), Shan et al. [125] using the BFT protocol, and Herrero et al. [61] based on Zookeeper.
Over thirty-one per cent of the reviewed systems partially adopt the local first sub-principle. Nodes in the
network are in charge of different functionalities in these systems. Centralised servers are in charge of computa-
tionally expensive processing such as model training and updating [14, 60, 63, 69, 112, 115, 118, 127, 145], global
decision making [8, 30, 117], and systems’ orchestration [21, 132]. Centralised servers execute these tasks because
they have enough computing power and a global view of the system. Central processing enables easier control of

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
20 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

the components of the system, which can be challenging in fully decentralised systems as autonomous entities
must self-govern. Practitioners opt for such central control to make systems robust as they rely on centralised
servers to offer scalable and highly available solutions. The works of Sultana et al. [132] and Brumbaugh et al. [21]
are good examples of central control. They encapsulate systems’ functionalities in logically distributed containers
that Kubernetes centrally controls. Kubernetes is the central platform that manages container instantiation,
resource allocation, scheduling, and execution. Systems that partially adopt the local first sub-principle use edge
nodes or mobile devices to offer data collection and preprocessing functionalities. These nodes also offer inference
capabilities when they host trained learning models. System developers use edge-computing architectures to
enable local inference and fit low latency and data ownership requirements as learning models are closer to end
users [28, 133]. For example, Sarabia et al. [118] deploy an edge-based fall detection system for Ambient Assisted
Living (AAL) environments. AAL systems process highly sensitive patient data and require very low processing
time. The authors propose a 3-layer fog-cloud architecture composed of medical devices, fog nodes, and a cloud
server. Deep learning models are deployed in fog nodes to detect patients’ falls. The authors prove that the
proposed platform performs better than a cloud baseline (i.e., better efficiency and response time). One or more
fog nodes attend to each patient, creating a local area network to preprocess (e.g., filtering) the patient’s data.
Dedicated fog nodes enable data privacy and security by design. Kemsaram et al. [69] present the architecture
design and development of an onboard stereo vision system for cooperative automated vehicles. The platform
relies on a stereo camera that captures left and right images processed by pre-trained DNNs to perform object
perception, lane perception, and free space perception. An object tracker then estimates different metrics over
the classified images, such as depth and radial distance, relative velocity, and azimuth and elevation angle. The
system implements a 4-layer architecture deployed in each vehicle to guide cooperative autonomous navigation.
We found that over half of the papers describe centralised systems in which central nodes host all their
components, which reflects the prevalent preference for the traditional client-server architecture. Some of these
systems work in extreme resource-constrained scenarios that demand deployment on a single device [22, 44, 58,
90, 122]. Engineers leverage such deployments in environments that require autonomous and self-contained
systems because communication with back-end servers is limited or impossible. That is the case of the work
proposed by Schumann et al. [122] that reports an ML-based system deployed in rovers to enable automatic vehicle
monitoring in spatial missions. The Spirit and Opportunity Mars Exploration Rovers (MER) were autonomously
functional on the surface of Mars for 6 and 14 years11 . These rovers performed data collection and navigation
tasks. The STRANDS Core System [58] supports long-term autonomy (LTA) robot applications in security and
care environments. LTA refers to the capability of robots to operate continuously for multiple weeks. STRANDS
supported over 100 days of autonomous operations for robots performing navigation, human behaviour prediction,
and activity recognition tasks. We also found that cloud servers host most centralised systems. Practitioners
exploit cloud servers flexibility to address scalability and availability requirements [2, 7, 9, 10, 13, 15, 35, 48–
50, 56, 57, 79, 88, 92, 111, 120, 128]. That is the case of iTelescope [57], which is an ML-based platform for
real-time video classification. This platform collects data streams from the network traffic and processes them
with two learning models for video identification and resolution classification. iTeleScope worked in a campus
network, served several hundred users, and demonstrated good performance with high concurrent streams.
Centralised entities rely on third-party entities to satisfy additional requirements such as monitoring and security.
For example, AIDEx [10] is a platform to predict patients’ risk of developing sepsis in the next 4 to 6 hours.
AIDEx fetches patients’ records from a real-time EMR database and displays hourly sepsis risk scores for each
patient. This platform is based on microservices for preprocessing data, executing the prediction learning model,
and storing and visualising the prediction outcomes. The privacy and security of patients’ data are key system
requirements, but designers leverage third-party mechanisms such as firewalls and virtual private clouds (VPCs).

11 NASA MER: [Link]

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 21

Systems that follow the peer-to-peer first sub-principle correspond to 20% of the reviewed papers. The rest
of the papers (i.e., 80%) report systems that do not follow this sub-principle. Some of these are implemented
as centralised architectures in the cloud [2, 7, 10, 13, 21, 48, 50, 57, 88, 111, 120, 128, 132]. Central entities in
these deployments orchestrate the interactions between systems’ components. Designers opt for such centralised
control as it enables robust and flexible systems that are easy to scale and highly available. For example, Ali et al. [7]
present ID-Viewer, a surveillance system for identifying infectious diseases in Pakistan. Data collection, analysis,
and visualisation components process data from different areas of the country. A central server hosts these
components to enable spatio-temporal analysis of the causes and evolution of infectious diseases. Several works
also describe systems as federated architectures [8, 9, 15, 30, 35, 49, 56, 60, 79, 92, 112, 115, 117, 118, 127] without
horizontal interactions between nodes at the same layer. Examples of these architectures are IoT deployments
in different domains such as healthcare [35, 92, 112, 118], smart buildings [79, 117], farming [8], and predictive
maintenance [9]. The architectures of these projects follow a layered pattern with vertical interaction between IoT
devices, edge servers, and cloud servers. Designers use different layers to satisfy different system requirements.
Lower layers (i.e., IoT devices) are in charge of data collection, middle layers (i.e., edge servers) are in charge
of data preprocessing, filtering, and transmission, and upper layers (i.e., cloud servers) are in charge of data
aggregation, decision making, visualisation, and user interaction. There are also self-contained systems deployed
in single constraint devices [14, 22, 44, 58, 63, 69, 90, 122]. These are embedded systems that combine hardware
and software components for specific functions in environments that require fast reactions (e.g., neuroprosthetic
devices [90]) or have limited connectivity with back-end servers (e.g., NASA’s Mars rovers [122]).

5.3 Openness
Figure 5 shows to what extent practitioners adopt the openness principle. Twenty-two per cent of the papers
rely on flexible architectures where new components can join autonomously. These new components join and
cooperate with the existing ones to achieve the system’s goals. Practitioners exploit this flexibility when systems
must be scalable and highly available. Dynamic requirements need on-demand extension of storage and computing
capabilities [37, 48, 61, 62, 68, 76, 125, 132, 145, 147]. Such flexibility causes security concerns as malicious entities
can join the architecture. Designers must implement distributed protocols to address this risk. For example,
Shan et al. [125] propose a decentralised architecture for Poligraph based on the BFT protocol, which adds
servers and reviewers to the fake news detection process. Such an adaptation responds to the request number
variability, enabling scalability, while parallelism enables low latency. A new node follows the BFT protocol to
join the network. It sends messages to existing nodes to verify their identities, receives requests, and participates
in the review consensus. The Sifter platform [145] is an online spam detection system for social networks. It
uses a recurrent neural network (RRN) for spam detection and Distributed Hash Tables (DHT) to address the
fast-changing nature of topics and events. Social data is stored by decentralised servers and grouped by topics
using the DHT structure to enable fast retrieval. Each server responds to spam detection requests using its local
neural network. New servers can join by following the DHT protocol, which extends the storage capability
of Sifter. A new node must contact an existing one, perform a lookup operation to find the closest nodes, and
update these nodes’ information and the network topology with its data. Local spam detection fits low latency
requirements as requests are solved closer to end users. We observe that the technologies and protocols used to
handle decentralised storage and peer-to-peer cooperation (Section 5.2) also support flexible architectures where
entities can join autonomously. That is the case protocols such as DHT [145], HDF5 [76], Spark [37], or BFT [125],
and architectural patterns such as dataflow [62, 132], the observer [48], or publish/subscribe [61, 62, 68, 132, 147].
There are systems based on federated [8, 15, 30, 60, 115, 117, 118, 127, 148] and centralised [7, 9, 13, 35, 49,
56, 63, 79, 92, 111, 112, 128] architectures that partially (i.e., 46% of the papers) adopt the autonomous entities
sub-principle. These architectures have mechanisms to add new sensing devices or users at the lowest layer. These

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
22 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

)XOO   


$GRSWLRQ
3DUWLDO

  

  


1R

$XWRQRPRXV $V\QFKURQRXV 0HVVDJHSURWRFRO


HQWLWLHV HQWLWLHV
3ULQFLSOH
Fig. 5. Adoption of openness principles. Most reviewed systems use autonomous entities, but only a third use them
asynchronously. More than 50% utilise message protocols.

mechanisms consist of interoperable communication technologies that enable new actors to exchange messages
with existing nodes. These technologies include WiFi, Bluetooth, Zigbee, 5G, etc. However, these systems are not
flexible to include more storage or compute nodes at the edge or cloud layers. Authors exploit such features to
implement and deploy flexible systems where new sensors expand systems’ data collection capabilities and new
users expand systems’ service coverage. For example, Alonso et al. [8] use Fiware12 as a middleware to manage
devices that transmit data using the ZigBee wireless standard. The collected data supports automated real-time
decisions about the state of dairy cattle and the feed grain. Niu et al. [92] developed the SensorBox device that
integrates different IoT sensors to measure environmental variables. Supervised learning models are trained
with such data to classify daily activities. Barachi et al. [13] deploys a crowdsensing platform based on DL to
generate accident reports. Users are data consumers and collectors who can join the network by installing a
mobile application. Calancea et al. [30] also require users to install an Android application to use the NLP-based
platform that supports visually challenged people.
A third of the papers report ML-based systems whose components are not autonomous. Practitioners avoid
autonomous components when systems requirements do not demand dynamic changes on their components.
12 Fiware: [Link]

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 23

These systems are self-contained embedded systems [14, 22, 44, 58, 69, 90, 122] or rely on cloud-based architec-
tures [2, 10, 21, 50, 57, 88, 120]. The authors deploy embedded systems to perform specific tasks with limited
resources. A change in the tasks of these systems implies their redesign, including their hardware components.
That is the case of the robots of the STRANDS project [58], specifically designed for navigation, human behaviour
prediction, and activity recognition tasks. Similarly, Cabanes et al. [22] particularly design an embedded system
to detect parking events. It uses a single 3D sensor to collect the input data for an object recognition component.
Cloud-based architectures rely on the cloud provider services to extend the storage or computing capabilities of
the systems. This flexibility enables systems to satisfy availability and scalability requirements. For example, the
Bighead framework [21] is based on Kubernetes to support the development of data-driven solutions across the
whole Airbnb organisation. Designers achieve such flexibility at the cost of the cloud service providers’ fees.
Almost a quarter of papers report systems that leverage asynchronous entities [37, 48, 61, 62, 68, 76, 90, 120,
122, 132, 147]. Software components act as data producers and consumers when communicating asynchronously.
Developers adopt this sub-principle because it enables components not to block each other and operate inde-
pendently. These do not wait for responses from other components but subscribe to relevant data producers.
Systems based on these loosely coupled interactions satisfy low latency and big data processing requirements
because asynchronous entities enable parallelisation and decentralised computing. The papers that follow the
asynchronous entities sub-principle also follow the message exchange protocol one. Practitioners consider message
exchange protocols because autonomous entities must communicate in open environments. Such protocols (e.g.,
MQTT, RabbitMQ, etc.) determine how entities exchange messages. The sentiment analysis system proposed by
Karageorgou et al. [68] relies on asynchronous components. Data from different intensive sources is collected
using the publish/subscribe pattern, enabling different workers to process different data streams in parallel. This
communication is based on RabbitMQ and Apache Kafka standards, while Spark enables a configurable number
of workers. Similarly, BigDL [37] support large-scale distributed training of deep learning models. Worker nodes
constitute a Spark cluster to compute and aggregate local gradients for each model in parallel. Spark provides a
data-parallel functional computing model that defines how workers operate. Workers perform map, reduce, and
filter operations to transform data represented as Resilient Distributed Datasets (RDDs). Robotic systems are also
based on asynchronous components because robots perform tasks in parallel. For example, a rover must move
while monitoring its sensors’ status [122]. Similarly, the neuroprosthetic hand proposed by Nguyen et al. [90]
consists of three parallel threads that perform data acquisition, preprocessing, and motor decoding tasks. A
queue-based protocol drives the communication between these systems’ components. Data coupling (Section 5.1)
enables asynchronous interactions by design as components write data to and read data from data mediums. We
observe a high correlation between the asynchronous entities and the data coupling subprinciples, as most papers
that follow the former also follow the latter.
Half of the papers report systems that partially adopt the asynchronous entities and the message exchange
protocol sub-principles. Designers implement asynchronous communication because systems must listen to
requests continuously or notify users at any time [13, 30, 56, 125, 128]. iAssistMe [30] relies on the pub/sub
communication pattern to interact with visually challenged people and support their daily activities. The system
uses RabbitMQ and the Advanced Message Queuing Protocol (AMQP) to handle data producers and consumers.
Sharkeye [56] uses the AWS Simple Notification Service (SNS) to send messages warning people about the
presence of sharks via smartwatches. Users subscribe to the notification service following the pub/sub pattern.
Other authors implement systems that use asynchronous entities to listen to data sources (e.g., sensor devices)
that can produce data at any time [2, 8, 9, 15, 35, 49, 57, 79, 88, 92, 111, 112, 115, 117, 118, 127, 148]. Bellocchio et
al. [15] present the SEAL project to provide home automation based on ML for building energy management and
safety. SmartSEAL is the platform that manages IoT devices deployed in the buildings. It offers asynchronous
and synchronous communication using pub/sub and client-server protocols. The Robot Operative System (ROS)
supports the pub/sub mechanism where sensor nodes write to and read from a topic. They extend ROS with

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
24 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

a central entity (i.e., roscore), which provides device naming and registration services. Synchronous API calls
enable the consumption of these services. Alves et al. [9] propose an industrial system based on IoT nodes capable
of acquiring and transmitting machine status parameters using the pub/sub architectural pattern (i.e., the MQTT
protocol). IoT nodes post their readings in a shared database, which the system’s components read asynchronously.
Gallagher [49] uses synchronous and asynchronous mechanisms in the IntelliMaV platform. IntelliMav is a
cloud-based system that applies machine learning to verify the performance of energy conservation measures in
near real-time. The system is deployed in the cloud and follows a 3-layer architecture. The user tier is the access
point for users through a web browser that presents an interface for model training and deployment. The cloud
tier is a virtual private cloud and hosts the application infrastructure. This tier processes the application requests
by running R code exposed through APIs. The site tier represents the pipelines that collect data from the industrial
site. These pipelines push the collected data to the cloud storage asynchronously for further processing. The smart
diary tracer platform [8] uses different protocols to collect data from diverse sources. These protocols include
Zigbee, LoRa, WIFI, Bluetooth, and 3G, which support communication with IoT sensors deployed on the crops and
the barns used to feed the livestock, the cattle to monitor their health, the factories for the traceability of packaged
products and energy monitoring, and the trucks that transport the dairy products. Sarabia et al. [118] propose a
3-layer fog-cloud architecture where new healthcare devices join using low-power wireless technologies such as
Bluetooth, Zigbee, 6LoWPAN, and Wi-Fi. The Bluetooth protocol is also used by Quintero et al. [112] to include
new devices to collect photoplethysmography data and heart rate monitoring.
The rest of the reviewed papers (i.e., 26.7%) report synchronous systems. These rely on traditional client/server
architectures and communication patterns such as RPC or API calls. The authors opt for this design decision when
systems are small, self-contained, and attend synchronous requests [14, 22, 44, 58, 60, 63, 69]. Hegemier et al. [60]
propose a real-time danger avoidance systems for robots. They deploy a neural network in an edge server, which
receives prediction requests from robots in HTTP format. Requests include an image of the robot environment
and its status information (e.g., damage report). The edge server responds synchronously with a classification of
the image. Johny et al. [63] propose a system for metastasis detection. They deploy a deep learning model in a
Raspberry Pi, which performs image classification tasks. Users send their requests (i.e., images) to an embedded
web server installed in the Pi. Components of larger systems also interact synchronously. These are usually
deployed in the cloud and follow microservices architectures. They adopt layered and modular architectures
that satisfy scalability and availability requirements [7, 10, 21, 50, 145]. For example, the AIDEX [10] system
adopts an architecture to predict patients’ risk of developing sepsis. The system encapsulates each functionality
as a microservice. The system exposes microservices for preprocessing data, executing the prediction algorithm,
storing the prediction outcomes, and visualising outcomes. The system orchestrates these components using API
calls to offer end-to-end capabilities such as sepsis prediction.

5.4 Summary
Table 7 summarises our survey findings against the research question we formulated in Section 4. This section
highlights the key findings and provides practical advice towards implementing and deploying Data-Oriented
ML-based systems in the real world. We found that only one-fifth of the reviewed ML-based systems follow
all or most DOA principles [37, 61, 62, 68, 76, 125, 147, 148], which reflects a low adoption of DOA principles
when implementing current ML-based systems. Despite this low adoption, we found that designers exploit DOA
principles in the reviewed systems to satisfy requirements that are becoming increasingly common for data-driven
systems. These systems address data requirements that demand managing big data from distributed and intensive
sources, low-latency processing tasks, and efficient management of storage and computing resources. We observed
that systems based on architectural patterns such as dataflow and publish/subscribe are more data-oriented
and follow more DOA principles. These system designs focus on data exchange between components (i.e., data

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 25

Table 7. Summary of our paper findings against the research questions we formulate in Section 4 for each DOA principle.
The goal of this survey is to determine to what extent and how researchers have adopted the DOA principles to implement
current ML-based systems, and what are the motivations behind such adoption?

Adoption Extent
DOA Adoption Motivation Adoption in Practice
(RQs To what extent?)
Principle (RQs Why?) (RQs How?)
(See Figures 3, 4, and 5 and Table 6 )
Practitioners use diverse data-driven algorithms
to provide different functionalities. For example,
ML-based systems use data-driven algorithms
DL models for fall detection, RL models
- All reviewed systems are data-driven. to satisfy their functional requirements. These
for resource optimisation, Bayesian networks
Data as a First - Almost 75% of papers fully or partially adopt the systems must handle continuous data sources
for rovers monitoring, etc. Designers implement
Class Citizen shared data model sub-principle. and historical data, usually large and
shared data models and data coupling
- Less than half of the systems adopt data coupling. dynamic. Data coupling fits better these data
using databases (e.g., HDF5, HBase, etc.),
management requirements.
streams (e.g., Apache Kafka, Spark Streaming,
etc.), and message queues (e.g., RabbitMQ).
Practitioners implement decentralised or
- Only 35% of works fully or partially adopt the Practitioners rely on distributed storage and
federated architectures because these enable
local data chunks sub-principle. computing technologies and protocols to
systems to manage resources efficiently and be
Prioritise - Almost half of the reviewed papers fully implement and manage decentralised or
fault-tolerant thanks to data partitioning and
Decentralisation or partially perform local first processing. federated architectures. Examples of these
replication. In addition, local data chunks
- Only 20% of systems rely on components that technologies and protocols are Apache Kafka,
improve systems privacy and security as the
prioritise peer-to-peer collaboration. HDFS, DHT, and BFT.
data ownership does not change.
Dynamic environments like the ones we find Practitioners use distributed computing (e.g.,
- Almost 70% of the reviewed systems rely on
in the real world require flexible architectures Apache Kafka, DHT, etc.) and communication
autonomous entities.
where new components (e.g., IoT sensors) can technologies to implement open systems.
- Almost 75% or papers rely on asynchoronous
join at runtime. Practitioners adopt the openness Examples of communication technologies are
Openness entities that communicate using a message protocol.
principle to design and implement such Wi-Fi, Bluetooth, Zigbee, and 5G. Protocols
- Hybrid architectures prevail, combining static
architectures. Autonomous entities usually like MQTT and RabbitMQ, work on top of these
and autonomous entities based on synchronous and
communicate in an asynchronous fashion using technologies to enable asynchronous
asynchronous communication protocols.
different message exchange protocols. communication between autonomous entities.

first), which does not assume any coupling on the control flow level. Practical advice: Developers focus first on
the data while creating data-intensive systems, and operations are secondary. Dataflow and publish/subscribe
patterns are well-suited for designing and implementing DOA systems.
We found that the data as a first-class citizen is the most adopted principle among the reviewed papers,
mainly because of the type of requirements ML-based systems have. These systems are data-driven by nature,
implying that they must manage large and dynamic data sets. Shared data models and data coupling sub-principles
fit better this type of requirements. Engineers adopt the data as a first-class citizen principle because it
enables efficient interactions between system components. This design decision avoids data transmission between
components as payloads of direct calls (e.g., API calls). Systems’ components act as producers and consumers of
data stored in shared data models. Such data coupling offers asynchronous interactions by design, which makes
components autonomous and non-blocking for each other. Data coupling also addresses low latency requirements
and resource constraints because autonomous components can process large data sets in parallel. The nature of
the data systems handled influences the selection of the shared data models that the systems’ components nurture.
Data coupling based on databases is appropriate for systems that work with historical data, while streams fit
better systems that handle continuous data from dynamic sources. An engineer can design a system with one or
more shared data models. This decision depends on how the system processes the data from inputs to outputs.
Practical advice: Data coupling enables systems to address big data processing requirements in data-driven
systems. Databases, streams, and message queues are examples of data mediums to consider when designing
data-first systems. These data mediums play the role of shared data models, which practitioners implement using
technologies such as Apache Kafka, Spark Streaming, HDF5, RabbitMQ, or HBase.
Our results show that the prioritise decentralisation is less popular than the data as a first-class citizen
one because centralised cloud deployments prevail as the preferred choice when deploying systems nowadays.
Practitioners have this strong preference because cloud platforms are flexible and mature enough to automate

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
26 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

cumbersome infrastructure management processes. Decentralised architectures (e.g., edge computing) require
communities to develop tools to facilitate their management for practitioners. We found that centralised archi-
tectures are the preferred design choice when deploying ML-based systems to the detriment of the prioritise
decentralisation principle. Cloud platforms cope well with the most relevant requirements that ML-based
systems have nowadays as they provide flexibility, high availability, scalability, and, most of the time, satisfy
current low-latency requirements. However, it is natural to expect these requirements to be more critical and
severe as data becomes more available, users require lower latency, and applications become more complex (e.g.,
digital twins, augmented reality, etc.). Further research on decentralised architectures is necessary to enable
the benefits of decentralisation while making it a feasible option for deployment. Reviewed works relying on
edge computing already leverage federated architectures with increasingly powerful edge servers that offer
local data storage and processing. However, the peer-to-peer interaction between nodes is missing in most cases.
Cloud servers are the backend and still play the central storage and processing roles. A collaboration between
nodes at lower layers in edge architectures can enable more sustainable systems by exploiting the storage and
computing power of everyday devices [74]. Practical advice: The absence of a central orchestrator in favour of
direct communication between any two nodes of the system is a straightforward way to move to decentralisation.
The authors of the reviewed papers used distributed storage technologies (e.g., Apache Kafka, HDFS, etc.) and
distributed protocols (e.g., DHTs and BFT) to implement decentralised solutions.
Systems that collect data from distributed devices (e.g., sensors) are usually open. These systems use communi-
cation technologies such as Wi-Fi, Bluetooth, Zigbee, and 5G, among others, to add new sensors and data sources.
Architectures are more closed and static at upper levels, whereas software components are static entities that
rely on synchronous communication via RPC or API calls. Data coupling and open environments have a strong
correlation. Reading from and writing to data mediums is an asynchronous communication process where systems’
components operate as data consumers and producers. This process requires components to utilise message
exchange protocols that describe how and when to read and write data. Such protocols also enable seamless and
flexible architectures where components can join or leave autonomously. Practical advice: The use of message
exchange protocols and data coupling results in systems that are open and flexible by design. It enables data
availability and horizontal scalability as systems add resources on demand. Communication protocols such as
MQTT and RabbitMQ are well-known tools for building open systems.

5.5 Threats to validity


In this section, we discuss threats to the validity and limitations of our work.
Internal validity. We used a multi-stage selection process to find papers for the survey. We have followed an
established methodology (SLR, [72, 73]). However, this approach can have drawbacks.
There might be missed terms in the search queries that we have run against digital libraries. To mitigate this
risk, we have iteratively improved our search procedure multiple times based on the analysis of the search output.
The search functionalities between different databases used in our automatic search differ. We addressed this
risk by including as many publicly available library APIs as possible, even if their content overlaps. Our tool is
publicly available and easy to extend to include new sources in the future to ensure quality and transparency.
The number of retrieved works required to automate the filtering process. We used the state-of-the-art algorithm
Lbl2Vec to implement the automatic filter. However, this semantic filter could exclude relevant works. We tested
the Lbl2Vec algorithm under different configurations to determine the best possible result. The Lbl2Vec algorithm
requires defining classes for categorising papers as included and excluded. We explore different configurations
for the classes and validate the outputs against the desired research works we expected to include. Furthermore,
we used the snowballing step in the methodology to identify relevant papers we could have missed.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 27

External validity. We can generalise the conclusions in our study beyond the projects described in the
reviewed papers and serve as general advice for ML engineers. Here, we discuss potential threats to such
generalizability and how we addressed them.
The scientific literature does not describe many ML deployments. Some deployments are presented as blog
posts, while others are not published anywhere. We did not include such projects in our study and dismissed
some published reports because they omitted information about their software architecture from the paper.
Nevertheless, our survey covers a significant subject of areas, including ML deployments across fields ranging
from healthcare to autonomous driving (see Table 3). We have also taken great care not to impose unjustified
limits on the search procedure to improve coverage.
The analysis and conclusions of this paper can be prone to personal biases as it relies on the authors’ expertise.
Following the advice by Kitchenham et al. [71], the authors of this survey reviewed each other’s study selection,
analysis and conclusions. To further alleviate this risk, we sought feedback on our work, presenting intermediate
results internally to our research group and at external scientific events.

6 OPEN RESEARCH CHALLENGES


In the previous section, we observed that while the DOA principles offer the desired properties that enable
ML-based systems to achieve demanding requirements at deployment, their adoption rates are mostly low. More
research efforts are needed to advance the community’s understanding of DOA, its strengths and weaknesses,
why and how to build DOA systems, and the advantages of the capabilities DOA offers. Table 8 summarises
the identified DOA limitations from our survey. The rest of this section discusses them and formulates future
research directions accordingly.

6.1 Interplay with other areas


The DOA style offers a range of desirable properties for multiple software systems. The community needs
additional research efforts to study the interplay of the DOA-style and computer science areas. In this section, we
mention some of the research ideas that DOA can enable in the interplay with diverse fields of study. Shared
data models offer opportunities for new practices around emulating individual systems’ components. Such
emulation ability can lead to efficient systems’ end-to-end optimisation [3, 146]. Such optimisation is particularly
pertinent in cases where the system has to satisfy multiple competing requirements [12], in which multi-objective
Bayesian optimisation techniques can be applicable [104]. DOA can also accelerate advances in the area of
Edge Computing [28, 126, 133], Federated Learning [20] and self-adaptive systems [25, 52], all of which stand to
benefit from a distributed shared data model as well as autonomous open environments. The ability to discover
and inspect explicit flows of data within DOA systems has been highlighted as a key feature required for data
governance [5, 31, 124, 129], meaning DOA can play a role in developing AI systems that comply with existing
and upcoming legislatures, such as GDPR, EU AI Act or Equal Credit Opportunity Act (ECOA). Similarly, access
to a complete dataflow graph means engineers can leverage causal inference techniques [102, 103].
The field of Natural Language Processing (NLP) has been experiencing rapid growth, with an increasing focus
on Large Language Models (LLMs). Models such as GPT-4 [95], Llama [135] or Codex [33] have shown remarkable
capabilities in understanding and generating human-like text or computer code. These massive models require
significant computational and storage resources, making software infrastructure a key determinant of their
success. Given LLMs’ dependency on the quantity and quality of data, DOAs bring benefits to building efficient
and robust infrastructure for them. An open research direction is to study the use of DOA principles for building
LLMs and how DOA can improve their infrastructure. LLMs depend on the quality of the training data, the
quality of the prompts, and the context users provide when asking questions. DOA-based systems can facilitate
training and contextual data collection to fine-tune and interact with LLMs. The realisation of this vision enables

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
28 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

Table 8. Survey findings, open challenges, and possible actions to address them.

Survey Findings Open Challenge Research Directions


The DOA style offers a range of desirable properties for multiple
software systems. Integrating the DOA style within other AI and ML
areas is an interesting and open research area. Alternative research
paths in this area are:

- Develop DOA-based systems as enablers of ML pipelines (e.g., emulators,


The DOA style is not fully
A few papers fully adopt the DOA Bayesian optimisation, etc.).
integrated with other ML areas
principles. (RQ1, RQ2, RQ3) - Explore the intersection between decentralised DOA-based systems
and the practitioners’ workflow.
and fields like edge computing, federated learning, self-adaptive systems,
and causal inference.
- Case studies to define the role of DOA-based systems in data management
and governance.
- Study how DOA principles can support LLMs in collecting data for
finetuning or enriching prompting.
One of the key features that shared data models and data coupling
enable is the capability of systems self-monitoring. Research paths in
this direction are:

Only one of the reviewed papers - Develop frameworks to enable statistical emulators consume data that
Current systems do not exploit
uses a shared data model and data systems expose in share-data models.
the DOA style monitoring
coupling with monitoring purposes. - Study and define how to apply the emulators’ mathematical concept
properties.
(RQ1) in the software engineering domain.
- Exploit data exposed in DOAs to feed LLMs or transformers that can
inform users about system states.
- Case studies to improve our understanding of the relationships
between DOAs, surrogate models, and LLMs.
The lack of maturity of the DOA style negatively impacts its adoption
in different domains. Practitioners prefer centralised cloud deployments,
relying on client/server or microservices architectures. We must
Practitioners prefer centralised research and develop knowledge, practices, and tools around the DOA
cloud deployments based on client/ style. Alternative research paths in this direction are:
Communities still miss tools
server or microservices over
and practices around the DOA
decentralised architectures when - Build a knowledge base for DOA similar to the one communities have
style of system architecture.
deploying the current ML-based developed for SOA.
systems. (RQ2) - Case studies to research and quantify the limits of the DOA
architectural style.
- Research on the impact of the DOA style on the system’s lifecycle
(i.e., design, development, deployment, and decommissioning).
Future software systems will have exacerbated data requirements,
demanding flexible architectures where openness plays a crucial role in
creating decentralised deployments and exploiting the computing
power of everyday devices. These flexible and open architectures
Systems architectures are flexible
generate privacy and security concerns that communities must mitigate.
at lower layers to add new sensors The technological evolution and
Research paths in this direction are:
or attend to users dynamically. its fast pace is exacerbating
However, this openness is absent systems requirements towards
- Explore and exploit different approaches from the security community
in the upper layers mainly because more flexible architectures,
to improve security and privacy aspects of the DOA style.
systems’ requirements do not which threaten data security and
- Implement and evaluate homomorphic encryption and zero-knowledge
demand dynamic architectures at privacy.
proof in the DOA context.
these levels. (RQ3)
- Exploit decentralisation to deploy systems that store and process data
locally (i.e., end users’ devices).
- Develop the DOA style and appropriate tools for secure deployment in
resource-constrained devices.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 29

promising opportunities for software generation and maintenance [114] and to create novel interfaces between
humans and machines to keep users in control despite the systems’ complexity [27].

6.2 Systems Monitoring and Shadow Systems


DOA proposes that systems’ components interact through data mediums. Shared data models store systems’
historic states that are fully available. This data availability should facilitate systems’ monitoring tasks in contrast
to software design paradigms where interfaces hide data (e.g., microservices). However, in practice, we found
that only one of the forty-five reviewed papers uses the shared data model for monitoring purposes. Specifically,
the robotic platform proposed by Hawes et al. [58] monitors historical database records of the robot status and
its interactions with the environment. This limitation emerges because communities lack tools for monitoring
DOA-based systems. New practices and frameworks for monitoring DOA systems are needed. These practices
can leverage the systems’ data availability offered by the DOA style of architecture.
One possible approach that fits with the key features of DOA is a network of statistical emulators (i.e., shadow
systems). A statistical emulator is a probabilistic surrogate model of a given process that allows quantifying
uncertainty to inform decision-making [105]. By exposing all intermediary data interactions within the system,
DOA can enable the creation and maintenance of a separate emulator for each system component. A network of
such emulators acts as a shadow system, which measures the gap in behaviour between the real-world deployment
and the designed software system. Such measuring includes monitoring, identifying the data stream fluctuations,
and estimating the effects of potential changes. Two research efforts are required to move the idea of a shadow
system beyond a mere concept. First, developing the DOA to identify the most efficient approaches and practices
around the automatic fitting of surrogate models to software components. Existing work focuses on auto-tuning
of system parameters [6, 38] and has limited scalability. Additional use cases should implement and test shadow
emulators as monitoring and explainability tools for software for exploring scalable ways of automatically building
surrogate models of systems’ components. Second, innovation in the mathematical composition of individual
emulators is required to build networks that can efficiently propagate uncertainty between components [39].
While there is prior work on uncertainty propagation in software [84], it is still uncommon to see interfaces and
APIs that provide access or otherwise handle input or output uncertainty.

6.3 DOA Concept Maturity and Tooling


Previous sections show practitioners rely on several tools, frameworks and services while building data-driven
systems. Databases, streams, message queues and file systems were all used as data communication mediums.
At the same time, there is no reference that engineers could use while deciding which technology to use in
their system. The adoption of DOA is now limited because the concept itself lacks maturity. Practitioners prefer
centralised cloud deployments based on traditional client/server and microservices architectures. DOA can benefit
from having a technical stack taxonomy so developers can have a complete list of necessary abstractions in a
DOA system and a list of tools to choose for each abstraction (or their combination). Such taxonomies already
exist for other areas of engineering practice, such as MLOps 13 or microservices [51].
A question closely related to the choice of tools and frameworks is the operational maintenance of DOA
systems. Over the years, the software engineering community accumulated a vast amount of knowledge on how
to run SOA systems in production: what metrics to monitor, how to scale horizontally and vertically, how to
mitigate and troubleshoot performance issues [16, 40, 78, 130]. A similar knowledge base about DOA systems
is necessary to accumulate experience running DOA software in production. Nevertheless, researchers can
study configurations of DOA setups, their performance, and critical metrics. The goal of such studies must be to
determine and quantify the impact of the DOA style on the systems’ life-cycle from design to decommissioning.
13 AI Infrastructure Alliance project: [Link]

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
30 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

6.4 Systems Security and Privacy


The DOA style encourages practitioners to implement open systems where autonomous entities can freely
access shared data models [82, 108, 142]. These data models can contain sensitive data, which naturally raises
questions regarding systems’ security and privacy [136]. Malicious entities can access and modify systems’ data
and behaviour in such environments. Designers must restrict data access and systems’ components to a specific
set of users, considering different policies depending on the application. Decentralised deployments can mitigate
security and privacy threats by storing and processing data in devices closer to end users. For example, the crowd
management system proposed by Santana et al. [117] anonymises Wi-Fi frames at the edge to protect people’s
identities. The authors use these frames to train a learning model for crowdsensing estimation. Despite these
efforts, managing authentication, permissions, and encryption in open environments is challenging.
Different research efforts from the security community can support DOA open setups to address security
and privacy challenges. Homomorphic encryption [47] is an interesting direction for performing decentralised
computations directly on encrypted data without providing the decryption key to the participant nodes. For
example, a payment system might inquire about the transaction’s validity without having access to the underlying
data (e.g. bank account number). Early deployment of zero-knowledge proof [45] and homomorphic encryption is
taking place in the industry [18]. This technology offers a solution to privacy issues in decentralised networks, but
the field is still in its infancy. Algorithms developed are often computationally expensive, so further research is
needed to make them more practical in resource-constrained devices. These novel research efforts must align with
the prioritisation of decentralised deployment, as this DOA principle benefits privacy and security attributes by
exploiting local storage and computing. In addition to these technical advances, security and privacy issues require
authorities to develop novel initiatives and policy frameworks to keep up with advances in technology [85].

7 CONCLUSIONS
This paper surveys the extent to which, why, and how practitioners have adopted DOA to design, implement, and
deploy ML-based systems in the real world. We first discuss the DOA principles based on the existing literature
and explain how these can support practitioners in addressing the challenges that emerge from deploying ML
models as part of larger systems. The works analysed in this survey were selected using a semi-automatic
process to facilitate this study’s reproducibility and evolution. We developed this process based on a well-known
methodology for literature reviews in the software engineering domain. The code, raw, and preprocessed data
generated in this process are available online.
Our survey shows that a few systems fully adopt the DOA principles, with most reviewed works demonstrating
partial adoption. We observed that systems that follow the DOA principles are mostly data-intensive. They address
requirements such as Big Data management, low-latency real-time processing, efficient resource management,
security and privacy. We also show the protocols and tools these systems use for adopting the principles and
distil recommendations for practitioners looking to build their data-intensive systems following DOA. We finish
formulating open research challenges that can improve the community’s understanding and application of DOA.
We expect this work to increase community awareness of DOA and ignite interest in research, development, and
adoption of this promising architectural style for real-world ML-based systems.

ACKNOWLEDGMENTS
This research was generously funded by the Engineering and Physical Sciences Research Council (EPSRC) and
the Alan Turing Institute under grant EP/V030302/1. We thank Oluwatomisin Dada, Eric Meissner, Markus Kaiser,
and Peter Ochieng for the helpful and insightful discussion on this paper.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 31

REFERENCES
[1] Mike Acton. 2014. Data-oriented design and C++. CppCon (2014).
[2] Alekh Agarwal, Sarah Bird, Markus Cozowicz, Luong Hoang, John Langford, Stephen Lee, Jiaji Li, Dan Melamed, Gal Oshri, Oswaldo
Ribas, Siddhartha Sen, and Alex Slivkins. 2016. Making Contextual Decisions with Low Technical Debt. (2016). [Link]
48550/ARXIV.1606.03966
[3] Virginia Aglietti, Xiaoyu Lu, Andrei Paleyes, and Javier González. 2020. Causal Bayesian Optimization. In Artificial intelligence and
statistics. PMLR.
[4] Hussain Akbar, Muhammad Zubair, and Muhammad Shairoze Malik. 2023. The Security Issues and Challenges in Cloud Computing.
International Journal for Electronic Crime Investigation 7, 1 (2023), 13–32.
[5] Sherif Akoush, Andrei Paleyes, Arnaud Van Looveren, and Clive Cox. 2022. Desiderata for next generation of ML model serving.
Workshop on Challenges in Deploying and Monitoring Machine Learning Systems, NeurIPS (2022).
[6] Sami Alabed and Eiko Yoneki. 2022. BoGraph: structured bayesian optimization from logs for expensive systems with many parameters.
Proceedings of the 2nd European Workshop on Machine Learning and Systems (2022).
[7] M.A. Ali, Z. Ahsan, M. Amin, S. Latif, A. Ayyaz, and M.N. Ayyaz. 2016. ID-Viewer: a visual analytics architecture for infectious diseases
surveillance and response management in Pakistan. Public Health 134 (2016), 72–85. [Link]
[8] Ricardo S. Alonso, Inés Sittón-Candanedo, Óscar García, Javier Prieto, and Sara Rodríguez-González. 2020. An intelligent Edge-IoT
platform for monitoring livestock and crops in a dairy farming scenario. Ad Hoc Networks 98 (2020), 102047. [Link]
adhoc.2019.102047
[9] Filipe Alves, Hasmik Badikyan, H.J. António Moreira, João Azevedo, Pedro Miguel Moreira, Luís Romero, and Paulo Leitão. 2020.
Deployment of a Smart and Predictive Maintenance System in an Industrial Case Study. In 2020 IEEE 29th International Symposium on
Industrial Electronics (ISIE). 493–498. [Link]
[10] Fatemeh Amrollahi, Supreeth Prajwal Shashikumar, Pradeeban Kathiravelu, Ashish Sharma, and Shamim Nemati. 2020. AIDEx - An
Open-source Platform for Real-Time Forecasting Sepsis and A Case Study on Taking ML Algorithms to Production. In 2020 42nd Annual
International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). 5610–5614. [Link]
2020.9175947
[11] Maurício Aniche, Joseph Yoder, and Fabio Kon. 2019. Current challenges in practical object-oriented software design. In 2019 IEEE/ACM
41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 113–116.
[12] Brendan Avent, Javier I. González, Tom Diethe, Andrei Paleyes, and Borja Balle. 2020. Automatic Discovery of Privacy–Utility Pareto
Fronts. Proceedings on Privacy Enhancing Technologies 2020 (2020), 5 – 23.
[13] May El Barachi, Faouzi Kamoun, Jannatul Ferdaos, Mouna Makni, and Imed Amri. 2020. An artificial intelligence based crowdsensing
solution for on-demand accident scene monitoring. Procedia Computer Science 170 (2020), 303–310. [Link]
03.044 The 11th International Conference on Ambient Systems, Networks and Technologies (ANT) / The 3rd International Conference
on Emerging Data and Industry 4.0 (EDI40) / Affiliated Workshops.
[14] Sebastian P. Bayerl, Tommaso Frassetto, Patrick Jauernig, Korbinian Riedhammer, Ahmad-Reza Sadeghi, Thomas Schneider, Emmanuel
Stapf, and Christian Weinert. 2020. Offline Model Guard: Secure and Private ML on Mobile Devices. In 2020 Design, Automation & Test
in Europe Conference & Exhibition (DATE). 460–465. [Link]
[15] Enrico Bellocchio, Gabriele Costante, Silvia Cascianelli, Paolo Valigi, and Thomas A. Ciarfuglia. 2016. SmartSEAL: A ROS based home
automation framework for heterogeneous devices interconnection in smart buildings. In 2016 IEEE International Smart Cities Conference
(ISC2). 1–6. [Link]
[16] Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016. Site reliability engineering: How Google runs production
systems. " O’Reilly Media, Inc.".
[17] Bernd Bischl, Martin Binder, Michel Lang, Tobias Pielok, Jakob Richter, Stefan Coors, Janek Thomas, Theresa Ullmann, Marc Becker,
Anne-Laure Boulesteix, et al. 2021. Hyperparameter optimization: Foundations, algorithms, best practices and open challenges. arXiv
preprint arXiv:2107.05847 (2021).
[18] Manuel Blum, Paul Feldman, and Silvio Micali. 2019. Non-interactive zero-knowledge and its applications. In Providing Sound
Foundations for Cryptography: On the Work of Shafi Goldwasser and Silvio Micali. 329–349.
[19] Jeannette Bohg, Antonio Morales, Tamim Asfour, and Danica Kragic. 2013. Data-driven grasp synthesis—a survey. IEEE Transactions
on Robotics 30, 2 (2013), 289–309.
[20] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný,
Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. 2019. Towards
Federated Learning at Scale: System Design. In Proceedings of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia
(Eds.), Vol. 1. 374–388. [Link]
[21] Eli Brumbaugh, Mani Bhushan, Andrew Cheong, Michelle Gu-Qian Du, Jeff Feng, Nick Handel, Andrew Hoh, Jack Hone, Brad Hunter,
Atul Kale, Alfredo Luque, Bahador Nooraei, John Park, Krishna Puttaswamy, Kyle Schiller, Evgeny Shapiro, Conglei Shi, Aaron Siegel,

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
32 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

Nikhil Simha, Marie Sbrocca, Shi-Jing Yao, Patrick Yoon, Varant Zanoyan, Xiao-Han T. Zeng, and Qiang Zhu. 2019. Bighead: A
Framework-Agnostic, End-to-End Machine Learning Platform. In 2019 IEEE International Conference on Data Science and Advanced
Analytics (DSAA). 551–560. [Link]
[22] Q. Cabanes, B. Senouci, and A. Ramdane-Cherif. 2019. A Complete Multi-CPU/FPGA-based Design and Prototyping Methodology for
Autonomous Vehicles: Multiple Object Detection and Recognition Case Study. In 2019 International Conference on Artificial Intelligence
in Information and Communication (ICAIIC). 158–163. [Link]
[23] Christian Cabrera, Viviana Bastidas, Jennifer Schooling, and Neil D. Lawrence. 2024. The Systems Engineering Approach in Times of
Large Language Models. arXiv preprint arXiv:2411.09050 (2024). arXiv:2411.09050 [[Link]] [Link]
[24] Christian Cabrera and Siobhán Clarke. 2021. A Reinforcement Learning-Based Service Model for the Internet of Things. In Service-
Oriented Computing, Hakim Hacid, Odej Kao, Massimo Mecella, Naouel Moha, and Hye-young Paik (Eds.). Springer International
Publishing, Cham, 790–799.
[25] Christian Cabrera and Siobhán Clarke. 2022. A Self-Adaptive Service Discovery Model for Smart Cities. IEEE Transactions on Services
Computing 15, 1 (2022), 386–399. [Link]
[26] Christian Cabrera, Fan Li, Vivek Nallur, Andrei Palade, M. A. Razzaque, Gary White, and Siobhán Clarke. 2017. Implementing
heterogeneous, autonomous, and resilient services in IoT: An experience report. In 2017 IEEE 18th International Symposium on A World
of Wireless, Mobile and Multimedia Networks (WoWMoM). 1–6. [Link]
[27] Christian Cabrera, Andrei Paleyes, and Neil David Lawrence. 2024. Self-sustaining Software Systems (S4): Towards Improved
Interpretability and Adaptation. In Proceedings of the 1st International Workshop on New Trends in Software Architecture (Lisbon,
Portugal) (SATrends ’24). Association for Computing Machinery, New York, NY, USA, 5–9. [Link]
[28] Christian Cabrera, Sergej Svorobej, Andrei Palade, Aqeel Kazmi, and Siobhán Clarke. 2023. MAACO: A Dynamic Service Placement
Model for Smart Cities. IEEE Transactions on Services Computing 16, 1 (2023), 424–437. [Link]
[29] Qiong Cai, Hao Wang, Zhenmin Li, and Xiao Liu. 2019. A survey on multimodal data-driven smart healthcare systems: approaches and
applications. IEEE Access 7 (2019), 133583–133599.
[30] Cristina Georgiana Calancea, Camelia-Maria Miluţ, Lenuţa Alboaie, and Adrian Iftene. 2019. iAssistMe - Adaptable Assistant for Persons
with Eye Disabilities. Procedia Computer Science 159 (2019), 145–154. [Link] Knowledge-Based
and Intelligent Information & Engineering Systems: Proceedings of the 23rd International Conference KES2019.
[31] Lucian Carata, Sherif Akoush, Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, Margo Seltzer, and Andy Hopper. 2014. A
primer on provenance. Commun. ACM 57, 5 (2014), 52–60.
[32] Chengliang Chai, Jiayi Wang, Yuyu Luo, Zeping Niu, and Guoliang Li. 2022. Data management for machine learning: A survey. IEEE
Transactions on Knowledge and Data Engineering 35, 5 (2022), 4646–4667.
[33] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas
Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke
Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet,
Felipe Petroski Such, David W. Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William H. Guss,
Alex Nichol, Igor Babuschkin, S. Arun Balaji, Shantanu Jain, Andrew Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa,
Alec Radford, Matthew M. Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam
McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. ArXiv abs/2107.03374
(2021). [Link]
[34] Jennifer Cobbe, Michael Veale, and Jatinder Singh. 2023. Understanding Accountability in Algorithmic Supply Chains. In Proceedings of
the 2023 ACM Conference on Fairness, Accountability, and Transparency (Chicago, IL, USA) (FAccT ’23). Association for Computing
Machinery, New York, NY, USA, 1186–1197. [Link]
[35] Bryan Conroy, Ikaro Silva, Golbarg Mehraei, Robert Damiano, Brian Gross, Emmanuele Salvati, Ting Feng, Jeffrey Schneider, Niels
Olson, Anne G Rizzo, et al. 2022. Real-time infection prediction with wearable physiological monitoring and AI to aid military workforce
readiness during COVID-19. Scientific reports 12, 1 (2022), 1–12.
[36] David E Culler. 1986. Dataflow architectures. Annual review of computer science 1, 1 (1986), 225–253.
[37] Jason Jinquan Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang Wang, Xianyan Jia, Cherry Li Zhang, Yan Wan, Zhichao Li,
Jiao Wang, Shengsheng Huang, Zhongyuan Wu, Yang Wang, Yuhao Yang, Bowen She, Dongjie Shi, Qi Lu, Kai Huang, and Guoqiong Song.
2019. BigDL: A Distributed Deep Learning Framework for Big Data. In Proceedings of the ACM Symposium on Cloud Computing (Santa
Cruz, CA, USA) (SoCC ’19). Association for Computing Machinery, New York, NY, USA, 50–60. [Link]
[38] Valentin Dalibard, Michael Schaarschmidt, and Eiko Yoneki. 2017. BOAT: Building auto-tuners with structured Bayesian optimization.
In Proceedings of the 26th International Conference on World Wide Web. 479–488.
[39] Andreas Damianou and Neil D Lawrence. 2013. Deep gaussian processes. In Artificial intelligence and statistics. PMLR, 207–215.
[40] Goran Delac, Marin Silic, and Sinisa Srbljic. 2012. Reliability modeling for SOA systems. In 2012 Proceedings of the 35th International
Convention MIPRO. IEEE, 847–852.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 33

[41] Filip Karlo Došilović, Mario Brčić, and Nikica Hlupić. 2018. Explainable artificial intelligence: A survey. In 2018 41st International
convention on information and communication technology, electronics and microelectronics (MIPRO). IEEE, 0210–0215.
[42] Tugba Akinci D’Antonoli, Arnaldo Stanzione, Christian Bluethgen, Federica Vernuccio, Lorenzo Ugga, Michail E Klontzas, Renato Cuo-
colo, Roberto Cannella, and Burak Koçak. 2024. Large language models in radiology: fundamentals, applications, ethical considerations,
risks, and future directions. Diagnostic and Interventional Radiology 30, 2 (2024), 80.
[43] Hugo Jair Escalante. 2020. Automated Machine Learning–a brief review at the end of the early years. arXiv preprint arXiv:2008.08516
(2020).
[44] Joao Diogo Falcao, Prabh Simran S Baweja, Yi Wang, Akkarit Sangpetch, Hae Young Noh, Orathai Sangpetch, and Pei Zhang.
2021. PIWIMS: Physics Informed Warehouse Inventory Monitory via Synthetic Data Generation. In Adjunct Proceedings of the
2021 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2021 ACM International
Symposium on Wearable Computers (Virtual, USA) (UbiComp ’21). Association for Computing Machinery, New York, NY, USA, 613–618.
[Link]
[45] Uriel Fiege, Amos Fiat, and Adi Shamir. 1987. Zero knowledge proofs of identity. In Proceedings of the nineteenth annual ACM symposium
on Theory of computing. 210–217.
[46] Kathi Fisler, Shriram Krishnamurthi, Benjamin S. Lerner, and Joe Gibbs Politz. 2021. A Data-Centric Introduction to Computing.
[47] Caroline Fontaine and Fabien Galand. 2007. A survey of homomorphic encryption for nonspecialists. EURASIP Journal on Information
Security 2007 (2007), 1–10.
[48] Stan Franklin, Tamas Madl, Sidney D’Mello, and Javier Snaider. 2014. LIDA: A Systems-level Architecture for Cognition, Emotion, and
Learning. IEEE Transactions on Autonomous Mental Development 6, 1 (2014), 19–41. [Link]
[49] Colm V. Gallagher, Kevin Leahy, Peter O’Donovan, Ken Bruton, and Dominic T.J. O’Sullivan. 2019. IntelliMaV: A cloud computing
measurement and verification 2.0 application for automated, near real-time energy savings quantification and performance deviation
detection. Energy and Buildings 185 (2019), 26–38. [Link]
[50] Guanyu Gao, Yonggang Wen, and Cedric Westphal. 2016. Resource provisioning and profit maximization for transcoding in Information
Centric Networking. In 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). 97–102. [Link]
10.1109/INFCOMW.2016.7562053
[51] Martin Garriga. 2018. Towards a taxonomy of microservices architectures. In Software Engineering and Formal Methods: SEFM 2017
Collocated Workshops: DataMod, FAACS, MSE, CoSim-CPS, and FOCLASA, Trento, Italy, September 4-5, 2017, Revised Selected Papers 15.
Springer, 203–218.
[52] Simos Gerasimou, Thomas Vogel, and Ada Diaconescu. 2019. Software Engineering for Intelligent and Autonomous Systems: Report
from the GI Dagstuhl Seminar 18343. [Link]
[53] Görkem Giray. 2021. A software engineering perspective on engineering machine learning systems: State of the art and challenges.
Journal of Systems and Software 180 (2021), 111031.
[54] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2017. Deep learning (adaptive computation and machine learning series).
Cambridge Massachusetts (2017), 321–359.
[55] Ian Goodfellow, Oriol Vinyals, and Andrew Saxe. 2015. Qualitatively Characterizing Neural Network Optimization Problems. In
International Conference on Learning Representations. [Link]
[56] Robert Gorkin, Kye R. Adams, Matthew J. Berryman, Sam Aubin, Wanqing Li, Andrew Davis, and Johan Barthélemy. 2020. Sharkeye:
Real-Time Autonomous Personal Shark Alerting via Aerial Surveillance. Drones (2020).
[57] Hassan Habibi Gharakheili, Minzhao Lyu, Yu Wang, Himal Kumar, and Vijay Sivaraman. 2019. iTeleScope: Softwarized Network
Middle-Box for Real-Time Video Telemetry and Classification. IEEE Transactions on Network and Service Management 16, 3 (2019),
1071–1085. [Link]
[58] Nick Hawes, Christopher Burbridge, Ferdian Jovan, Lars Kunze, Bruno Lacerda, Lenka Mudrova, Jay Young, Jeremy Wyatt, Denise
Hebesberger, Tobias Kortner, Rares Ambrus, Nils Bore, John Folkesson, Patric Jensfelt, Lucas Beyer, Alexander Hermans, Bastian Leibe,
Aitor Aldoma, Thomas Faulhammer, Michael Zillich, Markus Vincze, Eris Chinellato, Muhannad Al-Omari, Paul Duckworth, Yiannis
Gatsoulis, David C. Hogg, Anthony G. Cohn, Christian Dondrup, Jaime Pulido Fentanes, Tomas Krajnik, Joao M. Santos, Tom Duckett,
and Marc Hanheide. 2017. The STRANDS Project: Long-Term Autonomy in Everyday Environments. IEEE Robotics & Automation
Magazine 24, 3 (2017), 146–156. [Link]
[59] Petra Heck. 2024. What About the Data? A Mapping Study on Data Engineering for AI Systems. In Proceedings of the IEEE/ACM 3rd
International Conference on AI Engineering-Software Engineering for AI. 43–52.
[60] Henri Hegemier and Jaimie Kelley. 2021. Image Classification with Knowledge-Based Systems on the Edge for Real-Time Danger
Avoidance in Robots. In 2021 IEEE World AI IoT Congress (AIIoT). 0360–0365. [Link]
[61] Ricardo Dintén Herrero and Marta Zorrilla. 2022. An I4.0 data intensive platform suitable for the deployment of machine learning models:
a predictive maintenance service case study. Procedia Computer Science 200 (2022), 1014–1023. [Link]
3rd International Conference on Industry 4.0 and Smart Manufacturing.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
34 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

[62] Junchen Jiang, Shijie Sun, Vyas Sekar, and Hui Zhang. 2017. Pytheas: Enabling Data-Driven Quality of Experience Optimization Using
Group-Based Exploration-Exploitation. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation
(Boston, MA, USA) (NSDI’17). USENIX Association, USA, 393–406.
[63] Anil Johny and KN Madhusoodanan. 2021. Edge Computing Using Embedded Webserver with Mobile Device for Diagnosis and
Prediction of Metastasis in Histopathological Images. International Journal of Computational Intelligence Systems 14, 1 (2021). https:
//[Link]/10.1007/s44196-021-00040-x
[64] Praveen Joshi, Haithem Afli, Mohammed Hasanuzzaman, Chandra Thapa, and Ted Scully. 2022. Enabling Deep Learning for All-in
EDGE paradigm. [Link]
[65] Rajive Joshi. 2007. Data-oriented architecture: A loosely-coupled real-time soa. whitepaper, Aug (2007).
[66] J Stephen Judd. 1990. Neural network design and the complexity of learning. MIT press.
[67] Işıl Karabey Aksakalli, Turgay Çelik, Ahmet Burak Can, and Bedir Tekinerdoğan. 2021. Deployment and communication patterns in
microservice architectures: A systematic literature review. Journal of Systems and Software 180 (2021), 111014. [Link]
[Link].2021.111014
[68] Ioanna Karageorgou, Panagiotis Liakos, and Alex Delis. 2020. A Sentiment Analysis Service Platform for Streamed Multilingual Tweets.
In 2020 IEEE International Conference on Big Data (Big Data). 3262–3271. [Link]
[69] Narsimlu Kemsaram, Anwesha Das, and Gijs Dubbelman. 2020. Architecture Design and Development of an On-board Stereo Vision
System for Cooperative Automated Vehicles. 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC) (2020),
1–8.
[70] Tom Killalea. 2016. The Hidden Dividends of Microservices. Commun. ACM 59, 8 (jul 2016), 42–45. [Link]
[71] Barbara Kitchenham, O Pearl Brereton, David Budgen, Mark Turner, John Bailey, and Stephen Linkman. 2009. Systematic literature
reviews in software engineering–a systematic literature review. Information and software technology 51, 1 (2009), 7–15.
[72] Barbara Kitchenham and Pearl Brereton. 2013. A systematic review of systematic review process research in software engineering.
Information and Software Technology 55, 12 (2013), 2049–2075. [Link]
[73] Barbara Ann Kitchenham and Stuart Charters. 2007. Guidelines for performing Systematic Literature Reviews in Software Engineering.
Technical Report EBSE 2007-001. Keele University and Durham University Joint Report. [Link]
misc/[Link]
[74] Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. 2019. Local-First Software: You Own Your Data,
in Spite of the Cloud. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections
on Programming and Software (Athens, Greece) (Onward! 2019). Association for Computing Machinery, New York, NY, USA, 154–178.
[Link]
[75] Mohit Kumar, Kalka Dubey, and Rakesh Pandey. 2021. Evolution of Emerging Computing paradigm Cloud to Fog: Applications,
Limitations and Research Challenges. In 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence).
257–261. [Link]
[76] Matthew Lebofsky, Steve Croft, Andrew P. V. Siemion, Danny C. Price, J. Emilio Enriquez, Howard Isaacson, David H. E. MacMahon,
David Anderson, Bryan Brzycki, Jeff Cobb, Daniel Czech, David DeBoer, Julia DeMarines, Jamie Drew, Griffin Foster, Vishal Gajjar,
Nectaria Gizani, Greg Hellbourg, Eric J. Korpela, Brian Lacki, Sofia Sheikh, Dan Werthimer, Pete Worden, Alex Yu, and Yunfan Gerry
Zhang. 2019. The Breakthrough Listen Search for Intelligent Life: Public Data, Formats, Reduction, and Archiving. Publications of the
Astronomical Society of the Pacific 131, 1006 (nov 2019), 124505. [Link]
[77] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436–444.
[78] Grace A Lewis and Dennis B Smith. 2008. Service-oriented architecture and its implications for software maintenance and evolution.
In 2008 Frontiers of Software Maintenance. IEEE, 1–10.
[79] Qiuchen Lu, Ajith Kumar Parlikad, Philip Woodall, Gishan Don Ranasinghe, Xiang Xie, Zhenglin Liang, Eirini Konstantinou, James
Heaton, and Jennifer Schooling. 2020. Developing a Digital Twin at Building and City Levels: Case Study of West Cambridge
Campus. Journal of Management in Engineering 36, 3 (2020), 05020004. [Link]
arXiv:[Link]
[80] Lucy Ellen Lwakatare, Aiswarya Raj, Ivica Crnkovic, J. Bosch, and Helena Holmström Olsson. 2020. Large-scale machine learning
systems in real-world industrial settings: A review of challenges and solutions. Inf. Softw. Technol. 127 (2020), 106368.
[81] Patricia López Martínez, Ricardo Dintén, José María Drake, and Marta Zorrilla. 2021. A big data-centric architecture metamodel for
Industry 4.0. Future Generation Computer Systems 125 (2021), 263–284. [Link]
[82] Fang Miao, Wenjie Fan, Wenhui Yang, and Yan Xie. 2019. The study of data-oriented and ownership-based security architecture in
open Internet environment. In Proceedings of the 3rd International Conference on Cryptography, Security and Privacy. 121–129.
[83] Timur Mironov, Lilia Motaylenko, Dmitry Andreev, Igor Antonov, and Mikhail Aristov. 2021. Comparison of object-oriented program-
ming and data-oriented design for implementing trading strategies backtester. In Proceedings of the International Scientific and Practical
Conference.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 35

[84] Kesari Mishra and Kishor S Trivedi. 2011. Uncertainty propagation through software dependability models. In 2011 IEEE 22nd
International Symposium on Software Reliability Engineering. IEEE, 80–89.
[85] Jessica Montgomery and Neil D. Lawrence. 2021. Data Governance in the 21st century: Citizen Dialogue and the Development of Data
Trusts. /publications/[Link]
[86] Henry Muccini and Karthik Vaidhyanathan. 2021. Software Architecture for ML-based Systems: What Exists and What Lies Ahead. In
2021 IEEE/ACM 1st Workshop on AI Engineering - Software Engineering for AI (WAIN). 121–128. [Link]
2021.00026
[87] Raja Mubashir Munaf, Jawwad Ahmed, Faraz Khakwani, and Tauseef Rana. 2019. Microservices Architecture: Challenges and Proposed
Conceptual Design. In 2019 International Conference on Communication Technologies (ComTech). 82–87. [Link]
COMTECH.2019.8737831
[88] Martin M. Müller and Marcel Salathé. 2019. Crowdbreaks: Tracking Health Trends Using Public Social Media Data and Crowdsourcing.
Frontiers in Public Health 7 (2019). [Link]
[89] Alfredo Nazabal, Christopher KI Williams, Giovanni Colavizza, Camila Rangel Smith, and Angus Williams. 2020. Data engineering for
data analytics: a classification of the issues, and case studies. arXiv preprint arXiv:2004.12929 (2020).
[90] Anh Tuan Nguyen, Markus W Drealan, Diu Khue Luu, Ming Jiang, Jian Xu, Jonathan Cheng, Qi Zhao, Edward W Keefer, and Zhi Yang.
2021. A portable, self-contained neuroprosthetic hand with deep learning-based finger control. Journal of Neural Engineering 18, 5 (oct
2021), 056051. [Link]
[91] Stoyan Nikolov. 2018. OOP is dead, long live Data-Oriented Design. CppCon.(Retrieved 02.02. 2019). URL: [Link] youtube.
com/watch (2018).
[92] Long Niu, Sachio Saiki, and Masahide Nakamura. 2017. Recognizing ADLs of one person household based on non-intrusive envi-
ronmental sensing. In 2017 18th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and
Parallel/Distributed Computing (SNPD). 477–482. [Link]
[93] A Jefferson Offutt, Mary Jean Harrold, and Priyadarshan Kolte. 1993. A software metric system for module coupling. Journal of Systems
and Software 20, 3 (1993), 295–308.
[94] Robert Olsson. 2014. Applying REST principles on local client-side APIs. Master’s thesis. KTH, School of Computer Science and
Communication (CSC).
[95] OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
[96] O’Reilly. 2020. Microservices Adoption in 2020: A survey. Available at [Link]
[97] Genevieve B Orr and Klaus-Robert Müller. 2003. Neural networks: tricks of the trade. Springer.
[98] Claus Pahl. 2023. Research challenges for machine learning-constructed software. Service Oriented Computing and Applications 17, 1
(2023), 1–4.
[99] Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, and Nando de
Freitas. 2020. Hyperparameter selection for offline reinforcement learning. Offline Reinforcement Learning Workshop, NeurIPS (2020).
[Link]
[100] Andrei Paleyes, Christian Cabrera, and Neil D. Lawrence. 2022. An Empirical Evaluation of Flow Based Programming in the Machine
Learning Deployment Context. In 2022 IEEE/ACM 1st International Conference on AI Engineering – Software Engineering for AI (CAIN).
54–64.
[101] Andrei Paleyes, Siyuan Guo, Bernhard Scholkopf, and Neil D. Lawrence. 2023. Dataflow graphs as complete causal graphs. In 2023
IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN). 7–12. [Link]
2023.00010
[102] Andrei Paleyes, Siyuan Guo, Bernhard Schölkopf, and Neil D Lawrence. 2023. Dataflow graphs as complete causal graphs. In 2023
IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN).
[103] Andrei Paleyes and Neil David Lawrence. 2023. Causal Fault Localisation in Dataflow Systems. In Proceedings of the 3rd Workshop on
Machine Learning and Systems (Rome, Italy) (EuroMLSys ’23). Association for Computing Machinery, New York, NY, USA, 140–147.
[Link]
[104] Andrei Paleyes, Henry B. Moss, Victor Picheny, Piotr Zulawski, and Felix Newman. 2022. A penalisation method for batch multi-
objective Bayesian optimisation with application in heat exchanger design. Workshop on Adaptive Experimental Design and Active
Learning in the Real World, ICML (2022). [Link]
[105] Andrei Paleyes, Mark Pullin, Maren Mahsereci, Neil Lawrence, and Javier González. 2019. Emulation of physical processes with Emukit.
In Second Workshop on Machine Learning and the Physical Sciences, NeurIPS.
[106] Andrei Paleyes, Raoul-Gabriel Urma, and Neil D. Lawrence. 2022. Challenges in Deploying Machine Learning: A Survey of Case
Studies. ACM Comput. Surv. (apr 2022). [Link]
[107] Milan Patel, Brian Naughton, Caroline Chan, Nurit Sprecher, Sadayuki Abeta, Adrian Neal, et al. 2014. Mobile-edge computing
introductory technical white paper. White paper, mobile-edge computing (MEC) industry initiative (2014), 1089–7801.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
36 • Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, and Neil D. Lawrence

[108] Jan Pennekamp, Martin Henze, Simo Schmidt, Philipp Niemietz, Marcel Fey, Daniel Trauth, Thomas Bergs, Christian Brecher, and
Klaus Wehrle. 2019. Dataflow challenges in an internet of production: a security & privacy perspective. In Proceedings of the ACM
Workshop on Cyber-Physical Systems Security & Privacy. 27–38.
[109] Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine
learning: a survey. ACM SIGMOD Record 47, 2 (2018), 17–28.
[110] S Joe Qin. 2012. Survey on data-driven industrial process monitoring and diagnosis. Annual reviews in control 36, 2 (2012), 220–234.
[111] Xiwei Qiu, Yuanshun Dai, Peng Sun, and Xin Jin. 2020. PHM Technology for Memory Anomalies in Cloud Computing for IaaS. In 2020
IEEE 20th International Conference on Software Quality, Reliability and Security (QRS). 41–51. [Link]
00018
[112] Luis Quintero, Panagiotis Papapetrou, John E. Muñoz, and Uno Fors. 2019. Implementation of Mobile-Based Real-Time Heart Rate
Variability Detection for Personalized Healthcare. In 2019 International Conference on Data Mining Workshops (ICDMW). 838–846.
[Link]
[113] Mohaimenul Azam Khan Raiaan, Md. Saddam Hossain Mukta, Kaniz Fatema, Nur Mohammad Fahad, Sadman Sakib, Most Marufatul Jan-
nat Mim, Jubaer Ahmad, Mohammed Eunus Ali, and Sami Azam. 2024. A Review on Large Language Models: Architectures, Applications,
Taxonomies, Open Issues and Challenges. IEEE Access 12 (2024), 26839–26874. [Link]
[114] Diana Robinson, Christian Cabrera, Andrew D Gordon, Neil D Lawrence, and Lars Mennen. 2024. Requirements are All You Need: The
Final Frontier for End-User Software Engineering. arXiv preprint arXiv:2405.13708 (2024).
[115] Marouane Salhaoui, Antonio Guerrero Gonzalez, Mounir Arioua, Juan Carlos Molina Molina, Francisco J. Ortiz, and Ahmed El Oualkadi.
2020. Edge-Cloud Architectures Using UAVs Dedicated To Industrial IoT Monitoring And Control Applications. In 2020 International
Symposium on Advanced Electrical and Communication Technologies (ISAECT). 1–6. [Link]
[116] Luis Sanchez, Luis Muñoz, Jose Antonio Galache, Pablo Sotres, Juan R. Santana, Veronica Gutierrez, Rajiv Ramdhany, Alex Gluhak,
Srdjan Krco, Evangelos Theodoridis, and Dennis Pfisterer. 2014. SmartSantander: IoT experimentation over a smart city testbed.
Computer Networks 61 (2014), 217–238. [Link]
[117] Juan Ramón Santana, Luis Sánchez, Pablo Sotres, Jorge Lanza, Tomás Llorente, and Luis Muñoz. 2020. A Privacy-Aware Crowd
Management System for Smart Cities and Smart Buildings. IEEE Access 8 (2020), 135394–135405. [Link]
3010609
[118] David Sarabia-Jácome, Regel Usach, Carlos E. Palau, and Manuel Esteve. 2020. Highly-efficient fog-based deep learning AAL fall
detection system. Internet of Things 11 (2020), 100185. [Link]
[119] Tim Schopf, Daniel Braun, and Florian Matthes. 2021. Lbl2Vec: An Embedding-based Approach for Unsupervised Document Retrieval
on Predefined Topics. In Proceedings of the 17th International Conference on Web Information Systems and Technologies - WEBIST,.
124–132. [Link]
[120] Carson Schubert, Kevin Black, Daniel Fonseka, Abhimanyu Dhir, Jacob Deutsch, Nihal Dhamani, Gavin Martin, and Maruthi Akella.
2021. A Pipeline for Vision-Based On-Orbit Proximity Operations Using Deep Learning and Synthetic Imagery. In 2021 IEEE Aerospace
Conference (50100). 1–15. [Link]
[121] Robert Schuler, Carl Kesselman, and Karl Czajkowski. 2015. Data Centric Discovery with a Data-Oriented Architecture. In Proceedings of
the 1st Workshop on The Science of Cyberinfrastructure: Research, Experience, Applications and Models (Portland, Oregon, USA) (SCREAM
’15). Association for Computing Machinery, New York, NY, USA, 37–44. [Link]
[122] Johann Schumann, Timmy Mbaya, and Ole J Mengshoel. 2012. Software and system health management for autonomous robotics
missions. (2012).
[123] Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green AI. Commun. ACM 63, 12 (11 2020), 54–63. https:
//[Link]/10.1145/3381831
[124] Malte Schwarzkopf, Eddie Kohler, M. Frans Kaashoek, and Robert Tappan Morris. 2019. Position: GDPR Compliance by Construction.
In Poly/DMAH@VLDB.
[125] Guohou Shan, Boxin Zhao, James R. Clavin, Haibin Zhang, and Sisi Duan. 2022. Poligraph: Intrusion-Tolerant and Distributed Fake News
Detection System. IEEE Transactions on Information Forensics and Security 17 (2022), 28–41. [Link]
[126] Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. 2016. Edge computing: Vision and challenges. IEEE internet of things
journal 3, 5 (2016), 637–646.
[127] Wanxin Shi, Qing Li, Chao Wang, Gengbiao Shen, Weichao Li, Yu Wu, and Yong Jiang. 2019. LEAP: Learning-Based Smart Edge with
Caching and Prefetching for Adaptive Video Streaming. In 2019 IEEE/ACM 27th International Symposium on Quality of Service (IWQoS).
1–10. [Link]
[128] Chihhsiong Shih, Lai Youchen, Cheng-hsu Chen, and WillIam Cheng-Chung Chu. 2020. An Early Warning System for Hemodialysis
Complications Utilizing Transfer Learning from HD IoT Dataset. In 2020 IEEE 44th Annual Computers, Software, and Applications
Conference (COMPSAC). 759–767. [Link]
[129] Jatinder Singh, Jennifer Cobbe, and Chris Norval. 2018. Decision provenance: Harnessing data flow for accountable systems. IEEE
Access 7 (2018), 6562–6574.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.
ML Systems: A Survey from a Data-Oriented Perspective • 37

[130] Neha Singh and Kirti Tyagi. 2015. A Literature Review of the Reliability of Composite Web Service in Service-Oriented Architecture.
SIGSOFT Softw. Eng. Notes 40, 1 (feb 2015), 1–8. [Link]
[131] Ben Stopford. 2016. The Data Dichotomy: Rethinking the Way We Treat Data and Services. Available at [Link]
dichotomy-rethinking-the-way-we-treat-data-and-services/.
[132] Shabnam Sultana, Philippe Dooze, and Vijay Venkatesh Kumar. 2021. Realization of an Intrusion Detection use-case in ONAP
with Acumos. In 2021 International Conference on Computer Communications and Networks (ICCCN). 1–6. [Link]
ICCCN52240.2021.9522281
[133] Hadi Tabatabaee Malazi, Saqib Rasool Chaudhry, Aqeel Kazmi, Andrei Palade, Christian Cabrera, Gary White, and Siobhán Clarke. 2022.
Dynamic Service Placement in Multi-Access Edge Computing: A Systematic Literature Review. IEEE Access 10 (2022), 32639–32688.
[Link]
[134] Davide Taibi, Valentina Lenarduzzi, and Claus Pahl. 2018. Architectural patterns for microservices: A systematic mapping study.
In CLOSER 2018 - Proceedings of the 8th International Conference on Cloud Computing and Services Science. SCITEPRESS, 221–232.
[Link] International Conference on Cloud Computing and Services Science ; Conference date:
19-03-2018 Through 21-03-2018.
[135] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman
Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
(2023).
[136] Lillian Tsai, Malte Schwarzkopf, and Eddie Kohler. 2021. Privacy heroes need data disguises. In Proceedings of the Workshop on Hot
Topics in Operating Systems. 112–118.
[137] Lorenzo Vaccaro, Giuseppe Sansonetti, and Alessandro Micarelli. 2021. An Empirical Review of Automated Machine Learning.
Computers 10, 1 (2021), 11.
[138] Richard Van Noorden and Jeffrey M Perkel. 2023. AI and science: what 1,600 researchers think. Nature 621, 7980 (2023), 672–675.
[139] Christian Vorhemus and Erich Schikuta. 2017. A data-oriented architecture for loosely coupled real-time information systems. In
Proceedings of the 19th International Conference on Information Integration and Web-based Applications & Services. 472–481.
[140] Jonathan Waring, Charlotta Lindvall, and Renato Umeton. 2020. Automated machine learning: Review of the state-of-the-art and
opportunities for healthcare. Artificial Intelligence in Medicine 104 (2020), 101822.
[141] Hironori Washizaki, Hiromu Uchida, Foutse Khomh, and Yann-Gaël Guéhéneuc. 2020. Machine learning architecture and design
patterns. IEEE Software 8 (2020), 2020.
[142] Dawei Wei, Huansheng Ning, Feifei Shi, Yueliang Wan, Jiabo Xu, Shunkun Yang, and Li Zhu. 2021. Dataflow management in the
Internet of Things: Sensing, control, and security. Tsinghua Science and Technology 26, 6 (2021), 918–930.
[143] Danny Weyns. 2019. Software engineering of self-adaptive systems. Handbook of software engineering (2019), 399–443.
[144] Lauren J. Wong, IV William [Link], Bryse Flowers, R. Michael Buehrer, Alan J. Michaels, and William C. Headley. 2020. The RFML
Ecosystem: A Look at the Unique Challenges of Applying Deep Learning to Radio Frequency Applications. arXiv: Signal Processing
(2020).
[145] Hailu Xu, Boyuan Guan, Pinchao Liu, William Escudero, and Liting Hu. 2018. Harnessing the Nature of Spam in Scalable Online Social
Spam Detection. In 2018 IEEE International Conference on Big Data (Big Data). 3733–3736. [Link]
[146] Deze Zeng, Lin Gu, Song Guo, Zixue Cheng, and Shui Yu. 2016. Joint Optimization of Task Scheduling and Image Placement in Fog
Computing Supported Software-Defined Embedded System. IEEE Trans. Comput. 65, 12 (2016), 3702–3712. [Link]
2016.2536019
[147] Leihan Zhang, Jichang Zhao, and Ke Xu. 2016. Emotion-based social computing platform for streaming big-data: Architecture and
application. In 2016 13th International Conference on Service Systems and Service Management (ICSSSM). 1–6. [Link]
ICSSSM.2016.7538620
[148] Tao Zhang, Biyun Ding, Xin Zhao, Ganjun Liu, and Zhibo Pang. 2021. LearningADD: Machine learning based acoustic defect detection
in factory automation. Journal of Manufacturing Systems 60 (2021), 48–58. [Link]
[149] Jonathan Zittrain. 2022. Intellectual Debt: With Great Power Comes Great Ignorance. Cambridge University Press, 176–184.

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2023.

You might also like