0% found this document useful (0 votes)
144 views17 pages

A Call For Embodied AI

The document advocates for the development of Embodied AI (E-AI) as a crucial step towards achieving Artificial General Intelligence (AGI), contrasting it with current AI advancements like Large Language Models (LLMs). It emphasizes the importance of creating AI agents that can interact dynamically with their environments, learn from experiences, and prioritize truth and adaptability. The authors propose a theoretical framework for E-AI grounded in cognitive architectures, highlighting the need for these agents to evolve and coexist with humans and other intelligent entities in real-world contexts.

Uploaded by

yangqiusong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views17 pages

A Call For Embodied AI

The document advocates for the development of Embodied AI (E-AI) as a crucial step towards achieving Artificial General Intelligence (AGI), contrasting it with current AI advancements like Large Language Models (LLMs). It emphasizes the importance of creating AI agents that can interact dynamically with their environments, learn from experiences, and prioritize truth and adaptability. The authors propose a theoretical framework for E-AI grounded in cognitive architectures, highlighting the need for these agents to evolve and coexist with humans and other intelligent entities in real-world contexts.

Uploaded by

yangqiusong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A Call for Embodied AI

Giuseppe Paolo 1 Jonas Gonzalez-Billandon 2 Balázs Kégl 1

Abstract by a series of significant triumphs interspersed with set-


backs, including the well-documented AI winter of the mid-
We propose Embodied AI (E-AI) as the next fun-
arXiv:2402.03824v4 [cs.AI] 13 Sep 2024

1980s. The ambitious goal that has propelled AI research


damental step in the pursuit of Artificial General forward from the beginning was to create intelligence that
Intelligence (AGI), juxtaposing it against current either parallels or exceeds human abilities. This quest for
AI advancements, particularly Large Language superhuman intelligence, commonly termed Artificial Gen-
Models (LLMs). We traverse the evolution of the eral Intelligence (AGI), has been seen differently by experts
embodiment concept across diverse fields (phi- across different disciplines, yet it broadly refers to the abil-
losophy, psychology, neuroscience, and robotics) ity of a system to understand, learn, and apply knowledge
to highlight how E-AI distinguishes itself from in a wide array of tasks and contexts, mirroring the cogni-
the classical paradigm of static learning. By tive flexibility of humans and animals.
broadening the scope of E-AI, we introduce a the-
oretical framework based on cognitive architec- The remarkable progress in AI over the past decade can
tures, emphasizing perception, action, memory, largely be attributed to three pivotal developments: i) ad-
and learning as essential components of an em- vancements in deep learning algorithms, ii) the advent of
bodied agent. This framework is aligned with powerful new hardware, and iii) the availability of exten-
Friston’s active inference principle, offering a sive datasets for training. A prime illustration of this
comprehensive approach to E-AI development. advancement is the creation of Large Language Models
Despite the progress made in the field of AI, sub- (LLMs) like OpenAI’s GPT-4 (Achiam et al., 2023) and
stantial challenges, such as the formulation of a Google’s Gemini (Team et al., 2023). The surprising abil-
novel AI learning theory and the innovation of ities of these LLMs have sparked discussions within the
advanced hardware, persist. Our discussion lays AI community, with some pondering whether these models
down a foundational guideline for future E-AI have already achieved nascent forms of AGI. Foundation
research. Highlighting the importance of creat- models (large networks with billions of parameters trained
ing E-AI agents capable of seamless communica- on massive datasets) have found success in varied fields,
tion, collaboration, and coexistence with humans ranging from predicting 3D protein structures (Cramer,
and other intelligent entities within real-world en- 2021) and robotic control (Brohan et al., 2023), to generat-
vironments, we aim to steer the AI community ing images and audio (Ramesh et al., 2022; Radford et al.,
towards addressing the multifaceted challenges 2022). This breadth of achievement supports the hypoth-
and seizing the opportunities that lie ahead in the esis that continued scaling and refinement of foundation
quest for AGI. models could be a viable path toward realizing AGI.
In our paper, we argue that despite the significant advances
made by current AI technologies, they represent only the
1. Introduction initial steps towards truly intelligent agents. Despite their
Over recent years, the field of artificial intelligence (AI) impressive capabilities, these large networks are static and
has experienced a significant surge, leading to substantial unable to evolve with time and experience. They lever-
breakthroughs in areas ranging from computer vision (CV) age large datasets and cutting-edge hardware for scaling,
and natural language processing (NLP) to neuroscience. but they lack the ability to properly care about the truth
This journey through AI’s development has been marked (Vervaeke & Coyne, 2024), which in turn makes it impos-
1 sible to dynamically adjust their knowledge and actively
Noah’s Ark Lab, Huawei Technologies France, Paris, France
2
London Research Center, London, UK. Correspondence to: search for valuable new information. The two primary
Giuseppe Paolo <[email protected]>. manifestations of this fundamental shortfall are i) the dif-
ficulty in effectively aligning LLMs (Ouyang et al., 2022),
Proceedings of the 41 st International Conference on Machine and ii) their propensity to generate plausible but inaccu-
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 rate information, a phenomenon known as confabulation
by the author(s).

1
A Call for Embodied AI

(Huang et al., 2023; Wei et al., 2023). Current strategies to ied agent, and we discuss the major challenges to achieving
mitigate these issues, such as post-processing, fine-tuning, this ambitious goal in Sec. 5. Our motivation behind the
prompt engineering, and incorporating human feedback, need to develop E-AI and its fundamental role in our path
are undeniably valuable. However, we argue that these towards AGI is proposed throughout. Finally, Sec. 6 pro-
methods address only the superficial aspects of the prob- vides a short recap of our proposition.
lem and fall short in dealing with the core issue at play: the
inherent lack of a deeper, grounded sense of care in LLMs. 2. What is embodied AI
Pursuing the development of AGI, we draw upon the in- E-AI is s sub-field of AI, focusing on agents that inter-
sights of Vervaeke & Coyne (2024) to advocate for design- act with their physical environment, emphasizing senso-
ing AI agents that are bound to, observe, interact with, rimotor coupling and situated intelligence. As opposed
and learn from the real world (including humans) in a con- to mere passive observing, E-AI agents act on their en-
tinuous and dynamic manner. These Embodied AI (E-AI) vironment and learn from the reaction. E-AI is deeply
agents ought to prioritize their continued existence and rooted in embodied cognition (Shapiro, 2011; McNearney,
our bindings to them, thereby learning the value of truth. 2011), a perspective in philosophy and cognitive science
They should also be capable of adapting to environmental that posits a profound coupling between the mind and the
changes and evolving without human intervention. body. This idea, challenging Cartesian dualism — the his-
torically dominant view that distinctly separates the mind
While Large Language Models play a significant role in from the body (Descartes, 2012) — emerged in the early
the development of AI systems, they fall short of captur- 20th century. Pioneers like Lakoff & Johnson (1979; 1999)
ing the essence of what constitutes an intelligent agent. have significantly contributed to this paradigm by propos-
Notably, intelligent beings, whether humans or animals, ing that reason is not based on abstract laws but is grounded
are characterized by three fundamental components: the in bodily experiences. Embodied cognition forms a critical
mind, perception, and action capabilities (Kirchhoff et al., part of the 4E cognitive science framework (Varela et al.,
2018). LLMs, or more broadly, foundation models, may 1991; Clark, 1997; Clark & Chalmers, 1998), encompass-
be likened to an aspect of the mind’s reasoning function ing embodied, enactive, embedded, and extended aspects
(Xi et al., 2023). Yet, the perceptive and action-oriented di- of cognition. Within E-AI, the focus is predominantly on
mensions of intelligence, along with the pivotal ability to implementing the ‘embodied’ and ‘enactive’ aspects, while
dynamically revise beliefs and knowledge based on experi- the ‘embedded’ and ‘extended’ components are more perti-
ences, remain unaddressed. Autoregressive LLMs are not nent to situating AI in a social context and as an augmenta-
designed to understand the causal relationships between tion of human (individual or collective) cognition.
events, but rather to identify proximate context and cor-
relations within sequences (Bariah & Debbah, 2023). In In AI, initial explorations into embodiment emerged in the
contrast, a fully embodied agent should have the ability 1980s, driven by a growing recognition of the inherent lim-
to grasp the causality underlying events and actions within itations in disembodied agents. These limitations were pri-
its environment, be it digital or physical. By comprehend- marily attributed to the absence of rich, high-bandwidth
ing these causal relationships, such an agent can make in- interactions with the environment (Pfeifer & Iida, 2004;
formed decisions that consider both the anticipated out- Pfeifer & Bongard, 2006). An early advocate for this
comes and the reasons behind those outcomes. paradigm shift was Brooks (1991), who built walking
robots simulating insect-like locomotion. Simultaneously,
In short, this paper argues that the necessary next step the field of computer vision was undergoing its own trans-
in our pursuit for truly intelligent and general AIs is formation. Researchers and practitioners were increasingly
the development and study of Embodied AI. We pro- focusing on enabling agents to interact with their surround-
pose that current LLM-based foundation models could ings. This emphasis on interaction led to a concentration on
lay the groundwork for designing these agents, but are the perceptual elements of embodiment, particularly from
just one component of a truly embodied agent. This a first-person point of view (POV) (Shapiro, 2021). This
approach is akin to how neonates come into the world approach aligns with the concept of visual exploration and
equipped with inherent priors to successfully adapt to the navigation (Ramakrishnan et al., 2021), where an agent ac-
world (Reynolds & Roth, 2018). quires information about a 3D environment through move-
In the next section of this paper, we will define the concept ment and sensory perception, thereby continuously refin-
of embodiment and what we mean by E-AI in Sec. 2, ana- ing its model of the environment (Anderson et al., 2018;
lyzing the literature and various scientific and philosophical Chen et al., 2019). Such exploration techniques empower
currents. In Sec. 3, we discuss why we believe this is a nec- an agent to discover objects and understand their perma-
essary step towards Artificial General Intelligence (AGI). nence. As a result of these developments, many contem-
In Sec. 4 we analyze the main components of a truly embod- porary benchmarks in E-AI have emerged predominantly

2
A Call for Embodied AI

from the domains of vision and robotics (Duan et al., 2022), and goals to reach (satisfy their human users), but these
reflecting the integral role these disciplines have played in interactions are presents some limits due to both techni-
advancing the field. cal (e.g., catastrophic forgetting (Kirkpatrick et al., 2017;
Parisi et al., 2019)) and business (e.g., managing individ-
That said, the broader definition of E-AI does not
uated LLMs (Strubell et al., 2019; Kaplan et al., 2020))
require vision. Sensorimotor coupling may be imple-
reasons. Looking ahead, we anticipate advancements
mented using any physical sense (Pfeifer & Bongard,
that might address these limitations, potentially leading
2006). In the living world, many organisms survive
to the emergence of “personal assistant” LLMs. These
and thrive without vision, using, for example, chemical
would represent a form of embodied agents within a sym-
or electric sensing (Bargmann, 2006). Levin (2022)’s
bolic realm. However, at present, LLMs largely resemble
Technological Approach to Mind Everywhere (TAME)
static Internet AI (I-AI) (Duan et al., 2022), differing signif-
framework further explores this idea, suggesting that
icantly from the dynamic, interactive nature characteristic
cognition emerges from the collective intelligence of cell
of E-AI.
groups, they themselves deeply embodied within their
environment (the body they comprise). This framework It is intriguing that despite the growing concerns about
challenges traditional Cartesian dualism, embedding the risks and alignment challenges of LLMs highlighted
cognition within the physical and biological makeup of in recent research (Bender et al., 2021), SMAIs have at-
an organism. In the TAME perspective, cognition is not tracted comparatively less scrutiny (Huszár et al., 2022;
just an attribute of higher-order organisms; it extends Ribeiro et al., 2020). This is noteworthy considering
throughout the ontological hierarchy of living beings, from SMAIs have been around for a longer time and their in-
individual cells, through tissues and organs, to complex fluence on society is both wider and more profound. We
organisms. Each agent demonstrates cognitive capabilities propose that their widespread acceptance and their more in-
that are inherently connected to its physical structure and tegrated, less intrusive presence in our lives are due to their
the environmental interactions at its proper level. This closer alignment with the principles of embodiment, in con-
broadened view of cognition and embodiment goes beyond trast to LLMs.
the conventional focus on vision in robotics and computer
What do we mean by SMAIs being closer to embodi-
vision. It posits that any entity capable of perceiving, in-
ment? Firstly, SMAIs are driven by clear objectives: to
teracting with, and learning from its environment, thereby
captivate our attention and maximize our engagement with
adapting to it and influencing it, qualifies as embodied. A
their respective platforms (Bozdag, 2013; Bodó, 2021).
technological instantiation of this concept is an intelligent
These goals are fundamentally linked to the business mod-
router in a telecommunication network. This device ‘lives’
els of these platforms, which revolve around advertising.
in a realm dominated by electromagnetic sensing. It
The specifics of these “engagement” objectives are typi-
continuously learns from and adapts to the network traffic,
cally proprietary, forming the core of the competitive ad-
effectively mapping and managing the flow of informa-
vantage of these platforms. Although these goals are
tion. This example underscores the potential of applying
initially human-designed and not intrinsically generated
the principles of E-AI beyond the traditional domains,
(Covington et al., 2016), they are subject to evolutionary
embracing a more inclusive and diverse understanding of
pressure and adaptation, and thus they are tied to the ex-
intelligence and embodiment.
istence and the survival of the SMAI. Secondly, SMAIs
This broadening of the notion of E-AI raises the ques- learn almost entirely from the data they collect by inter-
tion: how close current commercial AI tools are to embod- acting with us. This leads to a high level of individua-
iment? Here we examine two such tools: Large Language tion (adapting to our individual preferences (Nguyen et al.,
Models (Brown et al., 2020; Devlin et al., 2019) and So- 2014)), and notions of exploration (offering us content not
cial Media Content AI Recommendation Systems (SMAI) so much to satisfy us but for the sake of learning what
(Bakshy et al., 2015; Covington et al., 2016; Eirinaki et al., we like). This creates a user experience that, when well-
2018). executed, resembles interaction with a considerate friend,
who wants our best, who connects us to things we like,
LLMs operate within a linguistic-symbolic domain, rep-
and who wants to understand us better. The flip side, how-
resenting textual information and generating new text by
ever, is the potential for these systems to morph into mech-
completing prompts. Their foundational training is es-
anisms that perpetuate addictive behaviors or harmful con-
sentially static, relying on datasets meticulously compiled
tent (Schüll, 2012). Nevertheless, since SMAIs connect
and curated by teams of AI engineers. Their goal is su-
and adapt to us in a more intuitive and deeper manner than
pervised: to generate likely tokens following a context.
LLMs, we often feel a greater sense of control over our in-
Their secondary training (fine-tuning) may involve both in-
teractions with these systems (by, for example, consciously
teractions with their symbolic environment (human users)
not clicking on content that we know we do not want to see

3
A Call for Embodied AI

in the long run). This control, albeit limited, is reminiscent with the world and speaking with others. We also learn by
of persuasion more than mechanical manipulation, align- collecting sequential experiences, not by passive observa-
ing with how we interact with other sentient beings rather tion of shuffled and randomized, even if carefully selected,
than machines. This type of relationship with AI systems data (Smith & Gasser, 2005; Westho et al., 2020). We ad-
is a fundamental aspect of Levin (2022)’s TAME proposal vocate for an approach where insights from cognitive sci-
. Our stance on E-AI suggests that, while systems akin to ence and developmental psychology inform the design of
SMAIs pose greater risks due to their seamless integration AI systems. Such systems should be designed to learn
into our social fabric, they also present more natural op- through active interaction with their surroundings, mirror-
portunities for alignment with our values. This alignment ing the embodied learning processes fundamental to human
process is procedural, perspectival, and evolutionary in na- cognition.
ture (Vervaeke et al., 2012; Vervaeke & Coyne, 2024), con-
Even advocates of static learning concede that multimodal
trasting with the primarily propositional approaches being
learning is the next milestone towards AGI (Fei et al., 2022;
applied to LLMs (Shen et al., 2023).
Parcalabescu et al., 2021). In I-AI, multimodal data needs
We posit that the potential for more effective and naturally to be collected and connected painstakingly. In contrast,
aligned AI systems is, alone, a compelling reason to priori- E-AI agents, when equipped with multimodal sensors, will
tizing E-AI in the broader AI research agenda. inherently collect and correlate multi-modal data by mere
co-occurrence. For instance, robots will see (CV), com-
In the forthcoming section, we further explore the pivotal
municate (NLP), reason (general intelligence), navigate
role that well-executed implementations of E-AI could play
and interact with their environment (planning and RL),
in the quest for AGI.
all simultaneously (Shenavarmasouleh et al., 2022). Intel-
ligent routers will observe requests and traffic (sensing),
3. Why embodiement? communicate with other routers, human engineers, absorb
In the previous section, we examined how contemporary news about their surroundings (NLP), reason (general in-
theories of embodiment, particularly the TAME framework telligence), and control the traffic (control and RL). De-
(Levin, 2022), challenge the long-standing Cartesian dual- spite the impressive progress in these domains, much of
ism which posits a distinct separation between mind and it has relied on the external collection and curation of vast
body (Descartes, 2012). This philosophical stance has sig- datasets for algorithmic training. This approach has signif-
nificantly influenced the development of current generative icant drawbacks: i) the collection and preparation of data
AI models, such as LLMs, which primarily rely on static demands substantial investments; ii) this data can contain
data and lack interaction with the physical or even the sym- biases that are hard to detect and rectify (Li & Deng, 2020;
bolic world. It is a prevalent belief that simply scaling up Balayn et al., 2021; Verma et al., 2021). The issue of bi-
such models, in terms of data volume and computational ases is particularly pertinent in discussions on AI align-
power, could lead to AGI. We contest this view. We pro- ment (Shen et al., 2023; Ji et al., 2023). Efforts to align AI
pose that true understanding, not only propositional truth through rule-based and procedural methods (such as RLHF
but also the value of propositions that guide us how to (Lambert et al., 2022)) often struggle, producing systems
act, is achievable only through E-AI agents that live in the that feel mechanistic and “dumb”, rather than an agent
world and learn of it by interacting with it. which seamlessly acts according to values compatible with
The significance of embodiment in cognitive development our society.
was demonstrated by Held & Hein (1963)’s carousel exper- An embodied agent, designed to interact with and learn
iment with kittens. In this study, one kitten could actively from its environment, fundamentally changes the tradi-
interact with and control a carousel, while the other could tional approach to data collection and curation in AI de-
only observe it passively. Despite both kittens receiving velopment. By being inherently integrated with its physi-
identical visual input, the one engaged in active interac- cal and social contexts, such an agent bypasses the labor-
tion exhibited normal visual development, unlike its pas- intensive processes previously required. This shift not only
sively observing counterpart. This seminal experiment un- simplifies the challenge of aligning AI with human values
derscores the vital role of embodied interaction in shaping but also enhances the agent’s learning efficiency by utiliz-
cognitive abilities (Shenavarmasouleh et al., 2022). It also ing the unique features of its environment. As a result, the
reinforces the observation that all known forms of intelli- focus in AI development transitions from data to simula-
gence, including human intelligence, are inherently embod- tors. These simulators serve a dual purpose: they are both
ied (Smith & Gasser, 2005), suggesting that embodiment training grounds for E-AI and platforms for testing and re-
serves as a solid foundation for cognitive learning and de- fining concepts and algorithms (Duan et al., 2022). More-
velopment. Current AI learns in a very different way from over, the process of aligning these agents with human val-
humans. We humans learn by seeing, moving, interacting ues becomes more intuitive as it involves defining goals re-

4
A Call for Embodied AI

flective of those values. This approach does not claim to ical understanding (Lake et al., 2017; Russell, 2021).
fully resolve the alignment challenge, as E-AI systems will
Finally, there is the important question of why an intelligent
still necessitate oversight and guidelines to avert unwanted
agent would do anything in the first place (Pfeifer & Iida,
behaviors. However, the alignment process becomes inher-
2004). What drives it to engage and acquire new knowl-
ently more natural. Adjusting and defining goals is a more
edge without external prompts? Within well-framed small
straightforward task than the extensive editing and curat-
worlds, such as a chess game, an agent’s purpose is straight-
ing of data. This methodology draws upon our inherent,
forward: deciding the next move. However, when navigat-
non-propositional understanding and instincts about align-
ing large, open worlds, the motivations guiding an agent’s
ing embodied intelligences—whether it is guiding our own
decisions grow increasingly ambiguous. The concepts
actions, nurturing children, or training pets.
of active inference and the free energy principle (Friston,
Another important characteristic of E-AI, stemming from 2010; Friston et al., 2023) provide a compelling framework
the coupling between the agent and its environment, is the for understanding the behaviors of intelligent agents. This
agent’s capacity for ongoing evolution and adaptation. This principle posits that minimizing surprise and uncertainty is
adaptability is vital for any agent destined to navigate a the core objective of the agents. They achieve this through
world in perpetual change. It underscores the importance the use of internal models to forecast outcomes, continually
of continual learning: the process of assimilating new ex- updating these models with sensory input, and proactively
periences while retaining previously acquired knowledge modifying their surroundings to better match their expec-
(Wang et al., 2023a). tations. This concept resonates within the AI community,
particularly in the design of agents equipped with mecha-
Moreover, Ishiguro & Kawakatsu (2004) have shown, both
nisms for intrinsic motivation (Oudeyer & Kaplan, 2007;
through theory and practical application in robotics, that
Pathak et al., 2017), which incentivize agents to explore
a close and effective integration of control mechanisms
and acquire new knowledge to reduce uncertainty.
with body dynamics significantly enhances energy effi-
ciency. Ororbia & Friston (2024) elaborates on energy ef- However, what propels an intelligent agent to act, espe-
ficiency, proposing that embodied mortal systems, which cially beyond mere survival instincts, continues to be a mat-
are characterized by their inherent lifecycle and eventual ter of debate. We argue that exploring and developing em-
mortality, can optimize energy usage through adaptive pro- bodied agents will illuminate this question. Thus, E-AI not
cesses. These processes allow the system to self-organize only shows potential for significant breakthroughs toward
and maintain homeostasis by minimizing free energy, in achieving AGI, but also has deep implications for our un-
alignment with the principles of the free energy princi- derstanding of cognition in general.
ple (Friston, 2010; Friston et al., 2023). Coupled systems
lead also to the emergence of intriguing behaviors that can 4. Theoretical framework
be hard to explicitly program or learn from disembodied In previous sections, we have underscored the pivotal role
datasets (Rosas et al., 2020), an observation aligning with of E-AI in advancing toward AGI. Shifting focus, we now
the principles of the TAME framework. delve into the essential components that, we believe, will
Embodiment is also a prerequisite for learning about affor- comprise E-AIs. We draw heavily on the concept of cog-
dances (Gibson, 1979). Learning, or more precisely re- nitive architectures designed by cognitive scientists aiming
alizing affordances, according to Vervaeke et al. (2012)’s to model the human mind (Thagard, 2012). Despite the
perspectival learning, is a fundamental capacity of AGI, promise these architectures hold for enhancing modern ma-
as affordances are what “fill our world with meaning” chine learning methods, progress on this has been notably
(Roli et al., 2022), and are thus necessary for agents that limited (Kotseruba & Tsotsos, 2020). The slow advance-
give meaning to their own world. Affordances emerge ment is largely due to cognitive architectures being the do-
from the dynamic interplay between an agent’s perception, main of neuroscientists and cognitive scientists, with only
objectives, abilities, and the characteristics of objects and a select few within the machine learning community ex-
contexts within the environment; for example, a chair af- ploring their potential for AGI. We advocate for a syner-
fords us to sit, a glass to drink and a hand to grasp and gistic strategy that marries cognitive architectures with ma-
pick up objects. Roli et al. (2022) argue that the capacity to chine learning within the E-AI paradigm, proposing it as
comprehend, utilize, and be influenced by environmental a viable path toward AGI. The emergence of agent-based
affordances distinguishes biological intelligence from cur- LLMs, such as AutoGPT (FIRAT & Kuleli, 2023), which
rent artificial systems. Besides affordances, E-AI is also in- pioneers the generation of autonomous agents, and Pan-
dispensable for investigating emergent phenomena such as guAgent (Christianos et al., 2023), an agent-focused lan-
qualia (Locke, 1847; Korth, 2022), consciousness (Solms, guage model, indicate the potential of this approach.
2019), as well as creativity, empathy (Perez, 2023), and eth- We identify four essential components of an E-AI sys-

5
A Call for Embodied AI

tem: (i) perception: the ability of the agent to sense its active and goal-directed types. Reactive actions, akin to
environment; (ii) action: the ability to interact with and human reflexes, occur almost instantaneously in response
change its environment; (iii) memory: the capacity to re- to stimuli and play a crucial role in an agent’s immediate
tain past experiences; and (iv) learning: integrating experi- self-preservation by maintaining stability. Goal-directed ac-
ences to form new knowledge and abilities. These compo- tions, on the other hand, involve strategic planning and are
nents are notably aligned with the active inference frame- motivated by high-level objectives. Reactive actions are
work of Friston (2010). In this framework, the agent mod- important for self-preservation, with model-free reinforce-
els its world through a probabilistic generative model that ment learning methods playing an important role for devel-
infers the causes of its sensory observations (perception). oping reactive control policies in tasks like robot walking
This model is hierarchical, forecasting future states in a (Rudin et al., 2022). On the other hand, for an agent to
top-down manner and reconciling these predictions with achieve more complex, high-level objectives, planning is
bottom-up sensory data, with discrepancies or errors be- indispensable, even if efficient planning remains an open
ing escalated upwards only when they cannot be reconciled area of research (Lin et al., 2022; Shi et al., 2022). Cen-
at the initial level. The agent acts to minimize the diver- tral to the concept of planning is the presence of a “world
gence between its anticipations and reality, thus moving to- model” within the agent, which it can use to predict the con-
wards states of reduced uncertainty (action). Concurrently, sequences of its own actions. Model-based RL has made
it collects and stores new information about its environment significant strides in developing algorithms that learn these
(memory) and refines its internal model to minimize pre- world models and use them for planning (Silver et al., 2016;
dictive errors (learning). In the sections that follow, we Kégl et al., 2021; Paolo et al., 2022).
will describe in detail these four components and how they
comprise the E-AI agent. 4.3. Memory
Embodied agents learn from their experience, which are
4.1. Perception stored in memory. Memory encompasses various dimen-
At the heart of an embodied agent lies the ability to per- sions, including its duration (short-term or long-term) and
ceive the world in which it exists. Perception is a process its nature (procedural, declarative, semantic, and episodic).
by which raw sensory data is transformed into a structured Importantly, memory is not necessarily represented as ex-
internal representation, enabling the agent to engage in cog- plicit propositional knowledge; it can be implicitly en-
nitive tasks. The range of inputs that inform perception is coded into the weights of a neural network (NN). To
vast, encompassing familiar human senses such as vision, navigate cognitive tasks, agents require diverse types of
hearing, smell, touch, and taste. It extends to any form memory systems, each playing a distinct role. Working
of stimuli an agent might encounter, be it force sensors in and short-term memory offer temporary storage to sup-
robotics or signal strength indicators in wireless technol- port the agent’s immediate objectives. Long-term and
ogy. The challenge with sensory data is that it is often episodic memories provide a reservoir for information
not immediately actionable. It typically undergoes a pro- over longer time. Episodic memory captures and stores
cess of transformation, a task where recent advances in ma- unique, perspectival experiences, ready to be accessed
chine learning can prove invaluable. The field has seen the when familiar scenarios unfold. Long-term memory, con-
development of sophisticated methods for learning feature versely, is the repository for broader propositional knowl-
and embedding spaces, facilitating the conversion of raw edge. LLMs, for example, implement long-term memory
data into meaningful information (Golinko & Zhu, 2019; using Retrieval-Augmented Generation (RAG) (Gao et al.,
Sivaraman et al., 2022). A particularly effective strategy 2024), a technique that reduces hallucinations using an ex-
has been self-supervised learning to learn such representa- ternal database. This technique showcases how sophisti-
tions. Although much of the research has concentrated on cated machine learning methods can be synergized with
single modalities, such as vision (Oquab et al., 2023), the cognitive architectures.
principles underlying these techniques are universally ap-
plicable across different sensory inputs (Orhan et al., 2022; 4.4. Learning
Lee et al., 2019). A defining trait of intelligent agents is their ability to
learn. Yet, how to learn, especially in a continuous and
4.2. Action dynamic way, remains a subject of ongoing research and
Embodied agents navigate the world by taking actions and debate (Wang et al., 2023a; Yifan et al., 2023). While re-
observing the outcomes. Acting can be broken down into cent strides in AI have largely been powered by training
two steps: (i) choosing what action to undertake next, like on static datasets, the concept of continual learning, es-
deciding to relocate to a specific spot, and (ii) determining sential for adapting over time, faces challenges. These
how to execute this action, such as plotting the course to challenges stem primarily from the inherent limitations of
that location. Actions can further be categorized into re- deep NNs, such as catastrophic forgetting (Kemker et al.,

6
A Call for Embodied AI

2018), and the complexities associated with learning from cess, encapsulating the value (business or otherwise) of
non-stationary data that result from an agent’s interaction the predictive pipeline. Practically, this is obviously not
with its environment (Fahrbach et al., 2023). The embodi- the case: the data on which we learn a predictor is often
ment hypothesis suggests that true intelligence is born from collected by the data scientist, responsible for the quality
such interactions (Smith & Gasser, 2005), underscoring the of the pipeline (O’Neil & Schutt, 2013; Provost & Fawcett,
need for dynamic learning methodologies. In this context, 2013). Furthermore, most of the debates around respon-
simulators emerge as a vital tool, offering a shift away from sible AI turn around the data, not the learning algorithm
the static learning typical of traditional AI. Instead, they (O’Neil, 2016; Selbst et al., 2019). Collecting, selecting,
enable agents to evolve through ongoing, interactive expe- and curating data is obviously part of the pipeline. The text
riences within simulated environments (Duan et al., 2022). we use to train LLMs is created by its writers, rather than
drawn from a distribution. In some cases, when collection
5. Challenges and model-retraining are automated, the situation may be
E-AIs agents will adopt an egocentric perspective, experi- even worse. For example, in click-through-rate prediction
encing their environment from a first-person viewpoint, in (Bottou et al., 2013; Perlich et al., 2014) or recommenda-
contrast to the allocentric perspective prevalent in current tion systems (Deldjoo et al., 2020), the deployed predictor
AI systems. This shift is not only essential for meaning- affects the data for the next round of training, generating an
ful interaction with the world but also offers an advantage often adversarial feedback. A similar phenomenon is hap-
by allowing the agents to focus on modeling their immedi- pening in the LLM world: as these AIs become the go-to
ate surroundings rather than the entirety of the world. On tools for creative and business writing, the data collected
the other hand, E-AIs introduces several challenges, includ- for the next round of training will, in large part, be coming
ing extending current learning theories, managing noise in from the previous generation of LLMs.
perception and action effectively and safely, and ensuring Reinforcement Learning (Sutton & Barto, 2018) and re-
meaningful communication with humans that adheres to lated paradigms (Bayesian optimization (Mockus, 1989)
ethical standards. The remainder of this section will cover or contextual bandits (Langford & Zhang, 2008)) offer a
these challenges, exploring potential pathways and solu- closer fit for embodied AI, when the prediction is not the
tions. end-product, rather part of a predictive pipeline that also
includes data collection. RL affords the data scientist to
5.1. New learning theory
design a higher-level objective, letting the algorithm opti-
The principles of E-AI challenge us to reevaluate tra- mize both the predictor and the data it is trained on. Here,
ditional learning theories (Devroye et al., 1996; Vapnik, the mismatch between theory and practice is different from
1998), bridging a gap between supervised and reinforce- supervised learning. The analysis in RL or bandit theory
ment learning. Supervised learning, while foundational often focuses on the convergence of the agent to a theo-
in AI, assumes that the data is drawn from an unknown retical optimum, given a fixed but often unknown environ-
but fixed distribution, collected independently of the learn- ment. RL theory usually does not offer tools to analyze
ing process. This theory gives rise to the classical no- the data collected during the learning process, especially
tions of generalization, over- and underfitting, bias and when the collection is semi-automatic (includes a human
variance, and asymptotic or finite-sample statistical consis- curator in the loop). RL agents, in practice, usually do not
tency. This framing is obviously highly useful: even those converge even in a stationary environment, they rather indi-
who are not explicitly doing theory use it transparently as viduate, making, for example, quite perversely, the random
their lingua technica and cognitive scaffolding when work- seed part of the algorithm (Henderson et al., 2018). This
ing with algorithms and analyzing results. is even more pronounced in non-stationary environments
When embodied agents interact dynamically with their en- where the agent’s actions alter the environment; a situation
vironment, data collection becomes part of the data sci- which AGI will definitely find itself (da Silva et al., 2006;
ence pipeline (Pfeifer & Iida, 2004; Thrun et al., 2005). Zhou et al., 2024).
Classical supervised learning theory is insufficient to A new learning theory for embodied AI must transcend
analyze these cases and to guide algorithm build- these limitations. It should account for the dynamic, in-
ing. Extensions, like transfer learning (Pan & Yang, teractive nature of data in E-AI, where the agent’s actions
2010), multitask learning (Caruana, 1997), distribution continuously reshape its learning environment. This theory
shift (Quiñonero-Candela et al., 2009), domain adaptation should not just aim for optimal performance in a fixed set-
(Csurka, 2017) or out-of-distribution generalization, have ting but should embrace a spectrum of behaviors suitable
been proposed to patch basic supervised learning theory, for evolving environments. Moreover, it should provide di-
but most of these cling to the original framing, pretend- agnostics to assess the quality and relevance of data gener-
ing that the data is coming from outside the learning pro- ated through these interactions.

7
A Call for Embodied AI

5.2. Noise and uncertainty ness of the training process. Despite numerous strategies
E-AI agents are tasked with navigating the real world, rife being put forward to mitigate the reality gap (Salvato et al.,
with noise and uncertainty. These elements can drastically 2021; Daza et al., 2023; Daoudi et al., 2023; Koos et al.,
affect both the agent’s perception of its surroundings and 2012; Tobin et al., 2017), it remains an unresolved issue in
the quality of its decision-making. For example, elevated the field, challenging the applicability of simulated training
noise levels may distort the agent’s interpretation of envi- environments.
ronmental cues, leading to suboptimal decisions. This chal-
5.4. Interaction with humans
lenge is accentuated in an egocentric perspective, where
agents frequently encounter continuous streams of fluctuat- A key ambition of E-AI is to seamlessly interact with and
ing and imprecise data. Sources of noise include the natural learn from humans, enhancing AI’s ability to offer person-
imprecision of sensors and actuators, which might lack ac- alized and impactful solutions. By improving these inter-
curacy due to manufacturing inconsistencies, degradation actions, E-AI will also diminish fear and mistrust towards
over time, or external disturbances. Additionally, quanti- AI technologies, leading to broader acceptance and inte-
zation error, a byproduct of converting analog signals into gration. In this endeavor, LLMs stand out as particularly
digital form (Widrow & Kollár, 2008), can further compro- beneficial, with their ability to comprehend and produce
mise data integrity. human-like text, facilitating communication in natural lan-
guage and making engagements with AI more natural and
As these agents learn and adapt to their environment, they accessible. The domain of Human-Robot Interaction (HRI)
must also grapple with uncertainty. This uncertainty can offers valuable lessons for enhancing AI-human communi-
obscure the agent’s understanding of its environment, in- cation, as researchers in this domain have dedicated efforts
fluencing its performance. This dilemma is especially to explore innovative methods for robots to better commu-
prevalent in RL scenarios dealing with partial observabil- nicate with us (Amirova et al., 2021; Bonarini, 2020). Yet,
ity, where decisions must be made with incomplete infor- the challenge of ensuring proper and ethical communica-
mation, leading to uncertainty in predicting the outcomes tion with AI systems persists. The effectiveness of LLMs,
of its actions (Dulac-Arnold et al., 2021; Hess et al., 2023; for instance, hinges significantly on their training and how
Pattanaik et al., 2017). Therefore, managing noise and un- well they are aligned with human intentions and values
certainty effectively is paramount for the progress of E-AI. (Wang et al., 2023b). Integrating human oversight directly
5.3. Simulators into the AI development process and establishing compre-
hensive guidelines and protocols for AI communication are
As we pivot towards E-AI, simulators will assume a fun-
among the proposed strategies to address these challenges,
damental role as a key driver of progress, similar to the
aiming to make AI interactions more meaningful and ethi-
role data sets play in the training of traditional I-AI mod-
cally sound.
els. These simulators offer a controlled, replicable envi-
ronment where AI systems can be rigorously trained and
5.5. Generalization
tested. This setup allows for learning and adapting to di-
verse scenarios prior to deployment, ensuring both safety An important issue in AI is generalization. There have
and cost-efficiency. A notable advantage of simulators, and been many attempts at developing systems capable of
requirements, is their speed and ease of parallelization, sig- quickly generalizing to settings unseen at training time
nificantly accelerating training time, making it more feasi- (Pourpanah et al., 2022) in the same fashion living beings
ble to train sophisticated AI models on multiple scenarios do. Nonetheless this is still an open problem that will likely
simultaneously. afflict embodied AIs as well, as acting in the real world ex-
Many advanced simulators have been introduced recently, poses the agent to situations unseen at training time. For
yet they often demand significant computational resources instance, consider a service robot trained in a simulated en-
and are predominantly geared towards robotics applica- vironment. When placed in a real household, it may en-
tions (Li et al., 2021; Gan et al., 2020; Yan et al., 2018; counter novel objects and behaviors not present in its train-
Puig et al., 2018; Gao et al., 2019). For these simulators ing data, leading to suboptimal or even erroneous actions.
to truly serve the needs of E-AI, they must expand their This illustrates the critical need for AIs that can adapt and
scope to a broader spectrum of environments. A major generalize beyond their initial programming. A promising
challenge in the use of simulators is bridging the “reality direction in addressing this problem is the leveraging of
gap” (Bousmalis & Levine, 2017): the difference between the enormous amount of internet data. LLMs have demon-
simulated conditions and the agent’s eventual real-world or strated remarkable zero-shot learning capabilities with min-
virtual deployment context (Ligot & Birattari, 2020). This imal fine-tuning (Wei et al., 2021). We can envision that
gap can lead to a situation where models that excel in simu- some form of pretraining on internet datasets can kick-start
lations fail in actual application, undermining the effective- the AI before its embodied phase, enhancing generalization

8
A Call for Embodied AI

and adaptability. the physical bulk and heft of GPUs pose logistical chal-
lenges for mobile agents or those operating within spatial
Recent developments in robotics have started exploring
limitations. Addressing these constraints necessitates the
this research direction. Ahn et al. (2024) used a mixed ap-
innovation of new, energy-efficient hardware solutions that
proach between I-AI and E-AI to effectively control multi-
can be embedded within the agents. Promising develop-
ple robots in different settings. However, only relying on
ments are on the horizon, with Google’s Tensor Processing
internet data is insufficient. An important aspect is also the
Unit (TPU) (Norrie et al., 2021; Cass, 2019) and Huawei’s
ability to accurately identify unknown situations and avoid
Ascend chip (Liao et al., 2021) leading the charge. These
overconfidence, a common shortcoming of LLMs, that of-
advancements, coupled with the potential of neuromorphic
ten produce plausible-sounding response that are factually
computing and the strategic synergy of hardware-software
incorrect (Xiong et al., 2024). The ability to assess its own
co-design, signal a new era of hardware capability. More-
uncertainty is essential, and can prompt the AI to seek hu-
over, the development of energy and data-efficient algo-
man assistance, similarly to how infants ask for help in their
rithms is critical. Such breakthroughs in hardware and algo-
early development. We believe that, while the integration
rithm efficiency will have a direct and profound effect on
of I-AI and E-AI will prove necessary as foundation for
an AI’s ability to understand, decide, and interact within
the development of the next generation of intelligent sys-
its environment, enabling E-AI agents to operate more au-
tems, the active learning paradigm and precise uncertainty
tonomously and effectively in a diverse array of settings.
estimation are vital. Active learning, where the AI actively
queries for information when uncertain, combined with re- 6. Conclusion
liable uncertainty estimation, can enable an E-AI agent to
manage novel situations effectively. In this paper, we have articulated the critical role Embodied
AI plays on the path toward achieving AGI, setting it apart
Finally, we believe that to properly address the issue of gen- from prevailing AI methodologies, notably LLMs. By in-
eralization, the community must first clearly define what tegrating insights from a spectrum of research fields, we
is the meaning of “generalization”. Currently, discussions underscored how E-AI’s development benefits from exist-
around this issue often rely on vague terms, referring to an ing knowledge, with LLMs enhancing the potential for in-
agent’s ability to adapt to unseen settings or data. However, tuitive interactions between humans and emerging AI en-
without a formal definition, it is challenging to assess or tities. We introduced a comprehensive theoretical frame-
improve generalization effectively. work for the development of E-AI, grounded in the princi-
Consider the varying degrees of generalization required in ples of cognitive science, highlighting perception, action,
different scenarios: transferring skills from driving a car memory, and learning, situating E-AI within the context
to driving a bus represents generalization within a similar of Friston’s active inference framework, thereby offering a
domain, whereas adapting from walking to swimming in- wide-ranging theoretical backdrop for our discussion. De-
volves a more profound shift in the type of task. These ex- spite the outlook, the journey ahead is fraught with chal-
amples illustrate the spectrum of generalization challenges lenges, not least the formulation of a novel learning theory
that embodied AI might face. tailored for AI and the creation of sophisticated hardware
solutions. This paper aims to serve as a roadmap for ongo-
To advance this field, it is imperative to develop a precise ing and future research into E-AI, proposing directions that
definition of generalization and establish standardized met- could lead to significant advancements in the field.
rics and benchmarks for measuring an AI’s generalization
capabilities (Kawaguchi et al., 2017). This necessity ties
back to our discussion in Section 5.1, highlighting the ur-
Impact Statement
gent need for a new learning theory that can provide a prin- While the development of Embodied AI introduces com-
cipled approach to developing agents that generalize well. plexities and challenges, particularly in hardware require-
Addressing these questions will hopefully lead to more pre- ments, ethical considerations, and safety protocols, the po-
cise and principled approaches in the development of the tential benefits significantly outweigh these drawbacks. E-
field. AI stands to evolve our interaction with technology, imbu-
ing AI with a deeper understanding of and engagement with
5.6. Hardware limitations both the physical world and human society. This not only
A significant challenge to the broad-scale development and paves the way for more natural and effective human-AI in-
integration of E-AI lies in the hardware requirements of teractions but also enhances AI’s adaptability and applica-
these AI systems. Presently, AI technologies largely de- tion across a broad spectrum of fields.
pend on GPU clusters, which are, while powerful, not ide-
ally suited for embodied agents due to their high cost, en-
ergy consumption, and extensive heat output. Additionally,

9
A Call for Embodied AI

References ACM, 2021. URL https://dl.acm.org/doi/


10.1145/3442188.3445922.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Bodó, B. Selling news to audiences–a qualitative inquiry
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint into the emerging logics of algorithmic news personal-
arXiv:2303.08774, 2023. ization in european quality news media. In Algorithms,
Ahn, M., Dwibedi, D., Finn, C., Arenas, M. G., Gopalakr- Automation, and News, pp. 75–96. Routledge, 2021.
ishnan, K., Hausman, K., Ichter, B., Irpan, A., Joshi, N., URL https://www.tandfonline.com/doi/
Julian, R., et al. Autort: Embodied foundation models full/10.1080/21670811.2019.1624185.
for large scale orchestration of robotic agents. arXiv
preprint arXiv:2401.12963, 2024. Bonarini, A. Communication in human-robot interac-
tion. Current Robotics Reports, 1:279–285, 2020.
Amirova, A., Rakhymbayeva, N., Yadollahi, E., URL https://link.springer.com/article/
Sandygulova, A., and Johal, W. 10 years of human-nao 10.1007/s43154-020-00026-1.
interaction research: A scoping review. Frontiers in
Robotics and AI, 8:744526, 2021. URL https:// Bottou, L., Peters, J., Quiñonero-Candela, J., Charles,
www.frontiersin.org/articles/10.3389/ D. X., Chickering, M., Portugaly, E., Ray, D., Simard,
frobt.2021.744526/full. P., and Snelson, E. Counterfactual reasoning and learn-
ing systems: The example of computational advertising.
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., In Journal of Machine Learning Research, volume 14,
Sünderhauf, N., Reid, I., Gould, S., and Van Den Hen- pp. 3207–3260, 2013. URL https://jmlr.org/
gel, A. Vision-and-language navigation: Interpreting papers/v14/bottou13a.html.
visually-grounded navigation instructions in real envi-
ronments. In Proceedings of the IEEE conference on Bousmalis, K. and Levine, S. Closing the simulation-to-
computer vision and pattern recognition, pp. 3674–3683, reality gap for deep robotic learning. Google Research
2018. URL https://ieeexplore.ieee.org/ Blog, 1, 2017. URL https://blog.research.
document/8578485. google/2017/10/closing-simulation-to-
Bakshy, E., Messing, S., and Adamic, L. A. Exposure reality-gap-for.html.
to ideologically diverse news and opinion on facebook.
In Proceedings of the National Academy of Sciences, Bozdag, E. Bias in algorithmic filtering and personal-
volume 112, pp. 5791–5796. National Acad Sciences, ization. Ethics and information technology, 15:209–
2015. URL https://www.science.org/doi/ 227, 2013. URL https://link.springer.com/
10.1126/science.aaa1160. article/10.1007/s10676-013-9321-6.

Balayn, A., Lofi, C., and Houben, G.-J. Manag- Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen,
ing bias and unfairness in data for decision sup- X., Choromanski, K., Ding, T., Driess, D., Dubey, A.,
port: a survey of machine learning and data engi- Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakr-
neering approaches to identify and mitigate bias and ishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J.,
unfairness within data management and analytics sys- Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov,
tems. The VLDB Journal, 30(5):739–768, 2021. D., Kuang, Y., Leal, I., Lee, L., Lee, T.-W. E., Levine,
URL https://link.springer.com/article/ S., Lu, Y., Michalewski, H., Mordatch, I., Pertsch, K.,
10.1007/s00778-021-00671-8. Rao, K., Reymann, K., Ryoo, M., Salazar, G., San-
keti, P., Sermanet, P., Singh, J., Singh, A., Soricut, R.,
Bargmann, C. I. Comparative chemosensation from Tran, H., Vanhoucke, V., Vuong, Q., Wahid, A., Welker,
receptors to ecology. Nature, 444(7117):295–301, 2006. S., Wohlhart, P., Wu, J., Xia, F., Xiao, T., Xu, P., Xu,
URL https://www.nature.com/articles/ S., Yu, T., and Zitkovich, B. Rt-2: Vision-language-
nature05402. action models transfer web knowledge to robotic con-
Bariah, L. and Debbah, M. Ai embodiment through 6g: trol. In arXiv preprint arXiv:2307.15818, 2023. URL
Shaping the future of agi. 2023. https://arxiv.org/abs/2307.15818.

Bender, E. M., Gebru, T., McMillan-Major, A., and Brooks, R. A. Intelligence without representa-
Shmitchell, S. On the dangers of stochastic par- tion. Artificial intelligence, 47(1-3):139–159,
rots: Can language models be too big? In 1991. URL https://www.sciencedirect.
Proceedings of the 2021 ACM Conference on Fair- com/science/article/abs/pii/
ness, Accountability, and Transparency, pp. 610–623. 000437029190053M.

10
A Call for Embodied AI

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., context detection. In Proceedings of the 23rd Interna-
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., tional Conference on Machine Learning, ICML ’06, pp.
Askell, A., et al. Language models are few-shot learn- 217–224, 2006. URL https://dl.acm.org/doi/
ers. Advances in neural information processing systems, 10.1145/1143844.1143872.
33:1877–1901, 2020. URL https://arxiv.org/
abs/2005.14165. Daoudi, P., Prieur, C., Robu, B., Barlier, M., and Santos,
L. D. A trust region approach for few-shot sim-to-real
Caruana, R. Multitask learning. Machine Learning, 28 reinforcement learning, 2023. URL https://arxiv.
(1):41–75, 1997. URL https://link.springer. org/abs/2312.15474.
com/article/10.1023/A:1007379606734.
Daza, I. G., Izquierdo, R., Martı́nez, L. M., Benderius, O.,
Cass, S. Taking ai to the edge: Google’s tpu now comes and Llorca, D. F. Sim-to-real transfer and reality gap
in a maker-friendly package. IEEE Spectrum, 56(5): modeling in model predictive control for autonomous
16–17, 2019. URL https://ieeexplore.ieee. driving. Applied Intelligence, 53(10):12719–12735,
org/document/8701189. 2023. URL https://link.springer.com/
article/10.1007/s10489-022-04148-1.
Chen, T., Gupta, S., and Gupta, A. Learning ex-
Deldjoo, Y., Di Noia, T., and Merra, F. A. Adver-
ploration policies for navigation. arXiv preprint
sarial machine learning in recommender systems (aml-
arXiv:1903.01959, 2019. URL https://arxiv.
recsys). In Proceedings of the 13th International Con-
org/abs/1903.01959.
ference on Web Search and Data Mining, pp. 869–872,
Christianos, F., Papoudakis, G., Zimmer, M., Coste, T., Wu, 2020. URL https://dl.acm.org/doi/abs/10.
Z., Chen, J., Khandelwal, K., Doran, J., Feng, X., Liu, J., 1145/3336191.3371877.
et al. Pangu-agent: A fine-tunable generalist agent with
Descartes, R. Discourse on method. Hackett Publish-
structured reasoning. arXiv preprint arXiv:2312.14878,
ing, 2012. URL https://hackettpublishing.
2023. URL https://arxiv.org/2312.14878.
com/discourse-on-method.
Clark, A. Being There: Putting Brain, Body, and World Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Together Again. MIT Press, 1997. URL https:// Pre-training of deep bidirectional transformers for lan-
mitpress.mit.edu/9780262531566/being- guage understanding, 2019. URL https://arxiv.
there/. org/abs/1810.04805.
Clark, A. and Chalmers, D. The extended mind. Devroye, L., Györfi, L., and Lugosi, G. A Probabilis-
Analysis, 58(1):7–19, 1998. URL https:// tic Theory of Pattern Recognition. Springer, 1996.
era.ed.ac.uk/bitstream/handle/1842/ URL https://link.springer.com/book/10.
1312/TheExtendedMind.pdf?sequence=1& 1007/978-1-4612-0711-5.
isAllowed=y.
Duan, J., Yu, S., Tan, H. L., Zhu, H., and Tan, C. A
Covington, P., Adams, J., and Sargin, E. Deep neural net- survey of embodied ai: From simulators to research
works for youtube recommendations. In Proceedings of tasks. IEEE Transactions on Emerging Topics in Com-
the 10th ACM Conference on Recommender Systems, pp. putational Intelligence, 6(2):230–244, 2022. URL
191–198, 2016. URL https://dl.acm.org/doi/ https://arxiv.org/abs/2103.04918.
10.1145/2959100.2959190.
Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J.,
Cramer, P. Alphafold2 and the future of structural bi- Paduraru, C., Gowal, S., and Hester, T. Challenges
ology. Nature structural & molecular biology, 28(9): of real-world reinforcement learning: definitions,
704–705, 2021. URL https://www.nature.com/ benchmarks and analysis. Machine Learning, 110(9):
articles/s41594-021-00650-1. 2419–2468, 2021. URL https://link.springer.
com/article/10.1007/s10994-021-05961-
Csurka, G. Domain adaptation for visual applica- 4.
tions: A comprehensive survey. arXiv preprint
arXiv:1702.05374, 2017. URL https://link. Eirinaki, M., Gao, J., Varlamis, I., and Tserpes, K.
springer.com/chapter/10.1007/978-3- Recommender systems for large-scale social net-
319-58347-1_1. works: A review of challenges and solutions. Future
Generation Computer Systems, 78:413–418, 2018.
da Silva, B. C., Basso, E. W., Bazzan, A. L. C., and Engel, URL https://www.sciencedirect.com/
P. M. Dealing with non-stationary environments using science/article/pii/S0167739X17319684.

11
A Call for Embodied AI

Fahrbach, M., Javanmard, A., Mirrokni, V., and Worah, Held, R. and Hein, A. Movement-produced stimulation
P. Learning rate schedules in the presence of distribu- in the development of visually guided behavior. Jour-
tion shift. arXiv preprint arXiv:2303.15634, 2023. URL nal of comparative and physiological psychology, 56
https://arxiv.org/abs/2303.15634. (5):872, 1963. URL https://psycnet.apa.org/
record/1964-03855-001.
Fei, N., Lu, Z., Gao, Y., Yang, G., Huo, Y., Wen, J., Lu,
H., Song, R., Gao, X., Xiang, T., et al. Towards arti- Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup,
ficial general intelligence via a multimodal foundation D., and Meger, D. Deep reinforcement learning that mat-
model. Nature Communications, 13(1):3094, 2022. URL ters. In Proceedings of the AAAI Conference on Artificial
https://arxiv.org/abs/2110.14378. Intelligence, volume 32, 2018. URL https://arxiv.
org/abs/1709.06560.
FIRAT, M. and Kuleli, S. What if gpt4 be-
Hess, F., Monfared, Z., Brenner, M., and Durstewitz, D.
came autonomous: The auto-gpt project and use
Generalized Teacher Forcing for Learning Chaotic Dy-
cases. Journal of Emerging Computer Technologies,
namics, October 2023. URL http://arxiv.org/
3(1):1–6, 2023. URL https://github.com/
abs/2306.04406. arXiv:2306.04406 [nlin].
Significant-Gravitas/AutoGPT.
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H.,
Friston, K. The free energy principle: A unified brain Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. A
theory? Nature Reviews Neuroscience, 11(2):127– survey on hallucination in large language models: Prin-
138, 2010. URL https://www.nature.com/ ciples, taxonomy, challenges, and open questions, 2023.
articles/nrn2787. URL https://arxiv.org/abs/2311.05232.
Friston, K., Da Costa, L., Sajid, N., Heins, C., Ueltzhöffer, Huszár, F., Ktena, S. I., O’Brien, C., Belli, L., Schlaikjer,
K., Pavliotis, G. A., and Parr, T. The free energy prin- A., and Hardt, M. Algorithmic amplification of poli-
ciple made simpler but not too simple. Physics Reports, tics on twitter. Proceedings of the National Academy
1024:1–29, 2023. URL https://arxiv.org/abs/ of Sciences, 119(1):e2025334119, 2022. doi: 10.1073/
2201.06387. pnas.2025334119. URL https://www.pnas.org/
doi/abs/10.1073/pnas.2025334119.
Gan, C., Schwartz, J., Alter, S., Mrowca, D., Schrimpf,
M., Traer, J., De Freitas, J., Kubilius, J., Bhandwal- Ishiguro, A. and Kawakatsu, T. How should con-
dar, A., Haber, N., et al. Threedworld: A platform trol and body systems be coupled? a robotic case
for interactive multi-modal physical simulation. arXiv study. In Embodied Artificial Intelligence: Interna-
preprint arXiv:2007.04954, 2020. URL https:// tional Seminar, Dagstuhl Castle, Germany, July 7-11,
arxiv.org/abs/2007.04954. 2003. Revised Papers, pp. 107–118. Springer, 2004.
URL https://link.springer.com/chapter/
Gao, X., Gong, R., Shu, T., Xie, X., Wang, S., and 10.1007/978-3-540-27833-7_8.
Zhu, S.-C. Vrkitchen: an interactive 3d virtual en-
vironment for task-oriented learning. arXiv preprint Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang,
arXiv:1903.05757, 2019. URL https://arxiv. K., Duan, Y., He, Z., Zhou, J., Zhang, Z., et al. Ai
org/abs/1903.05757. alignment: A comprehensive survey. arXiv preprint
arXiv:2310.19852, 2023. URL https://arxiv.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, org/abs/2310.19852.
Y., Sun, J., Guo, Q., Wang, M., and Wang, H. Retrieval-
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
augmented generation for large language models: A sur-
Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,
vey, 2024. URL https://arxiv.org/abs/2312.
and Amodei, D. Scaling laws for neural language mod-
10997.
els. arXiv preprint arXiv:2001.08361, 2020. URL
https://arxiv.org/abs/2001.08361.
Gibson, J. J. The Ecological Approach to Visual Perception.
Houghton Mifflin, 1979. Kawaguchi, K., Kaelbling, L. P., and Bengio, Y.
Generalization in deep learning. arXiv preprint
Golinko, E. and Zhu, X. Generalized feature embed- arXiv:1710.05468, 1(8), 2017.
ding for supervised, unsupervised, and online learning
tasks. Information Systems Frontiers, 21:125–142, 2019. Kégl, B., Hurtado, G., and Thomas, A. Model-based
URL https://link.springer.com/article/ micro-data reinforcement learning: what are the crucial
10.1007/s10796-018-9850-y. model properties and which model to choose? arXiv

12
A Call for Embodied AI

preprint arXiv:2107.11587, 2021. URL https:// Lambert, N., Castricato, L., von Werra, L., and Havrilla,
arxiv.org/2107.11587. A. Illustrating reinforcement learning from human feed-
back (rlhf). Hugging Face Blog, 2022. URL https://
Kemker, R., McClure, M., Abitino, A., Hayes, T., and huggingface.co/blog/rlhf.
Kanan, C. Measuring catastrophic forgetting in neural
networks. In Proceedings of the AAAI conference on ar- Langford, J. and Zhang, T. The epoch-greedy algorithm
tificial intelligence, volume 32, 2018. URL https:// for contextual multi-armed bandits. In Advances
arxiv.org/abs/1708.02072. in Neural Information Processing Systems, vol-
ume 20, 2008. URL https://proceedings.
Kirchhoff, M., Parr, T., Palacios, E., Friston, K., and neurips.cc/paper/2007/file/
Kiverstein, J. The Markov blankets of life: autonomy, 4b04a686b0ad13dce35fa99fa4161c65-
active inference and the free energy principle. Journal Paper.pdf.
of The Royal Society interface, 15(138):20170792, 2018.
URL https://royalsocietypublishing. Lee, M. A., Zhu, Y., Srinivasan, K., Shah, P., Savarese, S.,
org/doi/10.1098/rsif.2017.0792. Fei-Fei, L., Garg, A., and Bohg, J. Making sense of vi-
sion and touch: Self-supervised learning of multimodal
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., representations for contact-rich tasks. In 2019 Interna-
Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ra- tional Conference on Robotics and Automation (ICRA),
malho, T., Grabska-Barwinska, A., et al. Overcom- pp. 8943–8950. IEEE, 2019. URL https://arxiv.
ing catastrophic forgetting in neural networks. Pro- org/1810.10191v2.
ceedings of the National Academy of Sciences, 114
Levin, M. Technological approach to mind everywhere:
(13):3521–3526, 2017. URL https://arxiv.org/
an experimentally-grounded framework for understand-
abs/1612.00796.
ing diverse bodies and minds. Frontiers in systems
Koos, S., Mouret, J.-B., and Doncieux, S. The trans- neuroscience, 16:768201, 2022. URL https://
ferability approach: Crossing the reality gap in evolu- www.frontiersin.org/articles/10.3389/
tionary robotics. IEEE Transactions on Evolutionary fnsys.2022.768201.
Computation, 17(1):122–145, 2012. URL https:// Li, C., Xia, F., Martı́n-Martı́n, R., Lingelbach, M., Srivas-
ieeexplore.ieee.org/document/6151107. tava, S., Shen, B., Vainio, K., Gokmen, C., Dharan, G.,
Korth, M. The purpose of qualia: What if human thinking Jain, T., et al. igibson 2.0: Object-centric simulation
is not (only) information processing? arXiv preprint for robot learning of everyday household tasks. arXiv
arXiv:2212.00800, 2022. URL https://arxiv. preprint arXiv:2108.03272, 2021. URL https://
org/abs/2212.00800. arxiv.org/abs/2108.03272.
Li, S. and Deng, W. A deeper look at facial expression
Kotseruba, I. and Tsotsos, J. K. 40 Years of Cog-
dataset bias. IEEE Transactions on Affective Computing,
nitive Architectures: Core Cognitive Abilities and
13(2):881–893, 2020. URL https://arxiv.org/
Practical Applications, volume 53. Springer Nether-
1904.11150.
lands, 2020. ISBN 9550141039. doi: 10.1007/
s10462-018-9646-y. URL https://doi.org/10. Liao, H., Tu, J., Xia, J., Liu, H., Zhou, X., Yuan, H.,
1007/s10462-018-9646-y. and Hu, Y. Ascend: a scalable and unified architecture
for ubiquitous deep neural network computing: Industry
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Ger- track paper. In 2021 IEEE International Symposium on
shman, S. J. Building machines that learn and think High-Performance Computer Architecture (HPCA), pp.
like people. Behavioral and brain sciences, 40:e253, 789–801. IEEE, 2021. URL https://ieeexplore.
2017. URL https://pubmed.ncbi.nlm.nih. ieee.org/document/9407221.
gov/27881212/.
Ligot, A. and Birattari, M. Simulation-only experiments
Lakoff, G. and Johnson, M. Metaphors we live by. Univer- to mimic the effects of the reality gap in the automatic
sity of Chicago press, 1979. URL https://press. design of robot swarms. Swarm Intelligence, 14(1):1–
uchicago.edu/ucp/books/book/chicago/ 24, 2020. URL https://link.springer.com/
M/bo3637992.html. article/10.1007/s11721-019-00175-w.
Lakoff, G. and Johnson, M. L. Philosophy in the flesh : Lin, H., Sun, Y., Zhang, J., and Yu, Y. Model-based re-
the embodied mind and its challenge to western thought. inforcement learning with multi-step plan value estima-
1999. URL https://api.semanticscholar. tion. arXiv preprint arXiv:2209.05530, 2022. URL
org/CorpusID:16103621. https://arxiv.org/abs/2209.05530.

13
A Call for Embodied AI

Locke, J. An essay concerning human understanding. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
Kay & Troutman, 1847. URL https://www. C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K.,
gutenberg.org/files/10615/10615-h/ Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller,
10615-h.htm. L., Simens, M., Askell, A., Welinder, P., Christiano, P.,
Leike, J., and Lowe, R. Training language models to
McNearney, S. A Brief Guide to Embodied Cognition: follow instructions with human feedback, 2022. URL
Why You Are Not Your Brain, 2011. https://arxiv.org/abs/2203.02155.
Mockus, J. Bayesian Approach to Global Optimization:
Pan, S. J. and Yang, Q. A survey on transfer learning.
Theory and Applications. Kluwer Academic Publish-
IEEE Transactions on knowledge and data engineering,
ers, 1989. URL https://link.springer.com/
22(10):1345–1359, 2010.
book/10.1007/978-94-009-0909-0.
Nguyen, T. T., Hui, P.-M., Harper, F. M., Terveen, L., and Paolo, G., Gonzalez-Billandon, J., Thomas, A., and Kégl,
Konstan, J. A. Exploring the filter bubble: the effect of B. Guided safe shooting: model based reinforce-
using recommender systems on content diversity. In Pro- ment learning with safety constraints. arXiv preprint
ceedings of the 23rd international conference on World arXiv:2206.09743, 2022. URL https://arxiv.
wide web, pp. 677–686, 2014. URL https://dl. org/2206.09743.
acm.org/doi/10.1145/2566486.2568012. Parcalabescu, L., Trost, N., and Frank, A. What is multi-
Norrie, T., Patil, N., Yoon, D. H., Kurian, G., Li, modality? arXiv preprint arXiv:2103.06304, 2021. URL
S., Laudon, J., Young, C., Jouppi, N., and Patter- https://arxiv.org/abs/2103.06304.
son, D. The design process for google’s training
Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and
chips: Tpuv2 and tpuv3. IEEE Micro, 41(2):56–63,
Wermter, S. Continual lifelong learning with neural net-
2021. URL https://ieeexplore.ieee.org/
works: A review. Neural Networks, 113:54–71, 2019.
document/9351692.
O’Neil, C. Weapons of Math Destruction: How Big Data Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T.
Increases Inequality and Threatens Democracy. Crown Curiosity-driven exploration by self-supervised predic-
Publishing Group, 2016. tion. In International conference on machine learning,
pp. 2778–2787. PMLR, 2017. URL https://arxiv.
O’Neil, C. and Schutt, R. Doing Data Science. O’Reilly org/1705.05363.
Media, Inc., 2013.
Pattanaik, A., Tang, Z., Liu, S., Bommannan, G.,
Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., and Chowdhary, G. Robust Deep Reinforce-
Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., ment Learning with Adversarial Attacks, Decem-
Massa, F., El-Nouby, A., Howes, R., Huang, P.-Y., Xu, ber 2017. URL http://arxiv.org/abs/1712.
H., Sharma, V., Li, S.-W., Galuba, W., Rabbat, M., As- 03632. arXiv:1712.03632 [cs].
sran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H.,
Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P. Di- Perez, C. Artificial Empathy - A Roadmap for Human
nov2: Learning robust visual features without supervi- Aligned Artificial General Intelligence. Publisher, 2023.
sion, 2023.
Perlich, C., Dalessandro, B., Raeder, T., Stitelman, O., and
Orhan, P., Boubenec, Y., and King, J.-R. Don’t stop Provost, F. Machine learning for targeted display adver-
the training: continuously-updating self-supervised al- tising: transfer learning in action. In Machine learning,
gorithms best account for auditory responses in the cor- volume 95, pp. 103–127. Springer, 2014.
tex. arXiv preprint arXiv:2202.07290, 2022. URL
https://arxiv.org/abs/2202.07290. Pfeifer, R. and Bongard, J. How the Body Shapes the Way
We Think: A New View of Intelligence. MIT Press, 2006.
Ororbia, A. and Friston, K. Mortal computation: A founda-
tion for biomimetic intelligence, 2024. URL https:// Pfeifer, R. and Iida, F. Embodied artificial intelligence:
arxiv.org/abs/2311.09589. Trends and challenges. Lecture notes in computer
science, pp. 1–26, 2004. URL https://link.
Oudeyer, P.-Y. and Kaplan, F. What is intrinsic mo- springer.com/chapter/10.1007/978-3-
tivation? a typology of computational approaches. 540-27833-7_1.
Frontiers in neurorobotics, 1:6, 2007. URL https://
www.frontiersin.org/articles/10.3389/ Pourpanah, F., Abdar, M., Luo, Y., Zhou, X., Wang, R.,
neuro.12.006.2007. Lim, C. P., Wang, X.-Z., and Wu, Q. J. A review of

14
A Call for Embodied AI

generalized zero-shot learning methods. IEEE transac- Rosas, F. E., Mediano, P. A., Jensen, H. J., Seth, A. K.,
tions on pattern analysis and machine intelligence, 45 Barrett, A. B., Carhart-Harris, R. L., and Bor, D.
(4):4051–4070, 2022. Reconciling emergences: An information-theoretic
approach to identify causal emergence in multi-
Provost, F. and Fawcett, T. Data Science for Business: variate data. PLoS computational biology, 16(12):
What You Need to Know about Data Mining and Data- e1008289, 2020. URL https://journals.plos.
Analytic Thinking. O’Reilly Media, Inc., 2013. org/ploscompbiol/article?id=10.1371/
journal.pcbi.1008289.
Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fi-
dler, S., and Torralba, A. Virtualhome: Simu- Rudin, N., Hoeller, D., Reist, P., and Hutter, M. Learning to
lating household activities via programs. In Pro- walk in minutes using massively parallel deep reinforce-
ceedings of the IEEE Conference on Computer ment learning. In Conference on Robot Learning, pp.
Vision and Pattern Recognition, pp. 8494–8502, 91–100. PMLR, 2022. URL https://arxiv.org/
2018. URL https://openaccess.thecvf. abs/2109.11978.
com/content_cvpr_2018/html/Puig_
Russell, S. Human-compatible artificial intelligence.
VirtualHome_Simulating_Household_
Human-like machine intelligence, pp. 3–23, 2021. URL
CVPR_2018_paper.html.
http://aima.cs.berkeley.edu/˜russell/
papers/mi19book-hcai.pdf.
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A.,
and Lawrence, N. D. (eds.). Dataset Shift in Machine Salvato, E., Fenu, G., Medvet, E., and Pellegrino, F. A.
Learning. The MIT Press, 2009. Crossing the reality gap: A survey on sim-to-real trans-
ferability of robot controllers in reinforcement learning.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, IEEE Access, 9:153171–153187, 2021.
C., and Sutskever, I. Robust speech recognition via large-
scale weak supervision, 2022. URL https://arxiv. Schüll, N. D. Addiction by Design: Machine Gambling in
org/abs/2212.04356. Las Vegas. Princeton University Press, 2012.

Ramakrishnan, S. K., Jayaraman, D., and Grauman, K. Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubrama-
An exploration of embodied visual exploration. Inter- nian, S., and Vertesi, J. Fairness and abstraction in so-
national Journal of Computer Vision, 129:1616–1649, ciotechnical systems. ACM Conference on Fairness, Ac-
2021. URL https://arxiv.org/abs/2001. countability, and Transparency (FAT*), pp. 59–68, 2019.
02192. Shapiro, L. Embodied Cognition. Routledge, 2011.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Shapiro, L. G. Computer vision: the last fifty years. Uni-
Chen, M. Hierarchical text-conditional image generation versity of Washington, Last access, 7(05), 2021.
with clip latents, 2022. URL https://arxiv.org/
abs/2204.06125. Shen, T., Jin, R., Huang, Y., Liu, C., Dong, W., Guo,
Z., Wu, X., Liu, Y., and Xiong, D. Large lan-
Reynolds, G. D. and Roth, K. C. The development of guage model alignment: A survey. arXiv preprint
attentional biases for faces in infancy: A developmen- arXiv:2309.15025, 2023. URL https://arxiv.
tal systems perspective. Frontiers in psychology, 9:222, org/abs/2309.15025.
2018. URL https://www.frontiersin.org/
Shenavarmasouleh, F., Mohammadi, F. G., Amini, M. H.,
articles/10.3389/fpsyg.2018.00222.
and Reza Arabnia, H. Embodied ai-driven operation
Ribeiro, M. H., Ottoni, R., West, R., Almeida, V. A., of smart cities: A concise review. Cyberphysical
and Meira Jr, W. Auditing radicalization pathways on Smart Cities Infrastructures: Optimal Operation and
youtube. In Proceedings of the 2020 Conference on Fair- Intelligent Decision Making, pp. 29–45, 2022. URL
ness, Accountability, and Transparency, pp. 131–141, https://arxiv.org/abs/2108.09823.
2020. Shi, L. X., Lim, J. J., and Lee, Y. Skill-based
model-based reinforcement learning. arXiv preprint
Roli, A., Jaeger, J., and Kauffman, S. A. How organisms arXiv:2207.07560, 2022. URL https://arxiv.
come to know the world: fundamental limits on artificial org/abs/2207.07560.
general intelligence. Frontiers in Ecology and Evolution,
9:1035, 2022. URL https://www.frontiersin. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
org/articles/10.3389/fevo.2021.806283. Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,

15
A Call for Embodied AI

Panneershelvam, V., Lanctot, M., et al. Mastering the Vervaeke, J., Lillicrap, T. P., and Richards, B. A. Relevance
game of go with deep neural networks and tree search. realization and the emerging framework in cognitive sci-
nature, 529(7587):484–489, 2016. ence. Journal of Logic and Computation, 22(1):79–99,
2012.
Sivaraman, V., Wu, Y., and Perer, A. Emblaze: Illumi-
nating machine learning representations through inter- Wang, L., Zhang, X., Su, H., and Zhu, J. A comprehensive
active comparison of embedding spaces. In 27th Inter- survey of continual learning: Theory, method and appli-
national Conference on Intelligent User Interfaces, pp. cation. arXiv preprint arXiv:2302.00487, 2023a. URL
418–432, 2022. URL https://arxiv.org/abs/ https://arxiv.org/abs/2302.00487.
2202.02641.
Wang, Y., Zhong, W., Li, L., Mi, F., Zeng, X., Huang, W.,
Smith, L. and Gasser, M. The development of embodied
Shang, L., Jiang, X., and Liu, Q. Aligning large lan-
cognition: Six lessons from babies. Artificial life, 11 guage models with human: A survey. arXiv preprint
(1-2):13–29, 2005. URL https://pubmed.ncbi.
arXiv:2307.12966, 2023b.
nlm.nih.gov/15811218/.

Solms, M. The hard problem of consciousness and the free Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W.,
energy principle. Frontiers in Psychology, 2019. Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned
language models are zero-shot learners. arXiv preprint
Strubell, E., Ganesh, A., and McCallum, A. Energy and arXiv:2109.01652, 2021.
policy considerations for deep learning in NLP. In Pro-
ceedings of the 57th Annual Meeting of the Association Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter,
for Computational Linguistics, pp. 3645–3650, 2019. B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-
thought prompting elicits reasoning in large language
Sutton, R. S. and Barto, A. G. Reinforcement Learning: An models, 2023. URL https://arxiv.org/abs/
Introduction. MIT press, 2018. 2201.11903.
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Westho, B., Koele, I. J., and van de Groep, I. H. So-
Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., cial learning and the brain: How do we learn from and
et al. Gemini: a family of highly capable multimodal about other people? Everything You and Your Teach-
models. arXiv preprint arXiv:2312.11805, 2023. ers Need to Know About the Learning Brain, pp. 42,
Thagard, P. Cognitive architectures. The Cambridge hand- 2020. URL https://kids.frontiersin.org/
book of cognitive science, 3:50–70, 2012. articles/10.3389/frym.2020.00095.

Thrun, S., Burgard, W., and Fox, D. Probabilistic Robotics. Widrow, B. and Kollár, I. Quantization noise: round-
MIT Press, 2005. off error in digital computation, signal processing, con-
trol, and communications. Cambridge University Press,
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., 2008.
and Abbeel, P. Domain randomization for transferring
deep neural networks from simulation to the real world. Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B.,
In 2017 IEEE/RSJ international conference on intelli- Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan,
gent robots and systems (IROS), pp. 23–30. IEEE, 2017. X., Wang, X., Xiong, L., Zhou, Y., Wang, W., Jiang, C.,
Zou, Y., Liu, X., Yin, Z., Dou, S., Weng, R., Cheng, W.,
Vapnik, V. Statistical Learning Theory. Wiley, 1998. Zhang, Q., Qin, W., Zheng, Y., Qiu, X., Huang, X., and
Gui, T. The rise and potential of large language model
Varela, F. J., Thompson, E., and Rosch, E. The Embodied
Mind: Cognitive Science and Human Experience. MIT based agents: A survey, 2023. URL https://arxiv.
org/abs/2309.07864.
Press, 1991.

Verma, S., Ernst, M., and Just, R. Removing biased Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., and Hooi,
data to improve fairness and accuracy. arXiv preprint B. Can llms express their uncertainty? an empirical eval-
arXiv:2102.03054, 2021. URL https://arxiv. uation of confidence elicitation in llms, 2024.
org/2102.03054.
Yan, C., Misra, D., Bennnett, A., Walsman, A., Bisk, Y.,
Vervaeke, J. and Coyne, S. Mentoring the Machines. and Artzi, Y. Chalet: Cornell house agent learning envi-
Hackett Publishing, 2024. URL https://www. ronment. arXiv preprint arXiv:1801.07357, 2018. URL
mentoringthemachines.com. https://arxiv.org/abs/1801.07357.

16
A Call for Embodied AI

Yifan, C., Yulu, C., Yadan, Z., and Wenbo, L. Continual


learning in an easy-to-hard manner. Applied Intelligence,
pp. 1–21, 2023. URL https://link.springer.
com/article/10.1007/s10489-023-04454-
2.
Zhou, Q., Chen, S., Wang, Y., Xu, H., Du, W., Zhang, H.,
Du, Y., Tenenbaum, J. B., and Gan, C. HAZARD chal-
lenge: Embodied decision making in dynamically chang-
ing environments, 2024.

17

You might also like