0% found this document useful (0 votes)
35 views

Control Meets Learning

Control meets Learning Talk slides from Prof. Jonathan How (MIT). Bio: Jonathan P. How is the Richard C. Maclaurin Professor of Aeronautics and Astronautics at the Massachusetts Institute of Technology. He received a B.A.Sc. (aerospace) from the University of Toronto in 1987, and his S.M. and Ph.D. in Aeronautics and Astronautics from MIT in 1990 and 1993, respectively, and then studied for 1.5 years at MIT as a postdoctoral associate. Prior to joining MIT in 2000.

Uploaded by

Ying Xiao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Control Meets Learning

Control meets Learning Talk slides from Prof. Jonathan How (MIT). Bio: Jonathan P. How is the Richard C. Maclaurin Professor of Aeronautics and Astronautics at the Massachusetts Institute of Technology. He received a B.A.Sc. (aerospace) from the University of Toronto in 1987, and his S.M. and Ph.D. in Aeronautics and Astronautics from MIT in 1990 and 1993, respectively, and then studied for 1.5 years at MIT as a postdoctoral associate. Prior to joining MIT in 2000.

Uploaded by

Ying Xiao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Learning-based Planning and Control:

Opportunities and Challenges

Jonathan P. How with many collaborators


MIT Department of Aeronautics and Astronautics

Control Meets Learning Virtual Seminar Series


May 12th, 2021

How (AA, MIT)


http://www.cs.utexas.edu/~eladlieb/rl_interaction.png

Introduction
• See key role for tradi.onal planning/control in finding globally
op.mal policies using tradi.onal control, but some systems…
• Difficult/impossible to model from first principles
• Result in very complex models
• Curse of dimensionality in dynamic programming
• Yield poor performance on the real-world system [Abbeel06]

• Mo.vates working directly with the system, and/or


data from it è Machine learning/Reinforcement learning
• Similar to system ID and adapBve control

• Many recent advances


• E.g., 2011 IBM Watson beats humans in Jeopardy
• Significant advances since through deep learning è super-human
performance on many tasks such as image labeling (CNNs)

http://www.nvidia.com/content/events/geoInt2015/LBrown_DL_Image_ClassificationGEOINT.pdf
How (AA, MIT) 2
Control of a Quadrotor with RL
Recent Progress
• Learning-based algorithms are state-of-the-art in many
complex robotic perception and motion-planning tasks
• In many cases these techniques outperform traditional
model-based techniques Learning Quadrupedal Locomotion
• Perhaps more significantly, they often require Human-level control in Atari
over Challenging Terrain
far less domain knowledge/expertise games [Minh15]

• But numerous challenges still remain…

BADGR: An Autonomous Self- Socially Aware Motion Planning Super-Human Performance in Gran
Supervised Learning-Based with Deep RL Turismo Sport Using Deep RL
Navigation System

https://www.youtube.com/watch?v=T0A9voXzhng
https://www.youtube.com/watch?v=9j2a1oAHDL8
https://www.youtube.com/watch?v=UtoZEwrDHj4
https://www.youtube.com/watch?v=Zeyv1bN9v4A
https://www.youtube.com/watch?v=CK1szio7PyA

How (AA, MIT) 3


Guidance on Future Directions Annual Reviews in Control 43 (2017) 1–64
http://www.hichristensen.com/pdf/roadmap-2020.pdf

Contents lists available at ScienceDirect

Annual Reviews in Control


journal homepage: www.elsevier.com/locate/arcontrol

Review article

Systems & Control for the future of humanity, research agenda:


Current and future roles, impact and grand challenges
Francoise Lamnabhi-Lagarrigue a,∗, Anuradha Annaswamy b, Sebastian Engell c, Alf Isaksson d,
Pramod Khargonekar e, Richard M. Murray f, Henk Nijmeijer g, Tariq Samad h, Dawn Tilbury i,
Paul Van den Hof j
a
CNRS Laboratory of Signals and Systems, CentraleSupelec, University Paris-Sud, University Paris-Saclay, Gif-sur-Yvette, France
b
Active-Adaptive Control Laboratory, Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, USA
c
Process Dynamics and Operations Group, Department of Biochemical and Chemical Engineering, TU Dortmund, Dortmund, Germany
d
ABB AB, Corporate Research, Västerås, Sweden
e
Department of Electrical Engineering and Computer Science and Office of Research, University of California, Irvine, CA, USA
f
Control and Dynamical Systems, California Institute of Technology, Pasadena, USA
g
Department of Mechanical Engineering, Technische Universiteit Eindhoven, Eindhoven, The Netherlands
h
Technological Leadership Institute, University of Minnesota, USA
i
Mechanical Engineering Department, University of Michigan, USA
j
Department of Electrical Engineering, Technische Universiteit Eindhoven, Eindhoven, The Netherlands

a r t i c l e i n f o a b s t r a c t

Article history: Following in the footsteps of the renowned report “Control in an Information Rich World,” Report of the
Received 2 April 2017 Panel on “Future Directions in Control, Dynamics, and Systems” chaired by Richard Murray (2002), this
Accepted 2 April 2017
paper aims to demonstrate that Systems & Control is at the heart of the Information and Communi-
Available online 26 April 2017
cation Technologies to most application domains. As such, Systems & Control should be acknowledged
Keywords: as a priority by funding agencies and supported at the levels necessary to enable technologies address-
Systems & Control ing critical societal challenges. A second intention of this paper is to present to the industrials and the
Research challenges young research generation, a global picture of the societal and research challenges where the discipline
Critical societal challenges of Systems & Control will play a key role. Throughout, this paper demonstrates the extremely rich, cur-
rent and future, cross-fertilization between five critical societal challenges and seven key research and
innovation Systems & Control scientific challenges. This paper is authored by members of the IFAC Task
Road Map Committee, established following the 19th IFAC World Congress in Cape Town. Other experts
who authored specific parts are listed below.
© 2017 Elsevier Ltd. All rights reserved.

Contents

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Systems & Control: a rich history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. A glimpse into future and changing paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3. Organization of this paper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Advances in Systems & Control in the past fifteen years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3. The essential role of Systems & Control in meeting critical societal challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1. Transportation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2. Energy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3. Water . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4. Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


Corresponding author.
E-mail address: lamnabhi@l2s.centralesupelec.fr (F. Lamnabhi-Lagarrigue).

http://dx.doi.org/10.1016/j.arcontrol.2017.04.001
1367-5788/© 2017 Elsevier Ltd. All rights reserved.

How (AA, MIT) 4


Challenges in Reinforcement Learning
• Complexity
• How to learn from much less data (sample)?
• Environmentally friendly learning (computaBonal)?

• Safe learning
• Lack of safety guarantees can have catastrophic
consequences (e.g., loss of human life);
• How to guarantee/verify safety?

• Mul.agent Learning
• Scaling to larger teams
• Architectural design choices

How (AA, MIT) 5


Complexity
Learning With Mixed Simulated/Real World Data
• Challenges:
• Data collection: How to balance simulator fidelity and cost to collect data?
• Efficiency: How to learn on the real system from limited samples?
• Transfer: How to transfer policies trained on simulated data to real-world?

A multi-fidelity chain of Learning from Conceptual view of a simulaAon-


simulators and learning agents demonstrations [Thakur19] to-reality transfer process
[Cutler14] [Peng18]
Multi-fidelity RL
(MFRL)

• Simple models good to find globally optimal policies using traditional control
• Refine using learning-based control on more complex simulations

• MFRL with bi-directional transfer finds better policies in real-world with


fewer samples [Cutler15]
• Forward transfer (model à simulation à real world) to initialize policy/exploration
• Reverse transfer of observed data (complex simulation ß real world) to evaluate
learned policy and determine if update needed
• Key part is a “design of experiments” process
How (AA, MIT) 8
Example: Autonomous Drifting
• Goal: Learn closed-loop drifting controllers
• RC car with slick wheels - difficult to drive/control
by hand
• Significant model uncertainty - not obvious how
to design controller
• System ID “magic parameters” and initialize
with optimal control

Low Level Control On-Board

Measured position, Motion


State velocity, and heading Capture

𝑣𝑐𝑚𝑑 ,
𝑟𝑐𝑚𝑑 , 𝛿
𝜓𝑐𝑚𝑑
Pure Velocity
MFRL Throttle
Pursuit Control 𝜔𝑐𝑚𝑑
PI Wheel Speed
Controller
Actions (wireless
link) 𝜔𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑

How (AA, MIT) 9


Example: Autonomous Drifting
• Goal: Maintain constant body-frame velocities, with significant side-slip
1. Optimal control policy helpful, but further learning required in simulation
2. With good initial policy, learning on the real car is fast

4 Prior Optimal
(Real Car)
Control (Sim)
4
Prior Optimal
(Real Car)
Control (Sim) 6 Learned Trajectory
Learned (Real(Sim)
Trajectory Car)

Vx ( ms )
0.84
Learned Trajectory (Real(Sim)
Learned Trajectory Car)
33 0.02
°0.80
3.0

Vy ( ms )
22 0
1.5
°1
(m)

0.0
yy (m)

°2
°1.5
11 4

s )
√˙ ( rad
102
50
0
00 °2
1
1
Start
Start Start
Start
00

±
°1
°1 °1
°1°1 00 11 22 3 4 55
xx (m)
(m) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
t (s)

How (AA, MIT) 10


MFRL Experiments

How (AA, MIT) 11


Meta-Learning
• Issue:
• Standard ML has high sample complexity
• Often difficult to collect samples https://meta-world.github.io/

• Approach: Meta-learning, which learns adaptable policy from


collection of similar experiences [Schmidhuber87; Bengio92]
arg max log p(„|D) ∆ arg max log p(„|D, Dmeta-train )
<latexit sha1_base64="jif5MiZ3YwIUWwldVsYZv7b9qVc=">AAAEvHicjVNdb9MwFE23AlthsIEED7xYa0FDKlHTrYIX0CQmNJAmBus+pLqqXPe28ZbYkeOsm4wn8TP5BfwNbtoU9gECP8TX1+dc+54T95NIpKbR+F6amy/fun1nYbFy997S/QfLKw8PUpVpDvtcRUof9VkKkZCwb4SJ4CjRwOJ+BIf9k3f5/uEp6FQo2TbnCXRjNpJiKDgzmOotf6vVKpTpEaExO+vRJBSERmpEkrU8/kos5SwiW+4FubigX8QoNExrNcYV+R9a/VfUs9TAmbExGPbSaCakw6KVWq23XG34jckgN4OgCKpeMXZ7K/NVOlA8i0EaHrE0tUwbwSNwnaCZmHoExoDuVqiQSWbIBNITkvt4PvaamVBp+1FJZkImya5PtrEfShd3EMd4mKXIT8kHmaKcmQGihqQNPJSo5egcgWTSiDH2OFTjzVgYHwaZcxU6YAYszhLGXGUSL2Ej4AYzxXIsTCikHYpRpsHNNjVMCHHM5ABFCiHFPJrjJosc5KPUrC+4ne0gC6+ZMH7CRmBz24Uc3UyCUTy/l0JBBlzhFaRJbZr1Z3U6Lb8Fcdc6u+43IXY2SP5EmKEDPsVOkPipXD6x86ndrLeDrh1OeHmdLUDxNeydx30Vvce05ed6ak/qLOKdHUtMOYsF5W/CDlozJVm6FzIUIs7NipI8vlxitfV6qjcKpTTEtpidbRdB4UYh7jG2gm10LS16stXAOXLdg2M8DnEbiItgaDqE9pnl3JFqQJ6TahN/go7fzJUj1fU8s0EoMEJ1/j667tm1Q/Nq6T/KBX8rdkVjtMKkGAP2pSSgJMN8ieZXaMK0kAM0jLwhjcTgswquP6KbwUHTD1p+43OzulkvHtiC99Rb9da8wHvlbXrb3q6373HvR2mp9Lj0pPy2PCiflOMpdK5UcB55V0b59CccLpdU</latexit>

„ „

• Benefit: Fast/effective online adaptation to new tasks


or system perturbations during execution
• [Wang16, Duan16; Finn17, Mishra17, Nichol18, Nagabandi19]

• Challenge: effective learning of meta-parameters

How (AA, MIT) https://www.youtube.com/watch?v=ejG2nzCNdZ8&feature=emb_title&ab_channel=AnushaNagabandi 12


Learning with Mixed Simulated/Real World Data
• How to transfer simulation-trained policies to
real world?
• Many new tools to simulate realistic perception
• Powered by video game graphics engines (Unreal & Unity)

CAD2RL [Sadeghi17] Domain Randomization


[Tobin17]

Microsoft AirSim Flight Goggles [Guerra19]

• Despite realistic simulations of sensors & vehicles,


still difficult to test response to other agent behaviors
Complexity: Future Directions and Opportunities
• “Sim to real” gap reducing for perception systems with the new software available
• Still time-consuming to great large datasets

• Data with AI agents/robots in proximity to humans much harder to obtain


• Safety; time consuming/expensive

• Training with less data (fewer samples) is possible, but how representative is
it of the real-world?
• Design of experiments
• Extensions to MFRL ideas beyond Gaussian process models

• Integration of physics-based modeling/constraints in learning needed

How (AA, MIT) 14


Certification
Safe & Robust Learning
• Algorithmic decisions can have major safety implica.ons

Real world
adversarial
example

• For safety-cri.cal domains, what happens if we just train latest RL algorithm?


• Might work in simulator it was trained in, but...
• Unclear how it will transfer to real-world (which has noise)
• How do we know it will saBsfy hard safety constraints?
• What if agent operates far from training distribuBon?

left img: https://spectrum.ieee.org/cars-that-think/transportation/self-driving/three-small-stickers-on-road-can-steer-tesla-autopilot-into-oncoming-lane


right img: https://www.theverge.com/2018/6/22/17492320/safety-driver-self-driving-uber-crash-hulu-police-report
How (AA, MIT) 16
Example: Collision Avoidance with Deep RL (CADRL)
• Developed CADRL for decentralized mul.agent collision avoidance [Chen17]
• Learn a value funcBon of future cost (good performance)
• Encodes a policy for fast lookup (reacBon speed)
• Use deep RL to learn a policy
• 26% performance improvement
and achieved real-Bme
performance

• Safety?
• Cer.fica.on?

17
Safe & Robust Learning
• NNs in feedback loop date back to 1960s, popular again in 1990s
• Previously NNs for control è Now: deep NNs used for perception, planning, control, modeling

• Major discovery: adversarial examples [Szegedy13]

[Goodfellow14]
• Motivated new field in computer vision/machine [Eykholt17]
learning communities of neural network verification

Interval Prop. LP QP/SDP ReLUplex


[Gowal18] [Weng18, Wong18, Singh18] [Raghunathan18, Fazlyab19] [Katz17]

Fast Exact

How (AA, MIT) 18


Safe & Robust Learning: Issues
Must address both analysis and synthesis

• Analysis issues:
• Efficiently and effectively capturing the input uncertainty set
• Computation time (real time) and scalability

• Synthesis issues:
• Are analysis methods fast enough to be embedded in
design process?
• Typically require millions of analysis steps for synthesis
• How much effort to exert on robustness in
training given the use of a real-time analysis

• Many analogies in this process to the issues faced


in the control field in the 80’s-90’s Figure 1.1: A picture history of control

Zhou, Doyle, and Glover, Robust and Op@mal


Control Pearson, 1995

How (AA, MIT) 19


Safe & Robust Learning: Future Directions
Isolated NNs Neural Feedback Loops

Usefulness,
"Dog"

Difficulty
Future of Robust Learning Time, Robust Synthesis
Future of Robust Learning

] Fast analysis during training & tighter Empirically Robust


analysis during inference Training + Formal Analysis
b] Closed-Loop Reachability Analysis
Exact Verification, Relaxations Formal Robustness Analysis
Robust Rewards, Adversarial
Adversarial Re-training, Distillation Empirically Robust Training Environments A

ResNet, AlexNet Standard Learning DQN, PPO, Model-Based RL

How (AA, MIT) 20


Safe & Robust Learning: Tight Analysis via Partitioning
• Issue: Real-world uncertain.es at NN inputs, so how es.mate effect on outputs?
• NNs are very high dimensional & nonlinear
• Need to approximate, but not be too conservaBve
• Idea: [Evered20b] combines NN relaxa.ons [Salman19] with set par..ons [Xiang20]
• Result: Outer bounds on output set are refined to approach inner bounds from MC samples
Convex Hull of MC Samples

N MC Samples

Uncertainties Robust Outputs


Sensor Noise Worst-Case
Adv. Attacks Best-Case
... All cases
...
Estimated Outer Bound

NN Input Set NN Output Set

How (AA, MIT) github.com/mit-acl/nn_robustness_analysis 21


Safe & Robust Learning: CerMfiably Robust Deep RL
• Issue: How to analyze a learned policy and robustify the implementation?
• Idea: Formulate robust version of action selection
è then solve via convex relaxations

Robust Deep RL
With Partitioning

Safe but conservative Safe and faster

Defending against aPacks in images Learned collision avoidance


with adversarial attacks on blue agent's position
CARRL agent beats computer despite
imperfect measurements
How (AA, MIT) 22
Safe & Robust Learning: Closed-Loop Analysis
• Issue: How analyze safety of control systems with learned policies?
• Nonlinear & high dimensional neural network (NN) in the feedback loop

• Objective: Given uncertainty in


initial state, compute bounds on
trajectory (reachable sets) to ensure
system robustly terminates in goal set
while robustly avoiding obstacles

• Known linear dynamics, NN controller, ICs,


subject to sensor & process noise

How (AA, MIT) 23


Safe & Robust Learning: Closed-Loop Analysis
• Issue: Prior work solved problem using SDP relaxations è yields tight bounds,
but too slow for real-time [Hu20]
• Idea: Linear approximation of nonlinear activations è LP formulation that
yields fast solutions [Everett20c]
• As expected, convex relaxations yield looser bounds that SDP, but can more than recuperate lost
accuracy by partitioning input set (and recomputing)

[Hu20]

Reach-LP-Partition provides factor of 10


improvement in error and 2 in time
How (AA, MIT) github.com/mit-acl/nn_robustness_analysis 24
Safe & Robust Learning: Goal-Oriented MPC
• Problem: Autonomous naviga.on among pedestrians or other robots
• Issue: MPC requires global guidance & RL does not necessarily sa.sfy hard constraints
• Idea: Train subgoal recommenda.on policy with RL è use as guidance for MPC [Brito21]
• MPC for vehicle dynamics and collision constraints, RL to account for interacBons with other agents

How (AA, MIT)


GO-MPC Architecture 8 2nd-order unicycles running GO-MPC 25
Safe/Robust Learning: Future Directions and Opportunities
• Accurately describing real-world uncertainties with analysis tools
• Yields (very conservative) overbounds on properties of interest
• Perception errors: Wasserstein models [Wong19] maybe more realistic than Lp-balls
• Best framework to capture the error models?

• Robustness to incomplete training data


• Could result from poor simulation process or sparsely sampled state space
• Techniques such as [Kahn18, Lutjens19] designed to behave more conservatively when far
from training data

• Synthesizing known-to-be-safe NNs


• Control barrier functions [Ames14] as a non-learning wrapper to keep controllers safe
• Express policies in a form that can be verified faster or more accurately

How (AA, MIT) github.com/mit-acl/nn_robustness_analysis 26


Multiagent RL (MARL)
Multiagent Problems
• Single agent (deep) RL has gained great success, but many applications involve
interaction between multiple agents
• Multi-robot control, multiplayer games, and analysis of social dilemmas
• Key Issues: Agents learning to interact/communicate
• Challenges:
• Inherent non-stationarity due to simultaneously learning agents
• Who/what to communicate among collaborative agents?
• Scalability to many agents

Connected and
Robot soccer Starcraft Texas hold'em poker
autonomous vehicles
How (AA, MIT) 28
Multiagent Reinforcement Learning
• Non-stationarity is a core issue in MARL
• Difficult to converge (e.g., to an equilibrium)
• Challenging to learn an effective policy
• Standard mitigation: Centralized training
with decentralized execution (CTDE)
• Centralized alg. can have different structures [Yang20]
• Actor-critic methods generalize to mixed cooperative-competitive settings Q-DPP
• Current value-based methods mostly used for cooperative settings

Actor-critic methods Value-based methods

How (AA, MIT) MADDPG [Lowe17] QMIX [Rashid18] Spectrum of cooperative MARL methods [Yang20] 29
Multiagent Reinforcement Learning
• Centralized learning with decentralized execution (CTDE) has achieved great
performance in challenging domains
• Soccer [Liu19], Starcraft [Mahajan19], and Hide and seek [Baker20]

Soccer [Liu19] Starcraft [Mahajan19] Hide and seek [Baker20]


https://www.youtube.com/watch?v=kopoLzvh5jY&t=105s
• Some key limitations of current CTDE
• Considers current action/policy of other agents, but not the continuing learning/evolving
• Scalability - joint state-action space increases exponentially with number of agents
è natural driver to distributed/decentralized computation
How (AA, MIT) 30
MARL: Address Non-stationarity via Considering Peer Learning
• Learn to additionally account for changes in other agent’s policies [Kim20]

New terms enable Meta-MAPG to


actively influence future policies of
other agents as well through the peer
learning gradient (missing in Meta-PG)

Meta-PG [Al-Shevidat18]
Meta-MAPG [Kim20]

Results show that Meta-MAPG successfully


adapts to new and learning peer agent
throughout the Markov chain

Meta-PG [Al-Shevidat18], LOLA-DiCE[Foerster18], REINFORCE [WIlliams92]


31
Multiagent Reinforcement Learning
• Decentralized learning with decentralized
execution: only interact with some of
the other agents [Zhang19]

Three representa^ve informa^on structures in MARL

• Various decentralized MARL methods


• Mean field MARL [Yang18]: only consider mean effect of
nearby neighbors (blue)
• Consensus-based algorithms: time-varying and possibly sparse
communication network [Zhang18, Li20]
• Learning to communicate [Agarwal19, Liu20]: but learned
communication graph is often dense à poor scalability

Mean field MARL


How (AA, MIT) 32
MARL: Learn AdapMve Sparse CommunicaMon Graph
• Observations:
• Interactions between agents are often sparse at a certain time instance, such as soccer
• Interchangeability exists among homogeneous agents, which can enable parameter-sharing
• Learn communication graph via dot-product attention mechanism [Vaswani17],
in a principled way without prior knowledge
• New sparsity-induced activation function: adaptive projection onto a probability
simplex [Sun20]

Formation with Reward comparison with Our learned sparse


inherent sparsity MAAC [Iqbal19], communica^on graph
How (AA, MIT) Dense-Att [Agarwal19] 33
MARL: Future Directions and Opportunities

• Increasing scalability of decentralized learning


• Tractable means to address inherent non-staBonariBes

• Sharing policies/value func.ons for heterogeneous agents

• Learning of Bayes-op.mal policy for op.mal explora.on and exploita.on trade-off

• Adap.ve interac.on architecture search that can dynamically expand and shrink
depending on scenario

How (AA, MIT) 34


Summary
• Talk has highlighted numerous issues and advances, with a focus on the remaining
challenges, future directions, and opportunities

• Continued efforts needed to improve algorithmic performance for many systems


• Reduced sample and computational complexity

• As technologies mature, transition and deployment into real-world environments


(e.g., in close proximity to humans) we will see significant continued challenges in
areas of trust, verification, and explainability

• Thanks to Dr. Kasra Khosoussi, Dr. Kaveh Fathian, Dr. Michael Everett,
Dr. Chuangchuang Sun, Dr. Golnaz Habibi, and Dong-Ki Kim

• Research funded in part by ONR, AFOSR, ARO, ARL, DARPA, Boeing, Ford, Lockheed,
IBM, NGS, and AWS

How (AA, MIT) 35


Learning at MIT/ACL
Value of Experiments

Observation: dealing with real dynamics,


noises and uncertainties helps validate
assumptions made in the theory and/
or identify gaps in the algorithms.

Observation: few MARL experiments


since requires robust & capable robots
that can operate for extended periods
How (AA, MIT) 36
References
• M. Cutler, Thomas Walsh, Jonathan P. How, “Real-World Reinforcement Learning via Multi-Fidelity Simulators,” IEEE Transactions on Robotics, Vol. 31(3), pp. 655–671, June 2015. doi:
10.1109/TRO.2015.2419431
• Cutler, Mark, Thomas J. Walsh, and Jonathan P. How. "Reinforcement learning with multi-fidelity simulators." 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014.
• Thakur, Sanjay, et al. "Uncertainty aware learning from demonstrations in multiple contexts using bayesian neural networks." 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019.
• Peng, Xue Bin, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. "Sim-to-real transfer of robotic control with dynamics randomization." In 2018 IEEE international conference on robotics and
automation (ICRA), pp. 1-8. IEEE, 2018.
• Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8.3-4 (1992): 229-256.
• Iqbal, Shariq, and Fei Sha. "Actor-attention-critic for multi-agent reinforcement learning." International Conference on Machine Learning. PMLR, 2019.
• Yang, Y., Wen, Y., Chen, L., Wang, J., Shao, K., Mguni, D., & Zhang, W. (2020). Multi-Agent Determinantal Q-Learning. arXiv preprint arXiv:2006.01482.
• Lowe, R., Wu, Y. I., Tamar, A., Harb, J., Pieter Abbeel, O., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing
systems, 30, 6379-6390.
• Rashid, T., Samvelyan, M., De Witt, C. S., Farquhar, G., Foerster, J., & Whiteson, S. (2018). QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint
arXiv:1803.11485.
• Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., & Graepel, T. (2019). Emergent coordination through competition. arXiv preprint arXiv:1902.07151.
• Mahajan, A., Rashid, T., Samvelyan, M., & Whiteson, S. (2019). Maven: Multi-agent variational exploration. In Advances in Neural Information Processing Systems (pp. 7613-7624).
• Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell, G., McGrew, B., & Mordatch, I. (2019). Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528.
• Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., & Abbeel, P. (2017). Continuous adaptation via meta-learning in nonstationary and competitive environments. arXiv preprint
arXiv:1710.03641.
• Foerster, J., Farquhar, G., Al-Shedivat, M., Rocktäschel, T., Xing, E. P., & Whiteson, S. (2018). Dice: The infinitely differentiable monte-carlo estimator. arXiv preprint arXiv:1802.05098.
• Zhang, K., Yang, Z., & Başar, T. (2019). Multi-agent reinforcement learning: A selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635.
• Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean field multi-agent reinforcement learning. arXiv preprint arXiv:1802.05438.
• Li, W., Jin, B., Wang, X., Yan, J., & Zha, H. (2020). F2A2: Flexible Fully-decentralized Approximate Actor-critic for Cooperative Multi-agent Reinforcement Learning. arXiv preprint arXiv:2004.11145.
• Agarwal, A., Kumar, S., & Sycara, K. (2019). Learning transferable cooperative behavior in multi-agent teams. arXiv preprint arXiv:1906.01202.
• Liu, Y. C., Tian, J., Ma, C. Y., Glaser, N., Kuo, C. W., & Kira, Z. (2020). Who2com: Collaborative perception via learnable handshake communication. arXiv preprint arXiv:2003.09575.
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).

How (AA, MIT) 37


References
• G. Katz, C. W. Barrep, D. L. Dill, K. Julian, and M. J. Kochen-derfer, “Reluplex: An efficient SMT solver for verifying deep neural networks,” in Computer Aided Verificason - 29th Internasonal Conference, CAV
2017, Heidelberg, Germany, July 24-28, 2017, Proceedings,Part I, pp. 97–117, 2017.
• S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. Mann, and P. Kohli, “On the effecsveness ofinterval bound propagason for training verifiably robust models,”arXiv preprint
arXiv:1810.12715, 2018
• A. Raghunathan, J. Steinhardt, and P. Liang, “Cersfied defensesagainst adversarial examples,” inInternasonal Conference on Learn-ing Representasons (ICLR), 2018.
• M. Fazlyab, M. Morari, and G. J. Pappas, “Safety verificason androbustness analysis of neural networks via quadrasc constraints andsemidefinite programming,”arXiv preprint arXiv:1903.01287, 2019.
• H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, and L. Daniel, “Effi-cient neural network robustness cersficason with general acsvasonfuncsons,” inAdvances in neural informason processing systems,pp. 4939–
4948, 2018.
• T. Weng, H. Zhang, H. Chen, Z. Song, C. Hsieh, L. Daniel, D. Boning,and I. Dhillon, “Towards fast computason of cersfied robustness forrelu networks,” inInternasonal Conference on Machine Learning(ICML),
2018.
• E. Wong and J. Z. Kolter, “Provable defenses against adversarialexamples via the convex outer adversarial polytope,” inICML, vol. 80ofProceedings of Machine Learning Research, pp. 5283–5292, 2018.
• G. Singh, T. Gehr, M. Mirman, M. P ̈uschel, and M. Vechev, “Fast andeffecsve robustness cersficason,” inAdvances in Neural InformasonProcessing Systems, pp. 10802–10813, 2018.
• W. Xiang, H.-D. Tran, X. Yang, and T. T. Johnson, “Reachable setessmason for neural network control systems: A simulason-guidedapproach,”IEEE Transacsons on Neural Networks and LearningSystems, 2020.
• Y. F. Chen, M. Liu, M. Everep, and J. P. How, “Decentralized non-communicasng mulsagent collision avoidance with deep reinforce-ment learning,” in2017 IEEE internasonal conference on roboscsand
automason (ICRA), pp. 285–292, IEEE, 2017.
• H. Hu, M. Fazlyab, M. Morari, and G. J. Pappas, “Reach-sdp:Reachability analysis of closed-loop systems with neural network controllers via semidefinite programming,” in59th IEEE Conferenceon Decision and
Control, 2020.
• C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good-fellow, and R. Fergus, “Intriguing properses of neural networks,” inInternasonal Conference on Learning Representasons (ICLR), 2014.
• A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrierfuncson based quadrasc programs for safety criscal systems,”IEEETransacsons on Automasc Control, vol. 62, no. 8, pp. 3861–3876,2016.
• Salman, H., Yang, G., Zhang, H., Hsieh, C. J., & Zhang, P. (2019). A convex relaxason barrier to sght robustness verificason of neural networks. In Advances in Neural InformaOon Processing Systems (pp. 9835-
9846).
• Sadeghi, F., & Levine, S. (2016). Cad2rl: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201.
• Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017, September). Domain randomizason for transferring deep neural networks from simulason to the real world. In 2017 IEEE/RSJ
InternaOonal Conference on Intelligent Robots and Systems (IROS) (pp. 23-30). IEEE.
• Guerra, W., Tal, E., Murali, V., Ryou, G., & Karaman, S. (2019). Flightgoggles: Photorealissc sensor simulason for percepson-driven roboscs using photogrammetry and virtual reality. arXiv preprint
arXiv:1905.11377.
• Everep, M., Lutjens, B., How, J., Cersfied adversarial robustness in deep reinforcement learning, IEEE Transacsons on Neural Networks and Learning Systems (TNNLS) 2021
• Everep, M., Habibi, G., How, J., Robustness Analysis of Neural Networks via Efficient Parssoning with Applicasons in Control Systems, IEEE Control Systems Lepers 2021
• Everep, M., Habibi, G., How, J., Efficient Reachability Analysis of Closed-Loop Systems with Neural Network Controllers, Internasonal Conference on Roboscs and Automason (ICRA) 2021

How (AA, MIT) 38


References
• J. Schmidhuber. Evolutionary principles in self-referential learning. on learning how to learn: The meta-meta-meta...-hook. Diploma thesis, Technische Universitat Munchen, Germany, 14 May 1987.
• S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, volume 2. Univ. of Texas, 1992.
• J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
• Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
• C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
• N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive metalearner. arXiv preprint arXiv:1707.03141, 2017.
• A. Nichol and J. Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2018.
• A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn. Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning. arXiv preprint
arXiv:1803.11347, 2018.
• S. Thrun. Lifelong learning perspective for mobile robot control. In Proceedings of the IEEE/RSJ/GI International Conference on Intelligent Robots and Systems, volume 1, pp. 23–30, 1994.
• M. D. Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. arXiv preprint arXiv:1909.08383v2,
2020.
• R. Caruana. Multitask learning. Mach. Learn., 28(1):41–75, July 1997.
• B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research (JMLR), vol. 4, pp. 83–99, 2003.
• M. Taylor and P. Stone. Transfer Learning for Reinforcement Learning Domains: A Survey. JMLR, vol. 10, pp. 1633–1685, 2009.
• F. L. da Silva and A. H. R. Costa. 2019. A Survey on Transfer Learning for Multiagent Reinforcement Learning Systems. JAIR , 2019.
• Z. Li and D. Hoiem. Learning without forgetting. ECCV, pp. 614–629. Springer, 2016.
• J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of
the National Academy of Sciences, 2017.
• S. Lee, J. Kim, J. Jun, J. Ha, and B. T. Zhang. Overcoming catastrophic forgetting by incremental moment matching. NeurIPS, pp. 4652–4662, 2017.
• F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. ICML. pp. 3987–3995, 2017.
• A. Rannen, R. Aljundi, M. B. Blaschko, and T. Tuytelaars. Encoder based lifelong learning. ICCV, pp. 1320–1328, 2017.
• A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. ECCV, pp. 532–547, 2018.
• M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and Gerald Tesauro. Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference. ICLR, 2019.
• G. Gupta, K. Yadav, and L. Paull. La-maml: Look-ahead meta learning for continual learning. arXiv preprint arXiv:/2007.13904, 2020.

How (AA, MIT) 39


References
• J. A. Clouse. On integrasng apprensce learning and reinforcement learning, 1996.
• L. Torrey and M. Taylor. Teaching on a budget: Agents advising agents in reinforcement learning. AAMAS, 2013.
• O. Amir, E. Kamar, A. Kolobov, and B. J. Grosz. Interacsve teaching strategies for agent training. IJCAI, 2016.
• F. L. da Silva, R. Glap and A. H. R. Costa. Simultaneously learning and advising in mulsagent reinforcement learning. AAMAS, 2017.
• S. Omidshafiei, D. Kim, M. Liu, G. Tesauro, M. Riemer, C. Amato, M. Campbell, and J. P. How. Learning to Teach in Cooperasve Mulsagent Reinforcement Learning. AAAI, 2019.
• E. İlhan, J. Gow, and Diego Perez-Liebana. Teaching on a Budget in Muls-Agent Deep Reinforcement Learning. CoG, 2019.
• D. Kim, M. Liu, S. Omidshafiei, S. Lopez-Cot, M. Riemer, G. Habibi, G. Tesauro, S. Mourad, M. Campbell, J. P. How. Learning Hierarchical Teaching Policies for Cooperasve Agents. AAMAS, 2020.

How (AA, MIT) 40

You might also like