Robomorphic Computing

A Design Methodology for Computer Architecture
Parameterized by Robot Morphology

by
Sabrina M. Neuman
S.B., Massachusetts Institute of Technology (2009)
M.Eng., Massachusetts Institute of Technology (2011)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2020
© Sabrina M. Neuman, MMXX. All rights reserved.
The author hereby grants to MIT permission to reproduce and to
distribute publicly paper and electronic copies of this thesis document
in whole or in part in any medium now known or hereafter created.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
August 28, 2020
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Srinivas Devadas
Professor of Electrical Engineering and Computer Science
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Leslie A. Kolodziejski
Professor of Electrical Engineering and Computer Science
Chair, Department Committee on Graduate Students
2
A Design Methodology for Computer Architecture
Parameterized by Robot Morphology
by
Sabrina M. Neuman
Submitted to the Department of Electrical Engineering and Computer Science

on August 28, 2020, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
Abstract
Robots that safely interact with people are a promising solution to address societal
challenges from elder care to hazardous work environments. A key computational
barrier to the robust autonomous operation of complex robots is running motion
planning online at real-time rates and under strict power budgets. A performance gap
of at least an order of magnitude has emerged: robot joint actuators respond at kHz
rates, but promising online optimal motion planning for complex robots is limited to
100s of Hz by state-of-the-art software. While domain-specific hardware accelerators
have improved the power and performance of other stages in the robotics pipeline
such as perception and localization, relatively little work has been done for motion
planning. Moreover, designing a single accelerator is not enough. It is essential to
map out design methodologies to keep the development process agile as applications
evolve.
We address these challenges by developing a generalized design methodology for
domain-specific computer architecture parameterized by robot morphology. We (i)
describe the design of a domain-specific accelerator to speed up a key bottleneck in
optimal motion planning, the rigid body dynamics gradients, which currently con-
sumes up to 90% of the total runtime for complex robots. Acceleration is achieved
by exploiting features of the robot morphology to expose fine-grained parallelism
and matrix sparsity patterns. We (ii) implement this accelerator on an FPGA for a
manipulator robot, to evaluate the performance and power efficiency compared to
existing CPU and GPU solutions. We then (iii) generalize this design to prescribe
an algorithmic methodology to design such accelerators for a broad class of robot
models, fully parameterizing the design according to robot morphology. This research
introduces a new pathway for cyber-physical design in computer architecture, me-
thodically translating robot morphology into accelerator morphology. The motion
planning accelerator produced by this methodology delivers a meaningful speedup over
off-the-shelf hardware. Shrinking the motion planning performance gap will enable
roboticists to explore longer planning horizons and implement new robot capabilities.
3
Thesis Supervisor: Srinivas Devadas
Title: Professor of Electrical Engineering and Computer Science
4
Acknowledgments
I thank my advisor, Professor Srini Devadas, and my thesis committee members,
Professor Vijay Janapa Reddi and Professor Daniel Sanchez, for their guidance and
advice in preparation of this thesis.
I thank my collaborators who contributed to the projects described in this work:
Brian Plancher, Thomas Bourgeat, Thierry Tambe, Twan Koolen, Jules Drean, Jason
Miller, and Robin Deits.
Thanks to my MIT CSAIL colleagues, my friends, and my family. Special thanks
to Olivia Leitermann for her support.
5
6
Contents
1 Introduction 19
1.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Background 23
2.1 Robotics Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.1 Robot Morphology . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.2 Robotics Algorithms Using Robot Morphology . . . . . . . . . 24
2.1.3 Rigid Body Dynamics . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Computing Hardware Background . . . . . . . . . . . . . . . . . . . . 26
2.2.1 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.3 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.4 Other Architectures . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Benchmarking and Workload Analysis of Robot Dynamics Algo-

rithms 31
3.1 Survey of Software Libraries . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Hardware Measurement Setup . . . . . . . . . . . . . . . . . . 36
3.2.2 Software Environment . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Inputs and Cross-Library Verification of Results . . . . . . . . 37
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Forward Dynamics Results . . . . . . . . . . . . . . . . . . . . 39
7
3.3.2 Mass Matrix Results . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.3 Inverse Dynamics Results . . . . . . . . . . . . . . . . . . . . 42
3.3.4 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.5 Sources of Parallelism . . . . . . . . . . . . . . . . . . . . . . 44
3.3.6 Sensitivity to Compiler Choice . . . . . . . . . . . . . . . . . . 45
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Observed Trends . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Opportunities for Performance Gains . . . . . . . . . . . . . . 46
3.4.3 Implications for Dynamics Gradients . . . . . . . . . . . . . . 47
3.4.4 Contributions and Future Work . . . . . . . . . . . . . . . . . 48
4 Robomorphic Computing: A Design Methodology for Computer Ar-

chitecture Parameterized by Robot Morphology 51
4.1 Motivating Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.1 Key Kernel: Forward Dynamics Gradient . . . . . . . . . . . . 53
4.1.2 Control Rate Performance . . . . . . . . . . . . . . . . . . . . 54
4.2 Robomorphic Computing . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 Step 1: Create Parameterized HW Template (Once) . . . . . . 57
4.2.2 Step 2: Set Template Parameters (Per-Robot) . . . . . . . . . 58
5 Implementation and Evaluation of a Robomorphic Dynamics Gradi-

ent Accelerator for a Manipulator 61
5.1 Robomorphic Accelerator Design . . . . . . . . . . . . . . . . . . . . 62
5.1.1 Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.2 Step 1: Create Parameterized HW Template (Once) . . . . . . 65
5.1.3 Step 2: Set Template Parameters (Per-Robot) . . . . . . . . . 70
5.2 Architectural Design Choices . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.1 Hardware-Software Codesign . . . . . . . . . . . . . . . . . . . 71
5.2.2 Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.3 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.4 Future Work in Design Optimization . . . . . . . . . . . . . . 76
8
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 FPGA Accelerator Evaluation . . . . . . . . . . . . . . . . . . 80
5.3.3 End-to-End System Evaluation . . . . . . . . . . . . . . . . . 84
5.3.4 Synthesized ASIC Power and Performance . . . . . . . . . . . 85
5.3.5 Projected Control Rate Improvement . . . . . . . . . . . . . . 87
6 Hardware-Software Codesign for the Robot Dynamics Gradient 89

6.1 Hardware-Software Codesign . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.1 Algorithmic Features . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.2 FPGA Codesign . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1.3 CPU Codesign . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.1.4 GPU Codesign . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.2 Single Computation Results . . . . . . . . . . . . . . . . . . . 102
6.2.3 End-to-End Results . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3 Contributions and Future Work . . . . . . . . . . . . . . . . . . . . . 106
7 Discussion and Future Work 107

7.1 Targeting Different Robot Models . . . . . . . . . . . . . . . . . . . . 107
7.2 Other Robotics Applications . . . . . . . . . . . . . . . . . . . . . . . 109
7.3 Automating the Methodology . . . . . . . . . . . . . . . . . . . . . . 109
8 Conclusion 111
9
10
List of Figures
1-1 Overview of the thesis work in relation to a longer-term research vi-

sion. This thesis work introduces a methodology to transform robot
morphology to hardware accelerator morphology, and demonstrates
the application of that methodology to an accelerator for a single mo-
tion planning kernel. This is a first step towards a future end-to-end
processor for robotics, automatically optimized per-robot. . . . . . . . 20
2-1 The processing pipeline for robotics. This work demonstrates applying
robomorphic computing to a kernel in the motion planning and control
stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2-2 Robot morphology as a topology of limbs, rigid links, and joints. Robo-
morphic computing exploits parallelism and matrix sparsity patterns
determined by this structure. . . . . . . . . . . . . . . . . . . . . . . 24
2-3 High level tradeoffs between CPUs, GPUs, and FPGAs. Each platform
has strengths and weaknesses along different design axes. . . . . . . . 27
3-1 3D visualizations of the robot models used as benchmark cases. From

left to right, LBR iiwa (KUKA AG), HyQ (DLS Lab at the Italian
Institute of Technology), and Atlas (Boston Dynamics, version used
during the DARPA Robotics Challenge). . . . . . . . . . . . . . . . . 32
3-2 Runtime in microseconds. Shorter runtimes indicate better performance.
Within each cluster, the bars are in the same order as the top legend. 40
3-3 Instructions per cycle (IPC). Higher IPC indicates better instruction
throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
11
3-4 Total instructions retired by the processor, categorized into memory
accesses (loads and stores), branches, and other instructions. Results
normalized to RBDL on iiwa. . . . . . . . . . . . . . . . . . . . . . . 40
3-5 Total clock cycles, categorized into memory stall cycles, non-memory
“other” stall cycles, and non-stall cycles. Results normalized to RBDL
on iiwa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3-6 Total floating-point operations, categorized into double precision op-

erations with scalar, 128-bit packed vector, and 256-bit packed vector
operands. Results normalized to RBDL on iiwa. . . . . . . . . . . . 41
3-7 L1 data cache accesses, categorized into hits and misses. Results
normalized to RBDL on iiwa. . . . . . . . . . . . . . . . . . . . . . . 41
4-1 Overview of robomorphic computing, a design methodology to transform

robot morphology into customized accelerator hardware morphology
by exploiting robot features such as limb topology and joint type . . . 52
4-2 An online optimal motion planning and control system. The future mo-
tion trajectory of the robot is refined over iterations of the optimization
control loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4-3 Estimated control rates for three robots using different trajectory lengths
(based on state-of-the-art rigid body dynamics software implementa-
tions [9]), compared to ideal control rates required for online use [27].
We assume 10 iterations of the optimization loop. Current control rates
fall short of the desired 250 Hz and 1 kHz targets for most trajectory
lengths. This performance gap is worse for more complex robots, and
grows with the number of optimization iterations. . . . . . . . . . . . 55
12
4-4 Using the robomorphic computing design flow for motion planning
and control. First, we create a parameterized hardware template of
the algorithm. We identify limb and link-based parallelism in the
algorithm, as well as operations involving the sparse link inertia, joint
transformation, and joint motion subspace matrices. This is done only
once for any given algorithm. Then, for every target robot model, we
set the parameters of the hardware template based on the morphology
of the robot to create a customized accelerator. . . . . . . . . . . . . 59
5-1 Visualization of data flow in the inverse dynamics (ID) and ∇ inverse
dynamics (∇ID) algorithms, referenced against the robot links. Fully
parallelized, the latency of ID and ∇ID grows with 𝑂(𝑁 ). . . . . . . 66
5-2 Datapath of forward and backward pass units for a single link in Step 2
of Algorithm 1. The length of a single datapath and the total number
of parallel datapaths are both parameterized by the number of robot
links. The forward pass unit is folded into three sequential stages for
efficient resource utilization. . . . . . . . . . . . . . . . . . . . . . . . 67
5-3 Example of one row of a transformation matrix dot product functional

unit, 𝑋·, for the joint between the first and second links of a manipulator.
The sparsity of the tree of multipliers and adders is determined by the
robot joint morphology. . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5-4 Parameterized hardware template of the dynamics gradients accelerator.

The number of links 𝑁 , link inertial properties, and joint types are all
parameters that can be set to customize the microarchitecture for a
target robot model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5-5 We use the Kuka LBR iiwa [58] manipulator robot to demonstrate using
the robomorphic computing methodology to design and implement a
customized accelerator for the gradient of forward dynamics. . . . . . 71
13
5-6 Accelerator designs incorporating different steps of the dynamics gradi-
ent (Algorithm 1), as part of the hardware-software codesign process.
Our final design, on the right, implements all three steps. . . . . . . . 72
5-7 Effect on total FPGA resource utilization per forward pass unit from
folding the forward pass units along the dividing lines shown in Figure 5-
2. Folding these units conserves our most heavily-used resources on
the XCVU9P FPGA, the digital signal processing (DSP) blocks, a
reduction of 1.79× per forward pass unit. . . . . . . . . . . . . . . . . 75
5-8 FPGA coprocessor system implementation. . . . . . . . . . . . . . . . 79
5-9 Latency of a single computation of the dynamics gradient. Our FPGA

accelerator gives 8× and 86× speedups over the CPU and GPU. The
GPU suffers here from high synchronization and focus on throughput,
not latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5-10 To reduce FPGA utilization, we implement a single transformation

matrix-vector multiplication unit for all joints in the target robot. Using
a superposition of sparsity patterns, we recover 33.3% of average sparsity
while conserving area. . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5-11 We used 32-bit fixed-point with 16 decimal bits in our design due to
FPGA multiplier constraints. However, a range of fixed-point numerical
types deliver comparable optimization cost convergence to baseline 32-
bit floating-point after a fixed number of optimization iterations. This
indicates that it is feasible to use fewer than 32-bits in future work.
Fixed-point labeled as “Fixed{integer bits, decimal bits}”. . . . . . . 83
5-12 End-to-end system latency for a range of trajectory time steps. Our
FPGA accelerator (F) gives speedups of 2.2× to 2.9× over CPU (C)
and 1.9× to 5.5× over GPU (G). . . . . . . . . . . . . . . . . . . . . 84
5-13 ASIC synthesis shows a 4.5× to 7.2× speedup in single computation

latency over the FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . 86
14
5-14 Projected control rate improvements from our accelerator using the
analytical model from Figure 4-3. We enable planning on longer time
horizons for a given control rate, e.g., up to about 100 or 115 time steps
instead of 80 at 250 Hz. ASIC results show a narrow range between
process corners. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6-1 Latency of a single computation of the dynamics gradient. Our novel

GPU implementation gives a 5.4× speedup over the previous state-of-
the-art GPU solution (see Figure 5-9). Our CPU implementation is
comparable to state-of-the-art. The FPGA implementation is the same
one developed in Chapter 5. . . . . . . . . . . . . . . . . . . . . . . . 102
6-2 Throughput of 𝑁 = 16, 64, 128, and 256 computations of the gradient
of robot dynamics for the iiwa manipulator for the (G)PU using (s)plit
and (f)used and (c)ompletely-fused kernels. Using hardware-software
codesign to move from the split to completely-fused kernel improved
our overall design latency by at least 2.8× for all values of 𝑁 . . . . . 104
6-3 Throughput of 𝑁 = 16, 64, 128, and 256 computations of the gradient
of robot dynamics for the iiwa manipulator for the (C)PU, (G)PU, and
(F)PGA using the (c)ompletely-fused codesigned kernel partitionings.
Our FPGA implementation excels at lower numbers of computations,
but its throughput restrictions limit performance at higher numbers.
Our novel GPU implementation demonstrates consistent speedups over
the CPU, and surpasses the FPGA performance at higher numbers of
computations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7-1 Other examples of joints on real robots [7, 58, 95]. The transformation
matrices of these joints exhibit different sparsity patterns, which robo-
morphic computing translates into sparse matrix-vector multiplication
functional units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
15
16
List of Tables
3.1 Dynamics Libraries Evaluated . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Hardware System Configuration . . . . . . . . . . . . . . . . . . . . . 36
5.1 State-of-the-art CPU timing [9] for the steps of the forward dynamics
gradient (Algorithm 1). This timing breakdown informed our hardware-
software codesign. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Percent resource utilization (of totals, given in parentheses) for accel-
erator designs incorporating different steps of the dynamics gradient
on our target XCVU9P FPGA. Our final design, with all three steps,
makes heavy use of the digital signal processing (DSP) blocks, which
we use for matrix element multiplications. . . . . . . . . . . . . . . . 73
5.3 Hardware System Configurations . . . . . . . . . . . . . . . . . . . . 78
5.4 Synthesized ASIC (12nm Global Foundries) and baseline FPGA results
for accelerator computational pipeline. . . . . . . . . . . . . . . . . . 86
6.1 Algorithmic features of the gradient of rigid body dynamics and qual-
itative assessments of their suitability for different target hardware
platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.1 Examples of different model parameters for real robots (pictured in

Figure 7-1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
17
18
Chapter 1
Introduction
Robots that safely interact with people in dynamic and unpredictable environments
are a promising solution to improve the well-being of individuals in society, from their
use in elder care [48, 96] to promoting the health and safety of humans in hazardous
environments [62, 110]. A key computational barrier to the robust autonomous
operation of complex robots is running motion planning and control online at real-time
rates and under strict power budgets [27, 52, 87, 100]. Motion planning is the stage
in the robotics pipeline where robots calculate a valid motion path from an initial
position to a goal state. While domain-specific hardware accelerators have improved
the power and performance of other stages in the robotics pipeline such as perception
and localization [12, 91, 101], relatively little work has been done for motion planning
and control [61, 72]. Moreover, designing a single accelerator is not enough. It is
essential to automate hardware synthesis flows to keep the design process agile as
applications evolve [44].
We address this challenge by developing a design methodology for hardware
accelerators for robotics applications, using motion planning and control as our
motivating use case. Applying computer architecture insights to the robotics domain,
we exploit high-level information about robot topology to achieve higher performance
than existing CPU and GPU solutions. This work is a first step towards a longer-term
research vision: to develop a design flow to automatically generate a processor for the
complete end-to-end robotics pipeline, optimized for the target robot. A high-level
19
Figure 1-1: Overview of the thesis work in relation to a longer-term research vision.
This thesis work introduces a methodology to transform robot morphology to hardware
accelerator morphology, and demonstrates the application of that methodology to an
accelerator for a single motion planning kernel. This is a first step towards a future
end-to-end processor for robotics, automatically optimized per-robot.
overview of the work in this thesis is shown in Figure 1-1.
In this thesis, we demonstrate the systematic design of a robot topology-parameterized

hardware accelerator for a major bottleneck in online optimal motion planning and
control, the gradient of rigid body dynamics [9,87]. In this accelerator, we exploit robot
topology to determine parallelism and sparsity in the accelerator microarchitecture,
transforming robotics software optimizations [9, 66] to the hardware domain. We
implement the design on an FPGA for a manipulator as a proof of concept, demon-
strating a meaningful speedup over off-the-shelf CPU and GPU hardware. We also
collect power and area results for a simulated ASIC implementation of accelerator. We
generalize this design methodology to prescribe an algorithmic design flow to design
accelerator hardware parameterized by robot morphology. Finally, we propose how
that methodology can be expanded upon and automated in future work, to broaden
its applicability to other robot models and algorithms.
This research introduces a new pathway for the design of domain-specific computer
20
architecture for robotics, methodically translating robot morphology into accelerator
morphology. This work produces an accelerator implemented on an FPGA platform
that speeds up a key bottleneck in optimal motion planning and control, the dynamics
gradient, which currently consumes 30% to 90% of the total runtime for complex
robots [9]. Shrinking that performance gap will enable roboticists to explore longer
planning horizons and implement new robot capabilities.
1.1 Thesis Overview
We begin this thesis by providing background information about computer architecture

and robotics that is helpful for understanding the work presented in the thesis. After
that, the remainder of the thesis is structured as follows:
First, we present an initial foray into formal understanding of the computational
limitations of state-of-the-art robotics applications through the benchmarking and
workload analysis of robot rigid body dynamics algorithms. I conducted this work with
my MIT collaborators, Twan Koolen, Jules Drean, and Jason E. Miller. Koolen was
the expert on robotics topics, and helped me identify rigid body dynamics functions
as important bottlenecks in modern robotics applications. Koolen, Drean, and I all
collaborated on developing code to analyze four state-of-the-art rigid body dynamics
libraries, and releasing that code as an open-source benchmark suite, available at:
http://github.com/rbd-benchmarks/rbd-benchmarks.
Based on our findings from the benchmarking study, we (1) identified the gradient
of rigid body dynamics as a promising kernel for hardware acceleration; and (2)
realized an opportunity to translate per-robot optimizations that were successful in
the software domain, mapping them to the hardware domain. This resulted in the
development of the primary contribution of this thesis: a design methodology for
computer architecture parameterized by robot morphology. We use the gradient of
rigid body dynamics as motivation for the need for hardware acceleration, and then
we introduce our hardware design methodology, which we call robomorphic computing.
Next, to demonstrate the application of robomorphic computing, we use the
21
methodology to design a hardware accelerator for the gradient of rigid body dynamics.
We implement this accelerator design for an industrial manipulator target robot
model, on an FPGA and on a synthesized ASIC, evaluating performance compared to
state-of-the-art CPU and GPU baselines. I designed the accelerator myself, and wrote
the hardware description code. I collaborated with Harvard roboticist Brian Plancher
in analyzing the rigid body dynamics gradient kernel. Plancher prepared the CPU
and GPU baseline software implementations for comparison. My colleague Thomas
Bourgeat from MIT wrote a data-marshalling wrapper for my code, to implement
end-to-end communications between the accelerator and a host CPU. Thierry Tambe
from Harvard performed ASIC synthesis on the accelerator design to produce power
and area results. Using robomorphic computing, the accelerator demonstrates a
meaningful speedup over state-of-the-art CPU and GPU solutions.
Finally, using lessons learned from the development of the hardware accelerator,
we explore the tradeoffs of using hardware-software codesign on different hardware
platforms (CPU, GPU, and FPGA) to accelerate the gradient of rigid body dynamics.
For example, one key insight was that during the FPGA development process, we found
that it was especially important to carefully choose which portions of the dynamics
gradient kernel to migrate to the accelerator, to manage the I/O bandwidth. We
discuss this codesign process on the FPGA, as well as other platform-specific tradeoffs
on the CPU and GPU as well. Again, I designed all versions of the FPGA accelerator
myself and wrote the hardware description code. Thomas Bourgeat from MIT wrote
data-marshalling wrappers for all versions of my FPGA code. Brian Plancher from
Harvard wrote the CPU and GPU code. Plancher, Bourgeat, and I all collaborated
on high-level discussions about design choices for the CPU and GPU code, based on
what we learned from FPGA development. This work indicates a synergistic path
forward for software and hardware development for robotics applications, translating
best practices across different platforms.
22
Chapter 2
Background
2.1 Robotics Background
In robotics, the main processing pipeline can be broken down into three fundamental
stages: (1) perception, (2) mapping and localization, and (3) motion planning and
control (see Figure 2-1). These stages can be run sequentially as a pipeline or in parallel
loops leveraging asynchronous data transfers during runtime. During perception, a
robot gathers information from its sensors and processes that data into usable semantic
information (e.g., depth, object classifications). Next, the robot uses that labeled
data to construct a map of its surrounding environment and estimates its location
within that map. Finally, the robot plans and executes a safe obstacle-free motion
trajectory through the space. If this can be done online in real time, it allows the
robot to adapt to unpredictable environments. In this work, we focus on the problem
of online motion planning and control.
Figure 2-1: The processing pipeline for robotics. This work demonstrates applying
robomorphic computing to a kernel in the motion planning and control stage.
23
Figure 2-2: Robot morphology as a topology of limbs, rigid links, and joints. Robo-
morphic computing exploits parallelism and matrix sparsity patterns determined by
this structure.
2.1.1 Robot Morphology
Traditional robots, including most commercial manipulator arms, quadrupeds, and

humanoids, can be modeled as a topology of rigid links connected by joints (see
Figure 2-2). The morphology of the robot can be disassembled into three principal
components: 𝐿 limbs, a chain of 𝑁 links in a single limb, and the joints that connect
those links.
The rigid links of the robot’s limbs have inertial properties determined by their
shape and distribution of mass. The joint type describes the movement constraints
imposed upon the links connected by the joint. For example, a “revolute” or hinge
joint about the 𝑥-axis means that the links connected by that joint can only rotate
about the 𝑥-axis with respect to one another, and all other degrees of freedom are
constrained.
A quadruped robot might have, e.g., 𝐿 = 4 limbs, each with 𝑁 = 3 links, a
thigh, shin, and foot. The joint types might all be revolute about the 𝑥-axis. If the
quadruped has an arm mounted on its torso, then it would have 𝐿 = 5 limbs, where
e.g., the arm has 𝑁 = 4 links with 𝑧-axis revolute joints.
2.1.2 Robotics Algorithms Using Robot Morphology
Many critical robotics applications use information about the morphology of the robot,
including collision detection, localization, kinematics, and dynamics for both soft and
rigid robots [5, 24, 59, 82]. The design methodology we develop in this work can be
24
extended to all of these applications. In this paper, we will focus on applying the
methodology to develop an accelerator for one key kernel, the gradient of rigid body
dynamics.
2.1.3 Rigid Body Dynamics
Subject to the standard rigid body assumptions, the dynamics of a general robot
without kinematic loops can be described using the well-known equations of motion,
𝑀 (𝑞)𝑣˙ + 𝑐(𝑞, 𝑣) = 𝜏, (2.1)
where:
• 𝑞 ∈ R𝑛 is the joint configuration (or position) vector;
• 𝑣 ∈ R𝑚 is the joint velocity vector;
• 𝜏 ∈ R𝑚 is the joint torque (or effort/force) vector, which may include motor
torques determined by a controller as well as friction torques and other external
disturbances;
• 𝑀 (𝑞) ∈ R𝑚×𝑚 is the positive definite (joint space) mass or inertia matrix;
• 𝑐(𝑞, 𝑣) ∈ R𝑚 includes velocity-dependent terms and terms due to gravity, and is

referred to as the generalized bias force [31] or nonlinear effects.
Three standard problems related to the equations of motion are:
1. forward dynamics: computing 𝑣˙ given 𝑞, 𝑣, and 𝜏 ;
2. inverse dynamics: computing 𝜏 given 𝑞, 𝑣, and 𝑣;

˙
3. mass matrix: computing 𝑀 (𝑞).
Algorithms that solve these standard problems are based on traversing the kinematic
tree of the robot in several passes, either outward from the root (fixed world) to the
leaves (e.g., the extremities of a humanoid robot) or inward from the leaves to the root.
25
While the presented algorithms are often described as being recursive, in practice the
kinematic tree is topologically sorted, so that recursions are transformed into simple
loops, and data associated with joints and bodies can be laid out flat in memory.
Data and intermediate computation results related to each of the joints and bodies
are represented using small fixed-size vectors and matrices (up to 6 × 6).
The dominant inverse dynamics algorithm is the Recursive Newton-Euler Algorithm
(RNEA) [64]. It has 𝑂(𝑁 ) time complexity, where 𝑁 is the number of bodies.
The Composite Rigid Body Algorithm (CRBA) [106] is the dominant algorithm for
computing the mass matrix. It has 𝑂(𝑁 𝑑) time complexity, where 𝑑 is the depth
of the kinematic tree. For forward dynamics, the 𝑂(𝑁 ) articulated body algorithm
(ABA) is often used [30]. Alternatively, one of the libraries under study in Chapter 3
computes 𝑀 (𝑞) using the CRBA and 𝑐(𝑞, 𝑣) using the RNEA, and then solves the
remaining linear equation for 𝑣˙ using a Cholesky decomposition of 𝑀 (𝑞). Though this
approach is 𝑂(𝑁 3 ), shared intermediate computation results between the RNEA and
CRBA and the use of a highly optimized Cholesky decomposition algorithm could
make this approach competitive for small 𝑁 . Each of the algorithms is described in
detail in [31]. To bound the scope of this work, we focus on dynamics without contact.
2.2 Computing Hardware Background

The choice of computing hardware involves tradeoffs between the strengths and
weaknesses of different platforms, including programmer ease-of-use, throughput, and
latency, as illustrated in Figure 2-3. Given these tradeoffs, the ideal hardware choice
depends on the underlying application. In the remainder of this section, we give a
high-level overview of the strengths and weaknesses of different hardware platforms
with respect to these key design choices.
2.2.1 CPU
Most modern CPUs have several processor cores designed to work on different task
threads in parallel, and can support algorithms that require threads to either perform
26
Figure 2-3: High level tradeoffs between CPUs, GPUs, and FPGAs. Each platform
has strengths and weaknesses along different design axes.
the same instructions on multiple data inputs (SIMD parallelism) or multiple instruc-
tions on multiple data inputs (MIMD parallelism). A rich ecosystem of supporting
software tools makes CPUs easy to program. CPUs are also versatile, offering a large
variety of functional units per processor core. Each individual core runs at a high
clock rate (typically several GHz), and is optimized for high performance on sequential
programs that operate on working sets of data that fit well in their cache memory
hierarchy. CPUs offer a balance of high performance on latency-sensitive operations
and throughput from a modest degree of parallelism.
2.2.2 GPU
A GPU is a much larger set of very simple processors, optimized specifically for
parallel computations with identical instructions (SIMD parallelism). Compared to a
CPU, each GPU processor core has many more arithmetic logic units (ALUs), but
reduced control logic and a smaller cache memory. GPUs are best at computing
highly regular and separable computations over large working sets of data (e.g., large
matrix-matrix multiplication). GPUs also run at about half the clock rate of CPUs,
further hindering their performance on sequential code. When leveraging a GPU as
an coprocessor, all data must be first transferred from the host CPU’s memory to the
GPU’s and then back again after the computation is complete. GPU manufacturers
27
suggest amortizing this I/O overhead by performing large amounts of arithmetic
operations per round-trip memory transfer. GPUs are well-suited for applications that
are naturally parallel, require high throughput, and can tolerate latency on individual
sequential computations.
2.2.3 FPGA
Field-programmable gate arrays (FPGAs) have reconfigurable logic blocks and pro-
grammable interconnects that allow them to implement custom hardware functional
units and processing pipelines. The data flow on FGPAs can be customized to effi-
ciently compute code that has irregular memory accesses and high levels of branching.
To enable this flexibility, FPGA designs typically have even slower clock rates than
GPUs. However, because FPGA designs are customized to a particular task, they can
often eliminate large amounts of computational overhead. FPGAs offer fine-grained
parallelism at the instruction level, allowing parallel operation across small customized
functional units. Like GPUs, data must be transferred between a host CPU and FPGA
for most use cases, incurring an I/O overhead cost. FPGAs are generally considered
more challenging to program than CPUs or GPUs, and the resulting designs are more
difficult to modify. Overall, FPGAs can be well-suited to computations that benefit
from fine-grained, instruction-level parallelism, and are very sensitive to latency.
Another characteristic of FPGA designs is that they often use “fixed-point” arith-
metic, which treats decimal values as integers with a fixed decimal point location.
This allows common mathematical operations to be performed faster and to use less
area per computation compared to “floating-point” units, enabling higher performance
while consuming less energy. The tradeoff is that because the decimal point is fixed,
the dynamic range and precision of fixed-point numbers is reduced. However, for
many applications, this reduction still produces high quality end-to-end solutions (e.g.,
quantization for neural networks [34, 40, 90]).
28
2.2.4 Other Architectures
There are other computing architectures besides CPUs, GPUs, and FPGAs. Coarse-
grained reconfigurable architectures (CGRAs), for example, are a promising alternative
computational fabric to investigate besides FPGAs in future work.
One nontraditional architecture style that has been gaining popularity recently is
spatial or “dataflow” architectures. These platforms offer an extreme degree of compute
parallelism, laying out large, multidimensional arrays of processing elements arranged
in regular patterns. The objective of these architectures is to minimize unnecessary
data movement by allowing data to “flow” through the arrays of processing elements,
without frequent reads and writes to centralized memory data structures. Some
examples of spatial architectures include Google’s Tensor Processing Unit (TPU) [47],
and research accelerators like Eyeriss [12], and ExTensor [41].
Current spatial architectures largely target neural network and machine learning
applications, which feature large and very sparse matrices (around 50% to 99.9%
sparse [41]). By contrast, the key matrices in the robotics applications that we focus
on in this work are small (6 × 6 elements) and middlingly sparse (around 30% to 60%
sparse). Additionally, while some of these spatial architectures that target neural
networks address sparsity using compression approaches, e.g., compressed sparse row
(CSR) encoding [41], this is not suitable for taking advantage of the sparsity in the
applications we focus on in this work, as it introduces large overheads to perform
encoding and decoding.
Spatial architectures that do not lean heavily into encoding-based approaches

are a promising alternative platform, but few of them are currently commercially
available. The most readily-available platform is the NVIDIA GPU with Tensor Cores.
While the Tensor Cores themselves may be promising, tying them to a GPU platform
imposes the drawbacks of a GPU system: sequential dependencies and synchronization
points, which are abundant in our target algorithms (more details in Chapter 3 and
Chapter 5), will still hurt overall performance. Besides, GPU power dissipation is
very high, and the Tensor Cores are rigid in bit-width, whereas in later sections (see
29
Section 5.3) we will show there is promise for reduced precision in our algorithms of
interest. As a result, we do not evaluate these architectures in this thesis, and leave
their exploration for future work.
30
Chapter 3
Benchmarking and Workload

Analysis of Robot Dynamics
Algorithms
Modern robotics relies heavily on rigid body dynamics software for tasks such as
simulation, online control, trajectory optimization, and system identification. But while
early robots performed simple, repetitive tasks in constrained environments, robots
are increasingly expected to perform complex operations in dynamic, unstructured,
and unpredictable environments, ranging from non-standard manipulation tasks [13]
to disaster response [55]. These new challenges will require robots to adapt to their
environments in real-time, which in turn will require more complex control algorithms
that place a greater burden on the rigid body dynamics implementations that drive
them.
One trend in the effort to improve the adaptability of robots to their environment is
the increased use of nonlinear model-based control (MPC) [18], which moves trajectory
optimization, traditionally an off-line task, into the realm of online control. Differential
dynamic programming (DDP) techniques, including the iterative linear quadratic
regulator (iLQR) algorithm, are being deployed on more complex robots [78, 103].
These techniques require simulating into the future at every control time step, and also
require gradients of the dynamics. Where traditional control approaches might only
31
(a) LBR iiwa (b) HyQ (c) Atlas
Figure 3-1: 3D visualizations of the robot models used as benchmark cases. From left
to right, LBR iiwa (KUKA AG), HyQ (DLS Lab at the Italian Institute of Technology),
and Atlas (Boston Dynamics, version used during the DARPA Robotics Challenge).
evaluate dynamical quantities once per time step, MPC requires many evaluations, with
the quality of the control policy improving with longer time horizons. Another trend is
the use of machine learning algorithms trained on data obtained from simulation. The
performance of these algorithms typically improves with larger quantities of input data
[94]. Thus, speeding up the production of simulation samples may enable improved
performance and lower training costs.
Much work has gone into creating high-performance rigid body dynamics li-
braries [11, 33, 53, 66, 73, 97]. However, existing implementations still do not have the
performance necessary to satisfactorily run algorithms like iLQR for complex robots
on off-the-shelf hardware [87]. To help improve this situation, we present benchmark
results and an associated benchmark suite for rigid body dynamics libraries. We
include representative examples of three different robot categories (see Fig. 3-1):
a manipulator (LBR iiwa), a quadruped (HyQ), and a humanoid (Atlas). These
benchmarks are aimed at helping control engineers and library authors understand
differences and similarities between the libraries and identifying possible areas for
optimization. We perform a workload analysis in this work, and we also released our
suite for use by the larger community.
This work makes several important contributions:
1. Open-source benchmarking: github.com/rbd-benchmarks/rbd-benchmarks;
2. Direct comparison of four state-of-the-art rigid body dynamics libraries;
32
3. Consistent comparison of each of the libraries, using the same inputs and ensuring
that their outputs match;
4. Low-level performance statistics collected on modern hardware using hardware

performance counters;
5. An analysis of trends and differences across the distinct combinations of algorithm

and implementation, to provide insight for future work in optimization and
acceleration of this workload.
To our knowledge, this work provides the most comprehensive analysis of the
rigid body dynamics workload to date. Previous work analyzing various software
implementations focused only on overall performance [11, 33, 53, 66, 73], but not a full
workload analysis from a microarchitectural perspective. To our knowledge, the only
evaluation to analyze additional performance measurements was in [73], where the
authors used the profiling tool Valgrind to report instruction counts, cache misses,
and branch misprediction rates on several hardware platforms. However, this study
was much more limited in scope – only two software libraries and two dynamics
algorithms were measured. By contrast, our suite includes three key dynamics
algorithms implemented by four software libraries, and we present measurements taken
from hardware performance counters, including instruction counts, cache misses, stall
cycles, floating-point vector operations, and instruction mix.
We begin with a brief review of the rigid body dynamics problems under study. We
will then survey current, state-of-the-art software implementations of these algorithms
and characterize their performance and use of resources on modern hardware. Finally,
we will examine the commonalities and differences between these implementations
to motivate future work on the acceleration of these algorithms, an important step
towards satisfying the future computational needs of real-time robot motion control.
33
Table 3.1: Dynamics Libraries Evaluated
Library Language Linear Algebra Coord. Released
RBDL 2.6.0 [33] C++ Eigen 3.3.7 Body May 2018
Pinocchio 2.0.0 [11] C++ Eigen 3.3.7 Body Jan 2019
RigidBodyDynamics.jl Julia StaticArrays.jl 0.10.2, World Feb 2019
1.4.0 [53] OpenBLAS 0.3.3
RobCoGen 0.5.1 [66] C++ Eigen 3.3.7 Body Dec 2018
3.1 Survey of Software Libraries
We briefly introduce the software libraries under study and highlight some of their
commonalities and key differences. All of the libraries have been developed fairly
recently, and are actively maintained, open source, and provided free of charge. The
libraries can all load the layout and properties of a robot from the common ROS
URDF [88] file format, or a file format to which URDF can be converted using
automated tools. A well-known library missing from this study is SD-FAST [97]. We
chose not to include this library because it is somewhat older, proprietary, and the
authors of several libraries included in the study have published benchmarks that show
significant improvements over SD-FAST. The libraries are summarized in Table 3.1.
In contrast to the other libraries, RobCoGen is a Java tool that generates C++
source code specific to a given robot. RigidBodyDynamics.jl (referred to from this
point on as RBD.jl) is unique in that it is implemented in Julia [4]. It is also the only
library that doesn’t use the ABA for forward dynamics, instead solving for the joint
accelerations using a Cholesky decomposition of the mass matrix. Furthermore, RBD.jl
annotates the typical spatial vector and matrix types (e.g., wrenches) with an additional
integer describing the coordinate frame in which the quantity is expressed, for user
convenience and to enable frame mismatch checks (disabled for our benchmarks).
All of the C++ libraries use Eigen [39] as a linear algebra backend. Eigen provides
fixed-size matrix and vector types that can be allocated on the stack, thereby avoiding
any overhead due to dynamic memory allocation. Operations involving these types
can use vector (SIMD) instructions, if available. RBD.jl, uses StaticArrays.jl for
similar fixed-size array functionality, in addition to using OpenBLAS through Julia’s
34
LinearAlgebra standard library for operations on dynamically-sized matrices (mainly,
the Cholesky decomposition of the mass matrix).
A key difference between the libraries lies in how different joint types (e.g., revolute
or floating) are handled. While a naive implementation might rely on inheritance
and virtual methods, all of the libraries have avoided this in various ways to improve
performance. RBDL enumerates the different joint types, and uses run-time branching
based on the joint type in the main algorithm loops to implement joint-specific
functionality. RBD.jl’s non-standard choice of implementing the algorithms in world
coordinates allows joint-specific computations to be performed out-of-order: data
for all joints of the same type are stored in separate vectors and are iterated over
separately, avoiding if-statements in tight loops. Pinocchio handles different joint
types using a visitor pattern based on the Boost C++ library. RobCoGen’s generated
code unrolls all of the loops in the dynamics algorithms, replacing them with explicit
repetition of their contents and thus avoiding the overhead of program control flow
(e.g., branches and address calculations) in general, including in the implementation
of joint-type-specific behavior.
A peculiarity of RobCoGen is that the authors have opted to implement a hybrid

dynamics algorithm for floating-base robots, instead of regular RNEA. In essence,
this hybrid dynamics algorithm does forward dynamics for the floating base joint,
while implementing standard RNEA for the non-floating (revolute) joints. Although
the inputs and outputs of this algorithm are not the same as for the ‘regular’ inverse
dynamics algorithms implemented by the other libraries, we still chose to include the
results, as we deemed the algorithm to still be similar enough.
Both Pinocchio and RBD.jl implement the algorithms in a generic (parameterized or

templated) way. This enables, for example, automatic differentiation of the dynamics
using non-standard special scalar types and function overloading.
35
Table 3.2: Hardware System Configuration
Feature Value
Processor / Frequency Intel i7-7700, 4 Cores / 3.6GHz
Private L1 / L2 Cache per Core 8-way, 32kB / 4-way, 256kB
L3 Cache / DRAM Channels 16-way, 2MB / 2 Channels
3.2 Methodology
This section details our method for collecting timing results and measurements from
performance counters on a hardware platform.
3.2.1 Hardware Measurement Setup
Evaluation of the software implementations was performed on a modern desktop

machine with a quad-core processor. Key hardware parameters are shown in Table 3.2.
This machine was selected for the workload measurements because quad-core Intel i7
processors are a common choice for state-of-the-art robots, including many featured
in the 2015 DARPA Robotics Challenge [55], such as Atlas [56], Valkyrie [89], and
CHIMP [99].
To isolate our measurements from the non-deterministic effects of changing clock
frequencies and thread migrations, TurboBoost and HyperThreading were disabled
in the BIOS. The clock frequency of all four cores was fixed at 3.6GHz. To measure
timing, we used the Linux system call clock_gettime(), with CLOCK_MONOTONIC as
the source.
To collect measurements of architectural statistics, we used the hardware per-
formance counter registers [46] provided by the processor, which can be configured
to monitor a given set of hardware events (e.g., cache misses, instructions retired).
We accessed these registers using LIKWID [104], a C library interface for reading
hardware performance counters. In our testbed code, calls to the dynamics libraries’
routines were instrumented with LIKWID routines, in order to carefully measure only
activity during the regions of interest.
36
3.2.2 Software Environment
Our hardware measurement platform ran Ubuntu 16.04. For RBD.jl, we used version
1.1.0 of Julia [4] with flags -O3 and –check-bounds=no. All C/C++ code was
compiled using Clang 6.0, which we chose because both Clang 6.0 and Julia 1.1.0 use
LLVM 6 as their backend. The RBDL, Pinocchio, and RobCoGen C++ libraries were
compiled with the “Release” CMake build type option. For RobCoGen, we had to
add the EIGEN_DONT_ALIGN_STATICALLY flag to the default flags in the generated
CMake file, to avoid crashes due to memory alignment issues.
While Julia is JIT-compiled by default, for this study we statically compiled the
relevant RBD.jl functions to a C-compatible dynamic library ahead of time using
the PackageCompiler.jl Julia package [14], so as to avoid benchmark artifacts due to
JIT compilation and to enable interoperation with measurement tools. As a further
precaution against observing JIT overhead, we called all RBD.jl routines once before
starting measurements.
3.2.3 Inputs and Cross-Library Verification of Results
To compare the dynamics libraries, we used three different robot models as inputs:
iiwa, a robot arm with 7 revolute joints [57], HyQ, a quadruped robot with 12 revolute
joints and a floating joint [95], and Atlas, a humanoid robot with 30 revolute joints
and a floating joint [6] (fingers not modeled).
For verification, it was important to ensure that all libraries were manipulating the
exact same representations of the robot models. This was a non-trivial task because
RBDL, Pinocchio, and RBD.jl all take as input a URDF file describing the robot
model, whereas RobCoGen uses a custom file format, KinDSL. We started with a set
of reference URDF files for each robot. To generate the KinDSL files, we used a file
conversion tool created by the authors of RobCoGen, URDF2KinDSL [92]. This tool
performs not only a change of syntax in the robot representation, but also a number
of less-trivial transformations, e.g., from extrinsic to intrinsic rotations. Unfortunately,
URDF2KinDSL did not produce a KinDSL file that exactly matched the URDF
37
for Atlas, so we opted to convert each of the KinDSL files back to URDF using
RobCoGen’s built-in conversion tool. It was these back-and-forth-converted URDF
files that were ultimately used as the inputs for RBDL, Pinocchio, and RBD.jl. While
there was still a somewhat larger mismatch between RobCoGen’s outputs and those of
the URDF-capable libraries, these could be attributed to the rotation transformations.
To simulate the effect of running many dynamics calculations during the operation
of a robot with time-varying input data, we executed each dynamics calculation
100,000 times in a row, using different inputs each time. The libraries each expect
their inputs in a different format, for example as a result of depth-first or breadth-first
topological ordering of the kinematic tree. To enable cross-library verification, we used
RBD.jl to generate 100,000 random inputs for each of the robot models (in RBD.jl’s
format), after which the inputs were transformed into the format that each of the
libraries expects. Similarly, corresponding expected output values were computed
using RBD.jl, and transformed to the representation expected from each of the other
libraries. These expected results were used to verify that all libraries indeed performed
the same computation. For the inverse and forward dynamics algorithms, gravitational
acceleration and external forces (an optional input) were set to zero to facilitate direct
comparison of the libraries.
For each combination of algorithm and robot model, statistics were measured
for the entire set of inputs and then these numbers were divided by 100,000 to
calculate the average per calculation. To further insulate our measurements from
potential background noise, each experiment run of 100,000 inputs was performed 10
non-consecutive times, and the results across those 10 experiments were averaged.
3.3 Evaluation
This section presents the results of running the four different software implementations
of the robot dynamics algorithms for the three different robot models (Section 3.2).
The average execution times of the dynamics algorithms are shown in Fig. 3-2. Each
cluster of bars shows the results for all of the implementations on a particular robot
38
model.
The runtime values shown are averaged across multiple experimental runs of
100, 000 inputs each (as described in Section 3.2.3). We examined the standard
deviation, 𝜎, of the execution time of a single dynamics calculation from a single
input value, and we found that for all of our data collected, 0.2% < 𝜎 < 1.3% of the
overall mean runtimes. This suggests that for these implementations of the dynamics
algorithms, the performance is not sensitive to differences in the input data, as we
expected.
The top performing libraries overall were RobCoGen and Pinocchio. These results
will be explored in more detail in Sections 3.3.1, 3.3.2, and 3.3.3, which examine the
data from each algorithm separately. We will also comment on the memory usage of
the algorithms, and the effects of various types of parallelism on their performance.
3.3.1 Forward Dynamics Results
RobCoGen gave the fastest runtimes for forward dynamics for all robot models, at
1.1𝜇s, 2.4𝜇s, and 5.9𝜇s for iiwa, HyQ, and Atlas, respectively (Fig. 3-2a). The key
to RobCoGen’s performance on this algorithm is indicated by its instruction count
data (Fig. 3-4a). The total number of instructions retired by RobCoGen is much
lower than the total instructions retired by all of the other implementations. This
remarkable reduction can be attributed to RobCoGen’s technique of using explicit
loop unrolling. The unrolled loops can perform the same calculation with far lower
instruction count overhead because they eliminate branches (Fig. 3-4a) and calculation
of branch conditions, and can reuse address calculation temporaries. For the limited
number of links in the robot models evaluated, loop unrolling is an effective strategy
for performance (though it should be noted this may not necessarily be the case for
robots with many more links). RobCoGen gives the fastest performance because it
does so few instructions overall, despite having a somewhat lower rate of instruction
throughput, measured in instructions per cycle (IPC) (Fig. 3-3). (Note that the
maximum rate of IPC per core for our testbed hardware platform is 4 [19].)
The other three libraries, which do not perform explicit loop unrolling, all have
39
RBDL Pinocchio RBD.jl RobCoGen RBDL Pinocchio RBD.jl RobCoGen RBDL Pinocchio RBD.jl RobCoGen
16 7
16
14 6 14
12
MM Time [us]
FD Time [us]
ID Time [us]
12
10 10
4
8 8
3
6 6
4 2
4
2 1 2
0 0 0
a Q s a Q s a Q s
iiw Hy Atla iiw Hy Atla iiw Hy Atla
(a) Forward Dynamics Runtime (b) Mass Matrix Runtime (c) Inverse Dynamics Runtime
Figure 3-2: Runtime in microseconds. Shorter runtimes indicate better performance.

Within each cluster, the bars are in the same order as the top legend.
3.0 3.0 3.0
2.5 2.5 2.5
2.0 2.0 2.0
MM IPC
FD IPC
ID IPC
1.5 1.5 1.5
1.0 1.0 1.0
0.5 0.5 0.5
0.0 0.0 0.0
a Q s a Q s a Q s
(a) Forward Dynamics (b) Mass Matrix (c) Inverse Dynamics

Instructions Per Cycle Instructions Per Cycle Instructions Per Cycle
Figure 3-3: Instructions per cycle (IPC). Higher IPC indicates better instruction
throughput.
9 7 20
MM Instructions (Norm.)
FD Instructions (Norm.)
ID Instructions (Norm.)
8 6
7 15
load 5 load load
6
5 store 4 store store
10
4 branch 3 branch branch
3 other other other
2 5
2
1 1
0 0 0
a Q s a Q s a Q s
(a) Forward Dynamics Instructions (b) Mass Matrix Instructions (c) Inverse Dynamics Instructions
Figure 3-4: Total instructions retired by the processor, categorized into memory
accesses (loads and stores), branches, and other instructions. Results normalized to
RBDL on iiwa.
40
8 6 20
7
MM Cycles (Norm.)
FD Cycles (Norm.)
ID Cycles (Norm.)
6 15
5 mem. stall
4 mem. stall mem. stall
4 other stall 3 other stall 10 other stall
3 execution execution execution
2
2 5
1 1
0 0 0
a Q s a Q s a Q s

Clock Cycles Clock Cycles Clock Cycles
Figure 3-5: Total clock cycles, categorized into memory stall cycles, non-memory
“other” stall cycles, and non-stall cycles. Results normalized to RBDL on iiwa.
MM Floating-Point Ops. (Norm.)
FD Floating-Point Ops. (Norm.)
ID Floating-Point Ops. (Norm.)

7 7 45
6 6 40
35
5 5
dp scalar dp scalar
30 dp scalar
4 4 25
dp 128b dp 128b dp 128b
3 3 20
dp 256b dp 256b dp 256b
15
2 2
10
1 1 5
0 0 0
a Q s a Q s a Q s

Floating-Point Operations Floating-Point Operations Floating-Point Operations
Figure 3-6: Total floating-point operations, categorized into double precision operations
with scalar, 128-bit packed vector, and 256-bit packed vector operands. Results
normalized to RBDL on iiwa.
MM L1-D Accesses (Norm.)
FD L1-D Accesses (Norm.)
ID L1-D Accesses (Norm.)
8 7 45
7 6 40
6 35
5
5 30
miss 4 miss 25 miss
4
hit 3 hit 20 hit
3 15
2 2
10
1 1 5
0 0 0
a Q s a Q s a Q s

L1 Data Cache Accesses L1 Data Cache Accesses L1 Data Cache Accesses
Figure 3-7: L1 data cache accesses, categorized into hits and misses. Results normalized
to RBDL on iiwa.
41
similar rates of instruction throughput (Fig. 3-3), so it is not surprising that their
relative runtimes directly correspond to the total instructions retired by each. Recall
that RBD.jl, which generates the most instructions (Fig. 3-4a), also uses a different
algorithm for forward dynamics than the other libraries (see Section 2.1.3), which
requires significantly more memory accesses (loads and stores).
These results indicate that reducing the overall amount of instructions to be
performed and avoiding extra work (e.g., branch calculations) that does not directly
contribute to the algorithm is a clear path to performance success for this algorithm.
For RobCoGen, this reduction came from aggressive loop unrolling, leading to a
significant reduction in the number of instructions.
3.3.2 Mass Matrix Results
For the mass matrix calculation, RobCoGen was fastest for iiwa and HyQ (at 0.6𝜇s
and 1.1𝜇s, respectively), but Pinocchio was slightly faster for the Atlas robot, at
3.5𝜇s (Fig. 3-2b). RobCoGen’s impressive performance on this algorithm has the
same explanation as its superior performance on the forward dynamics algorithm
(Section 3.3.1): RobCoGen’s optimized code greatly reduces the number of superfluous
instructions (Fig. 3-4b) that are not related to the core calculations of the algorithm.
Once again, RobCoGen’s unrolled loops cut down on its total number of branch
instructions (Fig. 3-4b) and L1 data cache accesses (Fig. 3-7b) compared to the other
implementations.
Pinocchio’s good performance on this algorithm corresponds with an increased
IPC throughput (Fig. 3-3b) relative to the other libraries.
3.3.3 Inverse Dynamics Results
For inverse dynamics, RobCoGen was the fastest library for iiwa (at 0.6𝜇s). However,
for the floating base robots, RobCoGen was the slowest library by far, and Pinocchio
gave the fastest runtimes, at 1.5𝜇s and 3.5𝜇s for HyQ and Atlas, respectively (Fig. 3-
2a). This is due to the much larger number of instructions executed. In these cases,
42
RobCoGen executes many more loads and stores (Fig. 3-4c) than the other libraries,
and it also performs many more floating-point operations (Fig. 3-6c).
To understand this degradation of performance, recall that RobCoGen in fact
implements a hybrid dynamics algorithm for floating base robots (see Section 3.1).
For more detail, we profiled RobCoGen with the software profiling tool Valgrind [74].
We found that there were a handful of function calls in RobCoGen’s inverse dynamics
routine that were generating a clear majority of the function calls, as well as the
resulting high numbers of branch instructions and L1 data cache accesses observed in
the data from the hardware performance counters (Figs. 3-4c and 3-7c). Upon reviewing
the code, we found that this is likely caused by coordinate-frame transformation of
composite rigid body inertias, which are the intermediate result of the hybrid dynamics
algorithm that RobCoGen uses for floating-base robots. We suspect that this operation
can be easily and significantly optimized, as RobCoGen’s mass matrix algorithm
already uses a more efficient implementation of the same coordinate transformation.
In addition, this coordinate transformation is only needed as a result of RobCoGen’s
choice to use a hybrid dynamics algorithm for floating-base robots; the baseline
recursive Newton-Euler algorithm used by other libraries does not require this step.
For the iiwa robot, this effect does not come into play because RobCoGen does use
‘regular’ (non-hybrid) recursive Newton-Euler for fixed-base robots.
The runtimes for Pinocchio and RBDL are fairly close for inverse dynamics, with
Pinocchio giving the shortest runtimes for the HyQ and Atlas robots. RBD.jl executes
significantly more instructions overall than Pinocchio and RBDL, so it has a longer
runtime than those libraries.
3.3.4 Memory Usage
All of the software implementations spent the clear majority of their cycles on execution
(Fig. 3-5) rather than waiting for memory (stall cycles), so they can all be considered to
be compute-bound, not memory-bound. In fact, most of the software implementations
suffered almost no misses in the L1 data cache (Fig. 3-7), with average miss rates of
all computations < 1.4%. From this, we can see that for most of the implementations
43
of these algorithms, the working set fits comfortably in the 32kB L1 data cache on
this machine.
For inverse dynamics on HyQ and Atlas, RobCoGen has a much higher number
of L1 memory accesses (Fig. 3-7c). This is caused by non-optimal access patterns
related to the coordinate-frame transformation described in Section 3.3.3. Again,
this high number of memory accesses corresponds with a higher total number of
instructions executed (Fig. 3-2c) and greatly increases the runtime of RobCoGen for
inverse dynamics on floating-base robots.
A likely cause for RBD.jl’s comparatively high number of memory accesses is its
use of the integer frame annotations mentioned in Section 3.1, which need to be copied
along with each intermediate computation result.
3.3.5 Sources of Parallelism
All of the studied implementations are written sequentially at the top level, largely
because they implement a set of recursive algorithms. As a result, there is currently
no task-level parallelism exploited by any of the libraries. However, some future
opportunities for task-level parallelism are within reach. For example, RBD.jl uses
a world frame implementation (see Section 2.1.3), which enables a loop interchange
where joints of the same type can be stored in separate vectors and iterated over not
in topological order. This presents an opportunity for task-level parallelism, however,
it is not currently implemented in a parallel manner.
There was also data-level parallelism present in the floating-point workload (see
Fig. 3-6). Interestingly, RBD.jl was the only implementation whose linear algebra
library made significant use of the widest vector operations available, the 256-bit packed
double precision floating-point instructions. The main source of these densely packed
operations is Julia’s use of the superword-level parallelism (SLP) vectorizer LLVM
compiler pass in combination with unrolled code for small linear algebra operations
generated by StaticArrays.jl. While RBD.jl took a performance hit by generating
many more instructions overall than the other libraries (Fig. 3-4), a combination of
instruction codebase efficiency and vectorization together could result in increased
44
performance.
3.3.6 Sensitivity to Compiler Choice
All results presented in this section were taken from software compiled with Clang
6.0.1, but we performed some additional experiments to see the effect of compiling
the non-Julia libraries (RBDL, Pinocchio, and RobCoGen) with GCC/G++ 7.4,
released the same year as Clang 6.0.1. For RBDL and Pinocchio, results with GCC
demonstrated degraded performance (e.g., runtime increased by 54% for forward
dynamics with Pinocchio). For RobCoGen, performance was roughly the same for
forward dynamics and the mass matrix. The only case where performance improved at
all from using GCC was with RobCoGen on the inverse (or rather, hybrid) dynamics
benchmark for HyQ and Atlas (both runtimes decreased by about 40%). However,
these times are still far behind the best times observed for inverse dynamics using
Clang (see Fig. 3-2). From these additional experiments, it is clear that compiler
choice can have a large impact on performance for these applications.
3.4 Discussion
In this section, we note several trends that span the different algorithms and imple-
mentations and describe some possible strategies for improving performance in future
work. We also briefly speculate on the implications that our findings for the dynamics
workloads might have for another related set of computations, the dynamics gradients.
3.4.1 Observed Trends
One clear trend in our results is that none of these calculations are memory-bound.
Interestingly, all of them show extremely low L1 cache miss rates (except for Atlas
with RBD.jl where they are merely low, see Fig. 3-7) despite having an unusually high
proportion of load and store instructions (Fig. 3-4). From this, we conclude that these
routines have very small working sets with large amounts of locality (either spatial or
45
temporal). This is consistent with the majority of calculations being linear algebra
routines on small arrays. However, the large proportion of loads and stores indicates
that few operations are being performed on each fetched element before it is stored.
This suggests that there will be an opportunity to improve performance by combining
operations or reorganizing data access patterns to avoid loading and storing the same
values repeatedly.
Another observation is that the scaling trends of the algorithms (see Section 2.1.3),
are also demonstrated by their corresponding software implementations. Performance
and most other measures scale approximately linearly with robot complexity (i.e.,
number of joints). The Atlas robot has approximately 4.4× the number of joints of
iiwa and takes about 6× as long to calculate, on average. HyQ falls proportionally in
the middle. This suggests that our results can be extrapolated to estimate performance
for other robots using the degrees of freedom. However, there may be a point at higher
numbers of joints where the internal matrices’ sizes will exceed the L1 cache size and
performance will degrade substantially.
One final observation relates to the use of floating-point operations. These routines
vary considerably in how much they use packed (vector) floating-point instructions.
Since the majority of the math they are doing is linear algebra, we would expect this
workload to be highly amenable to vectorization. This suggests two things: 1) that
having high-performance floating-point units with vector support will benefit these
algorithms, and 2) that these implementations are probably not taking advantage of
vector floating-point instructions as much as they could be.
3.4.2 Opportunities for Performance Gains
As previously mentioned, all of these implementations have some inefficiencies in their

operation. The large proportion of load and store instructions indicates that there is
a lot of overhead beyond what the mathematics requires. This view is supported by
the lower overheads we see in RobCoGen and Pinocchio; for forward dynamics and
the mass matrix calculation, they perform significantly fewer loads and stores than
the other libraries. RobCoGen further reduces its total number of instructions by
46
unrolling loops and eliminating conditional branching, efficiently reducing overhead.
Careful profiling may expose additional opportunities.
A major opportunity for performance gains for the dynamics algorithms would be
better use of parallel resources. There would seem to be room for improvement in
all three types of parallelism: instruction, data and task. While the instruction-level
parallelism (ILP) we measured was reasonable, the processor in our machine is capable
of much more. In addition, there is much variability in the use of 128- and 256-bit
vector operations indicating that these highly-efficient data-parallel operations may be
under-utilized. Finally, none of the implementations make effective use of task-level
(a.k.a. thread-level) parallelism. This means that three of the four cores in our
testbed machine went unused. As trends in processor architectures are towards greater
parallelism rather than greater single-thread performance, it would be worthwhile to
exploit these resources.
3.4.3 Implications for Dynamics Gradients
As novel control techniques push more of the motion planning workload to the low-
level, high-rate part of a robot control architecture, an important requirement will
be fast evaluation of gradients of the dynamics. This is because motion planning
techniques typically employ gradient-based optimization and local approximations of
the dynamics.
Various approaches can be employed to compute gradients of dynamics-related
quantities. Perhaps the easiest but crudest technique is numerical differentiation.
Automatic differentiation may also be employed, which exploits the chain rule of
differentiation to compute exact gradients evaluated at a point, often at a reduced
computational cost compared to numerical differentiation. Employing automatic dif-
ferentiation requires writing the dynamics algorithms in a generic (in C++, templated)
way, so that non-standard input types that carry derivative-related information may
be used. Pinocchio, RBD.jl, and a fork of RobCoGen [78] are written to support
such non-standard inputs. Further performance gains may be achieved using analyti-
cal derivatives that exploit the structure of rigid body dynamics, as currently only
47
implemented by Pinocchio [11].
The relation between the performance of algorithms for dynamics quantities and
for their gradients is perhaps clearest for numerical differentiation, where the original
algorithm is simply run once for each perturbation direction. However, each of these
gradient computation approaches has clear links to the basic algorithms analyzed in
this work. As such, we expect insights gained and performance gains made for the
basic dynamics algorithms to extend to gradient computations to a large degree.
We also note that if gradients are required, it may be more worthwhile to utilize
task-level parallelization in the computation of gradients, rather than in the basic
algorithms themselves, because gradient computations can be trivially parallelized
with one partial derivative per thread, while we also expect lower threading overhead
due to a more significant workload per thread.
3.4.4 Contributions and Future Work
The calculation of robot dynamics is an essential and time-consuming part of controlling

a highly-articulated robot. This work introduced a new benchmark suite with four
different implementations of the three most commonly-used dynamics calculations.
Our initial analysis of the suite provides information to help steer future optimization
and acceleration of this workload.
Our goal in this work was to provide researchers and library implementers with
information they can use to understand and optimize these calculations in the future.
We found significant similarities and differences between the implementations, which
helps to highlight which characteristics are intrinsic to the problem and which are due
to the particular implementation. Key insights are that all of these implementations
have small working sets and are not highly parallelized. However, it is also clear that
they differ in the efficiency with which they perform the essential calculations, leading
us to believe that there are still significant opportunities for improvement.
One promising avenue for future work is the use of parallelism. These imple-
mentations do not fully exploit the parallelism currently available on even modest
desktop computers. It is possible that by using different algorithms or data structures,
48
additional parallelism can be exposed.
In Chapter 5, we design and evaluate a hardware accelerator to exploit these
opportunities for parallelism in calculating the gradient of forward dynamics.
To guide this design process, in Chapter 4 we present a design methodology for
computer architecture parameterized by robot morphology. Building on the success
of dynamics library techniques that optimize software for a particular robot model,
including function templating in Pinocchio [11] and code generation in RobCoGen [66]
(Section 3.1), our systematic design methodology introduces robotics software opti-
mizations to the hardware domain by exploiting high-level information about a robot
model to parameterize features of the accelerator architecture.
49
50
Chapter 4
Robomorphic Computing: A
Design Methodology for Computer
Architecture Parameterized by
Robot Morphology
Complex robots such as manipulators, quadrupeds and humanoids that can safely
interact and cooperate with people in dynamic, unstructured, and unpredictable
environments are a promising solution to address critical societal challenges, from elder
care [48, 96] to the health and safety of humans in hazardous environments [62, 110]. A
major obstacle to the deployment of complex robots is the need for high-performance
computing in a portable form factor. Robot perception, localization, and motion
planning applications must be run online at real-time rates and under strict power
budgets [27, 52, 87, 100].
Domain-specific hardware acceleration is a emerging solution to this problem,

building on the success of accelerators for other domains such as neural networks [12,
47, 91]. However, while accelerators have improved the power and performance of
robot perception and localization [12, 91, 101], relatively little work has been done for
motion planning [61, 72].
51
Figure 4-1: Overview of robomorphic computing, a design methodology to transform
robot morphology into customized accelerator hardware morphology by exploiting
robot features such as limb topology and joint type. This methodology can be applied
to a wide variety of complex robots. Pictured are the Atlas [7], Spot [8], and LBR
iiwa [58] robots, as examples.
Motion planning is the stage in the robotics pipeline where robots calculate a valid
motion path from an initial position to a goal state. Adaptive, online motion planning
approaches [78, 103] rely heavily on latency-critical calculation of functions describing
the underlying physics of the robot, e.g., rigid body dynamics and its gradient [9,31,37].
There exist several competing software implementations that are sufficient for use in
traditional control approaches [11, 33, 53, 66, 73, 97], but emerging techniques such as
nonlinear model predictive control (MPC) [18, 52] reveal a significant performance gap
of at least an order of magnitude: robot joint actuators respond at kHz rates, but these
promising approaches for complex robots are limited to 100s of Hz by state-of-the-art
software [27, 87]. This gap persists despite the use of software templating and code
generation to optimize functions for a particular robot model [11, 66] (as seen in
Chapter 3).
Hardware acceleration can shrink the motion planning performance gap, but the
paramount challenge in designing accelerators for all robotics applications has been
52
to provide formalized methodologies for accelerator design that can generalize across
different robot platforms or different algorithms. Traditional accelerator design can
be tedious, iterative, and costly because there is no principled, turn-key methodology.
It is essential to define systematic hardware synthesis flows to keep the design process
agile as applications evolve [44].
We address this challenge with robomorphic computing: a methodology to transform
robot morphology into customized accelerator hardware morphology. Our design
methodology (summarized in Figure 4-1) introduces a mapping between the physical
structure of a robot and basic architectural primitives such as parallelism and data
structure sparsity. In the robomorphic computing design flow: (1) a parameterized
hardware template is created for a robotics algorithm once, exposing parallelism and
matrix sparsity; then, (2) for each robot, template parameters are set according to
the robot morphology, e.g., limb topology and joint types, creating an accelerator
customized to that robot model.
This work provides a roadmap for future hardware accelerators for robotics. Our
design flow provides a reliable pathway towards identifying useful algorithmic features
in robotics applications, and a mechanistic way of encoding them in hardware. This
relieves the burden of hardware designers in approaching new algorithms and robots.
4.1 Motivating Use Case

As motivation for the robomorphic computing methodology, we examine a particular
use case: a computational bottleneck kernel in motion planning and control. In
Chapter 5, we use robomorphic computing to implement a hardware accelerator
targeting this application.
4.1.1 Key Kernel: Forward Dynamics Gradient
A key kernel in many motion planning techniques is the first-order gradient of for-
ward dynamics, which can be calculated in several ways: finite differences [52, 103],
Lagrangian derivation [36], automatic differentiation [37], or direct analytical deriva-
53
Figure 4-2: An online optimal motion planning and control system. The future motion
trajectory of the robot is refined over iterations of the optimization control loop.
tion [9]. The fastest of these methods uses analytical derivatives of the the recursive
Newton-Euler algorithm (RNEA) for inverse dynamics [9, 31, 63]. The result is then
multiplied by a matrix of inertial quantities, to recover the gradient of forward dynamics
(algorithm details in Section 5.1.1).
As seen in Chapter 3, the fastest software implementations of rigid body dynamics
and its gradient [76] use templating and code generation [11, 66] to optimize functions
for a particular robot model, incorporating robot morphology features into the code.
In this work, we extend this software approach to the hardware domain with the robo-
morphic computing methodology. Using this methodology, we exploit computational
opportunities in the dynamics gradient kernel to design a hardware accelerator.
4.1.2 Control Rate Performance
To motivate our work, we focus on a promising online motion planning and control
approach, nonlinear model predictive control (MPC) [27, 52, 87] (see Figure 4-2).
Nonlinear MPC involves iteratively optimizing a candidate trajectory describing a
robot’s motion through space over time. This trajectory is made up of the robot’s
state at discrete time steps, looking some time horizon into the future. Longer
time horizons increase resilience to disturbances, and can enable robots to perform
complicated movements and behaviors. This online approach allows a robot to adapt
to unpredictable environments by quickly recomputing safe trajectories in response to
changes in the world.
54
Figure 4-3: Estimated control rates for three robots using different trajectory lengths
(based on state-of-the-art rigid body dynamics software implementations [9]), compared
to ideal control rates required for online use [27]. We assume 10 iterations of the
optimization loop. Current control rates fall short of the desired 250 Hz and 1 kHz
targets for most trajectory lengths. This performance gap is worse for more complex
robots, and grows with the number of optimization iterations.
A significant performance gap exists for current state-of-the-art implementations

of nonlinear MPC. Figure 4-3 shows estimated motion planning control rates for three
robot models of increasing complexity (a manipulator, quadruped, and humanoid), for
trajectories of different lengths. Fewer time steps are typically used for trajectories
extending shorter time horizons into the future, while more time steps are used
for trajectories extending longer time horizons into the future. The three bands
represent control rates for the three robots, extrapolated analytically from state-of-
the-art compute times for the gradient of robot dynamics [9], a key bottleneck kernel
in optimal motion planning which accounts for 30% to 90% of the total runtime
depending on implementation details [9, 87]. The lower boundary of these bands
represents dynamics gradient calculations taking 30% of the total runtime, while the
upper boundary of these bands represents them taking 90% of the total runtime.
Control rates are calculated for 10 iterations of the optimization loop, allowing the
trajectory to begin to converge towards an optimal solution.
55
Two horizontal thresholds are shown in Figure 4-3. The upper threshold is the
1 kHz control rate at which robot joint actuators are capable of responding. The lower
250 Hz threshold is a conservative minimum suggested rate for nonlinear MPC to be
run online [27]. The 250 Hz MPC planner would have to be part of a hierarchical
system with faster-running low-level controllers interacting with the joint actuators.
A performance gap of at least an order of magnitude has emerged: if nonlinear MPC
could be run at kHz rates instead of 100s of Hz, the joint actuators could be controlled
directly, maximizing robot reflexes and responsiveness.
With current software solutions, nonlinear MPC is unable to meet the desired
1 kHz or 250 Hz control rates in most cases. For example, the manipulator can only
achieve the 1 kHz rate for short time horizons (under 25 time steps) and cannot achieve
the minimum 250 Hz rate for more than about 80 time steps. The performance gap is
worse for more complex robots such as the quadruped and humanoid. Additionally, for
some applications, more than 10 iterations of the optimization loop may be required
to achieve convergence [67], which would grow the performance gap even further.
The fundamental computational kernels used in motion planning and control
techniques like nonlinear MPC are tightly coupled with the physical properties of the
robot under control. As mentioned previously, state-of-the-art software solutions take
advantage of this coupling, performing code generation and templating functions based
on the particular topology and joint mechanics of a specific robot [9, 66] (Chapter 3).
Despite these optimizations, software solutions have not bridged the performance gap
illustrated by Figure 4-3.
4.2 Robomorphic Computing

We introduce robomorphic computing to address performance challenges in motion
planning and other robotics applications. Robomorphic computing is a hardware
design methodology for robotics accelerators with two steps:
1. Create a hardware template for an algorithm once, parameterized by key com-

ponents of robot morphology, e.g., limbs, links, and joints;
56
2. For each robot, set the template parameters to customize the processors and
functional units to produce a hardware accelerator tailored to that particular
robot model.
This methodology enables the systematic identification of fundamental architectural

design paradigms, such as parallelism and sparse data structures, in robotics algorithms.
After these structures are identified once, a portable accelerator template can be used
indefinitely to programmatically exploit these computational opportunities for each
new robot platform.
In this section, we briefly describe the robomorphic design flow in the context of
motion planning and control applications (Figure 4-4). In Chapter 5 we will follow
this methodology to create a hardware accelerator template for the dynamics gradient
kernel, and customize that template for an industrial manipulator. An overview of the
robomorphic computing design flow in the context of motion planning applications in
general is shown in Figure 4-4. Motion planning and control accelerators produced
using this methodology can shrink the motion planning and control performance gap,
allowing complex robots to successfully plan motion trajectories online and further
into the future, enabling robust operation and new capabilities.
4.2.1 Step 1: Create Parameterized HW Template (Once)
A parameterized hardware template needs to be created once per robotics algorithm,

e.g., inverse dynamics (Section 2.1.3). We break down the algorithm to identify
structural features that are parameterizable by robot morphology.
We take parallelism from loops iterating over the robot’s limbs and links, and
map that to parallel processing elements in the hardware accelerator template. We
identify linear algebra operations on key (sparse) robot matrices, e.g., link inertia,
joint transformation, and joint motion subspace matrices, and map those operations
to functional units whose constant values and sparsity patterns are parameterized by
the robot links and joint types.
The benefit of this method is that algorithm features that can be exploited for
57
accelerator design (e.g., parallelism, sparse matrix operations) only need to be identified
once, after which it is trivial to tune their parameters for each robot model.
Libraries of hardware templates can be distributed like software libraries or param-
eterized hardware intellectual property (IP) cores, e.g., RISC-V “soft” processors [43].
4.2.2 Step 2: Set Template Parameters (Per-Robot)
Once a hardware template has been created, it can be used to create customized
accelerators for many different robot models by setting the parameters to match the
robot morphology.
We use the numbers of limbs and links in the robot to set the numbers of paral-
lel processing elements in the accelerator template. We use link inertia values and
joint types to set constant values and sparsity patterns in the link and joint matri-
ces, streamlining the complexity of the functional units that perform linear algebra
operations.
For example, the first two links in a manipulator might be connected by a hinge
joint whose transformation matrix has only 13 of 36 elements populated. Then,
in the corresponding functional unit for matrix-vector multiplication using a tree
of multipliers and adders, we can prune operations on zeroed elements, reducing
multipliers by 64% and adders by 77%.
Setting such parameters per-robot results in a customized accelerator design based
on the robot morphology.
58
59
Figure 4-4: Using the robomorphic computing design flow for motion planning and control. First, we create a parameterized
hardware template of the algorithm. We identify limb and link-based parallelism in the algorithm, as well as operations involving
the sparse link inertia, joint transformation, and joint motion subspace matrices. This is done only once for any given algorithm.
Then, for every target robot model, we set the parameters of the hardware template based on the morphology of the robot to
create a customized accelerator.
60
Chapter 5
Implementation and Evaluation of

a Robomorphic Dynamics Gradient
Accelerator for a Manipulator
We demonstrate the use of our robomorphic computing methodology to design what

we believe to be the first domain-specific hardware architecture for the gradient of rigid
body dynamics. This kernel is a key bottleneck in online optimal motion planning and
control, taking as much as 30% to 90% of the total runtime [9,78,83,87] and contributing
to a significant performance gap (see Section 4.1). Using robomorphic computing, we
identify opportunities in the dynamics gradient algorithm to parameterize parallelism
and matrix sparsity patterns based on robot topology and joint types. Then, for a
specific robot model we can systematically set these parameters to exploit parallelism
in hardware datapaths and sparsity in linear algebra functional units, resulting in a
robot-customized accelerator.
We implement our dynamics gradient accelerator design on an FPGA for an indus-
trial manipulator, and integrate the FPGA accelerator in an end-to-end coprocessor
system, as it would be deployed for an off-the-shelf solution today. Our FPGA accel-
erator achieves speedups of 8× and 86× over state-of-the-art CPU and GPU latency,
and maintains an overall speedup from 1.9× to 2.9× when deployed in an end-to-end
system. We also synthesize an ASIC implementation to evaluate the performance and
61
power opportunities of a system on chip. We synthesize the design using the Global
Foundries 12 nm technology node, at slow and typical process corners. ASIC synthesis
indicates an additional 7.2× speedup factor over our FPGA implementation.
Our accelerator for the gradient of rigid body dynamics is a critical step towards
enabling emerging techniques for real-time online motion control for complex robots.
5.1 Robomorphic Accelerator Design
To demonstrate the robomorphic computing methodology, we design a domain-specific

architecture for the dynamics gradient algorithm. Following the design steps in
Figure 4-4, we:
1. Create a parameterized hardware template of the dynamics gradients algorithm;

and
2. Set the template parameters to customize the accelerator for an example target
robot model, an industrial manipulator.
First, we examine the dynamics gradient algorithm in detail. Then, we use high-level
insights from from robot-specific topology to design an accelerator template. We
exploit parallelism corresponding to the flow of physics through the robot’s structure
and use robot joint types to design customized functional units for the sparse matrix
operations and irregular data accesses prevalent in the calculation. We also employ
traditional computer architecture techniques, such as pipelining to hide latency, and
re-using folded processing elements for area and resource utilization efficiency.
Finally, after we have created our hardware template, we set the parameters to
customize the template for an industrial manipulator. For simplicity, we chose a
target robot with only a single limb as a proof of concept, however, the techniques
demonstrated here readily generalize to robots with multiple limbs (see Section 7.1).
We evaluate this novel accelerator implemented in an FPGA coprocessor and a
synthesized ASIC in Section 5.3.
62
5.1.1 Algorithm Details
The state of the art method to calculate the gradient of forward dynamics, used in
the Pinocchio software library [9, 11], is to make use of quantities computed earlier
in the optimization process (the joint acceleration 𝑞¨ and the inverse of the inertial
matrix 𝑀 −1 ) and take the following steps (see Algorithm 1):
1. Compute inverse dynamics using 𝑞¨ (Algorithm 2);
2. Compute the gradient of the inverse dynamics (Algorithm 3) with respect to

inputs from all links; and
3. Recover the gradient of forward dynamics by multiplication with 𝑀 −1 .
Before we present the design of an accelerator for the forward dynamics gradient,
we will briefly examine the two key functions: inverse dynamics and the gradient of
inverse dynamics.
Inverse Dynamics. The standard implementation of inverse dynamics (ID) is the
Recursive Newton-Euler Algorithm [31] (Algorithm 2). First, there is a sequential
forward pass from the base link of the robot out to its furthest link 𝑁 , propagating
per-link velocities, accelerations, and forces (𝑣𝑖 , 𝑎𝑖 , 𝑓𝑖 ) outward. Then, there is a
sequential backward pass from link 𝑁 back to the base link, updating the force values
𝑓𝑖 , and generating the output joint torques 𝜏𝑖 .
The inputs to the algorithm are the link position, velocity, and acceleration (𝑞𝑖 , 𝑞˙𝑖 , 𝑞¨𝑖 )
expressed in their local coordinate frame, and three key matrices: the link inertia
matrix, 𝐼𝑖 ; the joint transformation matrix, 𝑖 𝑋𝜆𝑖 ; and the joint motion subspace matrix,
𝑆𝑖 . The 𝐼𝑖 , 𝑖 𝑋𝜆𝑖 , 𝑆𝑖 matrices all have deterministic sparsity patterns derived from the
morphology of the robot model. We will elaborate on this sparsity in Section 5.1.2,
and describe how custom functional units can take advantage of these patterns. The
sine and cosine of the link position 𝑞 are used to construct the transformation matrices
𝑖
𝑋𝜆𝑖 . Our accelerator design takes sin 𝑞 and cos 𝑞 directly as inputs. These values are
computed at an earlier stage of the optimization algorithm when the inverse of the
inertial matrix 𝑀 −1 is calculated, and can be cached alongside 𝑀 −1 at that time.
63
Algorithm 1 ∇ Forward Dynamics w.r.t. 𝑢 = {𝑞, 𝑞}
˙ [9].
1: 𝑣, 𝑎, 𝑓 = Inverse Dynamics(𝑞, 𝑞,
˙ 𝑞¨) ◁ Step 1
2: 𝜕𝜏 /𝜕𝑢 = ∇ Inverse Dynamics(𝑞,
˙ 𝑣, 𝑎, 𝑓 ) ◁ Step 2
−1
3: 𝜕 𝑞¨/𝜕𝑢 = −𝑀 𝜕𝜏 /𝜕𝑢 ◁ Step 3
Algorithm 2 Inverse Dynamics [31].

Sparse matrices 𝑖 𝑋𝜆𝑖 , 𝐼𝑖 , 𝑆𝑖 derived from robot morphology.
1: 𝑣0 = 0; 𝑎0 = gravity; Define 𝜆𝑖 = Parent of Link 𝑖
2: for Link 𝑖 = 1 : 𝑁 do ◁ Forward Pass
3: Update 𝑖 𝑋𝜆𝑖 (𝑞𝑖 )
4: 𝑣𝑖 = 𝑖 𝑋𝜆𝑖 𝑣𝜆𝑖 + 𝑆𝑖 𝑞˙𝑖
5: 𝑎𝑖 = 𝑖 𝑋𝜆𝑖 𝑎𝜆𝑖 + 𝑆𝑖 𝑞¨𝑖 + 𝑣𝑖 × 𝑆𝑖 𝑞˙𝑖
6: 𝑓𝑖 = 𝐼𝑖 𝑎𝑖 + 𝑣𝑖 ×* 𝐼𝑖 𝑣𝑖 − 𝑓𝑖ext
7: end for
8: for Link 𝑖 = 𝑁 : 1 do ◁ Backward Pass
9: 𝑓𝜆𝑖 += 𝜆𝑖 𝑋𝑖 𝑓𝑖
10: 𝜏𝑖 = 𝑆𝑖𝑇 𝑓𝑖
11: end for
∇ Inverse Dynamics. The gradient of inverse dynamics (∇ID), shown in Algo-

rithm 3, is derived from line-by-line analytical derivatives of the ID (Algorithm 2)
with respect to position 𝑞𝑗 and velocity 𝑞˙𝑗 for all links 𝑗. For details, see previous
work [9]. Again, there is a sequential forward pass and backward pass. Operations on
the sparse 𝐼𝑖 , 𝑖 𝑋𝜆𝑖 , 𝑆𝑖 matrices are similarly fundamental to this algorithm.
𝑀 −1 Multiplication. After calculating ∇ID with respect to inputs from all links,
the final outputs are transformed by multiplication with the matrix 𝑀 −1 to recover
the forward dynamics gradient (Algorithm 1).
Workload Characteristics. The dynamics gradient kernel spends most of its
runtime on computation, instead of on memory accesses, making it a “compute-bound”
application. Most of the computational workload is matrix-vector multiplication using
matrices that are small (6 × 6 elements) and middlingly sparse (around 30% to 60%
sparse), compared to the large and very sparse matrices in applications such as neural
networks (around 50% to 99.9% sparse) [41] which have been the subject of much
64
Algorithm 3 ∇ Inverse Dynamics w.r.t. 𝑢 = {𝑞, 𝑞}
˙ [9].
Again, sparse matrices 𝑖 𝑋𝜆𝑖 , 𝐼𝑖 , 𝑆𝑖 derived from robot morphology.
1: 𝜕𝑣 0 𝜕𝑎0
,
𝜕𝑢 𝜕𝑢
= 0; Define 𝜆𝑖 = Parent of Link 𝑖
2: for Link 𝑖 = 1 : 𝑁 do ◁ Forward Pass
𝑖
3: Update 𝑋𝜆𝑖 (𝑞𝑖 ) ⎧
⎨(𝑖 𝑋𝜆 𝑣𝜆 ) × 𝑆𝑖
⎪
𝑢≡𝑞
𝜕𝑣𝑖 𝑖 𝜕𝑣 𝑖 𝑖
4: 𝜕𝑢
= 𝑋𝜆𝑖 𝜕𝑢𝜆𝑖
+⎪
⎩𝑆
𝑖 𝑢 ≡ 𝑞˙
⎧
⎨(𝑖 𝑋𝜆 𝑎𝜆 ) × 𝑆𝑖
⎪
𝜕𝑎𝑖 𝑖 𝜕𝑎𝜆𝑖 𝜕𝑣𝑖 𝑖 𝑖
5: 𝜕𝑢
= 𝑋𝜆𝑖 𝜕𝑢 + 𝜕𝑢 × 𝑆𝑖 𝑞˙𝑖 +
𝑣𝑖 × 𝑆𝑖
⎪
⎩
𝜕𝑓𝑖
6: 𝜕𝑢
= 𝐼𝑖 𝜕𝑎
𝜕𝑢
𝑖
+ 𝜕𝑣𝑖
𝜕𝑢
× 𝐼𝑖 𝑣𝑖 + 𝑣𝑖 × 𝐼𝑖 𝜕𝑣
*
𝜕𝑢
𝑖 *
7: end for
8: for Link 𝑖 = 𝑁 : 1 do ◁ Backward Pass
𝜕𝑓𝜆𝑖
(︁ )︁
9: 𝜕𝑢
+= 𝜆𝑖 𝑋𝑖 𝜕𝑓
𝜕𝑢
𝑖
+ 𝜆𝑖
𝑋𝑖 𝑓𝑖 ×* 𝑆𝑖
10: 𝜕𝜏𝑖
𝜕𝑢
= 𝑆𝑖𝑇 𝜕𝑓
𝜕𝑢
𝑖
11: end for
recent work on hardware acceleration [12, 47, 91]. Compression approaches that have
offered high performance for large sparse matrices, e.g., compressed sparse row (CSR)
encoding [41], are not suitable for taking advantage of the sparsity in this application
because they introduce large overheads to perform encoding and decoding.
There is opportunity for fine-grained link parallelism because the computation
of each per-link partial derivative that makes up the full gradient is independent up
until multiplication with 𝑀 −1 . However, within each partial derivative calculation
the forward and backward passes create chains of sequential dependencies between
parent and child links that scale with the total links 𝑁 and do not parallelize easily.
See Chapter 3 for further workload analysis of the inverse dynamics algorithm.
5.1.2 Step 1: Create Parameterized HW Template (Once)
We follow Step 1 of the robomorphic computing design flow shown in Figure 4-4 to
create a parameterized hardware template for an accelerator for the gradient of rigid
body dynamics.
65
Figure 5-1: Visualization of data flow in the inverse dynamics (ID) and ∇ inverse
dynamics (∇ID) algorithms, referenced against the robot links. Fully parallelized, the
latency of ID and ∇ID grows with 𝑂(𝑁 ).
Link-Based Parallelism and Data Flow. The first step of the forward dynamics
gradient (Algorithm 1) is computing ID (Algorithm 2). This is a sequential operation
with forward and backward passes whose lengths are parameterized by the number of
links in a robot, 𝑁 .
For the second step, ∇ID (Algorithm 3), we can design the datapaths of our acceler-
ator template to exploit fine-grained link-based parallelism between partial derivatives.
We refer to this parallelism as “fine-grained” because of the short duration of the
independent threads of execution. They are joined in the third step of Algorithm 1,
multiplication of the full gradient matrix with 𝑀 −1 .
To exploit this parallelism, we create separate datapaths per link, made up of
sequential chains of forward and backward pass processing units to compute each
partial derivative (see Figure 5-1). The latency of the computation in each datapath
grows linearly with the number of links, 𝑂(𝑁 ). Because there are also as many
datapaths as links, the total amount of work in the ∇ID step grows with 𝑂(𝑁 2 ), but
when parallelized, its latency grows with 𝑂(𝑁 ).
Note that we compute ∇ID with respect to two inputs, position 𝑞 and velocity 𝑞.
˙
The gradient with respect to 𝑞 and 𝑞˙ are completely independent, but share common
inputs, so datapaths for both can run in parallel in the accelerator and take advantage
of data locality by processing the same inputs at the same time.
66
Figure 5-2: Datapath of forward and backward pass units for a single link in Step
2 of Algorithm 1. The length of a single datapath and the total number of parallel
datapaths are both parameterized by the number of robot links. The forward pass
unit is folded into three sequential stages for efficient resource utilization.
Every step in the ∇ID forward and backward passes requires inputs 𝑣𝑖 , 𝑎𝑖 , 𝑓𝑖 ,
produced by the ID for link 𝑖. To satisfy this data dependency, the steps of the
computation of ID must execute one link ahead of the computation of the ∇ID
datapaths. As a result, we are able to exploit parallelism between the first two steps
of Algorithm 1: the datapath that computes ID can run almost entirely in parallel to
the ∇ID datapaths (offset by one link). With this design, computing both ID and
∇ID with respect to 𝑞 and 𝑞˙ can all be done with 𝑂(𝑁 ) total latency.
Link and Joint-Based Functional Units. The datapaths of the accelerator are
built from chains of forward and backward pass processing units. A single datapath of
forward and backward pass units is illustrated in Figure 5-2. Within these units are
circuits of sparse matrix-vector multiplication functional units, e.g., the 𝐼·, 𝑋·, and ·𝑣𝑗
blocks in the forward pass. To minimize latency, dot products in these functional units
are implemented as trees of multipliers and adders. Robot link and joint information
can be used to parameterize these operations.
The link inertia matrix 𝐼 has a fixed sparsity pattern for all robots, but its elements
are constant values that are determined by the distribution of mass in a robot’s links.
If these values are set per-robot, the multipliers in the 𝐼· unit can all be implemented
as multiplications by a constant value, which are smaller and simpler circuits than
67
Figure 5-3: Example of one row of a transformation matrix dot product functional
unit, 𝑋·, for the joint between the first and second links of a manipulator. The sparsity
of the tree of multipliers and adders is determined by the robot joint morphology.
full multipliers.
The joint transformation matrix 𝑖 𝑋𝜆𝑖 has a variable sparsity pattern that is
determined by robot joint type. When this sparsity is set per-robot, the tree of
multipliers and adders in the 𝑋· can be pruned to remove operations on zeroed matrix
elements, streamlining the functional unit (see Figure 5-3).
The joint motion subspace matrix 𝑆𝑖 has a sparsity pattern determined by joint
type. For many common joints (e.g., revolute, prismatic), the columns of 𝑆𝑖 are vectors
of all zeroes with a single 1 that filter out individual columns of matrices multiplied
by 𝑆𝑖 . Per robot, the effect of filtering by 𝑆𝑖 can be encoded within functional units
such as 𝑋· and ·𝑣𝑗 by pruning or muxing operations and outputs from matrix columns
that are not selected by the 𝑆𝑖 sparsity.
Robot-agnostic sparsity is also exploited in the hardware template. Cross product

operations are encoded either as muxes that re-order outputs, or implemented as
matrix-vector multiplications, e.g., in the 𝑓 𝑥· units.
Other Architectural Optimizations. We make several additional optimizations

to the computation of the dynamics gradient in the final, parameterized accelerator
template design. These design choices are discussed in detail in Section 5.2. We
performed hardware-software codesign to select which steps of the algorithm to
incorporate in the design. We used pipelining to increase throughput and we folded
processing elements and functional units to reduce area (or analogously, resource
utilization on an FPGA).
68
Figure 5-4: Parameterized hardware template of the dynamics gradients accelerator.
The number of links 𝑁 , link inertial properties, and joint types are all parameters
that can be set to customize the microarchitecture for a target robot model.
Briefly, we folded the forward and backward pass pipeline stages each into a
processor containing just a single row of per-link cores. We also fold the forward pass
link units along three divisions, indicated in Figure 5-2. Finally, to further conserve
multiplier units, we also performed folding to incorporate the third step of Algorithm 1,
multiplication by 𝑀 −1 , into the existing logic for the earlier steps. Without aggressive
folding, the number of multipliers needed for the template design would be enormous
for almost any robot model, making it impossible to implement using the limited
number of digital signal processing multiplier units on an FPGA, and consuming a
large amount of area in an ASIC implementation.
Final Template Design. The microarchitecture of the accelerator template for the
dynamics gradient is shown in Figure 5-4. The first stage performs the forward passes
69
of the inverse dynamics and their gradients with respect to position and velocity. It
takes 𝑁 , the total number of links, iterations of this processor stage to traverse the
robot links. One extra iteration is added for the imperfect overlap of the inverse
dynamics with the gradients, and the iterations each take 3 cycles because of the
forward pass folding, bringing the latency of the first stage to a total of (𝑁 + 1) * 3
cycles.
The second stage performs the backward pass in (𝑁 + 1) cycles. Because of

pipelining between these stages, its latency is entirely overlapped by the latency of
the forward pass stage. Finally, the third stage of the processor is the two matrix
multiplications that are folded in with units from the backward pass processor. These
multiplications add an additional delay of 2 cycles to the latency of the second stage,
however it is still dominated by the latency of the first stage.
5.1.3 Step 2: Set Template Parameters (Per-Robot)
In Section 5.3 we evaluate our accelerator design implemented for the LBR iiwa
industrial manipulator (pictured in Figure 5-5). This target robot model has 𝑁 = 7
links and “revolute”-type joints about the 𝑧-axis. We set these parameters to customize
the template in Figure 5-4, instantiating 7 parallel datapaths in the forward and
backward passes of the gradient with respect to velocity 𝑞,
˙ and fixing the constants
and sparsity patterns of the functional units (see the example illustrated in Figure 5-3).
Note that for this particular robot model the forward pass of the gradient with respect
to position 𝑞 is actually always zero for the first link, so for the gradient with respect
to 𝑞 we instantiate 6 parallel datapaths for the forward pass and the full 7 datapaths
for the backward pass.
Again, these parameters can all be systematically changed to target different robot
models, e.g., a quadruped with 𝑁 = 3 links per limb, and different joint types.
70
Figure 5-5: We use the Kuka LBR iiwa [58] manipulator robot to demonstrate using
the robomorphic computing methodology to design and implement a customized
accelerator for the gradient of forward dynamics.
5.2 Architectural Design Choices

In this section, we discuss several of the key architectural choices made in the design
of our accelerator template for the dynamics gradient (Figure 5-4). We performed
hardware-software codesign to determine which steps of Algorithm 1 to include in
the accelerator. We folded hardware units at multiple levels of the design hierarchy,
to conserve area and resources. We also implemented pipelining to hide latency and
increase throughput. These design optimizations produce an accelerator whose FPGA
implementation demonstrates meaningful speedups over state-of-the-art CPU and
GPU solutions (see Section 5.3). Additional microarchitectural optimizations are
possible, to continue optimizing performance. While implementing further design
modifications is beyond the scope of this thesis, we briefly discuss possibilities for
future work in design optimization at the end of this section.
5.2.1 Hardware-Software Codesign
Hardware-software codesign is a design process to achieve system-level goals by

leveraging synergy between hardware and software [15]. This includes partitioning of
a target algorithm between software and hardware. (See Chapter 6 for more details
on hardware-software codesign for different hardware platforms, including CPUs and
GPUs.)
In this section we describe how we used hardware-software codesign in the creation
of the accelerator template for the FPGA to determine which steps of the gradient of
71
Table 5.1: State-of-the-art CPU timing [9] for the steps of the forward dynamics
gradient (Algorithm 1). This timing breakdown informed our hardware-software
codesign.
Forward Dynamics Gradient Step CPU Time [us]
Step 1, Inverse Dynamics (ID) 1.20
Step 2, ∇ Inverse Dynamics (∇ID) 3.34
−1
Step 3, 𝑀 Multiplication 0.62
Figure 5-6: Accelerator designs incorporating different steps of the dynamics gradient
(Algorithm 1), as part of the hardware-software codesign process. Our final design, on
the right, implements all three steps.
forward dynamics (see Algorithm 1) to include in the accelerator hardware. Recall,

those steps are: Step 1, compute inverse dynamics (ID) using 𝑞¨; Step 2, compute the
inverse dynamics gradient (∇ID) with respect to inputs from all links; and Step 3,
recover the forward dynamics gradient by multiplication with 𝑀 −1 [9, 11].
State-of-the-art CPU implementation times [9] reported for the three steps of
forward dynamics gradient are shown in Table 5.1. Accelerator designs incorporating
different steps of the forward dynamics gradient are shown in Figure 5-6. Resource
utilization breakdowns for the three designs incorporating different steps of the forward
dynamics gradient are shown in Table 5.2.
Our initial accelerator design implemented only the longest latency step, Step 2,
∇ID (see Figure 5-6, left). However, isolating only that step in an an FPGA imple-
72
Table 5.2: Percent resource utilization (of totals, given in parentheses) for accelerator
designs incorporating different steps of the dynamics gradient on our target XCVU9P
FPGA. Our final design, with all three steps, makes heavy use of the digital signal
processing (DSP) blocks, which we use for matrix element multiplications.
Design Breakdown LUT Reg CARRY8 MUX DSP
(1182240) (2364480) (147780) (591120) (6840)
∇ID Only Fwd. Pass 21.16% 1.07% 8.89% 0.49% 36.49%
Bwd. Pass 9.54% 0.24% 6.22% 0.08% 26.20%
TOTAL 30.70% 1.32% 15.12% 0.57% 62.69%
∇ID & ID Fwd. Pass 22.99% 1.10% 9.56% 0.53% 39.30%
Bwd. Pass 10.18% 0.25% 6.65% 0.08% 27.54%
TOTAL 33.17% 1.35% 16.22% 0.61% 66.84%
∇ID, ID, 𝑀 −1 Fwd. Pass 22.99% 1.10% 9.56% 0.53% 39.30%
Bwd. Pass 9.31% 0.52% 7.16% 0.08% 38.19%
TOTAL 32.30% 1.62% 16.73% 0.61% 77.49%
mentation required a large number of input bits: a total of 4704 bits per gradient
calculation for our target robot model, the iiwa manipulator (Figure 5-5). These input
bits need to be transferred to the FPGA from a host CPU, so more bits consume
more of the limited I/O bandwidth. In our full coprocessor system implementation in
Section 5.3, the I/O channel was limited to PCIe gen1, so a large number of input
bits was especially undesirable.
To reduce the number of input bits sent from a host CPU, we next chose to also
implement Step 1 of Algorithm 1, ID, into our design (see Figure 5-6, middle). This
reduced the number of input bits from 4704 to 896, a reduction of 5.25×, by producing
the 𝑣𝑖 , 𝑎𝑖 , 𝑓𝑖 signals from ID units within the accelerator. The additional cost of doing
this was a modest increase in FPGA resource utilization, most significantly an extra
4.15% increase in our most heavily utilized resource, the digital signal processing
(DSP) units (see Table 5.2). In our FPGA implementation, we use those DSP units
for the many matrix element multiplications performed in the workload.
Finally, we decided to implement Step 3 of the forward dynamics gradient, the
final 𝑀 −1 multiplication, in our accelerator design because we determined that this
step would be faster on the FPGA than the host CPU, and we had enough DSP units
remaining on our target FPGA for this calculation if we folded it with the backward
73
pass (Section 5.2.2). Total DSP utilization was at 66.8% after implementing Steps 1
and 2 (Table 5.2). After implementing Step 3 as well, total DSP utilization increased
to 77.5%. This increase came entirely in the backward pass of the processor because
we used folding to incorporate the 𝑀 −1 multiplication into existing hardware for the
backward pass (details in Section 5.2.2). Receiving the 𝑀 −1 matrix as an input also
increased the number of input bits up to 2464, shrinking the reduction from the second
design from 5.25× down to 1.9×. However, in exchange for these tradeoffs, we are
able to complete the entire forward dynamics gradient algorithm in our design. The
results in Section 5.3 Figure 5-9 demonstrate that we achieve meaningful speedups on
all three steps of the algorithm compared to CPU and GPU implementations.
5.2.2 Folding
We performed folding at two different levels of the design, to compress total accelerator
area and conserve resources for FPGA implementation. We fold the forward passes of
all parallel datapaths into a processor that executes for a single link at a time, then
feeds back the results to iterate over all links in the sequential chain. This gives a
significant reduction in the number of functional units (a reduction of approximately
𝑂(𝑁 2 )× in area) in exchange for a very small latency penalty (the cost of loading and
storing intermediate results to registers). We fold the backward passes in the same
manner.
For additional area and resource savings, we also fold the forward pass link units
along three divisions, indicated in Figure 5-2. This allows us to re-use the sparse
matrix-vector joint functional units, conserving the number of multipliers and adders
needed in the design. Figure 5-7 shows how folding the forward pass units impacts
total FPGA resource utilization. While folding gives slight increases in the number of
several types of resources, it greatly reduces the utilization of the most heavily-used
resource, the DSP blocks (see Table 5.2). Using our folding scheme, we are able to
reduce DSP utilization by 1.79× per forward pass unit over an unfolded design.
Finally, to further conserve multiplier units, we also performed folding to incor-
porate the third step of Algorithm 1, multiplication by 𝑀 −1 , into the existing logic
74
Figure 5-7: Effect on total FPGA resource utilization per forward pass unit from
folding the forward pass units along the dividing lines shown in Figure 5-2. Folding
these units conserves our most heavily-used resources on the XCVU9P FPGA, the
digital signal processing (DSP) blocks, a reduction of 1.79× per forward pass unit.
for the earlier steps. If we just added the two 𝑀 −1 multiplications as separate logic
without folding with existing units, the total additional number of required DSP units
would be 2744, which would raise the total DSP utilization to 117.6%. If we folded
the two 𝑀 −1 multiplications with each other, it would still be an increase of 1372
DSPs, raising total utilization to 97.54%. Very high resource utilization on an FPGA
is undesirable because it makes routing signals between resources in the FPGA fabric
very constrained and difficult. This can cause the design to not synthesize, or to
synthesize at a very slow clock speed.
To avoid these problems, we fold the 𝑀 −1 multiplications into the backward pass
units for ∇ID with respect to velocity 𝑞.
˙ This design choice results in an increase
of only 728 DSPs, raising total utilization to only 77.49% (see Table 5.2). After the
backward pass row processor has completed work on a sequential chain of operation,
the multipliers from the sparse matrix-vector units in the ∇ID backward pass with
respect to 𝑞˙ are augmented with additional multipliers to perform the multiplication
of the 𝑀 −1 matrix and the columns of the outputs of the gradients backward passes,
˙ These 𝑀 −1 matrix multiplications are performed in two sequential
𝜕𝜏 /𝜕𝑞 and 𝜕𝜏 /𝜕 𝑞.
clock cycles after the backward pass has been computed.
75
5.2.3 Pipelining
We pipeline the forward pass and backward pass of all datapaths, to hide latency
when computing multiple gradient operations for a motion trajectory. In our FPGA
coprocessor system implementation (see Figure 5-8), we also pipeline the receipt and
buffering of inputs from the PCIe I/O connection with the rest of the accelerator, so
we can simultaneously process one dynamics gradient calculation while receiving data
for the next calculation.
5.2.4 Future Work in Design Optimization
More design optimizations can be done in future work to further increase the perfor-
mance of the design, or to reduce the total area and FPGA resource utilization.
One optimization for a faster clock speed would be to introduce more pipelining
into the design to reduce critical timing paths. In the current design, the critical path
is in the backward pass processor, going through the 𝑋 𝑇 · units. These units could
be pipelined into two stages, each performing half of the required multiplications.
Alternatively, the multiplication units themselves could be pipelined, to give results
at a higher throughput.
Another design optimization might be additional folding of the forward and
backward pass units for different links into each other. This may be especially effective
for the forward passes of ∇ID, where for some robot models, patterns of forward pass
outputs that are always zero can be determined from robot topology. However, this
will require significantly more complicated control flow to deliver inputs for different
links to different forward pass units at the same time, which is why we did not pursue
this optimization within the scope of the work presented here.
Yet another design optimization would be to fold the backward pass units them-
selves into multiple stages, like the folding we performed on the forward pass units
(see Figure 5-2). This could reduce total area and resource utilization. The latency of
this additional folding would be largely hidden by the latency of the folded forward
pass units, which are pipelined with the backward pass.
76
These future optimizations may be especially useful in reducing resource utilization
in order to fit the design onto a smaller FPGA platform, or to save space to fit an
accelerator parameterized to a more complex robot with a greater number of total
links than the 7-link manipulator we use as the basis of our evaluation in this work.
5.3 Evaluation
In this section, we evaluate the performance of the FPGA implementation of our
accelerator, comparing it to off-the-shelf CPU and GPU implementations. We also
evaluate a synthesized ASIC version of the accelerator pipeline. For both the FPGA and
ASIC, we evaluate our accelerator design implemented using an industrial manipulator.
However, in Section 7.1 we describe how robomorphic computing can be used to adapt
the design for different robot models.
5.3.1 Methodology
Our accelerator and the software baseline implementations all target the Kuka LBR
iiwa-14 manipulator [58] (pictured in Figure 4-1 and Figure 5-5) using the robot model
description file from the rigid body dynamics benchmark suite, RBD-Benchmarks [76].
Baselines. As baselines for comparison, we used state-of-the-art CPU and GPU
software implementations of the dynamics gradient from previous work. The CPU
baseline is from the dynamics library Pinocchio [11]. The application was parallelized
across the trajectory time steps using a thread pool so that the overheads of creating and
joining threads did not impact the timing of the region of interest. The GPU baseline
is taken from previous work on nonlinear MPC on a GPU [87]. This implementation
is also parallelized across the motion trajectory time steps.
Hardware Platforms. The platforms used in our evaluation are summarized in
Table 5.3. Our CPU platform is a 3.6 GHz quad-core Intel Core i7-7700 CPU
with 16 GB of RAM running Ubuntu 16.04.6. Quad-core Intel i7 processors have
been a common choice for the on-board computation for complex robot platforms,
including many featured in the DARPA Robotics Challenge [55], such as the Atlas [56],
77
Table 5.3: Hardware System Configurations
Platform CPU GPU FPGA
Processor i7-7700 RTX 2080 XCVU9P
# of Cores 4 2944 CUDA (46 SM) N/A
Max Frequency 3.6 GHz 1.7 GHz 55.6 MHz
I/O Bus N/A PCIe Gen3 PCIe Gen1
Valkyrie [89], and CHIMP [99] humanoids. Similarly, the Spot quadruped from Boston
Dynamics [8] offers a quad-core Intel i5 processor on board. The GPU platform is an
NVIDIA GeForce RTX 2080 with 2944 CUDA cores. This platform offers comparable
compute resources to the GPU available as an add-on for the Boston Dynamics
Spot quadruped, the NVIDIA P5000 with 2560 CUDA cores. We implemented our
accelerator on a Xilinx Virtex UltraScale+ VCU-118 board with a XCVU9P FPGA.
This platform was selected because it offers a high number of digital signal processing
(DSP) units, which we use to perform the element-wise multiplications in our linear
algebra-heavy workload. The FPGA design was synthesized at a clock speed of
55.6 MHz. (Note that it may be possible to increase the clock speed in future work by
implementing further design optimizations such as those suggested in Section 5.2.4.)
CPU code was compiled using Clang 10. Code for the GPU was compiled with
nvcc 11 using g++5.4. (The NVIDIA nvcc compiler calls g++ for final compilation.)
The GPU and FPGA were connected to a host CPU over PCIe connections. The
GPU was connected by a PCIe Gen3 interface, however, the FPGA was connected
with a PCIe Gen1 interface due to software limitations in Connectal [49]. Note that
for ease of implementation, the final output data returned from the FPGA to the host
CPU is written back in a transposed order from the CPU and GPU baselines.
Measurements. For clean timing measurements, we disabled TurboBoost on the

CPU and fixed the clock frequency of all four cores to the maximum 3.6 GHz.
HyperThreading was enabled so that threadpools could be used. Latency time was
measured with the Linux system call clock_gettime(), using CLOCK_MONOTONIC as
the source. For the single computation measurements in Section 5.3.2, we take the
mean of one million trials for the CPU and GPU data. For the FPGA data, we extract
78
Figure 5-8: FPGA coprocessor system implementation.
the timing from cycle counts and the clock frequency. For the multiple computation
results in Section 5.3.3, we take the mean of one hundred thousand trials. For the
GPU and FPGA 98% of trials were within 2% of the mean. For the CPU 90% of
trials were within 2% of the mean for longer trajectories but only within 10% of the
mean for shorter trajectories.
End-to-End System Integration We integrated our accelerator design into a com-

plete end-to-end system with an FPGA implementation of the accelerator connected
as a coprocessor to a host CPU (see Figure 5-8). The CPU generates inputs and calls
the accelerator to compute the dynamics gradient.
The communications between the CPU and FPGA is implemented using the tool
Connectal [49]. The Connectal framework sets up communication links that act as
direct memory accesses in C++ on the CPU-side software, and as FIFO buffers in
Bluespec System Verilog on the FPGA-side hardware. Connectal’s support of the
FPGA board used in our evaluation (VCU-118) is currently limited to the first version
of PCIe, but we evaluated that this limited bandwidth is still sufficient to deliver high
enough performance for our accelerator to out-perform CPU and GPU baselines (see
Section 5.3). Conversion between the floating-point data types used on the CPU and
the fixed-point data types used in the FPGA accelerator is implemented on the FPGA
using Xilinx IP Cores.
79
Figure 5-9: Latency of a single computation of the dynamics gradient. Our FPGA
accelerator gives 8× and 86× speedups over the CPU and GPU. The GPU suffers
here from high synchronization and focus on throughput, not latency.
5.3.2 FPGA Accelerator Evaluation
Latency Results. To evaluate the performance benefit of our accelerator for the
target algorithm, we compare the latency of a single execution of forward dynamics
gradient on the CPU, GPU, and our FPGA implementation (see Figure 5-9). This
result is broken down into the three steps of Algorithm 1:
1. Calculate 𝑣, 𝑎, 𝑓 from inverse dynamics (ID);
2. Calculate 𝜕𝜏 /𝜕𝑢, the gradient of inverse dynamics (∇ID); and
3. Recover 𝜕 𝑞¨/𝜕𝑢, the gradient of forward dynamics, from multiplication by the

𝑀 −1 matrix.
Because we are interested in comparing pure computational latency in this experi-

ment, we ignore pipelining between the forward and backward passes in the FPGA
accelerator. Additionally, in our accelerator design the datapaths of the ID and ∇ID
are overlapped for the forward and backward passes, but offset by one cycle each, so
the timing data for ID on the FPGA in Figure 5-9 is given as 2 cycles of overhead.
80
The results in Figure 5-9 show the accelerator implemented on the FPGA demon-
strating a significant computational speedup over the CPU and GPU, despite operating
with a much slower clock speed (see Table 5.3). The total latency of the accelerator’s
forward dynamics gradient computation is 8× faster than the CPU and 86× faster
than the GPU.
The FPGA outperforms the CPU and GPU because it has minimal control flow
overhead and can fully exploit parallelism from small matrix operations throughout
the workload and parallelism from partial derivatives in ∇ID before they join for
the 𝑀 −1 multiplications. This is because the structure of the algorithm is explicitly
implemented in the datapaths of the accelerator (Figure 5-4), including parallelism
directly determined by the target robot.
The GPU fares poorly in this experiment because it is a platform optimized
for parallel throughput, not the latency of a single calculation. It experiences an
especially long latency for ∇ID, the step of Algorithm 1 with the largest computational
workload. GPU processors are designed for large vector operations and have difficulty
exploiting parallelism from the small sparse matrices in ∇ID. The algorithm is also
very serial because of inter-loop dependencies in the forward and backward passes,
forcing many synchronization points and causing overall poor thread occupancy.
Additionally, because threads must synchronize for 𝑀 −1 multiplications, parallelism
between partial derivatives in ∇ID is of limited duration, further contributing to
frequent synchronization penalties.
The CPU can also only exploit limited parallelism in the linear algebra through vec-
tor operations, but its pipeline is optimized for single-thread latency, so it outperforms
the GPU.
Joint Transformation Matrix Sparsity. Our FPGA platform offered a limited
number of digital signal processing multipliers, so to conserve multiplier resources in
our design, we implemented a single transformation matrix-vector multiplication unit
for all seven joints in our target robot model (see Section 5.1.3). This unit covers
a superposition of the matrix sparsity patterns in all individual joints. This design
choice uses sparsity to reduce the total operations, while avoiding the waste of area
81
Figure 5-10: To reduce FPGA utilization, we implement a single transformation matrix-
vector multiplication unit for all joints in the target robot. Using a superposition of
sparsity patterns, we recover 33.3% of average sparsity while conserving area.
and resources from creating seven separate units.

Figure 5-10 compares the reduction in operations from this design choice to
several baselines. “No Sparsity” is total operations for a dense 6 × 6 matrix-vector
multiplication. “Robot-Agnostic” is a domain-specific optimization that assumes the
upper-right quadrant of the transformation matrix is empty, which is true regardless of
robot model. “Robomorphic, Superposition All Joints” is the design choice we made in
our accelerator, where we implemented a single matrix-vector multiplication unit that
could be used for all joints in the iiwa manipulator. The sparsity in that functional
unit was based on the superposition of the sparsity patterns in all the individual joint
matrices. Finally, “Robomorphic, Average All Joints” represents the average sparsity
of those individual joint matrices, to indicate how well the superposition captures the
sparsity of the individual joints. To achieve or surpass this bound, however, would
require creating seven separate units, one for each joint.
Our design choice recovered 33.3% of the average robomorphic sparsity of the
individual joint matrices in a single matrix-vector multiplication unit, without the
expense of area and resources it would take to instead instantiate seven matrix-vector
multiplication units, one per joint. This was advantageous in our FPGA design, where
area and resources were highly constrained. This design choice could be re-visited in
future platforms with less stringent area constraints.
Accuracy and Numerical Precision. In our FPGA design, we used a 32-bit fixed-
82
Figure 5-11: We used 32-bit fixed-point with 16 decimal bits in our design due to
FPGA multiplier constraints. However, a range of fixed-point numerical types deliver
comparable optimization cost convergence to baseline 32-bit floating-point after a
fixed number of optimization iterations. This indicates that it is feasible to use fewer
than 32-bits in future work. Fixed-point labeled as “Fixed{integer bits, decimal bits}”.
point numerical type with 16 decimal bits. Fixed-point reduces the area and complexity
of arithmetic operations compared to floating-point. To validate this design choice
and test the sensitivity of the algorithm to different degrees of numerical precision, we
experimented with different data types for the dynamics gradient function within a
nonlinear MPC implementation [87]. Figure 5-11 shows optimization cost convergence
results. This optimization converged for the standard 32-bit floating point type used
by our baseline software implementations. To test different numerical types, we used
the Julia programming language to implement a type-generic software implementation
of the dynamics gradient function that performs equivalent mathematical operations
to our accelerator design. This type-generic code allowed us to use 32-bit fixed-point
numbers where we could allocate different numbers of bits for the integer versus decimal
part of the number. Figure 5-11 shows convergence results using the standard 32-bit
floating-point type and using 32-bit fixed point types with different numbers of bits
allocated to the integer versus decimal part of the number (labeled as “Fixed{integer
bits, decimal bits}”).
We found that a range of 32-bit fixed point values worked as well as 32-bit floating
83
Figure 5-12: End-to-end system latency for a range of trajectory time steps. Our
FPGA accelerator (F) gives speedups of 2.2× to 2.9× over CPU (C) and 1.9× to 5.5×
over GPU (G).
point, validating our use of fixed point units in our accelerator design, which reduced
the complexity of the mathematical operators like multipliers. Results indicate it is
possible to shrink the integer down to 14 bits (“Fixed{14,18}”) and the decimal down
to down to 6 bits (“Fixed{26,6}”). This suggests we may ultimately be able to use a
20-bit fixed-point type (14 integer bits, 6 decimal bits) in future implementations of
our accelerator design. This would reduce the bit width throughout the computation
by 37.5%, which could lead to a faster runtime and lower power dissipation than
full-width 32-bit fixed-point math.
However, we used 32 bits in our current design because it was convenient for data
I/O with a CPU, and because our FPGA’s digital signal processing multipliers are
27 × 18 bits, so all operands between 19 and 36 bits require twice as many multipliers.
5.3.3 End-to-End System Evaluation
End-to-End Timing Results. While Figure 5-9 shows compute-only latency for a
single calculation of dynamics gradient, Figure 5-12 shows timing for an end-to-end
coprocessor system implementation of the accelerator (Figure 5-8) performing multiple
gradient calculations. FPGA and GPU results include I/O overheads. We evaluate
84
a representative range of trajectories, from 10 to 128 time steps [16, 22, 79, 87, 103].
Each time step requires one dynamics gradient calculation.
The FPGA accelerator gives speedups of 2.2× to 2.9× over CPU and 1.9× to 5.5×
over GPU because of its very low latency (see Figure 5-9). However, the scaling of
FPGA performance in this experiment is ultimately limited by throughput at higher
numbers of time steps. On our current FPGA platform we heavily utilized the limited
digital signal processor resources on the FPGA for linear algebra operations. As a
result, we could only instantiate the complete accelerator pipeline for a single gradient
computation. By contrast, the CPU has 4 cores and the GPU has 46 SM cores, so
they can process multiple gradient computations in parallel.
The FPGA and GPU have comparable I/O overhead from round-trip memory
transfers to the host CPU, despite the FPGA having a lower bandwidth I/O connection
than the GPU (PCI Gen1 vs. Gen3). We achieve this by pipelining the I/O data
marshalling with the execution of each computation. While one time step is being
processed, the inputs for the next time step are being received and buffered on the
FPGA.
Under 64 time steps, the CPU benefits from low latency (Figure 5-9) and outper-
forms the GPU. Beginning at 64 time steps, the GPU benefits from high throughput
on the large number of parallel computations, and surpasses the CPU. Thread and
kernel launch overheads flatten the scaling of the CPU and GPU at low numbers of
time steps (10, 16, and 32).
5.3.4 Synthesized ASIC Power and Performance
While we already see meaningful speedups from implementing our accelerator on

an FPGA (Figures 5-9 and 5-12), an ASIC platform offers additional performance
and power benefits. We ran ASIC synthesis on the core computational pipeline of
our design: the forward and backward pass processors, plus the intermediate SRAM
between them (Figure 5-4), using the Global Foundries 12nm tech node at both slow
and typical process corners. Table 5.4 compares these results to our XCVU9P FPGA
synthesized using the Vivado Design Suite [32].
85
Table 5.4: Synthesized ASIC (12nm Global Foundries) and baseline FPGA results for
accelerator computational pipeline.
Platform FPGA Synthesized ASIC
Process Corner Typical Slow Typical
Technology Node [nm] 14 12 12
Max Clock [MHz] 55.6 250 400
2
Area [mm ] N/A 1.627 1.885
Power [W] 9.572 0.921 1.095
Figure 5-13: ASIC synthesis shows a 4.5× to 7.2× speedup in single computation
latency over the FPGA.
The maximum clock of the core computational pipeline on the ASIC (typical
corner) is 7.2× faster than the clock of our FPGA implementation. Figure 5-13
compares the latency for a single computation on the ASIC versus the FPGA.
A system-on-chip will allow us to instantiate multiple parallel pipelines, improving
the throughput of our accelerator. On the FPGA we can only fit a single pipeline
due to limited multiplier resources (Section 5.3.3). An ASIC area of 1.9 mm2 (typical
corner), however, suggests many pipelines can fit on a chip. For example, Intel’s 14 nm
quad-core SkyLake processor [28] is around 122 mm2 , nearly 65× our pipeline area.
A major ASIC benefit is low power dissipation. Power budgets are an emerging
constraint in robotics [100], especially for untethered robots carrying heavy batteries.
For example, the Spot quadruped has a typical runtime of 90 minutes on a single
charge of its battery [8], limiting its range and potential use cases. Power dissipation
of our design on an ASIC (typical corner) is 8.7× lower than the calculated power on
an FPGA.
86
Figure 5-14: Projected control rate improvements from our accelerator using the
analytical model from Figure 4-3. We enable planning on longer time horizons for a
given control rate, e.g., up to about 100 or 115 time steps instead of 80 at 250 Hz.
ASIC results show a narrow range between process corners.
5.3.5 Projected Control Rate Improvement
Finally, we revisit the analytical model from Figure 4-3 to project control rate
improvements from using our accelerator (Figure 5-14). We enable faster control rates,
which robots can use to either perform more optimization loop iterations to compute
better trajectories, or plan on longer time horizons, e.g., up to about 100 or 115 time
steps instead of 80 at 250 Hz. Exploring longer planning time horizons allows robots
to be more adaptive to their environment and more robust in their motion, increasing
their resilience to disturbances and unlocking new behaviors.
87
88
Chapter 6
Hardware-Software Codesign for

the Robot Dynamics Gradient
As established in Chapter 3 and Chapter 4, rigid body dynamics algorithms and their
gradients are core computational kernels for many robotics applications, including mo-
tion planning and control. State-of-the-art software implementations of dynamics and
their gradients are based on spatial algebra [29], and are optimized to take advantage
of sparsity in the deterministic structure of the underlying matrix operations [10, 35].
However, as demonstrated in Chapter 5, existing software implementations for the
CPU and GPU do not take full advantage of opportunities for parallelism present
in the algorithm, limiting their performance [75]. As a consequence, while real-time
nonlinear model predictive control (MPC) is a promising motion planning and control
technique for complex robots [21, 26, 51, 65, 77, 86, 102], this optimization problem must
still be initially solved offline for high-dimensional dynamic systems [20]. As shown in
Section 4.1, computation of the gradient of rigid body dynamics is a bottleneck in
these optimizations, taking 30% to 90% of the total runtime [10, 51, 77, 86].
The performance of traditional CPU hardware has been limited by thermal dissi-
pation, enforcing a utilization wall that restricts the performance a single chip can
deliver [25,105]. Computing platforms such as GPUs and FPGAs provide opportunities
for higher performance and throughput by exploiting greater parallelism than CPUs
(see Chapter 2).
89
In robotics, there has been recent interest in the use of GPUs. Previous work on
GPUs implemented sample-based motion planning largely through Monte Carlo roll-
outs [42,45,81,85,107,108]. For motion planning and control approaches using dynamic
trajectory optimization, recent efforts leveraging multi-core CPUs and GPUs indicate
that significant computational benefits from parallelism are possible [1,26,38,54,84,86].
However, current state-of-the-art robot dynamics packages are not optimized for use
on the GPU [86], and GPU-optimized rigid body physics implementations from the
computer graphics community lack the accuracy needed for robotics applications [23].
Meanwhile, previous work on FPGAs for robot motion planning and control has
been limited to fast mesh and voxel-based collision detection [3, 60, 61, 69–71, 98], and
planning for systems with simple dynamics, such as cars and drones [68,93,109]. These
solutions do not address complex multi-body robots, e.g., industrial manipulators,
quadrupeds, and humanoids, which are the focus of our work in Chapter 4 and
Chapter 5.
The dynamics gradient accelerator design and evaluation in Chapter 5 demonstrates
the potential of leveraging a highly-parallel computing platform like an FPGA to
perform motion planning and control for complex robots. The process of hardware-
software codesign was critical to the design of that accelerator, helping it achieve
significant speedups for the rigid body dynamics gradient kernel over state-of-the-art
software implementations. In this chapter, we describe how that accelerator made use
of hardware-software codesign for the FPGA platform.
Then, inspired by the performance improvements yielded by the FPGA codesign
process, we use hardware-software codesign to create new software implementations of
the dynamics gradient for CPU and GPU platforms. In particular, this work describes
a novel optimized implementation of the gradient of rigid body dynamics for the GPU
showing significant speedups over state-of-the-art CPU and GPU implementations.
We demonstrate the benefits of using hardware-software codesign to refine GPU
designs, improving performance of our own initial design by 2.7× using this process.
We also characterize common codesign pitfalls, such as I/O overhead costs. Finally,
we discuss tradeoffs in targeting CPUs, GPUs, and FPGAs, to provide insight for
90
future work in codesign for these platforms in the field of robotics. We use this work
as a case study to roadmap the process of hardware-software codesign in the context
of a key robotics kernel, to illustrate how researchers can leverage parallel computing
platforms for other robotics applications in future work.
6.1 Hardware-Software Codesign

Hardware-software codesign is the process of achieving system-level goals by leveraging
synergy between hardware and software [15]. This includes the deliberate mapping
of an application to a specific hardware platform, and thoughtful partitioning of the
overall problem between software and hardware.
CPU, GPU, and FPGA computing systems offer different design tradeoffs (Figure 2-
3), and different types of algorithmic features map better and worse to each platform.
To design high-performance, hardware-optimized implementations of a given algorithm,
it is necessary to understand both the common canonical patterns found in the
algorithm, and the overall end-to-end application in which the algorithm will be run.
In this section, we identify some such features present in the gradient of rigid body
dynamics (Algorithm 1). While not an exhaustive list of all possible features, we use
this as an illustrative example of how to evaluate the hardware acceleration potential
of a robotics algorithm. Referencing these features, we then examine the process of
hardware-software codesign for the dynamics gradient on CPU, GPU, and FPGA
hardware platforms.
6.1.1 Algorithmic Features
Table 6.1 summarizes key algorithmic features seen in the computation of the gradient
of rigid body dynamics, and offers qualitative assessments of how well these patterns
map onto CPUs, GPUs, and FPGAs.
Recall that programmer ease-of-use is another tradeoff to consider when choosing
among hardware platforms. As discussed in Section 2.2, CPUs have a well-developed
software ecosystem, making them easy to use for programmers of all experience
91
Table 6.1: Algorithmic features of the gradient of rigid body dynamics and qualitative
assessments of their suitability for different target hardware platforms.
Algorithmic Features CPU GPU FPGA
Thread-Level Parallelism moderate excellent moderate
Fine-Grained Thread Parallelism poor moderate excellent
Fine-Grained Operation Parallelism moderate excellent excellent
Structured Sparsity good moderate excellent
Irregular Memory Accesses moderate poor excellent
Sequential Dependencies good poor good
Small Working Set Size excellent moderate good
Complex Control Flow excellent poor good
I/O Overhead excellent poor poor
levels. GPUs are more challenging to program effectively because of their rigid SIMD
programming model. FPGA development requires a high level of user expertise and
the hardware description code can be difficult to debug.
Thread-Level Parallelism. One algorithmic feature that is readily accelerated

is thread-level parallelism, of either the single-instruction multiple-data (SIMD) or
multiple-instruction multiple-data (MIMD) variety. This type of parallelism features
threads that operate completely independently for a long time before needing to join
or synchronize. Thread-level parallelism lends itself well to execution in separate
software threads on a CPU or GPU, and to parallel datapaths on an FPGA. GPUs in
particular are designed with SIMD thread-level parallelism as their primary strength.
For the dynamics gradient, SIMD thread-level parallelism is available from the
high-level usage and invocation of the algorithm. Many modern nonlinear MPC
implementations have a step which requires tens to hundreds of embarassingly parallel
computations of the gradient of rigid body dynamics [17, 21, 65, 86, 102], one for each
motion trajectory time step. While heavy SIMD parallelism is available, there is not
much native opportunity for MIMD thread-level parallelism in the original algorithm.
Fine-Grained Parallelism. Within the thread-level parallel SIMD threads of the

dynamics gradient algorithm, there are opportunities for smaller-scale, fine-grained
parallelism. We include both parallelism between short-duration threads and par-
allelism between mathematical operations in this category. The difference between
92
thread-level and fine-grained parallelism is the timescale: fine-grained parallelism
is of shorter duration than thread-level parallelism, existing between short threads
of execution and mathematical operations for brief periods of time before reaching
synchronization points.
The key to exploiting fine-grained parallelism between threads on CPU and GPU
platforms is finding enough independent work to perform before arriving at synchro-
nization points to amortize the overhead of distributing the work to different threads.
This can be challenging on a CPU, where thread overheads are substantial. On a GPU,
if calculations can be structured to spread across independent entries of large data
arrays, then GPU platforms can excel at fine-grained parallelism. However, if there are
frequent sequential data dependencies, then GPUs arrive at too many synchronization
points to use fine-grained parallelism effectively. On an FPGA, the hardware can be
designed to exploit many patterns of fine-grained parallelism natively in the routing of
the processor datapaths, making these platforms excellent at this type of parallelism.
As described in Section 5.1.1, within a single computation each column 𝑗 of the
computation of 𝜕 𝑞¨/𝜕𝑢𝑗 is fully independent, and can be computed in parallel using
separate SIMD threads until synchronization at the 𝑀 −1 multiplication step of the
algorithm. If the calculation of these partial derivatives is structured to have sequential
dependencies within them, e.g., from iterative loops, then it is difficult to exploit this
parallelism at the thread level, and it is available only as fine-grained parallelism.
The dynamics gradient algorithm also contains opportunities for fine-grained par-
allelism between low-level mathematical operations. For example, there is parallelism
to be found between some matrix-vector multiplication operations, and parallelism
between the element-wise multiplications within a single matrix-vector multiplication
itself. Depending on the structure of the implementation, GPUs can offer high perfor-
mance for fine-grained parallelism between mathematical operations, because they
have a large number of wide vector arithmetic units. FPGAs also have a large number
of arithmetic units woven into the computational fabric. CPUs offer only a small
number of parallel arithmetic units, limited to vector operations of modest width.
Structured Sparsity. Optimizations in previous software implementations of rigid
93
body dynamics and their gradients [10, 35] exploit structured sparsity to increase
performance. For example, robots with only revolute joints can be described with
transformations such that all 𝑆𝑖 = [0, 0, 1, 0, 0, 0]𝑇 . In this way, all computations that
are right multiplied by 𝑆𝑖 can simply be reduced to only computing and extracting
the third column or value.
There is also structured sparsity in the transformation matrices 𝑖 𝑋𝜆𝑖 , and cross
product matrices × and ×* (shown in Equation 6.1).
As mentioned in Chapter 2 and Chapter 5, spatial algebra represents most kine-

matic and dynamic quantities as operations over vectors in R6 and matrices in R6×6 .
Quantities are defined in the frame of each rigid body which are numbered 𝑖 = 1 to 𝑛
such that each body’s parent 𝜆𝑖 is a lower number. Transformations of quantities from
frame 𝜆𝑖 to 𝑖 are denoted as 𝑖 𝑋𝜆𝑖 and can be constructed from the rotation matrix
and transformation vector between the two coordinate frames which themselves are
functions of the joint position 𝑞𝑖 between those frames and constants derived from the
robot’s topology.
Spatial algebra also uses spatial cross product operators × and ×* , in which a
vector is re-ordered into a sparse matrix, and then a standard matrix multiplication
is performed. For example, the operation 𝑣 × 𝑤 can be solved by computing the
standard matrix vector multiplication between 𝑣× and 𝑤. The reordering is shown in
Equation 6.1 for a vector 𝑣 ∈ R6 :
0 −𝑣[2] 𝑣[1] 0 0 0
⎡ ⎤
⎢ 𝑣[2]
⎢ 0 −𝑣[0] 0 0 0 ⎥
⎥
⎢ −𝑣[1] 𝑣[0] 0 0 0 0 ⎥
𝑣× = ⎢
⎢ 0 −𝑣[5] 𝑣[4]
⎥
0 −𝑣[2] 𝑣[1] ⎥
⎢
⎣ 𝑣[5] 0 −𝑣[3] 𝑣[2] 0 −𝑣[0] ⎦
⎥ (6.1)
−𝑣[4] 𝑣[3] 0 −𝑣[1] 𝑣[0] 0
𝑣×* = −𝑣 ×𝑇 .
Memory Access Patterns. Memory access patterns are another key algorithmic
characteristic that must be considered for high-performance implementations. Regular
access patterns are consistent and predictable, enabling quick and easy computations
94
of memory address locations, and allowing the compiler to leverage striding patterns
of loads and stores, and even “pre-fetch” some anticipated memory loads in advance.
Irregular access patterns, on the other hand, make it difficult to calculate memory
addresses, and lead to increased round trips to memory, hindering performance.
The gradient of rigid body dynamics exhibits regular access patterns through
its ordered loops (assuming the local variables for each link 𝑗 of each knot point
𝑘 are stored regularly in memory). However, the cross product operations require
reorderings of data elements, which lead to irregular memory access patterns.
Sequential Dependencies. The dynamics gradient algorithm has long stretches of
sequential dependencies from loops whose computations depend on the results of the
previous loop iteration. This is seen, for example, in the many references to parent and
child links throughout both Algorithm 2 and Algorithm 3. These sequential chains of
operations do not parallelize and must be computed in order.
Working Set Size. Another algorithm characteristic that impacts hardware imple-
mentations is the size of the active “working set”, the set of memory addresses the
algorithm accesses frequently.
The dynamics gradient algorithm has a relatively small working set. It most
frequently accesses only a few local variables between loop iterations, and a set of
small matrices, in particular, 𝐼𝑖 , and 𝑖 𝑋𝜆𝑖 ∈ R6×6 . This means that the working set
size of algorithm can easily fit into small fast hardware memory structures, e.g., caches.
Control Flow Complexity. The “control flow” refers to conditional switching,
branching, and pointers redirecting the program order of low-level computational
primitive operations. This can also have a large impact on the performance of
a codesigned hardware implementation, by introducing overheads related to e.g.,
instruction address calculations and thread divergence.
The dynamics gradient algorithm can have a number of conditional operations when
implemented to take advantage of structured sparsity patterns, where computation
is skipped for certain elements of data structures. This complexity impacts the
optimizations that can be made by compilers, increases the code size (putting pressure
on instruction caches), and makes the algorithm harder to implement in a SIMD fashion,
95
where all threads are supposed to follow identical control flow without divergence.
I/O Overhead. Finally, an important algorithmic feature to consider for hardware
implementations is the amount of input and output (I/O) data that must be transferred
between computing platforms. The transmission of large amounts of I/O data can
introduce long latency delays in the system, slowing performance.
Depending on the exact partitioning of the algorithm, the gradient of rigid body
dynamics can require a substantial amount of I/O data bits to be transferred, so this
overhead must often be mitigated in the codesign process.
6.1.2 FPGA Codesign
Chapter 5 details our design of an FPGA accelerator implementation of the dynamics

gradient algorithm. In this section, we briefly review the high-level hardware-software
codesign choices that informed that implementation, and relate them to the algorithmic
features we identify in Table 6.1. The design methodology we introduce in Chapter 4
allowed us to systematically identify such features through the lens of robot morphology,
and programmatically map them to architectural hardware structures. Based on these
insights, in the following sections we develop revised software implementations of the
dynamics gradient algorithm for the CPU and GPU platforms.
The computational fabric on an FPGA can be configured to exploit both thread-
level and fine-grained parallelism in the structure of customized functional units and
datapaths. In our design we use this customizable parallelism to, for example, compute
the partial derivatives of ∇ID (Algorithm 3) in parallel datapaths (see Figure 5-1).
We also implemented fine-grained parallelism between (and within) linear algebra
operations.
Using the reconfigurable connections of an FPGA, we were also able to exploit the
structure of sparse data structures like the joint transformation matrices and spatial
cross product matrices. For sparse matrix-vector multiplications, we prune operations
from trees of multipliers and adders (see Figure 5-3). Implementing this sparsity
natively in hardware datapaths also helps our FPGA implementation avoid irregular
memory access patterns that might arise from sparse matrices, encoding them directly
96
in the signal routing.
By creating processing units tailored to the dataflow of the algorithm, our FPGA
implementation of the dynamics gradient streamlines sequential chains of dependencies
between links and minimizes control flow complexity by completely unrolling the
computations and hard-coding their layout in the hardware.
As described in Section 5.3.1, we used a framework called Connectal [50] to
implement the I/O transfer between the FPGA and a host CPU. Connectal’s support
for the FPGA platform we used (the VCU-118) is currently restricted to PCIe Gen1,
limiting the bandwidth of our I/O link.
As a result, one major choice we made in the hardware-software codesign of the
dynamics gradient for our FPGA implementation was partitioning the algorithm
between hardware and software to balance the number of I/O bits required with our
other hardware constraints. See Section 5.2.1 for details on this process, including
quantitative results from intermediate design steps.
Briefly, our main design decision was which steps of Algorithm 1 to include in the
accelerator hardware:
1. Calculate 𝑣, 𝑎, 𝑓 from inverse dynamics (ID);
2. Calculate 𝜕𝜏 /𝜕𝑢, the gradient of inverse dynamics (∇ID); and
3. Recover 𝜕 𝑞¨/𝜕𝑢, the gradient of forward dynamics, from multiplication by the

𝑀 −1 matrix.
After analyzing tradeoffs between different partitionings (see Figure 5-6 and Table 5.2),
we ultimately chose to use a completely-fused kernel, which implements all three steps
of Algorithm 1 on the FPGA.
6.1.3 CPU Codesign
Our CPU implementation is based on previous state-of-the-art work [10, 35]. In order
to efficiently balance the thread-level parallelism exposed by the tens to hundreds of
97
naturally parallel motion trajectory time step computations across a handful of pro-
cessor cores, each core must work through several computations sequentially. Efficient
threading implementations can have a large impact on the final design. We designed a
custom threading wrapper to reuse threads in a “thread pool” instead of continuously
launching and joining threads, leading to a reduction in total computational time. We
use the Eigen library [39] to vectorize many linear algebra operations to take some
advantage of some limited fine-grained parallelism.
We also wrote custom functions to exploit the structured sparsity in the algorithm
using explicit loop unrolling and zero skipping as the current fastest CPU dynamics
package, RobCoGen, does this during its code generation step [35, 75]. While this
approach creates irregular memory accesses patterns, this is not a problem for the CPU
as the values are small enough to fit in the CPU’s cache. The sequential dependencies,
small working sets, and complex control flow are also handled well by the CPU pipeline
and memory hierarchy. With all operations occurring on the CPU, there are no I/O
overheads and no problem partitioning was necessary.
6.1.4 GPU Codesign
The GPU is highly optimized for the natural thread-level parallelism present in our
application. It can natively use blocks of SIMD threads to efficiently take advantage
of thread-level parallelism across independent dynamics gradient computations.
Fine-grained parallelism, on the other hand, is harder to exploit in a GPU. The
state-of-the-art GPU implementation [87] used as a baseline in the evaluation performed
in Section 5.3 has difficulty exploiting parallelism from the small sparse matrices in
∇ID, and even had trouble with parallelism between partial derivatives in ∇ID because
they are independent for only a short duration and contain sequential operations and
irregular memory accesses within them. This forces frequent synchronization points
and leads to overall poor thread occupancy, which is the major performance limitation
of this implementation.
In this section, we present a novel GPU implementation of the dynamics gradient,
using hardware-software codesign and drawing on insights from the FPGA accelerator
98
Algorithm 4 ∇ Inverse Dynamics-GPU w.r.t. 𝑢 = {𝑞, 𝑞}.
˙ Optimized GPU imple-
mentation, incorporating insights from our FPGA design (see Chapter 5) and CPU
implementations from previous work [9].
1: for link 𝑖 = 1 : 𝑛 in parallel do
2: 𝛼𝑖 = (𝑖 𝑋𝜆𝑖 𝑣𝜆𝑖 ) × 𝑆𝑖
3: 𝛽𝑖 = 𝐼𝑖 𝑣𝑖
4: 𝛾𝑖 = 𝑣𝑖 ×* 𝐼𝑖
5: end for
6: for link 𝑖 = 1 : 𝑛 do⎧
⎨𝛼𝑖
⎪
𝜕𝑣𝑖 𝜕𝑣𝜆𝑖
7: 𝜕𝑢
= 𝑖 𝑋 𝜆𝑖 𝜕𝑢
+⎪
⎩𝑆
𝑖
8: end for
9: for link 𝑖 = 1 : 𝑛 in parallel
⎧ do
𝑖
⎨( 𝑋𝜆 𝑎𝜆 ) × 𝑆𝑖
⎪
𝜕𝑣 𝑖 𝑖
10: 𝜌𝑖 = 𝜕𝑢𝜆𝑖 × 𝑆𝑖 𝑞˙𝑖 +
⎩𝑣 × 𝑆
⎪
𝑖 𝑖
𝜕𝑣𝑖 *
11: 𝜇𝑖 = 𝜕𝑢 ×
12: end for
13: for link 𝑖 = 1 : 𝑛 do
𝜕𝑎𝑖 𝜕𝑎𝜆𝑖
14: 𝜕𝑢
= 𝑖 𝑋 𝜆𝑖 𝜕𝑢
+ 𝜌𝑖
15: end for
16: for link 𝑖 = 1 : 𝑛 in parallel do
𝜕𝑓𝑖
17: 𝜕𝑢
= 𝐼𝑖 𝜕𝑎
𝜕𝑢
𝑖
+ 𝜇𝑖 𝛽𝑖 + 𝛾𝑖 𝜕𝑣
𝜕𝑢
𝑖
18: end for

19: for link 𝑖 = 𝑛 : 1 do
𝜕𝑐′𝑖
20: 𝜕𝑢
= 𝑆𝑖𝑇 𝜕𝑓
𝜕𝑢
𝑖
𝜕𝑓𝜆𝑖
21: 𝜕𝑢
+= 𝜆𝑖 𝑋𝑖 𝜕𝑓
𝜕𝑢
𝑖
+ 𝛾𝑖
22: end for
design process (detailed in Section 6.1.2 and Chapter 5). The core of this novel
GPU implementation is a refactored version of the ∇ID, shown in Algorithm 4.
By being cognizant of algorithmic features (Section 6.1.1), we are able to reduce
thread synchronization points and deliver high performance. We evaluate this novel
implementation in Section 6.2.
Unlike on the CPU and FPGA, on the GPU it is actually faster to not exploit
99
the sparsity in the 𝑋, ×, and ×* matrices. The GPU’s computational model requires
groups of threads within each thread block to compute the same operation on memory
accessed via regular patterns for maximal performance. We created temporary values
for the ×* matrices and then use standard threaded matrix multiplication to compute
the final values. For example, the temporary value 𝜇𝑖 = 𝜕𝑣𝑖 /𝜕𝑢×* takes as an input
𝑛 values of 𝜕𝑣𝑖 /𝜕𝑢, each having 2𝑖 columns ∀𝑖 ∈ (1, 𝑛), and produces a matrix ∈ R6×6
for each column. The most efficient GPU solution is to have each matrix computed
in parallel with a loop over the entries in order to avoid thread divergence in the
control flow, even though each entry in each matrix is naturally parallel. Out-of-order
computations and memory access patterns are inefficient on the GPU because they
limit natural data parallelism, so it can be a worthwhile tradeoff to avoid these, even
at the expense of creating large numbers of temporary variables.
As mentioned above, the sequential dependencies within each dynamics gradient

computation are challenging for the GPU, as it must introduce synchronization points
at every data dependency, reducing the number of separable computations. Therefore,
in our novel implementation we re-ordered the computations to minimize the amount
of work done in synchronized loops, at the expense of generating even more temporary
values. Fortunately, due to the small working set size, even after generating these
additional temporary values we were able to fit everything in the GPU’s shared
memory (cache), minimizing any latency penalties. Despite these optimizations,
due to the highly serial nature and overall complex control flow in the algorithm,
many serial operations and synchronization points still exist in our novel codesigned
implementation (Algorithm 4).
Finally, because data needs to be transferred between the GPU and a host CPU,
I/O is a serious constraint for GPU, just as it was for the FPGA (see Section 6.1.2).
As we did with the FPGA (see Figure 5-6), we again implemented both split and fused
GPU kernels combining different steps of Algorithm 1 to analyze the I/O tradeoffs.
In the split kernel we only computed the most parallel and compute intensive section
of the algorithm on the accelerator, the 𝜕𝑐′ /𝜕𝑢 computation. In the initial fused
kernel we minimized I/O by computing both the 𝑣 ′ , 𝑎′ , 𝑓 ′ and 𝜕𝑐′ /𝜕𝑢 computations
100
on the accelerator. We also designed a completely-fused kernel computing all of
Algorithm 1. This leads to an increase in I/O but reduces synchronization points in
the overall algorithm. To further reduce I/O in our GPU and FPGA implementations,
we computed the variable values in the transformation matrices 𝑋 on-board, even in
the split kernel where it was already computed on the CPU.
On the GPU we used the NVIDIA CUDA library’s [80] built-in functions for
transferring data in bulk to and from the GPU over PCIe Gen3. To minimize
synchronization points, we copy all of the needed data over to the GPU memory once,
let it run, and then copy all of the results back. Finally, to better leverage the PCIe
data bus, we ensured that all values that needed to be transferred were stored as a
single contiguous block of memory.
6.2 Evaluation
In this section, we compare the performance of our FPGA implementation of the
rigid body dynamics gradient developed in Chapter 5 to our novel hardware-software
codesigned implementations for the CPU and GPU. We evaluate two timing results
to understand the performance of our designs: the latency of a single computation of
the algorithm (similar to Section 5.3.2) and the end-to-end latency, including I/O and
other overheads, to execute a set of 𝑁 computations (similar to Section 5.3.3).
6.2.1 Methodology
Our methodology for these experiments was largely the same as our setup for the
evaluation presented in Section 5.3. Again, all of our designs were implemented using a
Kuka LBR IIWA-14 manipulator as our robot model. CPU results were collected on a
desktop computer with a 3.6 GHz quad-core Intel Core i7-7700 CPU running Ubuntu
16.04.6 and CUDA 11. For GPU and FPGA results, we used host CPUs connected
over PCIe to an NVIDIA GeForce GTX 2080 GPU and a Virtex UltraScale+ VCU-118
FPGA. The FPGA design was synthesized at a clock speed of 55.6 MHz. For clean
timing measurements, we disabled TurboBoost and fixed the clock frequency to the
101
Figure 6-1: Latency of a single computation of the dynamics gradient. Our novel
GPU implementation gives a 5.4× speedup over the previous state-of-the-art GPU
solution (see Figure 5-9). Our CPU implementation is comparable to state-of-the-art.
The FPGA implementation is the same one developed in Chapter 5.
maximum 3.6 GHz. Code was compiled with CUDA 11 and g++7.4 and latency time
was measured with the Linux system call clock_gettime(), using CLOCK_MONOTONIC
as the source. For the single computation measurements, we take the mean of one
million trials for the CPU and GPU data. For the FPGA data, we extract the timing
from cycle counts and the clock frequency. For the end-to-end results, we take the
mean of one hundred thousand trials.
6.2.2 Single Computation Results
The results of the latency for a single computation are presented in Figure 6-1. These
times represent the computation performed in a single run of the codesigned versions
of each algorithm on its target hardware platform excluding overheads. The three
colors represent the steps of Algorithm 1.
Following the optimizations in Section 6.1.3, our CPU implementation is compara-
ble to the well-optimized state of the art [10].
Our GPU implementation is still slower than the CPU implementation in this
102
experiment because GPUs have difficulty with latency sensitive sequential operations
and derive their benefit instead from throughput offered by extreme parallelism. In
this single computation measurement, there is limited opportunity for parallelism.
However, while the 𝑣 ′ , 𝑎′ , 𝑓 ′ latency grows by 4.6×, the 𝜕𝑐′ /𝜕𝑢 latency grows by only
1.6×. This improved scaling is the result of the re-factoring of the 𝜕𝑐/𝜕𝑢 algorithm
(Algorithm 4) done in the codesign process (Section 6.1.4) to expose more SIMD
parallelism and reduce the computations done in synchronized loops. Leveraging these
optimizations, our GPU implementation is 5.4× faster than the state of the art [86].
The FPGA implementation is still much faster than either the CPU or the GPU
for a single computation, despite its slower clock frequency. As in Section 5.3.2, this
is because the dedicated datapaths in the FPGA implementation handle sequential
operations well, and the fine-grained parallelism available in the FPGA allows for
the execution of many densely-packed parallel operations. Furthermore, during the
codesign process (Section 6.1.2), to hide I/O latency, we were able to largely overlap
the computation of 𝑣 ′ , 𝑎′ , 𝑓 ′ and 𝜕 𝑞¨/𝜕𝑢 with the computation of 𝜕𝑐′ /𝜕𝑢. Therefore,
the 𝑣 ′ , 𝑎′ , 𝑓 ′ and 𝜕 𝑞¨/𝜕𝑢 times shown represent only the small latency penalty incurred
to accommodate this design choice.
6.2.3 End-to-End Results
We collected end-to-end latency results from running the entire algorithm for 𝑁 = 16,
64, 128, and 256 computations because, as mentioned earlier, many nonlinear MPC
applications use tens to hundreds of motion trajectory time steps. These results are
shown in Figures 6-2, and 6-3. These end-to-end numbers include all computations
of the dynamics gradient kernel as well as the time for I/O overhead needed to
communicate with a host CPU (for the GPU and FPGA). In each of the figures there
are groups of bars for each of the 𝑁 total motion trajectory time step values.
Figure 6-2 compares (going left to right within each grouping) the “[s]plit”, “[f]used”,
and “[c]ompletely fused” [G]PU kernels developed during the codesign problem parti-
tioning process described in Section 6.1.4. Moving more of the computation onto the
GPU increases the latency for a single computation (as also seen in Figure 6-1), and
103
Figure 6-2: Throughput of 𝑁 = 16, 64, 128, and 256 computations of the gradient of
robot dynamics for the iiwa manipulator for the (G)PU using (s)plit and (f)used and
(c)ompletely-fused kernels. Using hardware-software codesign to move from the split
to completely-fused kernel improved our overall design latency by at least 2.8× for all
values of 𝑁 .
Figure 6-3: Throughput of 𝑁 = 16, 64, 128, and 256 computations of the gradient of
robot dynamics for the iiwa manipulator for the (C)PU, (G)PU, and (F)PGA using
the (c)ompletely-fused codesigned kernel partitionings. Our FPGA implementation
excels at lower numbers of computations, but its throughput restrictions limit perfor-
mance at higher numbers. Our novel GPU implementation demonstrates consistent
speedups over the CPU, and surpasses the FPGA performance at higher numbers of
computations.
104
the 𝜕𝑐′ /𝜕𝑢 computation is by far the most expensive. However, because our optimized
GPU implementation can readily perform most of the computations in parallel, the
total end-to-end latency increase is quite small. This can be seen in the slight increase
from the blue to the green to the teal bar in each grouping. At the same time, due
to the optimized memory layouts in our novel GPU implementation and the high
bandwidth PCIe 3.0 connection, the changes in I/O latency are also relatively small
when moving between the various design points.
Performing more computation on the GPU reduces the frequency of high-level
algorithmic synchronization points between the CPU and GPU, and also allows our
optimized GPU implementation to accelerate the computations otherwise done on the
CPU (orange and yellow bars), improving overall end-to-end latency. Through our
codesign approach we were able to improve the end-to-end time of the GPU designs
moving from the “Gs” to “Gc” kernels by at least 2.8× for all values of 𝑁 .
Figure 6-3 compares the CPU to [c]ompletely-fused [G]PU and [F]PGA implemen-
tations. Within each group, the first bar is the CPU design, where the entire algorithm
is computed in SIMD threads. These bars scale superlinearly with increasing 𝑁 , as
the multiple cores on the machine become overtasked with the increasing number of
computations.
The second bar is the “Gc” kernel which was able to provide a 1.5× to 3.8×
performance improvement over the CPU design for 𝑁 = 16, 256 respectively. The
GPU performed better as the number of computations increased and was the top
performer at higher numbers of computations (𝑁 = 128, 256), where its support for
massive parallelism prevailed.
The third bar is the “Fc” kernel, representing our FPGA accelerator implementation
from Chapter 5, which was again able to outperform the CPU in all cases by leveraging
its custom datapaths, pipelining of computations, and exploitation of fine-grained
parallelism (Section 6.1.2). It also remained the top performer at lower numbers of
computations (𝑁 = 16, 64), but was surpassed at higher numbers of computations
by our novel GPU implementation because of throughput limitations that keep it
from exploiting thread-level parallelism. Recall that we only compute one dynamics
105
gradient calculation at a time in our current FPGA design. (See Section 5.2.4 for
suggested design improvements to address this limitation.)
6.3 Contributions and Future Work

In this work, we systematically explored the process of hardware-software codesign
for CPU, GPU, and FPGA implementations of the gradient of rigid body dynamics.
We showed that using these codesigned implementations can lead to performance
improvements both in the runtime of individual kernels and in overall end-to-end
coprocessor system timing. Our best GPU and FPGA performance results show mean-
ingful runtime improvements over our optimized CPU design. However, it is significant
that these results were achieved by exploiting opposing design tradeoffs. Our best
GPU design takes advantage of embarrassing parallelism between computations, using
algorithmic refactoring and problem partitioning to emphasize the GPU platform’s
strength at delivering throughput from thread-level parallelism. By contrast, our
FPGA design emphasizes low latency through custom functional units and datap-
aths that exploit opportunities for fine-grained parallelism. Further FPGA design
optimizations are still needed to take full advantage of throughput from thread-level
parallelism (see Section 5.2.4). Leveraging different tradeoffs for success on different
hardware platforms demonstrates the importance of hardware-software codesign in
achieving high computational performance.
We hope that this work provides a roadmap for researchers to apply hardware-
software codesign for hardware platforms such as GPUs and FPGAs to other compu-
tational bottlenecks in robotics. Promising directions for future work include further
refinement of the GPU and FPGA designs described in this work (some suggestions
for improving the FPGA implementation in particular are given in Section 5.2.4).
These designs can also be extended to higher dimensional robotic systems, such as
quadrupeds and humanoids (as will be discussed in Chapter 7).
106
Chapter 7
Discussion and Future Work
7.1 Targeting Different Robot Models
In Chapter 5, we evaluated an accelerator that was customized to target the iiwa

industrial manipulator [58]. Here, we describe how the accelerator design would change
if it were implemented for other robot models with different joint types and topologies
of limbs and links. Because we used robomorphic computing to design a parameterized
accelerator template (see Chapter 4 and Chapter 5), it is systematic and simple to
customize the template to different robots.
Like the iiwa joint in Figure 5-3, additional examples of joints on real robots
are shown in Figure 7-1: the left hind knee of HyQ [95], and the right shoulder of
Atlas [7]. Each joint’s transformation matrix has a fixed sparsity pattern. Using
robomorphic computing, these sparsity patterns directly program the structure of
sparse matrix-vector multiplication units by pruning a tree of multipliers and adders.
The topology of limbs and links in a robot model parameterizes the parallelism
exposed in the hardware template. Some examples of different robot model parameters
are shown in Table 7.1. For example, if we target our dynamics gradient template to
the HyQ quadruped, the customized accelerator will have 4 parallel limb processors,
each with 3 parallel datapaths (one per link). The limb outputs will be periodically
synchronized at a central torso processor to combine their overall impact.
107
Figure 7-1: Other examples of joints on real robots [7, 58, 95]. The transformation ma-
trices of these joints exhibit different sparsity patterns, which robomorphic computing
translates into sparse matrix-vector multiplication functional units.
108
Table 7.1: Examples of different model parameters for real robots (pictured in Figure 7-
1).
Robot iiwa [58] HyQ [95] Atlas [7]
Type manipulator quadruped humanoid
Total Limbs 1 4 4
Total Links 7 12 30
7.2 Other Robotics Applications

The robomorphic computing design methodology can be applied to other critical
robotics applications that draw on robot morphology information, including collision
detection, localization, kinematics, and dynamics for flexible “soft” robots [5,24,59,82].
These algorithms are tightly coupled with the physical properties of the target robot.
A parameterized template only needs to be created once per algorithm.
7.3 Automating the Methodology

The most significant contribution of robomorphic computing is that while we performed
it manually as a proof-of-concept in Chapter 5, it is a systematic flow (Figure 4-4)
that can be automated in future work.
The design of the parameterized hardware template can be automated using
a domain-specific language and a high-level synthesis (HLS) flow [2]. These HLS
flows use a combination of domain-specific libraries of hand-optimized RTL modules
(written once by a hardware engineer) and high-level languages that parameterize and
instantiate those modules. Robomorphic computing could be integrated into one of
those existing frameworks.
Setting the parameters per-robot to create customized accelerators is also simple
to automate. The necessary parameters are already parsed and extracted from robot
description files by existing robot dynamics software libraries [11,66]. These parameters
can be used in an HLS flow to automatically output customized hardware. Users can
create accelerators without intervention from roboticists or hardware engineers.
109
110
Chapter 8
Conclusion
We demonstrated the systematic design of a hardware accelerator parameterized by

robot morphology addressing a major bottleneck in online optimal motion planning
and control, the gradient of rigid body dynamics [9, 87]. In this accelerator, we exploit
robot structure to determine parallelism and sparsity in the accelerator microarchi-
tecture, transforming robotics software optimizations [9, 66] to the hardware domain.
We implemented the design on an FPGA for a manipulator as a proof of concept,
demonstrating a meaningful speedup over off-the-shelf CPU and GPU solutions. We
also collected power and area results for a simulated ASIC implementation of the
accelerator. Inspired by the performance improvements yielded by the FPGA design
process, we used hardware-software codesign to create new software implementations
of the dynamics gradient for CPU and GPU platforms. We generalized our design
methodology to prescribe an algorithmic design flow to implement accelerator hard-
ware parameterized by robot morphology. Finally, we proposed how that methodology
can be expanded upon and automated in future work, to broaden its applicability to
other robot models and algorithms.
Robomorphic computing provides a systematic and reliable shortcut to the tradi-
tional hardware accelerator design process, which is otherwise tedious, error-prone,
and requires substantial intervention from domain experts.
Our accelerator for the gradient of rigid body dynamics represents meaningful
progress towards real-time, online motion planning and control for complex robots,
111
the performance of which is limited by current software solutions.
Using robomorphic computing to shrink this performance gap will allow robots to
plan further into the future, helping them to safely interact with people in dynamic,
unstructured, and unpredictable environments. This is a critical step towards enabling
robots to realize their potential to address important societal challenges from elder
care [48, 96], to the health and safety of humans in hazardous environments [62, 110].
112
Bibliography
[1] Thomas Antony and Michael J. Grant. Rapid Indirect Trajectory Optimization
on Highly Parallel Computing Architectures. 54(5):1081–1091.
[2] Oriol Arcas-Abella, Geoffrey Ndu, Nehir Sonmez, Mohsen Ghasempour, Adria
Armejach, Javier Navaridas, Wei Song, John Mawer, Adrián Cristal, and Mikel
Luján. An empirical evaluation of high-level synthesis languages and tools
for database acceleration. In 2014 24th International Conference on Field
Programmable Logic and Applications (FPL), pages 1–8. IEEE, 2014.
[3] Nuzhet Atay and Burchan Bayazit. A motion planning processor on recon-
figurable hardware. In Proceedings 2006 IEEE International Conference on
Robotics and Automation, 2006. ICRA 2006., pages 125–132. IEEE, 2006.
[4] Jeff Bezanson, Stefan Karpinski, Viral B Shah, and Alan Edelman. Julia: A
fast dynamic language for technical computing. arXiv preprint arXiv:1209.5145,
2012.
[5] Michael Bloesch, Marco Hutter, Mark A Hoepflinger, Stefan Leutenegger, Chris-
tian Gehring, C David Remy, and Roland Siegwart. State estimation for legged
robots-consistent fusion of leg kinematics and imu. Robotics, 17:17–24, 2013.
[6] Boston Dynamics. Atlas - the world’s most dynamic humanoid, accessed in
2018.
[7] Boston Dynamics. Atlas® | boston dynamics, Accessed in 2020. Available:

bostondynamics.com/atlas.
[8] Boston Dynamics. Spot® | boston dynamics, Accessed in 2020. Available:

bostondynamics.com/spot.
[9] Justin Carpentier and Nicolas Mansard. Analytical derivatives of rigid body
dynamics algorithms. Robotics: Science and Systems, 2018.
[10] Justin Carpentier and Nicolas Mansard. Analytical derivatives of rigid body
dynamics algorithms. In Robotics: Science and Systems, 2018.
[11] Justin Carpentier, Florian Valenza, Nicolas Mansard, et al. Pinocchio: fast
forward and inverse dynamics for poly-articulated systems, 2015–2018.
113
[12] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial architecture for
energy-efficient dataflow for convolutional neural networks. In ISCA. ACM/IEEE,
2016.
[13] Nikolaus Correll, Kostas E Bekris, Dmitry Berenson, Oliver Brock, Albert Causo,
Kris Hauser, Kei Okada, Alberto Rodriguez, Joseph M Romano, and Peter R
Wurman. Analysis and observations from the first amazon picking challenge.
IEEE Transactions on Automation Science and Engineering, 15(1):172–188,
2018.
[14] Simon Danisch and contributors. PackageCompiler.jl, 2018.
[15] G De Michell and Rajesh K Gupta. Hardware/software co-design. Proceedings
of the IEEE, 85(3):349–365, 1997.
[16] Jared Di Carlo, Patrick M Wensing, Benjamin Katz, Gerardo Bledt, and Sangbae
Kim. Dynamic locomotion in the mit cheetah 3 through convex model-predictive
control. In 2018 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pages 1–9. IEEE, 2018.
[17] Jared Di Carlo, Patrick M Wensing, Benjamin Katz, Gerardo Bledt, and Sangbae
Kim. Dynamic locomotion in the mit cheetah 3 through convex model-predictive
control. In 2018 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pages 1–9. IEEE, 2018.
[18] Moritz Diehl, Hans Joachim Ferreau, and Niels Haverbeke. Efficient numerical
methods for nonlinear mpc and moving horizon estimation. In Nonlinear model
predictive control, pages 391–417. Springer, 2009.
[19] Jack Doweck, Wen-Fu Kao, Allen Kuan-yu Lu, Julius Mandelblat, Anirudha
Rahatekar, Lihu Rappoport, Efraim Rotem, Ahmad Yasin, and Adi Yoaz. Inside
6th-generation intel core: new microarchitecture code-named skylake. IEEE
Micro, 37(2):52–62, 2017.
[20] Boston Dynamics. More parkour atlas, 2019.
[21] T. Erez, K. Lowrey, Y. Tassa, V. Kumar, S. Kolev, and E. Todorov. An
integrated system for real-time model predictive control of humanoid robots. In
2013 13th IEEE-RAS International Conference on Humanoid Robots.
[22] Tom Erez, Kendall Lowrey, Yuval Tassa, Vikash Kumar, Svetoslav Kolev, and
Emanuel Todorov. An integrated system for real-time model predictive control of
humanoid robots. In 2013 13th IEEE-RAS International conference on humanoid
robots (Humanoids), pages 292–299. IEEE, 2013.
[23] Tom Erez, Yuval Tassa, and Emanuel Todorov. Simulation tools for model-based
robotics: Comparison of bullet, havok, mujoco, ode and physx. In 2015 IEEE
international conference on robotics and automation (ICRA), pages 4397–4404.
IEEE, 2015.
114
[24] Christer Ericson. Real-time collision detection. CRC Press, 2004.
[25] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam,
and Doug Burger. Dark Silicon and the End of Multicore Scaling. In Proceedings
of the 38th Annual International Symposium on Computer Architecture, ISCA
’11, pages 365–376. ACM.
[26] F. Farshidian, E. Jelavic, A. Satapathy, M. Giftthaler, and J. Buchli. Real-time

motion planning of legged robots: A model predictive control approach. In 2017
IEEE-RAS 17th International Conference on Humanoid Robotics.
[27] Farbod Farshidian, Edo Jelavic, Asutosh Satapathy, Markus Giftthaler, and
Jonas Buchli. Real-time motion planning of legged robots: A model predic-
tive control approach. In 2017 IEEE-RAS 17th International Conference on
Humanoid Robotics (Humanoids), pages 577–584. IEEE, 2017.
[28] Eyal Fayneh, Marcelo Yuffe, Ernest Knoll, Michael Zelikson, Muhammad
Abozaed, Yair Talker, Ziv Shmuely, and Saher Abu Rahme. 14nm 6th-generation
core processor soc with low power consumption and improved performance. In
2016 IEEE International Solid-State Circuits Conference (ISSCC), pages 72–73.
IEEE, 2016.
[29] Roy Featherstone. Rigid Body Dynamics Algorithms. Springer.
[30] Roy Featherstone. The calculation of robot dynamics using articulated-body

inertias. The International Journal of Robotics Research, 2(1):13–30, 1983.
[31] Roy Featherstone. Rigid body dynamics algorithms. Springer, 2008.
[32] Tom Feist. Vivado design suite. White Paper, 5:30, 2012.
[33] Martin L. Felis. RBDL: an efficient rigid-body dynamics library using recursive
algorithms. Autonomous Robots, pages 1–17, 2016.
[34] Ambrose Finnerty and Hervé Ratigner. Reduce power and cost by converting
from floating point to fixed point. 2017.
[35] Marco Frigerio, Jonas Buchli, Darwin G. Caldwell, and Claudio Semini. RobCo-
Gen: A code generator for efficient kinematics and dynamics of articulated
robots, based on Domain Specific Languages. 7(1):36–54.
[36] Gianluca Garofalo, Christian Ott, and Alin Albu-Schäffer. On the closed
form computation of the dynamic matrices and their differentiations. In 2013
IEEE/RSJ International Conference on Intelligent Robots and Systems, pages
2364–2359. IEEE, 2013.
[37] Markus Giftthaler, Michael Neunert, Markus Stäuble, Marco Frigerio, Claudio
Semini, and Jonas Buchli. Automatic differentiation of rigid body dynamics for
optimal control and estimation. Advanced Robotics, 31(22):1225–1237, 2017.
115
[38] Markus Giftthaler, Michael Neunert, Markus Stäuble, Jonas Buchli, and Moritz
Diehl. A Family of Iterative Gauss-Newton Shooting Methods for Nonlinear
Optimal Control.
[39] Gaël Guennebaud, Benoıt Jacob, Philip Avery, Abraham Bachrach, Sebastien
Barthelemy, et al. Eigen v3, 2010.
[40] Song Han, Huizi Mao, and William Dally. Deep compression: Compressing deep
neural networks with pruning, trained quantization and huffman coding. In The
International Conference on Learning Representations (ICLR), 10 2016.
[41] Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer
Jaleel, Edgar Solomonik, Joel Emer, and Christopher W Fletcher. Extensor:
An accelerator for sparse tensor algebra. In Proceedings of the 52nd Annual
IEEE/ACM International Symposium on Microarchitecture, pages 319–333, 2019.
[42] S. Heinrich, A. Zoufahl, and R. Rojas. Real-time trajectory optimization under

motion uncertainty using a GPU. In 2015 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pages 3572–3577.
[43] Carsten Heinz, Yannick Lavan, Jaco Hofmann, and Andreas Koch. A catalog
and in-hardware evaluation of open-source drop-in compatible risc-v softcore
processors. In 2019 International Conference on ReConFigurable Computing
and FPGAs (ReConFig), pages 1–8. IEEE, 2019.
[44] John L Hennessy and David A Patterson. A new golden age for computer
architecture. Communications of the ACM, 2019.
[45] B. Ichter, E. Schmerling, A. a Agha-mohammadi, and M. Pavone. Real-time

stochastic kinodynamic motion planning via multiobjective search on GPUs. In
2017 IEEE International Conference on Robotics and Automation.
[46] Intel Inc. Intel® 64 and ia-32 architectures software developer’s manual. Volume
4: Model-Specific Registers., (No. 335592-067US), 2018.
[47] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick
Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley,
Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra
Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg,
John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski,
Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen
Kumar, Steve Lacy, James Laudon, James Law, iemthu DLe, Chris Leary,
Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore,
Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni,
Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps,
Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory
Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes
116
Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan,
Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter
performance analysis of a tensor processing unit. In ISCA. ACM/IEEE, 2017.
[48] Claudia Kalb. Could a robot care for grandma? National Geographic, Jan 2020.
[49] Myron King, Jamey Hicks, and John Ankcorn. Software-driven hardware
development. In Proceedings of the 2015 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays, pages 13–22, 2015.
[50] Myron King, Jamey Hicks, and John Ankcorn. Software-driven hardware
development. In Proceedings of the 2015 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays, pages 13–22, 2015.
[51] Jonas Koenemann, Andrea Del Prete, Yuval Tassa, Emanuel Todorov, Olivier
Stasse, Maren Bennewitz, and Nicolas Mansard. Whole-body Model-Predictive
Control applied to the HRP-2 Humanoid. In Proceedings of the IEEERAS
Conference on Intelligent Robots.
[52] Jonas Koenemann, Andrea Del Prete, Yuval Tassa, Emanuel Todorov, Olivier
Stasse, Maren Bennewitz, and Nicolas Mansard. Whole-body model-predictive
control applied to the hrp-2 humanoid. In 2015 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pages 3346–3351. IEEE,
2015.
[53] Twan Koolen and contributors. RigidBodyDynamics.jl, 2018.
[54] Dimitris Kouzoupis, Rien Quirynen, Boris Houska, and Moritz Diehl. A Block
Based ALADIN Scheme for Highly Parallelizable Direct Optimal Control. In
Proceedings of the American Control Conference.
[55] Eric Krotkov, Douglas Hackett, Larry Jackel, Michael Perschbacher, James
Pippine, Jesse Strauss, Gill Pratt, and Christopher Orlowski. The DARPA
robotics challenge finals: results and perspectives. Journal of Field Robotics,
34(2):229–240, 2017.
[56] Scott Kuindersma, Robin Deits, Maurice Fallon, Andrés Valenzuela, Hongkai Dai,
Frank Permenter, Twan Koolen, Pat Marion, and Russ Tedrake. Optimization-
based locomotion planning, estimation, and control design for the atlas humanoid
robot. Autonomous Robots, 40(3):429–455, 2016.
[57] KUKA AG. LBR iiwa, accessed in 2018.
[58] KUKA AG. Lbr iiwa | kuka ag, Accessed in 2020. Available: kuka.com/
products/robotics-systems/industrial-robots/lbr-iiwa.
[59] Steven M LaValle. Planning algorithms. Cambridge university press, 2006.
117
[60] Ruige Li, Xiangcai Huang, Sijia Tian, Rong Hu, Dingxin He, and Qiang Gu.
Fpga-based design and implementation of real-time robot motion planning.
In 2019 9th International Conference on Information Science and Technology
(ICIST), pages 216–221. IEEE, 2019.
[61] Shiqi Lian, Yinhe Han, Xiaoming Chen, Ying Wang, and Hang Xiao. Dadu-p:
A scalable accelerator for robot motion planning in a dynamic environment. In
2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pages 1–6.
IEEE, 2018.
[62] Courtney Linder. A cave is no place for humans, so darpa is sending in the
robots. Popular Mechanics, Aug 2019.
[63] J Luh, M Walker, and R Paul. Resolved-acceleration control of mechanical

manipulators. IEEE Trans. on Automatic Control, 25(3):468–474, 1980.
[64] John YS Luh, Michael W Walker, and Richard PC Paul. On-line computational
scheme for mechanical manipulators. Journal of Dynamic Systems, Measurement,
and Control, 102(2):69–76, 1980.
[65] M. Neunert, F. Farshidian, A. W. Winkler, and J. Buchli. Trajectory Opti-

mization Through Contacts and Automatic Gait Discovery for Quadrupeds.
2(3):1502–1509.
[66] Frigerio Marco, Buchli Jonas, Darwin G Caldwell, and Semini Claudio. RobCo-
Gen: a code generator for efficient kinematics and dynamics of articulated
robots, based on domain specific languages. Journal of Software Engineering in
Robotics, 7(1):36–54, 2016.
[67] Carlos Mastalli, Rohan Budhiraja, Wolfgang Merkt, Guilhem Saurel, Bilal
Hammoud, Maximilien Naveau, Justin Carpentier, Ludovic Righetti, Sethu
Vijayakumar, and Nicolas Mansard. Crocoddyl: An Efficient and Versatile
Framework for Multi-Contact Optimal Control. In IEEE International Confer-
ence on Robotics and Automation (ICRA), 2020.
[68] Ian McInerney, George A Constantinides, and Eric C Kerrigan. A survey of the
implementation of linear model predictive control on fpgas. IFAC-PapersOnLine,
51(20):381–387, 2018.
[69] S. Murray, W. Floyd-Jones, Y. Qi, G. Konidaris, and D. J. Sorin. The microar-

chitecture of a real-time robot motion planning accelerator. In 2016 49th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), pages
1–12.
[70] Sean Murray, Will Floyd-Jones, George Konidaris, and Daniel J Sorin. A
programmable architecture for robot motion planning acceleration. In 2019 IEEE
30th International Conference on Application-specific Systems, Architectures and
Processors (ASAP), volume 2160, pages 185–188. IEEE, 2019.
118
[71] Sean Murray, Will Floyd-Jones, Ying Qi, Daniel J. Sorin, and George Konidaris.
Robot Motion Planning on a Chip. In Robotics: Science and Systems.
[72] Sean Murray, William Floyd-Jones, Ying Qi, George Konidaris, and Daniel J
Sorin. The microarchitecture of a real-time robot motion planning accelerator.
In MICRO. IEEE/ACM, 2016.
[73] Maximilien Naveau, Justin Carpentier, Sébastien Barthelemy, Olivier Stasse, and
Philippe Souères. Metapod: Template meta-programming applied to dynamics:
Cop-com trajectories filtering. In Humanoid Robots (Humanoids), 2014 14th
IEEE-RAS International Conference on, pages 401–406. IEEE, 2014.
[74] Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight
dynamic binary instrumentation. In ACM Sigplan notices, volume 42, pages
89–100. ACM, 2007.
[75] Sabrina Neuman, Twan Koolen, Jules Drean, Jason Miller, and Srini Devadas.
Benchmarking and workload analysis of robot dynamics algorithms. In 2019
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
2019.
[76] Sabrina M Neuman, Twan Koolen, Jules Drean, Jason E Miller, and Srinivas
Devadas. Benchmarking and workload analysis of robot dynamics algorithms. In
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
IEEE, 2019.
[77] M. Neunert, C. de Crousaz, F. Furrer, M. Kamel, F. Farshidian, R. Siegwart,
and J. Buchli. Fast nonlinear Model Predictive Control for unified trajectory
optimization and tracking. In 2016 IEEE International Conference on Robotics
and Automation (ICRA), pages 1398–1404.
[78] Michael Neunert, Cédric De Crousaz, Fadri Furrer, Mina Kamel, Farbod Farshid-
ian, Roland Siegwart, and Jonas Buchli. Fast nonlinear model predictive control
for unified trajectory optimization and tracking. In Robotics and Automation
(ICRA), 2016 IEEE International Conference on, pages 1398–1404. IEEE, 2016.
[79] Michael Neunert, Farbod Farshidian, Alexander W Winkler, and Jonas Buchli.
Trajectory optimization through contacts and automatic gait discovery for
quadrupeds. IEEE Robotics and Automation Letters, 2(3):1502–1509, 2017.
[80] NVIDIA. NVIDIA CUDA C Programming Guide. Version 9.1 edition.
[81] Shimpei Ohyama and Hisashi Date. Parallelized nonlinear model predictive
control on gpu. In 2017 11th Asian Control Conference (ASCC), pages 1620–1625.
IEEE, 2017.
[82] Cagdas D Onal and Daniela Rus. Autonomous undulatory serpentine locomotion
utilizing body dynamics of a fluidic soft robot. Bioinspiration & biomimetics,
8(2):026003, 2013.
119
[83] Zherong Pan, Bo Ren, and Dinesh Manocha. Gpu-based contact-aware trajectory
optimization using a smooth force model. In Proceedings of the 18th annual
ACM SIGGRAPH/Eurographics Symposium on Computer Animation, page 4.
ACM, 2019.
[84] Zherong Pan, Bo Ren, and Dinesh Manocha. Gpu-based contact-aware trajectory
optimization using a smooth force model. In Proceedings of the 18th Annual
ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’19,
pages 4:1–4:12, New York, NY, USA, 2019. ACM.
[85] C. Park, J. Pan, and D. Manocha. Real-time optimization-based planning in
dynamic environments using GPUs. In 2013 IEEE International Conference on
Robotics and Automation, pages 4090–4097.
[86] Brian Plancher and Scott Kuindersma. A Performance Analysis of Parallel
Differential Dynamic Programming on a GPU. In International Workshop on
the Algorithmic Foundations of Robotics (WAFR).
[87] Brian Plancher and Scott Kuindersma. A performance analysis of parallel
differential dynamic programming on a gpu. In International Workshop on the
Algorithmic Foundations of Robotics (WAFR), 2018.
[88] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy
Leibs, Rob Wheeler, and Andrew Y Ng. Ros: an open-source robot operating
system. In ICRA workshop on open source software, volume 3, page 5. Kobe,
Japan, 2009.
[89] Nicolaus A Radford, Philip Strawser, Kimberly Hambuchen, Joshua S Mehling,
William K Verdeyen, A Stuart Donnan, James Holley, Jairo Sanchez, Vienny
Nguyen, Lyndon Bridgwater, et al. Valkyrie: NASA’s first bipedal humanoid
robot. Journal of Field Robotics, 32(3):397–419, 2015.
[90] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.
Hernández-Lobato, G. Y. Wei, and D. Brooks. Minerva: Enabling Low-Power,
Highly-Accurate Deep Neural Network Accelerators. In 2016 ACM/IEEE 43rd
Annual International Symposium on Computer Architecture (ISCA), pages 267–
278.
[91] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang
Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David
Brooks. Minerva: Enabling low-power, highly-accurate deep neural network
accelerators. In ISCA. ACM/IEEE, 2016.
[92] RobCoGen team. urdf2kindsl, 2018.
[93] Jacob Sacks, Divya Mahajan, Richard C Lawson, and Hadi Esmaeilzadeh. Robox:
An end-to-end solution to accelerate autonomous control in robotics. In 2018
ACM/IEEE 45th Annual International Symposium on Computer Architecture
(ISCA), pages 479–490. IEEE, 2018.
120
[94] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp
Moritz. Trust region policy optimization. In International Conference on
Machine Learning, volume 37, pages 1889–1897, 2015.
[95] Claudio Semini. HyQ - design and development of a hydraulically actuated
quadruped robot. Doctor of Philosophy (Ph. D.), University of Genoa, Italy,
2010.
[96] Jonathan Shaw. The coming eldercare tsunami. Harvard Magazine, Jan 2020.
[97] Michael A Sherman and Dan E Rosenthal. SD/FAST, 2013.
[98] Xuesong Shi, Lu Cao, Dawei Wang, Ling Liu, Ganmei You, Shuang Liu, and
Chunjie Wang. Hero: Accelerating autonomous robotic tasks with fpga. In 2018
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
pages 7766–7772. IEEE, 2018.
[99] Anthony Stentz, Herman Herman, Alonzo Kelly, Eric Meyhofer, G Clark Haynes,
David Stager, Brian Zajac, J Andrew Bagnell, Jordan Brindza, Christopher
Dellin, et al. CHIMP, the CMU highly intelligent mobile platform. Journal of
Field Robotics, 32(2):209–228, 2015.
[100] Soumya Sudhakar, Sertac Karaman, and Vivienne Sze. Balancing actuation and
computing energy in motion planning. In ICRA. IEEE, 2020.
[101] Amr Suleiman, Zhengdong Zhang, Luca Carlone, Sertac Karaman, and Vivi-
enne Sze. Navion: a fully integrated energy-efficient visual-inertial odometry
accelerator for auto. nav. of nano drones. In VLSI Circuits. IEEE, 2018.
[102] Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and Stabilization of
Complex Behaviors through Online Trajectory Optimization. In IEEE/RSJ
International Conference on Intelligent Robots and Systems.
[103] Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization
of complex behaviors through online trajectory optimization. In Intelligent
Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on,
pages 4906–4913. IEEE, 2012.
[104] Jan Treibig, Georg Hager, and Gerhard Wellein. Likwid: A lightweight
performance-oriented tool suite for x86 multicore environments. In Parallel
Processing Workshops (ICPPW), 2010 39th International Conference on, pages
207–216. IEEE, 2010.
[105] Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vla-
dyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford
Taylor. Conservation Cores: Reducing the Energy of Mature Computations. In
Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for
Programming Languages and Operating Systems, ASPLOS XV, pages 205–218.
ACM.
121
[106] Michael W Walker and David E Orin. Efficient dynamic computer simulation of
robotic mechanisms. Journal of Dynamic Systems, Measurement, and Control,
104(3):205–211, 1982.
[107] Grady Williams, Andrew Aldrich, and Evangelos A Theodorou. Model predictive
path integral control: From theory to parallel computation. Journal of Guidance,
Control, and Dynamics, 40(2):344–357, 2017.
[108] Alexander Wittig, Viktor Wase, and Dario Izzo. On the Use of Gpus for
Massively Parallel Optimization of Low-thrust Trajectories.
[109] Fang Xu, Hong Chen, Xun Gong, and Qin Mei. Fast nonlinear model predictive
control on fpga using particle swarm optimization. IEEE Transactions on
Industrial Electronics, 63(1):310–321, 2015.
[110] Guang-Zhong Yang, Bradley J. Nelson, Robin R. Murphy, Howie Choset, Henrik
Christensen, Steven H. Collins, Paolo Dario, Ken Goldberg, Koji Ikuta, Neil
Jacobstein, Danica Kragic, Russell H. Taylor, and Marcia McNutt. Combating
covid-19—the role of robotics in managing public health and infectious diseases.
Science Robotics, Mar 2020.
122

Robomorphic Computing

Uploaded by

Robomorphic Computing

Uploaded by

A Design Methodology for Computer Architecture

Parameterized by Robot Morphology

Submitted to the Department of Electrical Engineering and Computer Science

3 Benchmarking and Workload Analysis of Robot Dynamics Algo-

4 Robomorphic Computing: A Design Methodology for Computer Ar-

5 Implementation and Evaluation of a Robomorphic Dynamics Gradi-

6 Hardware-Software Codesign for the Robot Dynamics Gradient 89

7 Discussion and Future Work 107

1-1 Overview of the thesis work in relation to a longer-term research vi-

3-1 3D visualizations of the robot models used as benchmark cases. From

3-6 Total floating-point operations, categorized into double precision op-

4-1 Overview of robomorphic computing, a design methodology to transform

5-3 Example of one row of a transformation matrix dot product functional

5-4 Parameterized hardware template of the dynamics gradients accelerator.

5-8 FPGA coprocessor system implementation. . . . . . . . . . . . . . . . 79

5-9 Latency of a single computation of the dynamics gradient. Our FPGA

5-10 To reduce FPGA utilization, we implement a single transformation

5-13 ASIC synthesis shows a 4.5× to 7.2× speedup in single computation

6-1 Latency of a single computation of the dynamics gradient. Our novel

3.1 Dynamics Libraries Evaluated . . . . . . . . . . . . . . . . . . . . . . 34

7.1 Examples of different model parameters for real robots (pictured in

overview of the work in this thesis is shown in Figure 1-1.

In this thesis, we demonstrate the systematic design of a robot topology-parameterized

1.1 Thesis Overview

We begin this thesis by providing background information about computer architecture

2.1 Robotics Background

2.1.1 Robot Morphology

Traditional robots, including most commercial manipulator arms, quadrupeds, and

2.1.2 Robotics Algorithms Using Robot Morphology

2.1.3 Rigid Body Dynamics

𝑀 (𝑞)𝑣˙ + 𝑐(𝑞, 𝑣) = 𝜏, (2.1)

• 𝑞 ∈ R𝑛 is the joint configuration (or position) vector;

• 𝑣 ∈ R𝑚 is the joint velocity vector;

• 𝑐(𝑞, 𝑣) ∈ R𝑚 includes velocity-dependent terms and terms due to gravity, and is

Three standard problems related to the equations of motion are:

1. forward dynamics: computing 𝑣˙ given 𝑞, 𝑣, and 𝜏 ;

2. inverse dynamics: computing 𝜏 given 𝑞, 𝑣, and 𝑣;

3. mass matrix: computing 𝑀 (𝑞).

2.2 Computing Hardware Background

Spatial architectures that do not lean heavily into encoding-based approaches

Benchmarking and Workload

1. Open-source benchmarking: github.com/rbd-benchmarks/rbd-benchmarks;

2. Direct comparison of four state-of-the-art rigid body dynamics libraries;

4. Low-level performance statistics collected on modern hardware using hardware

5. An analysis of trends and differences across the distinct combinations of algorithm

3.1 Survey of Software Libraries

A peculiarity of RobCoGen is that the authors have opted to implement a hybrid

Both Pinocchio and RBD.jl implement the algorithms in a generic (parameterized or

3.2.1 Hardware Measurement Setup

Evaluation of the software implementations was performed on a modern desktop

3.2.3 Inputs and Cross-Library Verification of Results

3.3.1 Forward Dynamics Results

Figure 3-2: Runtime in microseconds. Shorter runtimes indicate better performance.

(a) Forward Dynamics (b) Mass Matrix (c) Inverse Dynamics

(a) Forward Dynamics (b) Mass Matrix (c) Inverse Dynamics

ID Floating-Point Ops. (Norm.)

(a) Forward Dynamics (b) Mass Matrix (c) Inverse Dynamics

ID L1-D Accesses (Norm.)

(a) Forward Dynamics (b) Mass Matrix (c) Inverse Dynamics

3.3.2 Mass Matrix Results

3.3.3 Inverse Dynamics Results

3.3.4 Memory Usage

3.3.5 Sources of Parallelism

3.3.6 Sensitivity to Compiler Choice

3.4.1 Observed Trends

3.4.2 Opportunities for Performance Gains

As previously mentioned, all of these implementations have some inefficiencies in their

3.4.3 Implications for Dynamics Gradients

3.4.4 Contributions and Future Work

The calculation of robot dynamics is an essential and time-consuming part of controlling

Domain-specific hardware acceleration is a emerging solution to this problem,

4.1 Motivating Use Case

4.1.1 Key Kernel: Forward Dynamics Gradient

4.1.2 Control Rate Performance

A significant performance gap exists for current state-of-the-art implementations

4.2 Robomorphic Computing