an-introduction-to-assembly-programming-with-risc-v
an-introduction-to-assembly-programming-with-risc-v
Assembly Programming
with RISC-V
1st edition
An Introduction to Assembly Programming
with RISC-V
2
c 2021 Edson Borin
Copyright
All rights reserved. This book or any portion thereof may not be reproduced
or used in any manner whatsoever without the express written permission of
the author except for the use of brief quotation in a book review.
ISBN:978-65-00-15811-3
First edition 2021
Edson Borin
Institute of Computing - University of Campinas
Av. Albert Einstein, 1251
Cidade Universitária Zeferino Vaz
Barão Geraldo - Campinas - SP - Brasil
www.ic.unicamp.br/~edson
13083-852
An updated version of this book and other material may be available at:
www.riscv-programming.org
3
Foreword
This book focuses on teaching the art of programming in assembly language, using
the RISC-V ISA as the guiding example. Towards this goal, the text spans, at an
introductory level, the organization of computing systems, describes the mechanics of
how programs are created and introduces basic programming concepts including both
user level and system programming. The ability to read and write code in low-level
assembly language is a powerful skill to be able to create high performance programs,
and to access features of the machine that are not easily accessible from high-level
languages such as C, Java or Python, for example to control peripheral devices.
The book introduces the organization of computing systems, and the mechan-
ics of creating programs and converting them to machine-readable format suitable
for execution. It also teaches the components of a program, or how a programmer
communicates her intent to the system via directives, data allocation primitives and
finally the ISA instructions, and their use. Basic programming concepts of control
flow, loops as well as the runtime stack are introduced.
Next the book describes the organization of code sequences into routines and
subroutines, to compose a program. The text also addresses issues related to system
programming, including notions of peripheral control and interrupts.
This text, and ancillary teaching materials, has been used in introductory classes
at the University of Campinas, Brazil (UNICAMP) and has undergone refinement
and improvement for several editions.
Mauricio Breternitz
Principal Investigator & Invited Associate Professor
ISTAR ISCTE Laboratory
ISCTE Instituto Universitario de Lisboa
Lisbon, Portugal
4
Notices:
• Document version: May 9, 2022
• Please, report typos and other issues to Prof. Edson Borin (edson@ic.unicamp.
br).
5
Contents
Foreword 4
Glossary 11
Acronyms 14
6
4 Assembly language 37
4.1 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Assembly instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Immediate values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Symbol names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 The location counter and the assembling process . . . . . . . . . . . . 42
4.7 Assembly directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7.1 Adding values to the program . . . . . . . . . . . . . . . . . . . 44
4.7.2 The .section directive . . . . . . . . . . . . . . . . . . . . . . 46
4.7.3 Allocating variables on the .bss section . . . . . . . . . . . . . 47
4.7.4 The .set and .equ directives . . . . . . . . . . . . . . . . . . . 48
4.7.5 The .globl directive . . . . . . . . . . . . . . . . . . . . . . . . 49
4.7.6 The .align directive . . . . . . . . . . . . . . . . . . . . . . . . 49
II User-level programming 51
5 Introduction 52
7
7.4.1 Searching for the maximum value on an array . . . . . . . . . . 85
8 Implementing routines 87
8.1 The program memory layout . . . . . . . . . . . . . . . . . . . . . . . 87
8.2 The program stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.2.1 Types of stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3 The ABI and software composition . . . . . . . . . . . . . . . . . . . . 91
8.4 Passing parameters to and returning values from routines . . . . . . . 91
8.4.1 Passing parameters to routines . . . . . . . . . . . . . . . . . . 91
8.4.2 Returning values from routines . . . . . . . . . . . . . . . . . . 93
8.5 Value and reference parameters . . . . . . . . . . . . . . . . . . . . . . 93
8.6 Global vs local variables . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.6.1 Allocating local variables on memory . . . . . . . . . . . . . . . 96
8.7 Register usage policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.7.1 Caller-saved vs Callee-saved registers . . . . . . . . . . . . . . . 99
8.7.2 Saving and restoring the return address . . . . . . . . . . . . . 100
8.8 Stack Frames and the Frame Pointer . . . . . . . . . . . . . . . . . . . 100
8.8.1 Stack Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.8.2 The Frame Pointer . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.8.3 Keeping the stack pointer aligned . . . . . . . . . . . . . . . . . 102
8.9 Implementing RISC-V ilp32 compatible routines . . . . . . . . . . . . 102
8.10 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.10.1 Recursive routines . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.10.2 The standard “C” library syscall routines . . . . . . . . . . . . 104
8
A RV32IM ISA reference card 134
9
10
Glossary
binary digit is a digit that may assume one of two values: “0” (zero) or “1” (one).
10, 11, 14
byte addressable memory is a memory in which each memory word stores a single
byte. 3–5, 10, 18–21, 23, 54
Control and Status Register , or CSR, is an internal CPU register that exposes
the CPU status to the software and allow software to control the CPU. 10, 120,
121, 129, 131
endianness refers to the order in which the bytes are stored on a computing system.
There are two common formats: little-endian and big-endian. The little-endian
format places the least significant byte on the memory position associated with
the lowest address while the big-endian format places the most significant byte
on the memory position associated with the highest address. 10, 19, 46, 64–67
Instruction Set Architecture defines the computer instructions set, including, but
not limited to, the behavior of the instructions, their encoding, and resources
that may be accessed by the instructions, such as CPU registers. 4, 10, 49, 53,
54, 56, 60, 62, 64–67, 70–72, 74, 75, 84, 120, 127, 128
11
integer overflow occurs when the result of an arithmetic operation on two integer
m-bit binary numbers is outside of the range that can be represented by an
m-bit binary number. 10, 15, 16
ISA native datatype is a datatype that can be naturally processed by the ISA. 10,
54, 64
load instruction is an instruction that loads a value from main memory into a
register. 10, 56
main memory is a storage device used to store the instructions and data of pro-
grams that are being executed. 2–5, 10, 12, 18, 20, 21, 23, 27, 32, 33, 35, 36,
48, 107–112, 114–119, 123–125, 128
native program is a program encoded using instructions that can be directly exe-
cuted by the CPU, without help from an emulator or a virtual machine. 2, 10,
24, 26
opcode the opcode, or operation code, is a code (usually encoded as a binary num-
ber) that indicates the operation that an instruction must perform. 10, 57
peripherals are input/output, or I/O, devices that are connected to the computer.
Examples of peripheral devices include video cards (also known as graphics
cards), USB controllers, network cards, etc.. 2, 10, 107
persistent storage is a storage device capable of preserving its contents when the
power is shut down. Hard disk drives (HDDs), solid state drives (SSDs), and
flash drives are example of persistent storage devices. 2, 3, 10, 107
privilege level defines which ISA resources are accessible by the software being ex-
ecuted. 10, 120, 127, 128
privilege mode defines the privilege level for the currently executing software. 10,
13, 128–131
program counter or PC, is the register that holds the address of the next instruc-
tion to be executed. In other words, it holds the address of the memory position
that contains the next instruction to be executed. It is also known as instruction
pointer, or IP, in some computer architectures. 10, 55
12
pseudo-instruction is an assembly instruction that does not have a corresponding
machine instruction on the ISA, but can be translated automatically by the
assembler into one, or more, alternative machine instructions to achieve the
same effect. 10, 39, 40, 56, 58, 68, 69
register is a small memory device usually located inside the Central Processing Unit
(CPU) for quick read and write access. 3, 10
row-major order specifies that the elements of a two-dimensional array are orga-
nized in memory row by row. In this context, the elements of the first row are
placed first then the elements of the second row are placed after the elements
of the first one and so on. 10, 21
stack pointer is a pointer that points to the top of the program stack. In other
words, it holds the address of the top of the program stack. In RISC-V, the
stack pointer is stored by the sp register.. 10
store instruction is an instruction that stores values into main memory. 10
unprivileged ISA is the sub-set of the ISA that is accessible by the software running
on unprivileged mode. 10, 55, 128
unprivileged mode is the privilege mode with least privileges. In RISC-V, it is the
User/Application privilege mode. 10, 13, 128
unprivileged registers are a set of registers accessible on the unprivileged mode.
10, 55
user application is an application designed to be executed at user-mode on a system
managed by an operating system. 10
user-mode on RISC-V, the user-mode is equivalent to the User/Application mode.
10, 13, 128
13
Acronyms
ABI Application Binary Interface. 10, 54, 84, 91–94, 99, 102, 103, 125
ASCII American Standard Code for Information Interchange. 10, 16–18
bit Binary digit. 2–5, 10–20, 22, 23, 25, 28, 29, 35, 37, 40, 44–46, 48–50, 102, 108–114,
120, 121, 125
CPU Central Processing Unit. 2–5, 10–13, 32, 36, 49, 52, 107–112, 114–125, 128–132
CSR Control and Status Register. 10, 120–125, 129–132
ISA Instruction Set Architecture. 4, 10, 13, 24–26, 38–40, 49, 50, 55, 109–111, 120,
124, 125, 128, 129
ISR Interrupt Service Routine. 10, 117–120, 124, 125
UTF-8 Universal Coded Character Set (or Unicode) Transformation Format - 8-bit.
10, 16–18
14
Part I
Introduction to computer
systems and assembly
language
1
Chapter 1
Execution of programs: a
10,000 ft overview
There are several ways of encoding a computer program. Some programs, for ex-
ample, are encoded using abstract instruction sets and are executed by emulators or
virtual machines, which are other programs designed to interpret and execute the
abstract instruction set. Bash scripts, Java byte-code programs, and Python scripts
are common examples of programs that are encoded using abstract instruction sets
and require an emulator or a virtual machine to support their execution.
A native program is a program encoded using instructions that can
be directly executed by the computer hardware, without help from an
emulator or a virtual machine. In this book, we focus our discussion on native
programs. Hence, from now on, whenever we use the term “program”, unless stated
otherwise, we are referring to native programs.
Native program instructions usually perform simple operations, such as adding or
comparing two numbers, nonetheless, by executing multiple instructions, a computer
is capable of solving complex problems.
Most modern computers are built using digital electronic circuitry. These machines
usually represent information using voltage levels that are mapped to two states,
HIGH and LOW, or “1” (one) and “0” (zero). Hence, the basic unit of information
on modern computers is a binary digit, i.e., “1” or “0”. Consequently, information
and instructions are encoded as sequences of binary digits, or bits.
• Main memory: The main memory is used to store the instructions and data of
programs that are being executed. The main memory is usually volatile, hence,
if the computer is turned off, its contents are lost.
• Central Processing Unit: the Central Processing Unit, or CPU, is the com-
ponent responsible for executing the computer programs. The CPU retrieves
programs’ instructions from the main memory for execution. Also, when execut-
ing instructions, the CPU often reads/writes data from/to the main memory.
2
1.1. MAIN COMPONENTS OF COMPUTERS
Figure 1.1 illustrates a computer system in which the CPU, the main memory, a
persistent storage device (HDD) and two I/O devices are connected through a system
bus.
Main Output
Memory HDD device
bus
Input
CPU
device
1 byte 1 byte
0 00110110 0 36
1 00000000 1 00
2 00001000 2 08
3 10000000 3 80
Addresses Memory Addresses Memory
4 11110000 locations 4 F0 locations
5 11111111 5 FF
6 00001111 6 0F
7 11100001 7 E1
... ... ... ...
(a) (b)
Figure 1.2: Organization of a byte addressable memory with its contents represented
in the binary (a) and the hexadecimal (b) bases.
• Registers: a CPU register is a small memory device located inside the CPU.
The CPU usually contains a small set of registers. RISC-V processors, for ex-
ample, contain thirty-one 32-bit registers1 that can be used by programs to
store information inside the CPU. Computers often contain instructions that
1A 32-bit register is a register that is capable of storing 32 bits, i.e., values composed of 32 bits.
copy values from the main memory into CPU registers, known as “load” in-
structions, and instructions that copy values from the CPU registers into the
main memory, known as “store” instructions.
• A control unit: the control unit is the unit responsible for orchestrating the
computer operation. It is capable of controlling the datapath and other compo-
nents, such as the main memory, by sending commands through the bus. For
example, it may send a sequence of commands to the datapath and to the main
memory to orchestrate the execution of a program instruction.
Accessing data on registers is much faster than accessing data on the main memory.
Hence, programs tend to copy data from memory and keep them on CPU registers to
enable faster processing. Once the data is no longer needed, it may be discarded or
saved back on the main memory to free CPU registers.
The Instruction Set Architecture, or ISA, defines the computer instructions set,
including, but not limited to, the behavior of the instructions, their encoding, and
resources that may be accessed by the instructions, such as CPU registers. A program
that was generated for a given ISA can be executed by any computer that implements
a compatible ISA.
ISAs tend to evolve over time, however, ISA designers try to keep newer ISA
versions compatible with previous ones so that legacy code, i.e., code generated for
previous versions of the ISA, can still be executed by newer CPUs. For example, a
program that was generated for the 80386 ISA can be executed by any processor that
implements this or any other compatible ISAs, such as the 80486 ISA.
Memory
Address
words
... ...
000100112 8000
000001012 8001
000101012 8002
000000002 8003
Assembly language (RV32I)
loop: 100100112 8004
addi a0, a0, 1 100001012 8005
addi a1, a1, -1
beq a0, a1, loop 11110101 2 8006
11111111 2 8007
11100011 2 8008
000011002 8009
101101012 800A
11111110 2 800B
... ...
Figure 1.3: Three RV32I instructions stored on a byte addressable memory starting
at address 8000.
performed by a simple RV32I CPU. First, the CPU uses the address in the PC to fetch
an instruction (a sequence of four memory words, i.e., 32 bits) from main memory
and store it on an internal register called IR. Then, it updates the PC so it points to
the next instruction in memory. Finally, it executes the instruction that was fetched
from memory. Notice that when executing the instruction, the CPU may also access
the main memory to retrieve or store data.
Algorithm 1: RV32I instructions execution cycle.
1 while True do
2 // Fetch instruction and update PC ;
3 IR ← MainMemory[PC] ;
4 PC ← PC+4;
5 ExecuteInstruction(IR);
6 end
To execute a program, the operating system essentially loads the program into the
main memory (e.g., from a persistent storage device) and sets the PC so it points to
the program entry point.
BIOS. More modern computers use the Unified Extensible Firmware Interface, or UEFI, standard.
Data representation on
modern computers
This chapter discusses how information is represented on computers. First, Section 2.1
introduces the concepts of numeral systems and the positional notation. Then, sec-
tions 2.2 and 2.3 discuss how numbers and text are represented on computers, respec-
tively. Next, Section 2.4 shows how data is organized in memory. Finally, Section 2.5
discusses how instructions are encoded.
• D10 be the set of symbols used in the decimal numeral system, i.e., D10 =
{“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”}; and
6
2.1. NUMERAL SYSTEMS AND THE POSITIONAL NOTATION
The value of a number with m digits in the decimal numeral system is computed
by Equation 2.2.
i<m
X
value(number10 ) = symbol value(d i) × 10i (2.2)
| {z }
i=0
digit value
For example, the value of any natural number represented on the binary numeral
system (base = 2) is defined by Equation 2.5
i<m
X
value(number2 ) = symbol value(d i2) × 2i (2.5)
| {z }
i=0
digit value
Notice that the value of the sequence 11 is three on the binary numeral system
(1 × 21 + 1 × 20 ) while it is eleven on the decimal numeral system (1 × 101 + 1 × 10 0)
and seventeen on the hexadecimal numeral system (1 × 16 1 + 1 × 160 ).
When working with multiple numeral systems it is often necessary to annotate the
numbers so that it is possible to identify the numeral system being used, and hence, its
value. A common notation is to append a subscribed suffix to the number indicating
the base of the positional numeral system. For example, the value of number 1110 is
eleven while the value of number 112 is three 1 .
Appending a subscribed suffix to the number is not a natural way of annotating
numbers in computer programs. In these cases, a common approach is to append a
prefix that indicates the base. For example, in “C”, the programmer may use the
prefix “0b”/“0”/“0x” to indicate that the number is in the binary/octal/hexadecimal
base, i.e., base 2/8/16. In “C”, numbers that lack a prefix belong to the decimal
numeral system.
The binary and octal numeral systems use a subset of the symbols used on the
decimal numeral system to represent the numbers. The hexadecimal numeral system,
on the other hand, requires more than ten symbols, hence, new symbols are needed.
In this case, the first letters of the alphabet are used to complement the set of symbols.
Table 2.1 shows the symbols used in each one of these positional numeral systems and
their corresponding values.
Used in base
Symbol symbol value
2 8 10 16
“0” zero X X X X
“1” one X X X X
“2” two X X X
“3” three X X X
“4” four X X X
“5” five X X X
“6” six X X X
“7” seven X X X
“8” eight X X
“9” nine X X
“A” ten X
“B” eleven X
“C” twelve X
“D” thirteen X
“E” fourteen X
“F” fifteen X
Table 2.1: Set of symbols used in binary, octal, decimal, and hexadecimal numeral
systems and their respective values.
Table 2.2 shows how values zero to twenty are represented in the hexadecimal,
decimal, octal, and binary numeral systems.
Numeral system
Value
Hexadecimal Decimal Octal Binary
zero 0 16 010 08 02
one 1 16 110 18 12
two 216 210 28 102
three 3 16 310 38 112
four 4 16 410 48 100 2
five 5 16 510 58 101 2
six 616 610 68 110 2
seven 7 16 710 78 111 2
eight 8 16 810 108 10002
nine 9 16 910 118 10012
ten A16 1010 128 10102
eleven B16 1110 138 10112
twelve C16 1210 148 11002
thirteen D16 1310 158 11012
fourteen E16 1410 168 11102
fifteen F 16 1510 178 11112
sixteen 1016 1610 208 10000 2
seventeen 1116 1710 218 10001 2
eighteen 1216 1810 228 10010 2
nineteen 13 16 1910 238 10011 2
twenty 1416 2010 248 10100 2
Table 2.2: Values zero to twenty represented in the hexadecimal, decimal, octal, and
binary numeral systems.
V = v (d m−1
b ) × b m−1 + · · · + v (d1b ) × b 1 + v (d 0b) × b0 (2.6)
Notice that v (d0b ) and V 0 are equivalent to the remainder and the quotient of the
division of V by b. Hence, to find out the symbol value of digit d0b (i.e., v (d0b )) it
suffices to compute the remainder of the division of V by b.
Using the same reasoning, the symbol value of digit d1b may be computed by
dividing V 0 by b. Notice that the remainder of the division of V 0 by b is equal to
v (d 1b ).
Let symbol f rom value(v, b) be a function that returns the symbol used to rep-
resent value v on base b (e.g., symbol from value(eleven, 16) = “B ”), Algorithm 2
shows an algorithm to compute the sequence of digits d m−1
b d m−2
b · · · d1b d0b that repre-
26 2 26 2 26 2
- 26 13 - 26 13 2 - 26 13 2
0 v(d0 2 ) = 0 - 12 6 v(d 02 ) = 0 - 12 6 2
quotient
v(d1 2) = 1 v(d12 ) = 1 -6 3 2
remainder
v(d2 2 ) = 0 -2 1 2
3 -0
v(d 2) = 1 0
v(d 42) =1
26 16 26 16
v(d 0 16) = 10 => d 016= "A"
- 16 1 - 16 1 16
10 v(d 016) =10 -0 0 v(d 1 16) = 1 => d 116= "1"
quotient
v(d1 16) = 1 1A 16 = twenty six
remainder
(a) (b) (c)
converting between these bases can be done by replacing subsets of consecutive four
bits by single hexadecimal digits, and vice versa. For example, Equation 2.8 illustrates
how number 3E 816 can be converted to binary and Equation 2.9 shows how number
10110101110101 2 can be converted to hexadecimal.
3 |{z}
|{z} E |{z}
8 = 0011111010002 = 11111010002 (2.8)
0011 2 1110 2 1000 2 16
10 1101
|{z} |{z } 0111
|{z } 0101
|{z } = 2D 7516 (2.9)
216 D16 716 516 2
To illustrate this concept, Table 2.3 shows the list of unsigned numbers that can
be represented with three-bit words.
Value in the
Three-bit word
unsigned representation
000 010
001 110
010 210
011 310
100 410
101 510
110 610
111 710
Table 2.3: Unsigned numbers that can be represented with three-bit words.
Value
Three-bit word signal and
unsigned
magnitude
000 0 10 010
001 1 10 110
010 2 10 210
011 3 10 310
100 4 10 −010
101 5 10 −110
110 6 10 −210
111 7 10 −310
Table 2.4: Values of three-bit words on the unsigned and signal and magnitude rep-
resentations.
The “signal and magnitude” representation can represent numbers in the range
[−(2m−1 − 1) . . (2m−1 − 1)] with a sequence of m bits.
One’s complement
The “one’s complement” representation is a number representation that can be used
to represent signed numbers. In this representation, signed numbers are represented
as a sequence of m bits so that bit d m−1
2 represents the signal of the number. If bit
dm−1
2 is “1”, then the number is negative, otherwise, it is non-negative.
The magnitude of a non-negative number represented in one’s complement (i.e.,
a number in which dm−12 = “0”) is computed in the same way the value of unsigned
numbers are computed, i.e., using Equation 2.5. For example, the three-bit number
“010” is a non-negative number in one’s complement representation and its magnitude
is two, since 0 × 22 + 1 × 21 + 0 × 20 = two.
The magnitude of a negative number represented in one’s complement (i.e., a
number in which dm−12 = “1”) is computed by first “complementing” (inverting) all
bits and then using Equation 2.5. For example, the three-bit number “110” is a
negative number in one’s complement representation and its magnitude is one, since
the value of its complement (“001”) is one (0 × 22 + 0 × 21 + 1 × 20 = one).
Table 2.5 shows the values of three-bit words on the unsigned, signal and mag-
nitude, and one’s complement representations. Similar to the signal and magnitude
representation, there are two representations for value zero on the one’s complement
representation.
Value
Three-bit word signal and one’s two’s
unsigned
magnitude complement complement
000 010 010 010 0 10
001 110 110 110 1 10
010 210 210 210 2 10
011 310 310 310 3 10
100 410 −010 −310 −4 10
101 510 −110 −210 −3 10
110 610 −210 −110 −2 10
111 710 −310 −010 −1 10
Table 2.5: Values of three-bit words on the unsigned, signal and magnitude, one’s
complement and two’s complement representations.
The “one’s complement” representation can represent numbers in the range [−(2m−1−
1) . . (2m−1 − 1)] with a sequence of m bits.
Two’s complement
Figure 2.3: The addition of two numbers produces the same sequence of bits both in
the unsigned numbers representation and the two’s complement representation.
The two’s complement representation can represent numbers in the range [−(2m−1 ) . . (2m−1 −
1)] with a sequence of m bits.
m bits) and if any overflow beyond these bits are discarded. This is usually the case in modern
computers.
4 As discussed in previous section, adding (subtracting) two m-bit unsigned binary numbers pro-
duces the same sequence of m-bits as adding (subtracting) two m-bit signed numbers on the two’s
complement representation. Hence, the same approach used for unsigned binary numbers may be
used to perform addition and subtraction operations on signed numbers using the two’s complement
representation.
(a) (b)
Figure 2.4: Adding two three-bit binary numbers. (a) Red arrows indicate where the
carry out comes from. (b) Simplified representation (without arrows).
In these cases, “some value” is borrowed from the left digit and this borrowed value
must be accounted for when performing the subtraction on the left digit.
Figure 2.5 illustrates the subtraction of two three-bit unsigned binary numbers.
Figure 2.5 (b) shows the first step, in which the least significant digits are subtracted.
Since “1” cannot be subtracted from “0”, some value must be borrowed from the left
column. The “*” character indicates that value had to be borrowed from the left
column. The result in this column is “1”, since “10” (two) minus “1” (one) is “1”.
Figure 2.5 (c) illustrates the operation on the second least significant digit. Since
the right column borrowed from this column, the first operand is now “0”. Again,
since “1” cannot be subtracted from “0”, some value must be borrowed from the
left column. The result in this column is “1”, since “10” (two) minus “1” (one) is
“1”. Figure 2.5 (d) shows the subtraction of the most significant digits (zero minus
zero) and the final result. Figure 2.5 (e) shows a simplified representation of the
subtraction.
* * *
1 1 0 (6 10) 1 10 10 (610 ) 10 10 10 (610 )
-0 1 1 (3 10) -0 1 1 (310 ) -0 1 1 (310 )
1 1 1
* * * *
10 10 10 (610 ) 1 1 0 (610)
-0 1 1 (310 ) -0 1 1 (310)
0 1 1 (310 ) 0 1 1 (310)
(d) (e)
Figure 2.5: Subtraction of two three-bit binary numbers. (a) The digits of both num-
bers are aligned on columns. (b) First, the least significant digits are subtracted - the
“*” character indicates that some value was borrowed from the left column. (c) The
second least significant digits are subtracted - again, the “*” character indicates that
some value was borrowed from the left column. (d) Finally, the most significant digits
are subtracted producing digit “0”. (e) Simplified representation of the subtraction.
unsigned binary numbers causes an integer overflow. In this case, adding one to seven
should result in eight, however, the value eight cannot be represented using only three
bits on the unsigned binary representation.
1 1 1 ← carry digits
0 0 1 (110 )
+ 1 1 1 (710 )
0 0 0 (010 )
Even though the operation illustrated on Figure 2.6 characterizes an integer over-
flow on the unsigned binary representation, it does not characterize an integer overflow
on the signed (two’s complement) binary representation. In this case, the operation
is adding one (001) to minus one5 (111) and the expected result, i.e., zero, can be
represented by a three-bit unsigned binary number.
Figure 2.7 shows an example in which the addition of two three-bit signed binary
numbers using the two’s complement method causes an integer overflow. In this case,
however, there was not integer overflow on the unsigned binary number representation.
Notice that the result of the operation is as expected, i.e., four (100), on the unsigned
binary representation.
1 1 ← carry digits
0 1 1 (310 )
+ 0 0 1 (110 )
1 0 0 (−4 10)
Figure 2.7: Example of an integer overflow on the signed binary representation. The
result of three plus one cannot be represented by a three-bit signed binary number
using the two’s complement representation.
representation.
6 Control characters are not intended to represent printable information. A line feed, carrier return
Table 2.6: Subset of the characters encoded by the ASCII character encoding stan-
dard. Hex. and Dec. columns show the encoding value in hexadecimal and decimal
representation while the Char. column shows the symbol encoded by the character.
W3Techs web site8 indicated that more than 95.5 % of the world wide web websites
are encoded with the UTF-8 character encoding standard.
The “Unicode (or Universal Coded Character Set) Transformation Format - 8-
bit”, or UTF-8 for short, is a variable-width character encoding standard. In this
standard, each character may be represented by one, two, three, or four bytes, i.e.,
one, two, three, or four 8-bit numbers. Common characters, such as letters “a”, “b”,
and “c”, are represented by a single byte, while more exotic ones are represented
using multiple bytes. The euro currency sign (¤), for example, is encoded using three
bytes: 11100010 2, 10000010 2, and 10101100 2.
The UTF-8 standard was designed to be backward compatible with the ASCII
standard. Hence, ASCII characters are represented on the UTF-8 standard using a
single byte with the the same value. For example, letter “a” is represented by value
ninety seven in both standards. In this way, a software designed to work with the
UTF-8 standard can naturally open and handle ASCII encoded files.
Texts are represented in computers as sequences of characters on mem-
ory. For example, the word “Yes” is represented by a sequence of three characters
(“Y”, “e”, and “s”) stored on consecutive memory positions. In case the ASCII
character encoding standard is being used, the three consecutive memory positions
will contain values 12110 , 10110, and 11510 , respectively. Figure 2.8 illustrates how
the word “maçã”9 is represented in three different character encoding standards: the
UTF-8, the ISO-LATIN-1 and the Mac OS Roman. Each square represents a byte
and the values inside the squares are in hexadecimal. Notice that the UTF-8 standard
requires two bytes to represent letter “ç” and two bytes to represent letter “ã”.
M a ç ã M a ç ã M a ç ã
4D 61 E7 E3 4D 61 8D 8B 4D 61 C3 A7 C3 A3
ISO-LATIN-1 Mac OS Roman UTF-8
Figure 2.8: Word “maçã” represented in three different character encoding standards:
the UTF-8, the ISO-LATIN-1 and the Mac OS Roman.
8 https://w3techs.com/technologies/overview/character_encoding
9 “Maçã” is the word for apple in Portuguese.
1 #include<stdio.h>
2 char name1[] = "John";
3 char name2[] = {0x4a, 0x6f, 0x68, 0x6e, 0x00};
4 int main()
5 {
6 printf("Name 1: \"%s\"\n", name1);
7 printf("Name 2: \"%s\"\n", name2);
8 printf("Size of name 1 = %d\n", sizeof(name1));
9 printf("Size of name 2 = %d\n", sizeof(name2));
10 return 0;
11 }
Both strings (name1 and name2) in previous code require five memory words to
be stored on memory11 . In fact, since the hexadecimal values used in line 3 are
the values for the “J”, “o”, “h”, and “n” letters on the ASCII and UTF-8 encoding
standards, both string are identical. The following listing shows the output of the
previous program.
1 Name 1: "John"
2 Name 2: "John"
3 Size of name 1 = 5
4 Size of name 2 = 5
Figure 2.9: 32-bit number 00000000 00000000 00000100 00000001 2 (102510 ) stored on
four consecutive memory words using the (a) little-endian and the (b) big-endian
endianness formats.
Vector elements are usually organized in a linear fashion on the memory. Hence,
when translating the previous code into machine language, all elements (int values) of
vector V are placed in consecutive memory positions - starting with the first element,
i.e., V[0]. The base address of an array is the address of the first memory
word that is being used to store the array elements. Assuming the base address
of vector V is 000, then the first element (V[0]) is stored starting at memory address
000. Also, assuming each element requires four memory words13, the second element
12 This is a property of the “C” programming language. Other languages, such as Pascal, associate
numbers.
is stored starting at memory address 004 and the third one starting at memory address
008. Figure 2.10 illustrates the contents of vector V placed on memory starting at
address 000. Notice that, in this example, each element is a 32-bit number that is
stored on four consecutive memory words using the little-endian format.
Address Contents
000 000010012
001 000000002
v[0] = 910
002 000000002
003 000000002
004 000010002
005 000000002
v[1] = 810
006 000000002
007 000000002
008 000000012
009 000000002
v[2] = 110
010 000000002
011 000000002
The previous example showed an array of int elements. Nonetheless, in “C”, the
programmer may also create arrays of other types. For example, one may create an
array of char, in which each element occupies only one byte, an array of double, in
which each element occupies 8 bytes, or even an array of a new type defined with the
struct operator, in which each element may occupy several bytes.
Let:
In “C”, and several other programming languages, each element V [i] occupies
elem size memory words of a byte addressable memory and is placed at the main
memory starting at address &V [i], which is defined by Equation 2.12.
The way two-dimensional arrays are organized on memory depends on the pro-
gramming language. In “C”, elements are grouped by row and each row is placed on
memory consecutively. Hence, in the previous example, the elements of the first row,
i.e., M[0][0]=7, M[0][1]=9, and M[0][2]=11, are placed first on memory. Then, the
elements of the second row, i.e., M[1][0]=2, M[1][1]=8, and M[1][2]=1, are placed
after the elements of the first row. This way of organizing two-dimensional arrays on
memory is known as row-major order.
Let:
In “C”, each element A[x][y ] occupies elemsize memory words of a byte addressable
memory and is placed at the main memory starting at address &A[x][y ], which can
be computed using Equation 2.13.
Notice that Equation 2.13 adds to the base address (Aaddr ) two offsets: offset 1
and offset 2. The first offset is the amount of space in bytes required to store all
elements that belong to previous rows, i.e., rows that must be placed before row x.
The second offset is the amount of space in bytes required to store all elements that
belong to the same row but must be placed before element A[x][y ], i.e., the elements
that has a column index less than y .
1 struct user_id {
2 int id;
3 char name[256];
4 short level;
5 };
6
9 void print_manager()
10 {
11 printf("Manager id = %d\n", manager.id);
12 printf("Manager name = %s\n", manager.name);
13 printf("Manager level = %d\n", manager.level);
14 }
Address Contents
000 000000012
001 000000002
id
002 000000002
003 000000002
004 010010102 name[0]
... ...
259 000000002 name[255]
260 000000002
level
261 000001112
Figure 2.11: Elements of variable manager stored on memory starting at address 000.
Figure 2.12: Encoding of two different instructions: (a) instruction mov, which belongs
to the x86 instruction set architecture, and (b) instruction add, which belongs to the
RISC-V instruction set architecture.
parameters. These parameters specify that the add operation must be performed
using the values stored in registers x1 and x0 and the result stored in register x3,
which are identified by values 00012 , 00002 , and 0011 2 on fields rs1, rs2, and rd,
respectively.
Most modern computers store the code, i.e., the program instructions, on the same
memory they store the data - the main memory. Also, modern computer instructions
are encoded using multiples of 8 bits so that they fit the size of multiple memory words
on a byte addressable memory. Figure 2.13 shows an example of how a program
written in x86 assembly language is mapped to machine language and stored on a
byte addressable memory. Notice that, the first instruction, push $ebp, is encoded
using one byte while the third one, imul $113, 12(%ebp), %eax, is encoded using
four bytes. Also, notice that instructions are placed sequentially on memory, in the
same order they appear on the original assembly program.
Figure 2.13: Mapping x86 instructions from an assembly language program to mem-
ory.
This chapter presents the main concepts and elements of assembly, object, and exe-
cutable files.
1 int main()
2 {
3 int r = func (10)
4 return r+1;
5 }
24
3.1. GENERATING NATIVE PROGRAMS
An assembly program is also a program encoded as a plain text file. The following
code shows an example of a program written using the RV32I assembly language.
This program has the same semantics as the previous C program.
1 .text
2 .align 2
3 main:
4 addi sp,sp,-16
5 li a0,10
6 sw ra,12(sp)
7 jal func
8 lw ra,12(sp)
9 addi a0,a0,1
10 addi sp,sp,16
11 ret
Different from high-level languages, assembly language is very close to the ISA.
For example, the previous assembly program contains references to instructions (e.g.,
addi, li, ...) and registers (e.g., sp, ra, a0) that belong to the RV32I ISA. Lines
4 to 11 of the previous code contain assembly instructions, which are converted by
the assembler into RV32I machine instructions. As a consequence, they are ISA
dependent, i.e., an assembly program generated for one ISA is usually not compatible
with other ISAs.
Machine language is a low-level language that can be directly processed
by a computer’s central processing unit (CPU). An assembler is a tool
that translates a program in assembly language into a program in machine
language. For example, it converts assembly instructions (encoded as sequences of
ASCII characters) into machine instructions (encoded as sequences of bits accordingly
to the ISA). Each assembly language is associated with a given ISA.
The “GNU Assembler”2 tool, or as, is an assembler that is capable of translat-
ing programs written in several assembly languages into machine language for their
respective ISAs. In this book we will use the as tool to translate RV32IM assembly
programs to machine language programs. The following command line illustrates how
the riscv64-unknown-elf-as tool, a version of the GNU Assembler that generates
code for RISC-V ISAs, can be invoked to assemble a RV32I assembly program. In
this example, the RV32I assembly program is stored on the main.s file and the result,
a file that contains code in machine language, will be stored on the main.o file.
Assemblers usually produce object files that are encoded in binary and contains
code in machine language. The object file also contains other information, such as
the list of symbols (e.g., global variables and functions) defined in the file. There
are several known file formats used to encode object files. The Executable and
Linking Format, or ELF, is frequently used on Linux-based systems while the
Portable Executable format, or PE, is used on Windows-based systems. The
riscv64-unknown-elf-as tool, used in the previous example, produces an ELF-based
object file.
Even though the object file produced by the assembler contains code in machine
language, it is usually incomplete in the sense that it may still need to be relocated
(more on relocation later) or linked with other object files to compose the whole
program. For example, the code in an object file may need to be linked with the C
library so that the program can invoke the printf function. As a consequence, the
object file produced by the assembler is not an executable file.
A linker is a tool that “links” together one or more object files and
produces an executable file. The executable file is similar to an object file in the
2 https://www.gnu.org/software/binutils/
main: func:
li a0, 10 li a1, 113 Assembly language program
jal func mul a0, a0, a1
(text file)
ret ret
sense that it is encoded in binary and contains code in machine language. Nonetheless,
it contains all the required elements (e.g., libraries) for execution.
The following command line illustrates how the riscv64-unknown-elf-ld tool,
a version of the GNU Linker3 tool that links object files generated for RISC-V ISAs,
can be invoked to link two object files together: the main.o and mylib.o object files.
In this example, the linker will produce an executable file named main.x.
Figure 3.1 illustrates the code generation process used to produce a native program
executable file from a C program organized in two files. First, the two C program files
are translated into assembly programs by the compiler. Then, the assembly programs
are assembled by the assembler, which produces object files. Finally, the linker links
the object files together producing an executable file.
Assuming the high-level language program files are named main.c and func.c,
the following sequence of commands produce a RV32I executable file named main.x.
1 x:
2 .word 10
3
4 sum10:
5 lw a0, x
6 addi a0, a0, 10
7 ret
Global variables and program routines are program elements that are stored on
the computer main memory. Each variable and each routine occupies a sequence of
memory words and are identified by the address of the first memory word they occupy.
At one hand, to read the contents of a global variable, or execute a routine, it suffices
to have their addresses, i.e., the address of the first memory word they occupy4 . On
the other hand, the addresses assigned to variables and routines are only final on the
executable file, after the linker links together the multiple object files into a single
file. Hence, assembly programs require a mechanism to refer to global variables and
routines. This is accomplished by using labels, as illustrated in the previous example.
In this context, before allocating space for each global variable or producing the code
for each routine, the programmer (or the compiler) defines a label that will be used
to identify the variable or the routine.
Program symbols are “names” that are associated with numerical val-
ues and the “symbol table” is a data structure that maps each program
symbol to its value. Labels are automatically converted into program symbols by
the assembler and associated with a numerical value that represents its position in
the program, which is a memory address. The assembler adds all symbols to the
program’s “symbol table”, which is also stored on the object file.
We can inspect the contents of the object file by using tools that decode the infor-
mation on the object file and shows them on a human-readable format, i.e., a textual
format. The GNU nm tool, for example, can be used to inspect the “symbol table”
of an object file. Assuming the previous code was encoded into an object file named
sum10.o, we can inspect its symbol table by executing the riscv64-unknown-elf-nm
tool as follows.
$ riscv64-unknown-elf-nm sum10.o
00000004 t .L0
00000004 t sum10
00000000 t x
4 To execute a routine it suffices to set the PC with the address of the first instruction of the
routine - this is usually done by executing a “jump” instruction, which sets the PC with a given
value.
Notice that, in this case, the symbol table contains three symbols: .L0 5, sum10,
and x, which are associated with values 00000004, 00000004, and 00000000, respec-
tively.
The programmer may also explicitly define symbols by using the .set directive.
The following code shows a fragment of assembly code that employs the .set directive
to define a symbol named answer and assign value 42 to it.
1 .set answer, 42
2 get_answer:
3 li a0, answer
4 ret
Assuming the previous code is stored on a program file named get answer.s, we
can assemble it and inspect the object file symbol table by executing the following
commands:
Notice that the symbol table contains two symbols: answer and get answer. The
answer symbol is an absolute symbol, i.e., its value is not changed during the linking
process – this is indicated by the letter ‘a’ on the output. The get answer symbol
is a symbol that represents a location on the .text section and may have its value
(which is an address) changed during the relocation process. The next sections discuss
the relocation process and the program sections’ concept.
1 trunk42:
2 li t1, 42
3 bge t1, a0, done
4 mv a0, t1
5 done:
6 ret
When assembling this program, the assembler translates each assembly instruction
(e.g., li, bge, ...) to a machine instruction, i.e., an instruction encoded with 32
bits. As a result, the program occupies a total of 16 memory words, four for each
instruction. Also, the assembler maps the first instruction to address 0, the second
one to address 4, and so on. In this context, the trunk42 label, which marks the
5 The .L0 symbol was automatically introduced by the assembler when translating the lw a0, x
assembly instruction. This is a special instruction called pseudo-instruction that will be discussed
latter on Section 6.4.
6 A branch instruction is an instruction that change the execution flow under certain conditions
- In this example, the bge t1, a0, done (branch greater equal) instruction jumps to the position
identified by the done label if the value in register t1 is greater or equal to the value in register a0.
beginning of the program, is associated with address 0 and the done label, which
marks the position in which instruction ret is located, is associated with address c.
Since the bge instruction has a reference to label done, the assembler encodes the
address associated with the done label (address c) in the fields of this instruction.
The GNU objdump tool can be used to inspect several parts of the object file.
The following example shows how to use the riscv64-unknown-elf-objdump7 tool
to decode the data and instructions on the trunk.o file so that we can inspect its
contents.
$ riscv64-unknown-elf-objdump -D trunk.o
00000000 <trunk42>:
0: 02a00313 li t1,42
4: 00a35463 bge t1,a0,c <done>
8: 00030513 mv a0,t1
0000000c <done>:
c: 00008067 ret
...
Notice that, for each instruction, it shows its address, its encoding in hexadeci-
mal, and a text that resembles assembly code8. The bge t1, a0, done instruction,
for example, is mapped to address 4 and encoded with the 32-bit value 00a35463.
The objdump tool indicates that it refers to the label done, which is mapped to ad-
dress c (bge t1,a0,c <done>). Also, notice that the labels (and their addresses) are
displayed on their respective program position.
In the previous example, the trunk42 function starts at address 0, however, when
linking this object file (trunk.o) with others, the linker may need to move the code
(assign new addresses) so that they do not occupy the same addresses. In this process,
the addresses associated with labels may change and each reference to a label must
also be fixed to reflect the new addresses.
Relocation is the process in which the code and data are assigned new
memory addresses. As discussed previously, during the relocation process, the
linker needs to adjust the code and data to reflect the new addresses. More specifically,
the addresses associated with labels on the symbol table and the references to labels
must be adjusted. The relocation table is a data structure that contains
information that describes how the program instructions and data need to
be modified to reflect the addresses reassignment. Each object file contains
a relocation table and the linker uses their information to adjust the code when
performing the relocation process.
The following example shows how to use the riscv64-unknown-elf-objdump tool
to inspect the contents of the relocation table on the trunk.o file. Notice that, in this
case, the object file contains one relocation record, which indicates that the instruction
on address 4, a RISC-V branch instruction, contains a reference to label done. The
linker uses this information to adjust the branch instruction when the done label is
assigned a new address.
$ riscv64-unknown-elf-objdump -r trunk.o
00010054 <trunk42>:
10054: 02a00313 li t1,42
10058: 00a35463 bge t1,a0,10060 <done>
1005c: 00030513 mv a0,t1
00010060 <done>:
10060: 00008067 ret
...
Notice that the code on the trunk.x program was relocated, i.e., assigned new
addresses. In this example, the code of the trunk42 routine starts at address 10054
and the bge instruction jumps to address 10060 in case the value in register t1 is
greater or equal to the value in register a0.
The assembler assembles this program and register the exit label on the symbol
table as an undefined symbol. The riscv64-unknown-elf-nm tool identifies the un-
defined symbols by placing the ‘U’ character before the symbol name. Assuming the
previous code was assembled into the main.o object file, we can inspect the contents
of its symbol table as follows:
$ riscv64-unknown-elf-nm main.o
00000000 t start
U exit
The assembler also register the reference to this symbol on the relocation table.
The riscv64-unknown-elf-objdump tool shows that the main.o object file includes
a relocation record for the reference to the exit label on the jal instruction.
9 The jal instruction is used to invoke routines. In this case, it is invoking the exit routine. This
$ riscv64-unknown-elf-objdump -r main.o
When linking the object files, the linker must resolve the undefined symbols, i.e.,
it must find the symbol definition and adjust the symbol table and the code with
the symbol value. In the previous example, the linker will look for the exit symbol
so that it can adjust the jal instruction to refer to the correct address. In case it
cannot find the definition of the symbol, it stops the linking process and emits an
error message. The following example illustrates this situation. In this case, we are
trying to link the main.o file without providing another object file that contains a
definition of the exit label. Notice that the linker emits the error message undefined
reference to ‘exit’.
Assuming the code that invokes the exit function is located on the main.s file and
the exit function is located on the exit.s file, the following sequence of commands
shows how to assemble both files and link them together.
Notice that the linker is not producing the undefined reference to ‘exit’
anymore. It is worth noting that, in case the exit label is not registered as a global
label, the linker will not use it to resolve the undefined symbol and the linking process
will fail.
Once the start label is registered as a global symbol, the linker uses its address
to set the entry point information. The following sequence of commands assemble the
main.s and the exit.s assembly programs and link them together into the main.x
executable file. In this case, there were no error nor warning messages because we
used the .globl directive to set both the exit and the start labels as global symbols,
allowing the linker to resolve the exit reference and set the program entry point.
Notice that when invoking the linker, in this case, we passed the exit.o object file
before the main.o file. Because of this, the linker places the contents of the exit.o
file before the contents of the main.o file on the main.x file. This can be observed by
listing the contents of the main.x file with the riscv64-unknown-elf-objdump tool,
as follows:
$ riscv64-unknown-elf-objdump -D main.x
00010054 <exit>:
10054: 00000513 li a0,0
10058: 05d00893 li a7,93
1005c: 00000073 ecall
00010060 <start>:
Even though the linker placed the exit function first, the code associated with the
start label will be executed first because the entry point field contains the address
associated with the start label.
The GNU readelf tool can be used to display information about ELF files. The
following command shows how the riscv64-unknown-elf-readelf tool can be used
to inspect the header of the main.x executable file. Notice that the entry point
address field was set to 0x10060, i.e., the address of the start label.
$ riscv64-unknown-elf-readelf -h main.x
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2’s complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10060
Start of program headers: 52 (bytes into file)
Start of section headers: 476 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 1
Size of section headers: 40 (bytes)
Number of section headers: 6
Section header string table index: 5
When linking multiple object files, the linker groups information from sections
with the same name and places them together into a single section on the executable
file. For example, when linking multiple object files, the contents of the .text sections
from all object files are grouped together and placed sequentially on the executable
file on a single section that is also called .text. Figure 3.2 shows the layout of an
RV32I executable file that was generated by the riscv64-unknown-elf-ld tool, and
is encoded using the Executable and Linking Format. This file contains three sections:
the .data, the .rodata, and the .text sections. The contents of section .text are
8011 11010110
8010 10010101
800f 10011100
.data
800e 11101011 section
800d 10101010
800b 10010101
Addresses
800a 10011100 .rodata
8009 11101011 section
8008 10101010
8007 11010110
8006 10010101
8005 10011100
8004 11101011 .text
8003 10101010 section
8002 10011100
8001 10010101
8000 11010110
11101011 Section
... header
10101010 table
mapped to addresses 8000 to 8007, while the contents of section .data are mapped
to addresses 800d to 8011.
By default, the GNU assembler tool adds all the information to the .text section.
To instruct the assembler to add the assembled information into other sections, the
programmer (or the compiler) may use the .section secname directive. This di-
rective instructs the assembler to place the following assembled information into the
section named secname. The following example illustrates how the .section direc-
tive can be used to add instructions to the .text section and variables to the .data
section.
1 .section .data
2 x: .word 10
3 .section .text
4 update_x:
5 la t1, x
6 sw a0, (t1)
7 ret
8 .section .data
9 y: .word 12
10 .section .text
11 update_y:
12 la t1, y
13 sw a0, (t1)
14 ret
The .section .data directive in the first line of the previous example is instruct-
ing the assembler to add information to the .data section from this point on. The
second line contains a label (x:) and a .word directive, which are used together to
declare and initialize a global variable named x. The .section .text directive in the
third line instructs the assembler to add the following information to the .text sec-
tion. As a consequence, the update x label (line 4) refers to a position on the .text
section and the next three instructions (lines 5-7) are added at the .text section. The
.section .data directive in line eight instructs the assembler to add the following
information to the .data section. Hence, the y variable, created by combining the
y: label with the .word directive, is added to the .data section, right after the x
variable. After this, .section .text directive in line 10 instructs the assembler to
add the next information to the .text section. Finally, the update y label (line 11)
refers to a position in the .text section and the remaining instructions (lines 12-14)
are added to the .text section.
Assuming the previous code is stored on a file named prog.s, we may assemble the
program and inspect the contents of the object file using the following commands 10:
00000000 <update_x>:
0: 00000317 auipc t1,0x0
4: 00030313 mv t1,t1
8: 00a32023 sw a0,0(t1) # 0 <update_x>
c: 00008067 ret
00000010 <update_y>:
10: 00000317 auipc t1,0x0
14: 00030313 mv t1,t1
18: 00a32023 sw a0,0(t1) # 10 <update_y>
1c: 00008067 ret
00000000 <x>:
0: 000a c.slli zero,0x2
...
00000004 <y>:
4: 000c 0xc
...
Notice that the program routines, represented by the update x and update y
labels, and the program instructions are all located on the .text section while the
global variables, represented by the x and y labels, and the 32-bit values 000a and
000c, are located on the .data section. Also, notice that the elements of each section
are assigned addresses starting at zero. However, since instructions and data are
stored on the same memory, the main memory, we may not load the variables and
the instruction into the same memory addresses. The linker prevents this problem by
relocating the instructions and data so they are assigned non-conflicting addresses.
The following commands show how to invoke the linker to produce an executable file
named prog.x and how to invoke the objdump tool to inspect its contents:
assembler into two machine instructions: auipc and mv. These two instructions will be later updated
by the linker so that they load the address of the respective label into the target register, i.e., register
t1.
00010074 <update_x>:
10074: 00001317 auipc t1,0x1
10078: 01c30313 addi t1,t1,28 # 11090 <__DATA_BEGIN__>
1007c: 00a32023 sw a0,0(t1)
10080: 00008067 ret
00010084 <update_y>:
10084: 80418313 addi t1,gp,-2044 # 11094 <y>
10088: 00a32023 sw a0,0(t1)
1008c: 00008067 ret
00011090 <__DATA_BEGIN__>:
11090: 000a c.slli zero,0x2
...
00011094 <y>:
11094: 000c 0xc
...
Notice that the linker assigned addresses 10074 to 1008f to the contents of the
.text section and addresses 10090 to 10097 to the contents of the .data section.
• Addresses on object files are not final and elements from different sections may
be assigned the same addresses, as discussed in Section 3.3. As a consequence,
the elements of different sections may not reside in the main memory at the
same time;
• Object files usually contain several references to undefined symbols, which are
expected to be resolved by the linker;
• Object files contain a relocation table so that instructions and data on object
files can be relocated on linking. Addresses on executable files are usually final;
• Object files do not have an entry point;
Assembly language
Assembly programs are encoded as plain text files and contain four main elements:
• Comments: comments are textual notes that are often used to document in-
formation on the code, however, they have no effect on the code generation and
the assembler discards them;
• Labels: as discussed in Section 3.2.1, labels are “markers” that represent pro-
gram locations. They are usually defined by a name ended with the suffix “:”
and can be inserted into an assembly program to “mark” a program position so
that it can be referred to by assembly instructions or other assembly commands,
such as assembly directives;
• Assembly instructions: Assembly instructions are instructions that are con-
verted by the assembler into machine instructions. They are usually encoded
as a string that contains a mnemonic and a sequence of parameters, known as
operands. For example, the “addi a0, a1, 1” string contains the addi mne-
monic and three operands: a0, a1, and 1;
• Assembly directives: Assembly directives are commands used to coordinate
the assembling process. They are interpreted by the assembler. For example, the
.word 10 directive instructs the assembler to assemble a 32-bit value (10) into
the program. Assembly directives are usually encoded as strings that contains
the directive name, which have a dot (‘.’) prefix, and its arguments.
As discussed before, comments have no effect on the assembling process and are
discarded by the assembler. This is usually performed by a preprocessor, which
removes all comments and extra white spaces. Once comments and extra white spaces
are discarded, the assembly program contains only three kinds of elements: labels,
assembly instructions and assembly directives. Assuming <label>, <instruction>,
and the <directive> represent valid labels, assembly instructions, and assembly
directives, respectively, the following regular expression can be used to summarize
the syntax of the assembly language once its comments and extra white spaces are
removed.
The first two rules of the previous regular expression indicate that an assembly
program is composed by one or more lines, which are delimited by the end of line
character, i.e., ‘\n’. The last rule implies that:
• a line may be empty. Notice that the <label>, the <instruction>, and the
<directive> elements are optional1 ;
1 Elements expressed between brackets on a regular expression are optional.
37
CHAPTER 4. ASSEMBLY LANGUAGE
The following RV32I assembly code contains examples of valid assembly lines:
1 x:
2
The following RV32I assembly code contains examples of invalid assembly lines.
The first line contains two labels and the second one contains an instruction followed
by a label (notice that the label has to precede the instruction when both are located
in the same line). The third line contains two instructions and the fourth one contains
two assembly directives, however, only one instruction or one directive is allowed per
line. The fifth line contains an assembly directive followed by a label, however, the
label has to precede the directive when both are located in the same line. The sixth
line contains an instruction and an assembly directive on the same line while the
seventh line contains an invalid directive.
1 x: z:
2 addi a0, a1, 1 sum:
3 li a0, 2 li a1, 1
4 .word 10 .word 20
5 .word 10 y:
6 addi a0, a1, 1 .word 12
7 .sdfoiywer 1
The following RV32I assembly code is also invalid because all elements of a single
instruction, i.e., its mnemonic and operands, must be expressed in the same line.
This is also a requirement for assembly directives.
1 addi
2 a0, a1, 1
4.1 Comments
RV32I assembly programs may contain line or multi-line comments. On GNU as-
semblers, line comments are delimited by a line comment character, which is target
specific, i.e., it depends on the target ISA. The RV32I GNU assembler uses the #
character as the line comment character. All characters located between the first
occurrence of the line comment character (e.g., #) on the line and the end of the
same line are considered part of the comment. The following assembly code shows
examples of line comments.
1 x: .word 10
2 foo:
3 addi a0, a1, 1
1 sum1:
2 /* This
3 is
4 a
5 multi-line
6 comment.
7 */
8 addi a0, a1, 1
9 ret
hard-wired to zero.
example is the “mv” instruction, which copies the contents of one register into an-
other. In this case, the pseudo-instruction “mv a5, a7”, which copies the contents of
a7 into a5, is converted into the instruction “addi a5, a7, 0”, which adds zero to
the value in a7 and stores the result on register a5.
Appendix A presents a list with most of the RV32I assembly instructions, and the
chapters in Part II discuss how these instructions can be used to implement program
structures, including conditional sentences, loops, and routines.
The operands of assembly instructions may contain:
• A register name: a register name identifies one of the ISA registers. RV32I ISA
registers are numbered from 0 to 31 and are named x0, x1, ..., x31. RV32I
registers may also be identified by their aliases, for example, a0, t1, ra, etc..
Appendix A presents a list of RV32I registers and their respective aliases.
• A symbol name: symbol names identify symbols on the symbol table and are
replaced by their respective values during the assembling and linking processes.
They may identify, for example, symbols that were explicitly defined by the
user or symbols created automatically by the assembler, such as the symbols
created for labels. Their value are also encoded into the machine instruction as
a sequence of bits.
are automatically converted into symbols by the assembler. Also, the programmer,
or the compiler, may explicitly create symbols by using the .set directive.
Symbol names are defined by a sequence of alphanumeric characters and the un-
derscore character ( ). However, the first character may not be a numeric character.
The following names are examples of valid symbol names: x, var1, z12345, x, , 1,
123, and a12b.
The following names are examples of invalid symbol names: 1, 1var, z@12345,
x-y, -var, and a+b.
The following code shows examples of instructions that use symbol names as
operands (lines 4 and 5). The .set directive (line 1) creates the max temp sym-
bol and associates value 100 with it. The load immediate instruction (line 4), loads
the value of symbol max temp into register t1. The branch less equal instruction
(ble) jumps to the code position represented by symbol temp ok (which is defined
automatically by label temp ok:) if the value in register a0 is less or equal to the
value in register t1.
4.5 Labels
As discussed in Section 3.2.1, labels are “markers” that represent program locations.
They can be referred to by instructions and assembly directives and are translated to
addresses during the assembling and linking processes.
GNU assemblers usually accept two kinds of labels: symbolic and numeric labels.
Symbolic labels are stored as symbols in the symbol table and are often used to
identify global variables and routines. They are defined by an identifier followed by
a colon (:). The identifier follows the same syntax of symbol names, as defined in
the previous section. The following code contains two symbolic labels: age: and
get age:.
1 age: .word 42
2
3 get_age:
4 la t1, age
5 lw a0, (t1)
6 ret
Numeric labels are defined by a single decimal digit followed by a colon (:). They
are used for local reference and are not included in the symbol table of executable
files. Also, they can be redefined repeatedly in the same assembly program.
References to numeric labels contain a suffix that indicates whether the reference
is to a numeric label positioned before (‘b’ suffix) or after (‘f’ suffix) the reference.
The following code contains examples of numeric labels and references to them. This
code has one symbolic label (pow) and two numeric labels (both named 1:). The first
numeric label, located at line 7, marks the beginning of a sequence of instructions
that belongs to a loop. The jump instruction located at line 11 jumps back to this
label – notice the reference 1b, which refers to the numeric label ‘1’ positioned before
the reference. The second numeric label, located at line 12, marks the location of
the instruction that is positioned after the loop, i.e., the instruction that must be
executed when the execution flow leaves the loop. The instruction at line 8 jumps to
this numeric label when the value of register a1 is equal to zero – notice the reference
1f, which refers to the numeric label ‘1’ positioned after the reference.
1 sum42:
2 addi a0, a0, 42
3 ret
Upon start, the GNU assembler clears the contents of the sections and the symbol
table, initializes all location counters with zero, and selects the .text section as the
active section. Figure 4.1 illustrates the status of the internal assembler structures
upon start.
Active section
Contents Address Contents Address
000 000
001 001
002 002
003 003
007 007
... ...
Symbol table
Location Location
000 000
counter: counter:
.text section .data section
Figure 4.1: Assembler internal structures upon start. The sections’ contents and the
symbol table are cleared, the location counters are initialized with zero, and the .text
section is set as the active section.
Once the internal structures are initialized, the assembler reads the input assem-
bly file sequentially, line by line, processing the labels, assembly instructions, and
assembly directives one by one, in the order they appear.
The first element in our assembly program is a label named “sum42:”. When
processing a label, the assembler registers it as a symbol at the symbol table and
associates it with an address that represents the current program location. The
current location is indicated by the active location counter. Figure 4.2 illustrates how
the symbol table is updated when the “sum42:” label is processed by the assembler.
Notice that the assembler register the name sum42 into the symbol table (1) and
associates it with address zero (2), i.e., the address of the active location counter.
Active section
Contents Address Contents Address
000 000
001 001
002 002
003 003
007 007
1
2 ... ...
Symbol table
Location Location
sum42 000 000 000
counter: counter:
.text section .data section
The next element in our assembly program is the addi a0, a0, 42 assembly
instruction. In this case, the assembler (1) translates it into a machine instruction,
(2) adds it to the active section in the address indicated by the active location counter,
and (3) updates the active location counter so it points to the next available address.
In this case, the active location counter is incremented by four units because the
RV32I instruction that was added to the active section requires four memory words.
Figure 4.3 illustrates this process.
Active section
Contents Address Contents Address
00010011 2 000 000
0x02a50513 2
00000101 2 001 001
Machine instruction
10100101 2 002 002
00000010 2 003 003
1
Input file 004 004
sum42: ...
005 005
addi a0, a0, 42
ret 006 006
007 007
... ...
Symbol table
Location 000 Location
sum42 000 3 000
counter: 004 counter:
.text section .data section
The last element in our assembly program is also an assembly instruction. Again,
the assembler (1) translates it into a machine instruction, (2) adds it to the active
section in the address indicated by the active location counter, and (3) updates the
active location counter so it points to the next available address. This process is
illustrated in Figure 4.4.
Active section
Contents Address Contents Address
00010011 2 000 000
00000101 2 001 001
0x00008067 10100101 2 002 002
Machine instruction
2
1 00000010 2 003 003
... ...
Symbol table
Location 004 Location
sum42 000 3 000
counter: 008 counter:
.text section .data section
Once all elements from the input file are processed, the assembler stores the section
contents, the symbol table, and other relevant information (such as the relocation
records), on the object file.
Table 4.1: List of assembly directives that can be used to add values to a program.
All directives on Table 4.1 add values to the active section. The .byte, .half,
.word, and .dword directives add one or more values to the active section. Their
arguments may be expressed as immediate values, as discussed in Section 4.3, symbols,
which are replaced by their value during the assembling and linking processes, or by
arithmetic expressions that combine both. The following code shows examples of
valid arguments for these directives. The .byte in the first line adds four 8-bit values
to the active section (10, 12, 97, and 10). The .word directive in the second line
adds a 32-bit value associated with symbol x to the active section. Notice that value
associated with symbol x is the address assigned to label x:. The .word directive in
the third line also adds a 32-bit value to the active section, however, in this case, the
value is computed by adding four to the value associated with symbol y, which is the
address assigned to label y:.
The .string, .asciz, and .ascii directives add strings to the active section. The
string is encoded as a sequence of bytes as discussed on Section 2.3. The .string
and .asciz directives also adds, after the string, an extra byte with value zero. They
are useful to add NULL-terminated strings to the program3.
To illustrate the use of the previous directives, let us assemble the following pro-
gram, which adds values to the .data section:
1 .section .data
2 msg: .ascii "hello"
3 x: .word 10
As discussed in Section 4.6, the GNU assembler first clears the contents of the
sections and the symbol table, initializes all location counters with zero, and selects
the .text section as the active section. Then, it starts processing the input file.
The first assembly element in the input file is the .section .data directive, which
instructs the assembler to make the .data the active section. Figure 4.5 illustrates
this process.
Active section
Contents Address Contents Address
000 000
1 001 001
002 002
003 003
007 007
... ...
Symbol table
Location Location
000 000
counter: counter:
.text section .data section
The next element in our assembly program is a label named “msg:”. In this
case, the assembler (1) registers the symbol named msg at the symbol table and (2)
associates it with an address that represents the current program location, which
is indicated by the active location counter, i.e., the location counter of the .data
section. Figure 4.6 illustrates this process.
3 Strings declared in C programs are NULL-terminated strings.
Active section
Contents Address Contents Address
000 000
001 001
002 002
003 003
007 007
1
... ...
Symbol table
Location Location
msg 000 000 000
2 counter: counter:
.text section .data section
The next element in our assembly program is the .ascii "hello" directive, which
instructs the assembler to add a string to the active section. Assuming our input file is
encoded using the ASCII standard, the assembler (1) encodes the string as a sequence
of bytes based on the ASCII standard, (2) add these bytes to the next available
addresses on the active section, and (3) updates the location counter. Figure 4.7
illustrates this process.
Active section
Contents Address Contents Address
000 011010002 000
007 007
... ...
Symbol table
Location Location 000
msg 000 000
counter: counter: 005
3
.text section .data section
The next element in our assembly program is a label named “x:”. In this case,
the assembler registers the symbol named x at the symbol table and associates it the
address that represents the current program location, i.e., the address in the active
location counter.
Finally, the last element in our assembly program is the .word 10 directive, which
instructs the assembler to add a 32-bit value to the active section. In this case, the
assembler (1) encodes the 32-bit value as a sequence of four bytes, (2) stores the bytes
on the active section using the little-endian convention4 , and (3) updates the location
counter. Figure 4.8 illustrates this process.
Active section
Contents Address Contents Address
000 011010002 000
section. To instruct the assembler to add the assembled information into other sec-
tions, the programmer (or the compiler) may use the .section secname directive.
As discussed in Section 4.6, when assembling a program, the assembler encodes and
adds each assembly element to the active section. The “.section secname” changes
the active section to secname, hence, all the information processed by the assembler
after this directive is added to the secname section.
Program instructions are expected to be placed on the .text section, while con-
stants, i.e., read-only data, must be placed on the .rodata section. Also, initialized
global variables must be placed on the .data section, and uninitialized global variables
should be placed on the .bss section.
The following assembly code shows how the .section directive can be used to
add the program instructions to the .text section and the program variables to the
.data and .rodata sections.
1 .section .text
2 set_x:
3 la t1, x
4 sw a0, (t1)
5 ret
6 get_msg:
7 la a0, msg
8 ret
9 .section .data
10 x: .word 10
11 .section .rodata
12 msg: .string "Assembly rocks!"
NOTE: The RV32I GNU assembler also contains the “.text”, “.data”,
and “.bss” directives, which are aliases to “.section .text”, “.section
.data”, and the “.section .bss” directives, respectively.
Since no information is stored on the .bss section in object and executable files,
the GNU assembler does not allow assembly programs to add data to the .bss section.
To illustrate this situation, let us consider the following code and assume it is stored
on a file named data-on-bss.s:
1 .section .bss
2 x: .word 10
3 .section .text
4 set_x:
5 la t1, x
6 sw a0, (t1)
7 ret
This code is trying to use the .word 10 directive to add a 32-bit value to the .bss
section. However, when processing the .word 10 directive, the GNU assembler stops
assembling the code and emits the following error message:
To allocate variables on the .bss section it suffices to declare a label to identify the
variable and advance the .bss location counter by the amount of bytes the variable
require, so further variables are allocated on other address.
The .skip N directive is a directive that advances the location counter by N units
and can be used to allocate space for variables on the .bss section. The following
code shows how the .skip directive can be combined with labels to allocate space
for three distinct variables: x, V, and y. In this example, the program is allocating 4
bytes for variables x and y and 80 bytes for variable V. As a consequence, labels x,
V, and y will be associated with addresses 0x0, 0x4, and 0x54, respectively.
1 .section .bss
2 x: .skip 4
3 V: .skip 80
4 y: .skip 4
NOTE: Some systems initialize the memory words dedicated to the .bss
section with zeros when loading the program into the main memory for execu-
tion. Nonetheless, the programmer should not assume variables on the .bss
section will be initialized with zeros.
1 .set max_value, 42
2
3 truncates_value_to_max:
4 li t1, max_value
5 ble a0, t1, ok
6 mv a0, t1
7 ok:
8 ret
The .equ directive performs the same task as the .set directive.
1 .globl max_value
2 .globl start
3
4 .set max_value, 42
5
6 start:
7 li a0, max_value
8 jal process_temp
9 ret
1 .text
2 foo:
3 j next
4 .byte 0xa
5 next:
6 ret
Even though the previous program was assembled by the assembler, a RV32I CPU
will fail when trying to execute the ret instruction because it requires all instructions
to be stored starting at addresses that are multiple of four.
The programmer (or the compiler) is responsible for keeping RV32I instructions
aligned to 4-byte boundaries, i.e., at addresses that are multiple of four. In the
previous example, it could be done by advancing the location counter by three units
right after adding the 8-bit value to the program (line 4). In this context, the following
code would be executed correctly by a RV32I CPU.
1 .text
2 foo:
3 j next
4 .byte 0xa
5 .skip 3 # Advancing the location counter by 3 units.
6 # This is a very poor way of keeping the
7 # location counter aligned to a 4 byte boundary.
8 next:
9 ret
The proper way of ensuring the location counter is aligned is by using the .align
N directive. The .align N directive checks if the location counter is a multiple of 2N ,
if it is, it has no effect on the program, otherwise, it advances the location counter to
the next value that is a multiple of 2N .
The compiler usually inserts a .align 2 directive5 before routine labels to ensure
the routine instructions start on addresses that are multiple of four. The following
code shows an assembly code that uses the .align 2 directive to align the location
counter before each routine.
1 .text
2 .align 2
3 func1:
4 addi a0, a0, 2
5 ret
6 .align 2
7 func2:
8 addi a0, a0, 42
9 ret
Notice that, if the location counter already contains a value that is a multiple of
2N , then the .align N directive has no effect on the location counter. Hence, since
the code in the previous example starts at address zero and each assembly element is
an assembly instruction that occupies exactly four bytes, the .align 2 has no effect
on the program.
The RV32I ISA allows programs to load and store data on unaligned memory
addresses, however, for performance reasons, The RISC-V Instruction Set Manual [4]
recommends 16-bit, 32-bit, and 64-bit values to be stored on addresses that are mul-
tiple of two, four, and eight, respectively. The following code shows who the .align
N directive can be used to align multi-byte variables on the memory.
1 .data
2 .align 1
3 i: .half 1 # 16-bit variable initialized with value 1
4 .align 2
5 x: .word 9 # 32-bit variable initialized with value 9
6 .align 3
7 y: .dword 11 # 64-bit variable initialized with value 11
8 .bss
9 .align 3
10 z: .skip 8 # 64-bit variable (uninitialized)
5 The .align 2 directive aligns the location counter to a multiple of four (2 2).
User-level programming
51
Chapter 5
Introduction
Many computer systems are organized so that the software is divided into user and
system software. The system software (e.g., the operating system kernel and device
drivers) is the software responsible for protecting and managing the whole system,
including interacting with peripherals to perform input and output operations and
loading and scheduling user applications for execution. The user software is usually
limited to performing operations with data that is located on CPU registers and the
main memory. Whenever the user software needs to perform a procedure that requires
interacting with other parts of the system, such as reading data from a file or showing
information on the computer display, it invokes the system software to perform the
procedure on behalf of the user software.
In this part of the book, we will focus on the implementation of user software, i.e.,
software that performs operations with data that is located on CPU registers and the
main memory. We will also discuss how user software may invoke an operating system
to perform other operations on its behalf, such as input/output from/to peripherals.
Part III covers system-level programming, including interacting with peripherals
and securing the system against faulty or malicious user programs.
52
Chapter 6
The RISC-V is a modular Instruction Set Architecture, allowing the design of a wide
variety of microprocessors. This flexibility allows industry players to design micro-
processors for applications with different requirements, including ultra-low-power and
compact microprocessors for embedded devices and high-performing microprocessors
for powerful servers running on data centers.
To achieve this flexibility, the RV32I Instruction Set Architecture relies on four
base Instruction Set Architectures and several extensions that can be combined with
the base Instruction Set Architectures to implement specialized versions of the In-
struction Set Architecture. Table 6.1 presents the base Instruction Set Architectures
and some of its extensions1 .
Base ISAs
Name Description
RV32I 32-bit integer instruction set
RV32E 32-bit integer instruction set for embedded microprocessors
RV64I 64-bit integer instruction set
RV128I 128-bit integer instruction set
Extensions
Suffix Description
M Standard extension for integer multiplication and division
A Standard extension for atomic instructions
F Standard extension for single-precision Floating-Point
D Standard extension for double-precision Floating-Point
G Shorthand for the base and above extensions
Q Standard extension for quad-precision floating-point
L Standard extension for decimal floating-point
C Standard extension for compressed instructions
B Standard extension for bit manipulation
J Standard extension for dynamically translated languages
T Standard extension for transactional memory
P Standard extension for packed-SIMD instructions
V Standard extension for vector operations
N Standard extension for user-level interrupts
H Standard extension for hypervisor
In this book, we will focus on the RV32IM , which includes the RV32I base and
extension M, which includes instructions for integer multiplication and division. The
RV32IM has the following properties:
1 Some of these extensions are still under development. Refer to the official RISC-V consortium
53
CHAPTER 6. THE RV32I ISA
• It also contains instructions to multiply and divide values held in the integer
registers (M extension).
1 byte 1 byte
0 00110110 0 36
1 00000000 1 00
2 00001000 2 08
3 10000000 3 80
Addresses Memory Addresses Memory
4 11110000 locations 4 F0 locations
5 11111111 5 FF
6 00001111 6 0F
7 11100001 7 E1
... ... ... ...
(a) (b)
Figure 6.1: Organization of a byte addressable memory with its contents represented
in the binary (a) and the hexadecimal (b) bases.
Datatypes larger than one byte are stored on multiple memory locations. Hence,
when storing a halfword (word) datatype value on memory, the stores the two (four)
bytes on two (four) consecutive memory locations.
When translating a program written in “C” to RV32I assembly code, the C
datatypes must be converted into RISC-V native datatypes. Table 6.3 shows the
mappings from C native datatypes to RV32I native datatypes2 . All pointers in C
(e.g., int*, char*, and void*) represent memory addresses and are mapped to the
unsigned word datatype.
2 This mapping is valid for the RISC-V ilp32 ABI, which is discussed in this book and further
discussed in Chapter 8.
Caller- Callee-
Register Alias Description
save save
x0 zero Hard-wired zero
x1 ra Return address x
x2 sp Stack pointer x
x3 gp Global pointer
x4 tp Thread pointer
x5-x7 t0-t2 Temporaries 0 to 2 x
x8 s0/fp Saved register 0 / Frame pointer x
x9 s1 Saved register 1 x
x10-17 a0-a7 Function arguments 0 to 7 x
x18-27 s2-s11 Saved registers 2 to 11 x
x28-31 t3-t6 Temporaries 3 to 6 x
pc pc Program counter
words, to read/write a value from/to memory, the software must execute a load/store
instruction.
The RISC-V is a Load/Store architecture, hence, to perform operations (e.g.,
arithmetic operations) on data stored on memory, it requires the data to be first
retrieved from memory into a register by executing a load instruction. As an example,
let us consider the following assembly code, which loads a value from memory, multiply
it by two, and stores the result on memory.
1 lw a5, 0(a0)
2 add a6, a5, a5
3 sw a6, 0(a0)
The first instruction, called load word and indicated by the mnemonic lw, is a
load instruction. It retrieves a word value from memory and stores it on register a5.
The expression 0(a0) indicates the address of the memory position that stores the
value that must be loaded. In this case, the address is the sum of the contents of
register a0 and the constant 0. In other words, in case register a0 contains the value
8000 when this load instruction is executed, the hardware will retrieve the data from
the memory location associated with address 8000.
The second instruction, indicated by the mnemonic add, adds two values and
stores the result on a register. In this case, it is adding the values from registers a5
and a5 and storing the result on register a6. Notice that, since both source operands
are the same, i.e., a5, the result is equivalent to multiplying the contents of a5 by
two.
Finally, the third instruction, called store word and indicated by the mnemonic sw,
stores the value from register a6 into memory. Again, the expression 0(a0) indicates
the address of the memory position that will receive the data.
6.4 Pseudo-instructions
When assembling an assembly program, the assembler converts each assembly in-
struction (encoded as plain text) into its corresponding machine instruction (encoded
as binary). For example, the assembly instruction “add x10, x11, x12” is con-
verted into its corresponding machine instruction, which is encoded in four bytes as
“0x00c58533”.
A pseudo-instruction is an assembly instruction that does not have a corre-
sponding machine instruction on the , but can be translated automatically by the as-
sembler into one or more alternative machine instructions to achieve the same effect.
As an example, the no operation instruction, or “nop”, is a RV32I pseudo-instruction
that is converted by the assembler into the “addi x0, x0, 0” instruction5. Another
example is the “mv” instruction, which copies the contents of one register into an-
other. In this case, the pseudo-instruction “mv a5, a7”, which copies the contents of
a7 into a5, is converted into the instruction “addi a5, a7, 0”, which adds zero to
the value in a7 and stores the result on register a5.
Since the focus of the book is on assembly programming, the remaining of this
book will not differentiate pseudo-instructions from real RV32I machine instructions.
We refer the reader to the RISC-V Instruction Set Manual [4] for a full list of real
RV32I machine instructions and pseudo-instructions.
hard-wired to zero.
or
where MNM indicates the instruction mnemonic, rd indicates the target register, rs1
indicates the first source operand and rs2 (or imm) indicates the second operand. The
following assembly code shows examples of logic, shift and arithmetic RV32I assembly
instructions:
The first instruction performs a bitwise “and” operation using the values from a2
and a6 and stores the result on a0. The second instruction shifts the value from a3
to the left twice and stores the result on a1. In this case, the second source operand
(2) is an immediate value (imm). Finally, the third instruction subtracts the value at
a6 from the value at a5 and stores the result on register a4.
Any general purpose register (x0-x31) may be used as rd, rs1, or rs2. However,
it is worth noticing that if x0 (zero) is indicated as a target operand (rd), then the
result will be discarded. This happens because x0 is hard-wired to zero.
They are invalid because the assembler cannot encode the immediate values into
the instruction (notice that they may not be encoded as 12-bit twos’-complement
signed numbers). In this example, the assembler will fail to assemble the code and
potentially show an error message. The GNU assembler (as) shows the following
message when trying to assemble an assembly program with these instructions:
6 An immediate value is a constant value encoded into the instruction itself.
To perform operations with immediate values less than -2048 or greater than 2047,
the programmer could employ multiple instructions to compose the value, store it into
a register, and use an instruction that reads the second source operand from a register.
There are several ways of composing these values using RV32I instructions. As an
example, one could load a small constant (e.g., 1000) into a register, shift its value to
the left to multiply it by a power of two, and add another small constant to produce
the desired value. The following assembly code produces the value 4005 by loading
the value 1000 into a5, shifting it twice to the left7 , and adding 5 to it.
Instruction Description
Performs the bitwise “and” operation on rs1 and
and rd, rs1, rs2
rs2 and stores the result on rd.
Performs the bitwise “or” operation on rs1 and rs2
or rd, rs1, rs2
and stores the result on rd.
Performs the bitwise “xor” operation on rs1 and
xor rd, rs1, rs2
rs2 and stores the result on rd.
Performs the bitwise “and” operation on rs1 and
andi rd, rs1, imm
imm and stores the result on rd.
Performs the bitwise “or” operation on rs1 and imm
ori rd, rs1, imm
and stores the result on rd.
Performs the bitwise “xor” operation on rs1 and
xori rd, rs1, imm
imm and stores the result on rd.
The following code loads two immediate values into registers a1 and a2 and per-
forms an “and” operation. The result of this operation (0x0000AB00) is stored in
register a0.
Instruction Description
Performs a logical left shift on the value at rs1 and
sll rd, rs1, rs2 stores the result on rd. The amount of left shifts is
indicated by the value on rs2.
Performs a logical right shift on the value at rs1
srl rd, rs1, rs2 and stores the result on rd. The amount of right
shifts is indicated by the value on rs2.
Performs an arithmetic right shift on the value at
sra rd, rs1, rs2 rs1 and stores the result on rd. The amount of
right shifts is indicated by the value on rs2.
Performs a logical left shift on the value at rs1 and
slli rd, rs1, imm stores the result on rd. The amount of left shifts is
indicated by the immediate value imm.
Performs a logical right shift on the value at rs1
srli rd, rs1, imm and stores the result on rd. The amount of left
shifts is indicated by the immediate value imm.
Performs an arithmetic right shift on the value at
srai rd, rs1, imm rs1 and stores the result on rd. The amount of left
shifts is indicated by the immediate value imm.
The logical left shift instructions (sll or slli) perform a logical left shift on
a value that is stored on a register. The amount of shifts is indicated as an operand
to the instruction and can be either the value in a register or an immediate value.
The following code shows examples of logical left shift instructions. The first shift
instruction (slli) shifts the value in a2 twice to the left and stores the result in a0.
The second one (sll) performs a similar operation, but the amount of shifts to the
left is defined by the value in a3.
1 li a2, 24 # a2 <= 24
2 slli a0, a2, 2 # a0 <= a2 << 2
3 sll a1, a2, a3 # a0 <= a2 << a3
The first two instructions on the previous code load the immediate value 24 into
register a2, shift its contents twice to the left and stores the result on a0. The
immediate value 24 is represented by the binary number
The logical left shift operation shifts the bits to the left, discarding the leftmost bits
and adding zeros to the right. Hence, after shifting it twice to the left, the result will
be the binary number
Shift operations are easier to implement in hardware than the multiply operation
and usually take less time and/or energy to be executed. As a consequence, whenever
possible, compilers try to generate these instructions to perform multiplications.
The logical right shift instructions (srl or srli) performs a logical right shift
on a value that is stored on a register. Similar to the logical left shift instructions,
the amount of shifts is indicated as an operand to the instruction and can be either
the value in a register or an immediate value. The following code shows examples of
logical right shift instructions. The first shift instruction (slri) shifts the value in
a5 twice to the right and stores the result in a0. The second one (sll) performs a
similar operation, but the amount of shifts to the right is defined by the value in a7.
1 li a5, 24 # a5 <= 24
2 srli a0, a5, 2 # a0 <= a5 >> 2
3 srl a1, a5, a7 # a0 <= a5 >> a7
The first two instructions on the previous code load the immediate value 24 into
register a5, shift its contents twice to the right and stores the result on a0. The
immediate value 24 is represented by the binary number
The logical right shift operation shifts the bits to the right, discarding the rightmost
bits and adding zeros to the left. Hence, after shifting it twice to the right, the result
will be the binary number
This binary number corresponds to the decimal value 6 and is equivalent to 24/4. In
fact, logical right shift operations may be used to integer divide numbers by powers
of two.
In the previous example, we verified that by performing a logical right shift opera-
tion twice on value 24 resulted in value 6, i.e., 24/4. Nonetheless, this is not valid for
negative numbers. Let us take the value −24 as an example. This value is represented
by the following binary number in RISC-V 8
numbers.
The logical right shift operation shifts the bits to the right, discarding the rightmost
bits and adding zeros to the left. Hence, after shifting it twice to the right, the result
will be the binary number
Notice, however, that this number is not negative and does not correspond to the
division of −24 by four. In fact, it is a very large positive number (1 073 741 818).
In case we assume the unsigned representation, the previous binary number rep-
resents the value 4 294 967 272, instead of −24. In this case, the result of applying the
logical right shift operation twice would result in the division of this number by four,
i.e., 1 073 741 818.
In summary, the logical right shift operation may only be used to divide
unsigned numbers. In this case, shifting an unsigned number N times to the
right with a logical right shift operation is equivalent to integer dividing
the unsigned number by 2N .
The arithmetic right shift instructions (sra or srai) perform an arithmetic
right shift on a value that is stored on a register. Similar to the logical right shift
instructions, the amount of shifts to the right is indicated as an operand on the instruc-
tion and can be either the value in a register or an immediate value. The following
code shows examples of arithmetic right shift instructions. The first arithmetic right
shift instruction (srai) shifts the value in a5 twice to the right and stores the result
in a0. The second one (sra) performs a similar operation, but the amount of shifts
to the right is defined by the value in a7.
The first two instructions on the previous code load the immediate value −24
into register a5, shift its contents twice to the right and stores the result on a0. As
discussed before, the immediate value −24 is represented by the binary number
The arithmetic right shift operation shifts the bits to the right, discarding the right-
most bits. Instead of simply adding zeros to the left, it replicates the leftmost bit, i.e.,
if the leftmost bit is equal to 1, then it inserts ones on the left. In case the leftmost
bit is equal to 0, then it inserts zeros on the left. As a result, after performing an
arithmetic right shift twice on the previous, the result will be the binary number
This binary number corresponds to the decimal value −6 and is equivalent to −24/4.
In fact, arithmetic right shift operations may be used to integer divide signed numbers
by powers of two. Notice that this instruction can also be used to integer divide
positive signed numbers by powers of two. It works because the leftmost bit of
positive signed numbers is zero. Hence, the arithmetic right shift operation will push
zeros to the left when shifting the value.
In summary, the arithmetic right shift operation may only be used to
integer divide signed numbers. In this case, shifting a signed number N
times to the right with an arithmetic right shift operation is equivalent to
integer dividing the signed number by 2 N.
Instruction Description
Adds the values in rs1 and rs2 and stores the result
add rd, rs1, rs2
on rd.
Subtracts the value in rs2 from the value in rs1
sub rd, rs1, rs2
and stores the result on rd.
Adds the value in rs1 to the immediate value imm
addi rd, rs1, imm
and stores the result on rd.
Multiplies the values in rs1 and rs2 and stores the
mul rd, rs1, rs2
result on rd.
Divides the value in rs1 by the value in rs2 and
stores the result on rd. The U suffix is optional and
div{u} rd, rs1, rs2
must be used to indicate that the values in rs1 and
rs2 are unsigned.
Calculates the remainder of the division of the value
in rs1 by the value in rs2 and stores the result on
rem{u} rd, rs1, rs2
rd. The U suffix is optional and must be used to in-
dicate that the values in rs1 and rs2 are unsigned.
The add instructions (add and addi) adds two numbers and stores the result
on a register (rd). In both cases, the first number is retrieved from the register rs1.
Instruction add retrieves the second number from register rs2 while instruction addi
uses the immediate value imm.
The subtract instruction (sub) subtracts the value in rs2 from the value in
rs1 and stores the result on rd. The RV32I does not contain a subi instruction,
i.e., an instruction that subtracts an immediate values from the contents of a register
and stores the result on another register. Nonetheless, it is worth noticing that a
programmer can easily achieve this effect by adding a negative immediate value using
the addi instruction. The following code shows an example of an instruction that
subtracts the immediate value 10 from the contents of register a2 and stores the
result on a0 using an addi instruction.
The multiply instruction (mul) multiplies the values in rs1 and rs2 and stores
the result on rd.
The divide instructions (div and divu) divide the value in rs1 by the value in
rs2 and stores the result on rd. Instruction div divides signed numbers while divu
divides unsigned numbers.
The remainder instructions (rem and remu) computes the remainder of the
division of the value in rs1 by the value in rs2 and stores the result on rd. Instruction
rem computes the remainder for divisions of signed numbers while remu computes the
remainder for divisions of unsigned numbers.
The following assembly code shows examples of RV32IM arithmetic instructions:
Did you know? In case you are programming for an RV32I that lacks the M
extension, i.e., it does not contain the multiply and divide instructions, you may
be able to combine arithmetic and shift instructions to perform multiplications and
divisions. The following assembly code shows an example of how the slli and addi
instructions may be used to multiply the value of a2 by 5 and by 10:
Instruction Description
Loads a 32-bit signed or unsigned word from memory
lw rd, imm(rs1) into register rd. The memory address is calculated by
adding the immediate value imm to the value in rs1.
Loads a 16-bit signed halfword from memory into reg-
lh rd, imm(rs1) ister rd. The memory address is calculated by adding the
immediate value imm to the value in rs1.
Loads a 16-bit unsigned halfword from memory into reg-
lhu rd, imm(rs1) ister rd. The memory address is calculated by adding the
immediate value imm to the value in rs1.
Loads a 8-bit signed byte from memory into register rd.
lb rd, imm(rs1) The memory address is calculated by adding the immedi-
ate value imm to the value in rs1.
Loads a 8-bit unsigned byte from memory into register
lbu rd, imm(rs1) rd. The memory address is calculated by adding the im-
mediate value imm to the value in rs1.
Stores the 32-bit value at register rs1 into memory. The
sw rs1, imm(rs2) memory address is calculated by adding the immediate
value imm to the value in rs2.
Stores the 16 least significant bits from register rs1 into
sh rs1, imm(rs2) memory. The memory address is calculated by adding the
immediate value imm to the value in rs2.
Stores the 8 least significant bits from register rs1 into
sb rs1, imm(rs2) memory. The memory address is calculated by adding the
immediate value imm to the value in rs2.
Instruction Description
mv rd, rs Copies the value from register rs into register rd.
li rd, imm Loads the immediate value imm into register rd.
la rd, rot Loads the label address rot into register rd.
For each one of the lw, lh, lhu, lb, and lbu
machine instructions there is a pseudo-instruction
L{W|H|HU|B|BU} rd, lab
that performs the same operation, but the mem-
ory address is calculated based on a label (lab).
For each one of the sw, sh, and sb machine in-
structions there is a pseudo-instruction that per-
S{W|H|B} rd, lab
forms the same operation, but the memory ad-
dress is calculated based on a label (lab).
consecutive memory positions (starting at the calculated address) and stores these
four bytes into the target register. The RV32I follows the little-endian endianness
format, hence, the byte loaded from the memory position associated with the smallest
address is loaded into the register’s least significant byte. Figure 6.2 illustrates a value
being loaded from memory into register a0. In this example, the data (a four-byte
word) is being loaded from four consecutive memory locations, starting at address
8000. The start address is calculated by adding the immediate value (0) to the
contents of register a2 (800010 ).
lw a0, 0(a2)
0A 0E 01 08 a0
… 08 01 0E 0A … …
8000 8001 8002 8003 00 00 1F 40 a2
Main Memory …
Registers
8000 10
Figure 6.2: Example of a word (0x0A0E0108) being loaded by a load word instruction
The load word instruction is used to load int, unsigned int, long, unsigned
long, and pointers from memory9.
The load unsigned byte instruction (lbu) loads a 8-bit unsigned byte from
memory into a register. Since registers have 32 bits, or four bytes, the unsigned
byte loaded from memory is stored on the least significant register byte and the
other three register bytes are set to zero. Figure 6.3 illustrates an unsigned byte
value being loaded from memory into register a0. In this example, the data (an
unsigned byte) is being loaded from the memory location associated with address
8000, which is calculated by adding the immediate value (0) to the contents of register
a2 (800010 ).
00 00 00 08 a0
… 08 01 0E 0A … …
8000 8001 8002 8003 00 00 1F 40 a2
Main Memory …
Registers
8000 10
Figure 6.3: Example of unsigned byte (0x08) being loaded by a load unsigned byte
instruction
The load unsigned byte instruction is used to load unsigned char datatype values
9 Section 6.1 discusses the mappings from C datatypes to the RV32I ISA native datatypes.
from memory.
The load byte instruction (lb) loads a 8-bit signed byte from memory into a
register. Again, since registers have 32 bits, the signed byte loaded from memory
is stored on the least significant register byte. In case the value is non-negative, the
other three register bytes are set to zero. In case it is negative, the bits of the other
three register bytes are set to one. Figure 6.4 illustrates a non-negative signed byte
value (0x08 = 810 ) being loaded from memory into register a0. In this example,
the data, a non-negative unsigned byte, is being loaded from the memory location
associated with address 8000, which is calculated by adding the immediate value (0)
to the contents of register a2 (8000 10). Notice that the three most significant register
bytes are set to zero.
lb a0, 0(a2)
00 00 00 08 a0
… 08 01 0E 0A … …
8000 8001 8002 8003 00 00 1F 40 a2
Main Memory …
Registers 8000 10
Figure 6.4: Example of a non-negative signed byte value (0x08 = 810) being loaded
by the load byte instruction
Figure 6.5 illustrates a negative signed byte value (0xFE = −210 ) being loaded
from memory into register a0. Again, the data is being loaded from the memory
location associated with address 8000, which is calculated by adding the immediate
value (0) to the contents of register a2 (800010 ). Notice, however, that the bits of the
three most significant register bytes are set to ones and the final value is properly set
to 0xFFFFFFFE, i.e., −210.
lb a0, 0(a2)
FF FF FF FE a0
… FE 01 0E 0A … …
8000 8001 8002 8003 00 00 1F 40 a2
Main Memory …
Registers 8000 10
Figure 6.5: Example of a negative signed byte value (0xFE = −210 ) being loaded
by the load byte instruction
The load byte instruction is used to load char C datatype values from memory.
The load unsigned halfword instruction (lhu) loads a 16-bit unsigned halfword
from memory into a register. Since an unsigned halfword datatype has two bytes,
this instruction loads two bytes from two consecutive memory positions (starting at
the calculated address) and stores these two bytes into the target register. Again,
since the RV32I follows the little-endian endianness format, the byte loaded from
the memory position associated with the smallest address is loaded into the regis-
ter’s least significant byte and the second byte loaded into the register’s second-least
significant byte. The other two register bytes are set to zero. Figure 6.6 illustrates
an unsigned halfword value (0x0108) being loaded from memory into register a0.
In this example, the data, an unsigned halfword, is being loaded from the mem-
ory locations starting at address 8000, which is calculated by adding the immediate
value (0) to the contents of register a2 (800010 ). Notice that the two most-significant
register bytes are set to zero.
The load unsigned halfword instruction is used to load unsigned short C datatype
values from memory.
The load halfword instruction (lh) loads a 16-bit signed halfword from memory
into a register. Since a halfword datatype has two bytes, this instruction loads two
00 00 01 02 a0
… 02 01 0E 0A … …
8000 8001 8002 8003 00 00 1F 40 a2
Main Memory …
Registers
8000 10
Figure 6.6: Example of an unsigned halfword value (0x0108) being loaded by the
load unsigned halfword instruction
bytes from two consecutive memory positions (starting at the calculated address) and
stores these two bytes into the target register. Again, since the RV32I follows the
little-endian endianness format, the byte loaded from the memory position associated
with the smallest address is loaded into the register’s least significant byte and the
second byte loaded into the register’s second-least significant byte. In case the value
is non-negative, the bits of the other two bytes of the register are set to zeros, and
in case it is negative, they are set to ones. Figure 6.7 illustrates a non-negative
halfword value (0x0102 = 25810) being loaded from memory into register a0. In
this example, the data, a non-negative halfword, is being loaded from the memory
location associated with address 8000, which is calculated by adding the immediate
value (0) to the contents of register a2 (800010 ). Notice that the two most significant
register bytes are properly set to zero and the final result is 0x00000102, i.e., 25810.
lh a0, 0(a2)
00 00 01 02 a0
… 02 01 0E 0A … …
8000 8001 8002 8003 00 00 1F 40 a2
Main Memory …
Registers
8000 10
Figure 6.7: Example of a non-negative signed halfword value (0x0102 = 258 10)
being loaded by the load halfword instruction
Figure 6.8 illustrates a negative halfword value (0xFFFE = −2 10) being loaded
from memory into register a0. Again, the data is being loaded from the memory
locations starting at address 8000, which is calculated by adding the immediate value
(0) to the contents of register a2 (8000 10). Notice, however, that the bits of the two
most significant register bytes are set to ones and the final value is properly set to
0xFFFFFFFE, i.e., −210 .
lh a0, 0(a2)
FF FF FF FE a0
… FE FF 0E 0A … …
8000 8001 8002 8003 00 00 1F 40 a2
Main Memory …
Registers
8000 10
Figure 6.8: Example of a negative signed halfword value (0xFFFE = −210) being
loaded by the load halfword instruction
The load halfword instruction is used to load short C datatype values from mem-
ory.
sw a0, 0(a2)
0A 0E 01 08 a0
… 08 01 0E 0A … …
8000 8001 8002 8003 00 00 1F 40 a2
Main Memory …
Registers
8000 10
Figure 6.9: Example of a word value (0x0A0E0108) being stored by the store word
instruction
The store word instruction is used to store int, unsigned int, long, unsigned
long, and pointers into memory.
The store halfword instruction (sh) stores the least significant 16-bit halfword
from rs1 into memory. Since a halfword datatype has two bytes, this instruction
stores the two least significant bytes from register rs1 into two consecutive memory
positions (starting at the calculated address). The RV32I follows the little-endian
endianness format, hence, the least significant byte is stored on the memory position
associated with the smallest address and so on. Figure 6.10 illustrates a halfword
value from a0 being stored into the main memory by a sh instruction. In this example,
the data (a two-byte halfword) is being stored on two consecutive memory locations,
starting at address 8000. The start address is calculated by adding the immediate
value (0) to the contents of register a2 (800010 ).
sh a0, 0(a2)
0A 0E 01 08 a0
… 08 01 … … … …
8000 8001 8002 8003 00 00 1F 40 a2
Main Memory …
Registers 8000 10
Figure 6.10: Example of a halfword value (0x0108) being stored by the store halfword
instruction
The store halfword instruction is used to store short and unsigned short C
datatype values into memory.
The store byte instruction (sb) stores the least significant 8-bit byte from rs1
into memory. Since a byte datatype has only one byte, this instruction stores the
least significant byte from register rs1 into a single memory position, indicated by
the calculated address. Figure 6.11 illustrates a byte value from a0 being stored into
the main memory by a sb instruction. In this example, the data (a single byte) is
being stored on a single memory location, at address 8000. The memory address is
calculated by adding the immediate value (0) to the contents of register a2 (800010 ).
sb a0, 0(a2)
0A 0E 01 08 a0
… 08 … … … … …
8000 8001 8002 8003 00 00 1F 40 a2
Main Memory …
Registers
8000 10
Figure 6.11: Example of a byte value (0x08) being stored by the store byte instruction
The store byte instruction is used to store char and unsigned char C datatype
values into memory.
mv rd, rs
where rd indicates the target register and rs indicates the source register.
The load immediate instruction (li) is a pseudo-instruction that loads an
immediate value into a register. As discussed in Section 6.5.2, depending on the
immediate value, the assembler may convert this pseudo-instruction into a single or
multiple machine instructions. The assembly syntax for this instruction is
li rd, imm
where rd indicates the target register and imm the immediate value to be loaded into
the target register.
The load address instruction (la) is a pseudo-instruction that loads a 32-bit
address, indicated by a label, into a register. The assembly syntax for this instruction
is
la rd, symbol
where rd indicates the target register and symbol the name of the label.
The load global instructions are a set of pseudo-instructions to facilitate the
load of values from memory positions that are identified by labels. The assembly
syntax for these instructions is
where rd indicates the target register and symbol the name of the label. As an
example, instruction
lh a0, var_x
loads a halfword datatype value from the memory positions starting at the address
associated with label “var x”. The loaded value is stored into register a0. Since a
label represents a 32-bit address and may not be encoded as an immediate value on a
load instruction10, the assembler may generate multiple RV32I machine instructions
to perform the load operation. In this case, it first generates instructions to load the
10 The immediate field (imm) of load and store instructions are limited to values that can be repre-
address of the label into rd11 . Then, it generates a machine load instruction to load
the value from memory into rd.
The store global instructions are a set of pseudo-instructions to facilitate the
store of values into memory positions that are identified by labels. The assembly
syntax for these instructions is
where rs indicates the source register, symbol indicates the name of the label, and
rt indicates a temporary register to support the computation of the address. As an
example, instruction
sw a0, var_x, a5
stores the word value from register a0 into the memory positions starting at the
address associated with label “var x”. Similar to the load global instructions, the
label may not be encoded as an immediate value, hence, the assembler may generate
multiple RV32I machine instructions so that it can load the label address into a
register before executing a machine store instruction. In this case, however, the
assembler may not use the rs as a temporary register, since it would destroy the
contents of the register before storing it on the memory. For this reason, the user is
required to explicitly indicate a general-purpose register so that the assembler may
use it as a temporary when generating the code for the pseudo-instruction.
NOTE: The verb “to jump” is commonly used to indicate that a control-flow
instruction changed the normal execution flow.
11 The assembler uses rd as a temporary, since its previous value will be discarded anyway after
NOTE: The target address is the address of the next instruction that must
be fetched in case the instruction jumps.
In the following code, the beq instruction jumps to label L (which indicates the
target address) if the contents of a0 is equal to the contents of a1. In this case, the
next instruction to be executed will be the sub instruction.
If the contents of registers a0 and a1 differ from each other, then the beq instruction
does not jump and the execution flow continues normally, i.e., the next instruction
on memory (add) is executed.
There are several conditional control-flow instructions in the RV32I . Table 6.10
shows the RV32I conditional control-flow instructions and pseudo-instructions.
Instruction Description
Jumps to label lab if the value in rs1 is equal to the
beq rs1, rs2, lab
value in rs2.
Jumps to label lab if the value in rs1 is different from
bne rs1, rs2, lab
the value in rs2.
Jumps to label lab if the value in rs1 is equal to zero
beqz rs1, lab
(pseudo-instruction).
Jumps to label lab if the value in rs1 is not equal to
bnez rs1, lab
zero (pseudo-instruction).
Jumps to label lab if the signed value in rs1 is smaller
blt rs1, rs2, lab
than the signed value in rs2.
Jumps to label lab if the unsigned value in rs1 is
bltu rs1, rs2, lab
smaller than the unsigned value in rs2.
Jumps to label lab if the signed value in rs1 is greater
bge rs1, rs2, lab
or equal to the signed value in rs2.
Jumps to label lab if the unsigned value in rs1 is
bgeu rs1, rs2, lab
greater or equal to the unsigned value in rs2.
The blt rs1, rs2, lab jumps to label lab if the value stored on the register indi-
cated by rs1 is less than the value stored on the register indicated by rs2. In this case,
the processor assumes that the values in rs1 and rs2 are signed values (represented
in two’s complement), hence, 0xFFFFFFFF12 is considered less than 0x00000000 by
this instruction. To compare unsigned values, one must use the bltu rs1, rs2, lab
instruction. In this case, the processor assumes that the values in rs1 and rs2 are
unsigned binary values, hence, 0xFFFFFFFF13 is considered greater than 0x00000000
by this instruction.
The following assembly code shows examples of RV32I conditional control-flow
instructions:
NOTE: In the RISC-V ISA manual [4], Waterman and Asanović use the
term conditional branch to refer to conditional control-flow instructions. No-
tice that the term “branch” suggests that the execution may diverge to two
different paths, which is the case for RISC-V conditional control-flow instruc-
tions. The authors use the term unconditional jumps to refer to unconditional
control-flow instructions.
opcode and other parameters (such as registers) inside a single 32-bit long instruc-
tion. To overcome this limitation, direct target addresses are encoded as an offset
that is added to the program counter15 (PC) when the instruction is executed. In
RV32I, the target address of conditional control-flow instructions is encoded as 12-bit
signed offsets. Using a 12-bit offset to encode the target prevents the instruction from
jumping to addresses that are too far away from the address of the instruction itself.
In these cases, the programmer loads the target address into a register and, then, use
an indirect jump instruction to jump to the target address.
6 FOO:
7 sub a0, a1, a2
8 # ...
NOTE: In the RISC-V ISA manual [4], Waterman and Asanović use the
term “unconditional jumps” to refer to unconditional control-flow instructions.
There are several unconditional control-flow instructions in the RV32I . Table 6.11
shows the RV32I unconditional control-flow instructions and pseudo-instructions.
Instruction Description
Jumps to address indicated by symbol sym (pseudo-
j lab
instruction).
Jumps to the address stored on register rs1 (pseudo-
jr rs1
instruction).
Stores the return address (PC+4) on the return register
jal lab
(ra), then jumps to label lab (pseudo-instruction).
Stores the return address (PC+4) on register rd, then
jal rd, lab
jumps to label lab.
Stores the return address (PC+4) on register rd, then
jarl rd, rs1, imm jumps to the address calculated by adding the immedi-
ate value imm to the value on register rs1.
Jumps to the address stored on the return register (ra)
ret
(pseudo-instruction).
Instruction “j lab” jumps to label lab while instruction “jr rs1” jumps to the
address stored on register rs1.
known as jump and link, is particularly useful when invoking routines. The following
code, annotated with addresses, illustrates this process: First, the jal instruction
(located at memory address 0x8000) is used to invoke (jump to) routine FOO (line 1).
When executed, this instruction will store the address of the subsequent instruction
(PC+4 = 0x8004) into register ra and jump to FOO. Then, once FOO is executed,
the routine returns by performing an indirect jump (jr) to the contents of register ra
(line 8). At this point, since ra contains the value 0x8004, the execution flows back
to the instruction sub (line 2). Next, another jump and link instruction is executed
to invoke routine FOO. Again, the jump and link instruction will store the address
of the subsequent instruction (PC+4 = 0x800C) into register ra and jump to FOO.
Notice, however, that at this time, the subsequent instruction is the mul instruction,
and the return address is 0x800C. Finally, once FOO is executed, the routine returns
by performing an indirect jump (jr) to the contents of register ra (line 8), which
in this case contains the value 0x800C and will cause the execution to flow back to
instruction mul.
The “jarl rd, rs1, imm” instruction also stores the address of the subsequent
instruction (i.e., PC+4) in register rd, however, in this case, the target address is
defined by adding the contents of rs1 to the immediate imm. Hence, it is an indirect
jump.
NOTE: The jump register (jr) and return (ret) instructions are pseudo-
instructions that are converted into jarl instructions by the assembler. The
jr rs1 is converted into jarl zero, rs1, 0 and ret is converted into jarl
zero, ra, 0. Notice that, in RV32I, any value stored in register zero is
discarded.
As an example, let us assume the user wants to invoke the Linux operating system
service routine “write” (also known as write syscall) to display some information on
the screen. The write syscall takes three parameters: the file descriptor, the address
of a buffer that contains the information that must be written, and the number of
bytes that must be written. The file descriptor is an integer value that identifies a
file16 or a device. In this case, it indicates the file or device where the information
must be written to. In many Linux distributions, the file descriptor ‘1’ (one) is used to
represent the standard output, or stdout, which is usually the terminal screen. The
following code shows an example in which the contents of the msg buffer is written to
the file descriptor ‘1’, hence, the screen. First, the code sets the system call parameters
(lines 6 to 8), then it sets register a7 with a number that indicates the service routine
that must be invoked, which, in this case, is the write syscall (line 9). Finally, it
invokes the operating system by executing the ecall instruction.
1 .data
2 msg: .asciz "Assembly rocks" # Allocates a string on memory
3
4 .text
5 start:
6 li a0, 1 # a0: File descriptor = 1 (stdout)
7 la a1, msg # a1: Message buffer address
8 li a2, 14 # a2: Message buffer size (14 bytes)
9 li a7, 64 # Syscall code (write = 64)
10 ecall # Invoke the syscall
NOTE: Each operating system may have a different set of service routines.
The focus of this book is not to discuss the set of service routines offered by
an specific operating system; nonetheless, to illustrate the concepts, whenever
necessary, it will use service routines available on the Linux operating system.
opened file.
Instruction Description
Sets rd with 1 if the signed value in rs1 is less than the
slt rd, rs1, rs2
signed value in rs2, otherwise, sets it with 0.
Sets rd with 1 if the signed value in rs1 is less than
slti rd, rs1, imm the sign-extended immediate value imm, otherwise, sets
it with 0.
Sets rd with 1 if the unsigned value in rs1 is less than
sltu rd, rs1, rs2
the unsigned value in rs2, otherwise, sets it with 0.
Sets rd with 1 if the unsigned value in rs1 is less than
sltui rd, rs1, imm the unsigned immediate value imm, otherwise, sets it
with 0.
Sets rd with 1 if the value in rs1 is equal to zero, oth-
seqz rd, rs1
erwise, sets it with 0 (pseudo-instruction).
Sets rd with 1 if the value in rs1 is not equal to zero,
snez rd, rs1
otherwise, sets it with 0 (pseudo-instruction).
Sets rd with 1 if the signed value in rs1 is less than
sltz rd, rs1
zero, otherwise, sets it with 0 (pseudo-instruction).
Sets rd with 1 if the signed value in rs1 is greater than
sgtz rd, rs1
zero, otherwise, sets it with 0 (pseudo-instruction).
two unsigned integers. In this case, the bltu instruction jumps to label handle ov in
case there is an overflow.
The following code shows how to detect overflow when adding two signed numbers.
17 This notation is used to indicate that two registers are being used to store a 64-bit value. The
least significant value is being stored by register a0 while the most significant value is being stored
by register a1.
1 if (x >= 10)
2 {
3 y = x;
4 }
1 li t1, 10
2 blt a3, t1, skip # jumps to skip if x < 10
3 mv a4, a3 # y = x
4 skip:
In case there are multiple statements in the “then block”, they can be placed
between the branch less than instruction (line 2) and the skip label (line 4).
77
CHAPTER 7. CONTROLLING THE EXECUTION FLOW
signed variable. In case variable x was an unsigned integer (unsigned “C” type),
the programmer (or the compiler) must have used instruction bltu to perform the
comparison.
1 if (x >= 10)
2 {
3 y = y + 1;
4 }
5 else
6 {
7 y = x;
8 }
1 li t3, 10
2 bltu a1, t3, else # jumps to else if x < 10
3 addi a2, a2, 1 # y = y + 1
4 j cont # skip the else block
5 else:
6 mv a2, a1 # y = x
7 cont:
In case there are multiple statements in the “then block”, they can be placed
between the instruction bltu (line 2) and the j cont instruction (line 4). In case
there are multiple statements in the “else block”, they can be placed between labels
else (line 5) and cont (line 7).
Assuming x and y are a signed integer variables (int “C” type) mapped to registers
a1 and a2, respectively, the following code shows how the previous “C” code can
be implemented in assembly language. First, the code loads the constant ten into
temporary register t1 (line 1). Then, if the contents of register a1 (variable x) are
less than ten, it jumps to label skip to skip the execution of the “then block”. Notice
that, because of the and operator (represented by && in “C”), if the first part of the
boolean expression is false, then the whole expression is false, hence, there is no need
to check the second part 4. If the first part of the boolean expression (lines 1 and
2) has been evaluated to true, then the code must check the second part, which is
verified by instructions in lines 3 and 4. In this case, if the contents of variable y is
greater or equal to 20, then the code (line 4) skips the “then block” by jumping to the
skip label. Otherwise, it executes the next instruction (line 5), which corresponds to
the first instruction of the “then block”.
1 li t1, 10
2 blt a1, t1, skip # jumps to skip if x < 10
3 li t1, 20
4 bge a2, t1, skip # jumps to skip if y >= 20
5 mv a1, a2 # x = y
6 skip:
The following code shows an example in which the boolean expression contains an
or (||) operation.
1 if ((x>=10) || (y<20))
2 {
3 x = y;
4 }
Assuming x and y are signed integer variables (int “C” type) mapped to registers
a1 and a2, respectively, the following code shows how the previous “C” code can
be implemented in assembly language. First, the code loads the constant ten into
temporary register t1 (line 1). Then, if the contents of register a1 (variable x) are
greater or equal to ten, it jumps to label then to execute the “then block”. Notice
that, because of the or operator (represented by || in “C”), if the first part of the
boolean expression evaluates to true, then the whole expression is true, hence, there
is no need to check the second part5. If the first part of the boolean expression (lines
1 and 2) evaluates to false, then the code must check the second part, which is verified
by instructions in lines 3 and 4. In this case, if the contents of variable y is greater
or equal to 20, then the code (line 4) skips the “then block” by jumping to the skip
4 The semantics of the “C” programming language implies that the remaining of this boolean
label. Otherwise, it executes the next instruction (line 5), which corresponds to the
first instruction of the “then block”.
1 li t1, 10
2 bge a1, t1, then # jumps to then if x >= 10
3 li t1, 20
4 bge a2, t1, skip # jumps to skip if y >= 20
5 then:
6 mv a1, a2 # x = y
7 skip:
1 if (x == 10)
2 {
3 x = 5;
4 if (y == 20)
5 {
6 x = 0;
7 }
8 }
Translating the previous code to assembly code can be easily done by starting with
the outer if-then statement. Assuming x and y are variables mapped to registers
a1 and a2, respectively, the following code shows the skeleton for the outer if-then
statement.
1 li t1, 10
2 bne a1, t1, skip # jumps to skip if x != 10
3 # <= Insert the code for the then block here
4 skip:
Once the skeleton code is generated, the next step is to generate the code for the
“then block”. The following code shows the final code.
1 li t1, 10
2 bne a1, t1, skip # jumps to skip if x != 10
3 li a1, 5 # x = 5
4 li t1, 20
5 bne a2, 20, skip_inner # jumps to skip_inner if y != 20
6 li a1, 0 # x = 0
7 skip_inner:
8 skip:
In the previous example, we used two different labels, one to skip the execution of
the “then block” of the outer if statement (skip) and another to skip the execution of
the “then block” of the inner if statement (skip inner). In this case, since both labels
represent the same address (notice that there are no instructions or data between both
labels), we could simplify the code by using a single skip label. Notice, however, that
this simplification does not affect the code generated by the assembler, since labels
are basically addresses annotations.
1 int i=0;
2 while (i < 20)
3 {
4 y = y+3;
5 i = i+1;
6 }
Assuming that variables i and y are mapped to registers a1 and a2, respectively,
the following code shows how the previous “C” code can be implemented in assembly
language. First, it contains the code that comes before the while loop - in this case
an instruction that loads the constant zero into register a1 (line 1). Then, there is
a label that defines the beginning of the loop (line 2) and the code that checks the
“loop condition” (lines 3 and 4). Notice that instruction bge checks whether variable
i (contents of register a1) is greater or equal to 20. If so, it jumps to label skip to
leave the loop. Otherwise, the execution continues on the next instruction, which is
the first instruction of the “loop body”. The “loop body” in this example is composed
by two instructions (lines 5 and 6). The first one implements the statement y=y+3
and the second one implements the statement i=i+1. After concluding the execution
of the “loop body”, the code jumps back to the beginning of the loop (line 7) so that
the loop can be executed again, starting by verifying again the “loop condition”.
1 li a1, 0 # i=0
2 while:
3 li t1, 20 # if i>=20
4 bge a1, t1, skip # jump to skip to leave the loop
5 addi a2, a2, 3 # y = y+3
6 addi a1, a1, 1 # i = i+1
7 j while # loop back
8 skip:
the “loop body” (lines 3-5). The condition also consists of a boolean expression,
however, int the do-while statement, the condition must be evaluated after each
iteration of the do-while loop, i.e., after each execution of the “loop body”. If the
“loop condition” evaluates to true, then the “loop body” must be executed again. In
this case, after completing the execution of the “loop body”, the “loop condition” is
evaluated again. If the “loop condition” evaluates as false, then the loop execution is
complete and the execution continues after the loop.
1 int i=0;
2 do
3 {
4 y = y+2;
5 i = i+1;
6 } while (i < 10);
Assuming that variables i and y are mapped to registers a1 and a2, respectively,
the following code shows how the previous “C” code can be implemented in assembly
language. First, it contains the code that comes before the while loop - in this case an
instruction that loads the constant zero into register a1 (line 1). Then, there is a label
that marks the beginning of the loop (line 2) and the “loop body”, which is composed
of two instructions (lines 5 and 6). The first one implements the statement y=y+2
and the second one implements the statement i=i+1. After the “loop body”, there is
the code that checks the “loop condition” (lines 5 and 6). Notice that instruction blt
checks whether variable i (contents of register a1) is less than 10. If so, it jumps back
to label dowhile to repeat the loop execution. Otherwise, the execution continues on
the next instruction, leaving the loop.
1 li a1, 0 # i=0
2 dowhile:
3 addi a2, a2, 2 # y = y+2
4 addi a1, a1, 1 # i = i+1
5 li t1, 10
6 blt a1, t1, dowhile # jumps back to dowhile if i < 10
Assuming that variables i and y are mapped to registers a1 and a2, respectively,
the following code shows how the previous “C” code can be implemented in assembly
language. First, it contains the “initialization code”, i.e., i=0 (line 1). Then, there
is a label that defines the beginning of the loop (line 2) and the code that checks the
“loop condition” (lines 3 and 4). Notice that instruction bge checks whether variable
i (contents of register a1) is greater or equal to 10. If so, it jumps to label skip to
leave the loop. Otherwise, the execution continues on the next instruction, which is
the first instruction of the “loop body”. The “loop body” in this example is composed
of only one instruction (line 5), which implements the statement y=y+2. The “update
code”, i.e., i=i+1 (line 6), is placed right after the “loop body”. Finally, after the
“update code”, the code jumps back to the beginning of the loop (line 7) so that the
loop can be executed again, starting by verifying again the “loop condition”.
1 li a1, 0 # i=0
2 for:
3 li t1, 10 # if i >= 10 then jumps
4 bge a1, t1, skip # to skip to leave the loop
5 addi a2, a2, 2 # y = y+2
6 addi a1, a1, 1 # i = i+1
7 j for
8 skip:
1 li a1, 0 # i=0
2 for:
3 li t1, 10 # if i >= 10 then jumps
4 bge a1, t1, skip # to skip to leave the loop
5 addi a2, a2, 2 # y = y+2
6 addi a1, a1, 1 # i = i+1
7 j for
8 skip:
In these cases, the instruction can be hoisted (moved) before the loop to improve
the code performance. This is an optimization called “loop-invariant code motion”
(LICM) commonly applied by compilers. The following code shows the code after
applying LICM to the previous code. Notice that, in this case, each loop repetition
executes only four instructions.
1 li a1, 0 # i=0
2 li t1, 10 # t1=10
3 for:
4 bge a1, t1, skip # if i >= 10 then jumps to skip to leave the loop
5 addi a2, a2, 2 # y = y+2
6 addi a1, a1, 1 # i = i+1
7 j for
8 skip:
when the routine is invoked. The label usually defines the routine name6 and the
fragment of code contains the instructions that implement the routine. The following
code shows an example of a routine called update x. This routine stores in variable
x the value of from register a0 and then returns.
Invoking a routine is as simple as jumping to the label that defines its entry point.
However, before invoking (jumping to) the routine, it is important to save
the return address so that the routine can return to the calling site after
its execution.
As discussed in Section 6.7.3, RV32I contains a special jump instruction to fa-
cilitate saving the return address when invoking a routine. This instruction, called
jump and link, or jal lab, stores the return address7 (PC+4) on the return address
register (ra) and then jumps to the label lab.
The following fragment of code shows how to invoke the update x routine to
update the value of variable x with value 42. First, it loads value 42 into register a0,
then it invokes the update x routine using the jump and link (jal) instruction.
1 .data
2 x: .skip 4
3
When the update x routine finishes executing, it needs to return to the calling
site. This operation can be performed by jumping to the address stored in register
ra, which was set by the jal instruction when the routine was invoked. The pseudo-
instruction ret performs this operation.
7.4 Examples
7.4.1 Searching for the maximum value on an array
The following “C” code shows a global array variable named numbers and a function
that returns the largest value from the array.
1 /* Global array */
2 int numbers[10];
3
1 .data
2 # Allocate the numbers array (10 integers = 40 bytes)
3 numbers: .skip 40
4
5 .text
6 get_largest_number:
7 la a5, numbers # a5 = &numbers
8 lw a0, (a5) # a0 (largest) = numbers[0]
9 li a1, 1 # a1 (i) = 1
10 li t4, 10
11 for:
12 bge a1, t4, end: # if i >= 10, then end loop
13 slli t1, a1, 2 # t1 = i * 4
14 add t2, a5, t1 # t2 = &numbers + i*4
15 lw t3, (t2) # t3 = numbers[i]
16 blt t3, a0 # if numbers[i] < largest, then skip
17 mv a0, t3 # Update largest
18 skip:
19 addi a1, a1, 1 # i = i+1
20 j for
21 end:
22 ret # Return
Alternative solutions:
1 get_largest_number:
2 lui a5,%hi(numbers)
3 lw a0,%lo(numbers)(a5)
4 addi a5,a5,%lo(numbers)
5 addi a3,a5,36
6 .L3:
7 lw a4,4(a5)
8 addi a5,a5,4
9 bge a0,a4,.L2
10 mv a0,a4
11 .L2:
12 bne a5,a3,.L3
13 ret
1 get_largest_number:
2 la a5, numbers # a5: pointer to current element
3 lw a0, (a5) # Load first element (number[0])
4 addi a5, a5, 4 # Advance pointer to next element
5 addi a6, a5, 40 # a6 <= address after last element
6 do_while:
7 lw a4, (a5) # Load current element
8 bge a0, a4, skip
9 mv a0, a4 # Update largest
10 skip:
11 addi a5, a5, 4 # Advance pointer to next element
12 bne a5, a3, do_while # do while a5 != a3
13 ret
Implementing routines
Main Memory
Stack
Top of
the stack
Addresses
Free space
Program
Heap break
Static Data
Code
0x0000
the heap. Data allocated using the malloc routine, for example, is allocated into the heap. The
memory allocation library is responsible for keeping track of which addresses within the heap are
free or allocated.
2 The program break defines the end of the heap.
87
CHAPTER 8. IMPLEMENTING ROUTINES
For example, in the following code, routine fun is invoked by routine bar, which is
invoked by routine main. Initially, the main routine is active. Then, it invokes the
bar routine, which also becomes active. Finally, the fun routine is invoked and it also
becomes active. At this point, there are three active routines in the system.
1 int a = 10;
2
3 int main()
4 {
5 return bar() + 2;
6 }
7 int bar()
8 {
9 return fun() + 4;
10 }
11 int fun()
12 {
13 return a;
14 }
The set of active routines increases whenever a routine is invoked and decreases
whenever a routine returns. Routines are activated and deactivated in a last-in-first-
out fashion, i.e., the last one to be activated must be the first one to be deactivated.
Consequently, the most natural data structure to keep track of active rou-
tines is a stack.
Routines usually need memory space to store important information, such as local
variables, parameters, and the return address. Hence, whenever a routine is invoked
(and becomes active), the system needs to allocate memory space to store information
related to the routine. Once it returns (is deactivated), all the information associated
with the routine invocation is not needed anymore and this memory space must be
freed.
The program stack is a stack data structure that stores information
about active routines, such as local variables, parameters, and the return address.
The program stack is stored in the main memory and, whenever a routine is invoked,
the information about the routine is pushed on top of the stack, which causes it to
grow. Also, whenever a routine returns, the information about the routine is discarded
by dropping the contents at the top of the stack, which causes it to shrink.
The program stack is allocated at the stack space, which is usually placed at the
end (high addresses) of the memory. As a consequence, the program stack must grow
towards low addresses.
The stack pointer is a pointer to the top of the stack, i.e., it stores the
address of the top of the stack. Growing or shrinking the stack is performed by
adjusting the stack pointer.
In RISC-V, the stack pointer is stored by register sp. Also, in RISC-V, the
stack grows towards low addresses, hence, growing (or allocating space on) the stack
can be performed by decreasing the value of register sp (the stack pointer). The
following code shows how to push the contents of register a0 into the stack. First, the
stack pointer is decreased to allocate space (4 bytes), then, the contents of register
a0 (4 bytes) are stored on the top of the program stack using the sw instruction.
Figure 8.2 (a) illustrates a program stack that starts at address 0x0500 and grows
down to address 0x04E4. Since the stack pointer points to the top of the stack, the
contents of the register sp is equal to 0x04E4.
(a) (b)
Figure 8.2: Example of a program stack starting at address 0x0500 and growing
downward to address 0x04E4 (a) before and (b) after pushing value 0xFA0312B0 into
the stack.
Figure 8.2 (b) shows how the program stack is modified after executing the fol-
lowing code, i.e., after pushing the contents of register a0 into the stack. Notice that
the value of the stack pointer (sp) was decremented by 4 units and the contents of
register a0 (0xFA0312B0) was stored at the memory starting at address 0x04E0.
1 li a0, 0xFA0312B0
2 addi sp, sp, -4 # allocate stack space
3 sw a0, 0(sp) # store data into the stack
The previous examples discussed how to push and pop a single word (4-byte
value) to and from the stack. In many situations, a program may need to push or
pop multiple values to or from the stack. For example, the program may need to save
a set of register values on the stack. In these cases, the code may be optimized by
adjusting (increasing or decreasing) the stack pointer only once. The following code
shows how to push four values from registers a0, a1, a2, and a3 into the program
stack. Notice that the stack pointer was adjusted only once and the immediate field
of the store word instruction (sw) was used to select the proper position to store each
one of the values. In this example, the last value pushed into the stack was the value
stored in the register a3.
The following code shows how to pop four values from the program stack into
registers a3, a2, a1, and a0. Notice that the stack pointer was adjusted only once and
the immediate field of the load word instruction (lw) was used to select the proper
position to load each one of the values. In this example, the first value popped from
the stack was stored into register a3.
Previous examples showed how to perform push and pop operations on the pro-
gram stack. Popping data from the stack consists of retrieving the data and then
deallocating the stack space, however, if the data is not needed anymore, then, only
the deallocation process needs to be performed. As we will discuss in sections 8.6 and
8.4, allocating or deallocating stack space without immediately storing or retrieving
data to or from the stack is useful in many cases.
The stack pointer register must be initialized to the base of the program stack before
executing the program. When running the program without the support of an operat-
ing system (for example, in an embedded system) the stack pointer must be initialized
by the system initialization code. On the other hand, when running the program on
top of an operating system, the execution environment (e.g., the operating system
kernel) usually initializes the stack pointer before jumping to the program’s entry
point.
Also, lets assume John made this routine available through a binary library.
Now, let us assume Mary has John’s library and wants to invoke the jsort routine.
So far, we know that Mary can link her program with John’s library and invoke the
jsort routine by executing a jal jsort instruction. However, where should Mary
place the routine parameters, i.e., the pointer to the array (char* a) and the size of
the array (int n)?
The answer to the previous question is “it depends on where the code implemented
by John is expecting the parameters”. For example, if the jsort routine is expecting
the first parameter (the pointer to the array) to be placed at register a0 and the
second one (the array size) to be placed at register a1, then, Mary has to place these
parameters in these two registers, otherwise, John’s code will not work properly.
The calling convention, defined by the ABI, defines where routine parameters must
be passed. Assuming John and Mary are following the same ABI, it should be easy
for Mary to place the routine parameters in the correct registers.
There may be multiple, different, ABIs defined for a single computer architecture.
This is the case for x86, for example, with different ABIs defined for different operating
systems. The RISC-V consortium defines several ABIs3 . Unless otherwise stated, in
this text we will use the RISC-V ilp32 ABI, which defines that int, long, and
pointers are all 32-bits long. It also defines that long long is a 64-bit type, char
is 8-bit, and short is 16-bit.
NOTE: Only code generated for the same ABI can be linked together by
the linker.
NOTE: When generating code with GCC, the user may specify the ABI us-
ing the -mabi flag. For example, the following command may be used to com-
pile and assemble the program.c file using the ilp32 ABI: gcc -c program.c
-mabi=ilp32 -o program.o
program stack. As discussed in Section 8.3, the proper places to pass parameters to
a routine is defined by the ABI.
The RISC-V ilp32 ABI defines a set of conventions to pass parameters to rou-
tines4 . These conventions specify how different types of values (char, integer, structs,
...) must be passed on registers or the stack. To simplify the discussion, we will focus
on scalar values that can be represented with 32 or fewer bits. These values can be
(unsigned) char, (unsigned) short, (unsigned) integer, or pointer values. The
integer calling convention specifies the following rules to pass these type of values as
parameters:
• The first eight scalar parameters of the routine are passed through registers a0
to a7, being one parameter per register. Hence, if the routine has less than 9
parameters, all of them are passed through registers. Integer scalars narrower
than 32 bits (e.g., char or short) are widened according to the sign of their type
up to 32 bits, then sign-extended to 32 bits.
• In case there are more than 8 parameters, the remaining parameters, i.e., pa-
rameters 9 to N, are passed through the stack. In this case, the parameters must
be pushed into the program stack. The last parameter, i.e., the Nth parameter,
must be pushed first and the 9th one must be pushed last. These parameters
must be later removed from the stack by the same routine that pushed them
into the stack. Again, integer scalars narrower than 32 bits (e.g., char or short)
are widened according to the sign of their type up to 32 bits, then sign-extended
to 32 bits. Consequently, scalar values smaller than 32 bits are expanded to 32
bits and occupy 4 bytes when passed as parameters through the program stack.
Before invoking a routine, the caller must set the parameters, i.e., they must
be placed in the registers and into the stack accordingly to the ABI. To illustrate
how to pass parameters, let us assume there is a routine called sum10 that takes, as
parameters, 10 integer values, sums them, and returns the result. The following code
shows the sum10 signature:
According to the RISC-V ilp32 ABI, to invoke this routine, one must place pa-
rameters a, b, c, d, e, f, g, and h on registers a0, a1, a2, a3, a4, a5, a6, and a7,
respectively. Also, parameters i and j must be placed on the stack, being the value
of j pushed first and the value of i last.
The following code shows how to call the sum10 routine passing as arguments
values 10, 20, 30, 40, 50, 60, 70, 80, 90, 100. Notice that the 9th and the 10th
parameters are pushed into the stack (lines 11-15). Also, notice that these parameters
are later removed from the program stack by the same routine that placed them on
the program stack (line 17), i.e., the main routine.
1 # sum10(10,20,30,40,50,60,70,80,90,100);
2 main:
3 li a0, 10 # 1st parameter
4 li a1, 20 # 2nd parameter
5 li a2, 30 # 3rd parameter
6 li a3, 40 # 4th parameter
7 li a4, 50 # 5th parameter
8 li a5, 60 # 6th parameter
9 li a6, 70 # 7th parameter
10 li a7, 80 # 8th parameter
11 addi sp, sp, -8 # Allocate stack space
4 https://github.com/riscv/riscv-elf-psabi-doc/blob/master/riscv-elf.md
The routine that was invoked (sum10 in this example) must retrieve the parameters
from the registers and the stack accordingly to the ABI. The following code shows a
possible implementation for the sum10 routine and illustrates this process. It loads
the values of the 9th and the 10th parameters from the stack into registers t1 and t2.
Notice that, even though it loads these values from the program stack, it does not
pop (deallocate) the values from the program stack - this cleaning process must be
performed by the caller routine, i.e., the routine that invoked the sum10 routine.
1 sum10:
2 lw t1, 0(sp) # Loads the 9th parameter into t1
3 lw t2, 4(sp) # Loads the 10th parameter into t2
4 add a0, a0, a1 # Sums all parameters
5 add a0, a0, a2
6 add a0, a0, a3
7 add a0, a0, a4
8 add a0, a0, a5
9 add a0, a0, a6
10 add a0, a0, a7
11 add a0, a0, t1
12 add a0, a0, t2 # Place return value on a0
13 ret # Returns
1 int pow2(int v)
2 {
3 return v*v;
4 }
Accordingly to the RISC-V ilp32 ABI, this parameter must be passed in register
a0. Since it is passed as value, register a0 will contain the value itself. The following
code shows the implementation of the previous “C” code in assembly language. Notice
that the code multiplies the contents of register a0 by itself.
1 pow2:
2 mul a0, a0, a0 # a0 = a0 * a0
3 ret # return
The following code shows how to invoke the pow2 routine to compute the square
of 32. Notice that the value itself is directly placed into register a0.
1 main:
2 li a0, 32 # set the parameter with value 32
3 jal pow2 # invoke pow2
4 ret
1 void inc(int* v)
2 {
3 *v = *v + 1;
4 }
Again, accordingly to the RISC-V ilp32 ABI, this parameter must be passed in
register a0. Since it is passed as reference, register a0 will contain the address of the
variable. The following code shows the implementation of the previous “C” code in
assembly language. Notice that the code uses the address passed in register a0 to
update the contents of the variable.
1 inc:
2 lw a1, (a0) # a1 = *v
3 addi a1, a1, 1 # a1 = a1 + 1
4 sw a1, (a0) # *v = a1
5 ret
The following code shows how the inc routine can be invoked to increase the value
of variable y. Notice that the address of variable y, instead of its value, is loaded into
register a0.
1 .data
2 y: .skip 4
3
4 .text
5 main:
6 la a0, y # set the parameter with the address of y
7 jal inc # invoke inc
8 ret
Reference parameters can be used to pass information in and out of routines. Since
a reference is essentially a memory address, the information being passed into or out
of the routine must be located in the memory.
1 int x;
2
3 int main()
4 {
5 return x+1;
6 }
Global variables are allocated on the static data space by the assembler
and are usually declared on assembly programs with the help of directives.
The following code shows the assembly code for the previous “C” program.
1 .data
2 x:
3 .skip 4
4
5 .text
6 main:
7 la a0, x # Loads the address of variable x
8 lw a0, 0(a0) # Loads the value o x
9 addi a0, a0, 1 # Increments the value
10 ret # Return
In the previous code, the .data directive (line 1) informs the assembler that the
following contents must be placed into the static data space. The x label marks the
address of variable x. The .skip 4 directive (line 3) instructs the assembler to skip
four bytes, which is used to allocate space for variable x. The .text directive (line 5)
informs the assembler that the following contents must now be placed into the code
space. The remainder of the code (lines 6-10) implements the main routine.
In high-level languages, such as in “C”, local variables are variables declared inside
routines, and can be used only inside the routine that declared it.
Ideally, local variables should be allocated on registers. The following code con-
tains a local variable called tmp that can be allocated on a register.
The following code shows the assembly code for the previous “C” code. Notice
that the local variable tmp was allocated on register a2.
1 exchange:
2 lw a2, (a1) # tmp = *b
3 lw a3, (a0) # a3 = *a
4 sw a3, (a1) # *b = a3
5 sw a2, (a0) # *a = tmp
6 ret
1 int foo()
2 {
3 int userid;
4 get_uid(&userid);
5 return userid;
6 }
The following code shows the assembly code for the foo routine. First, the stack
pointer is decreased to allocate space for the userid variable (line 2). Then, the
address of the userid variable is loaded into register a0 to be passed as parameter
to routine get uid (line 3). Notice that, since the last element added to the program
stack was the userid variable, the stack pointer points to (contains the address of)
this variable. Next, the get uid routine is invoked (line 4) and, after returning, the
value of the userid variable is loaded into register a0 to be returned (line 5) 5. Finally,
before returning from the foo routine, the stack pointer is increased to deallocate the
userid variable from the program stack.
1 foo:
2 addi sp, sp, -4 # Allocate userid
3 mv a0, sp # a0 = address of userid (&userid)
4 jal get_uid # Invoke the get_uid routine
5 lw a0, (sp) # a0 = userid
6 addi sp, sp, 4 # Deallocate userid
7 ret
The following code shows another example in which a local variable needs to be
allocated in memory. In this case, the my array local variable needs to be allocated
on memory because it is an array. Also, the address of this variable is passed to
routine init array.
1 int bar()
2 {
5 Since the value of variable userid may have been modified by the get uid routine, the code
needs to load the value of variable userid from memory after the execution of the get uid routine.
3 int my_array[8];
4 init_array(my_array);
5 return my_array[4];
6 }
The following code shows the assembly code for the bar routine. First, the stack
pointer is decreased to allocate space for the my array variable (line 2)6 . Then, the
address of the my array variable is loaded into register a0 to be passed as parameter
to routine init array (line 3). Again, since the last element added to the program
stack was the my array variable, the stack pointer points to (contains the address) of
this variable7 . Next, the init array routine is invoked (line 4) and, after returning,
the value of my array[4] is loaded into register a0 for return (line 5)8 . Finally,
before returning from the bar routine, the stack pointer is increased to deallocate the
my array variable (line 6) from the program stack.
1 bar:
2 addi sp, sp, -32 # Allocate my_array
3 mv a0, sp # a0 = address of my_array
4 jal init_array # Invoke the init_array routine
5 lw a0, 16(sp) # Load my_array[4] into a0
6 addi sp, sp, 32 # Deallocate my_array
7 ret
The following code shows yet another example in which a local variable needs to
be allocated in memory. In this case, the d local variable needs to be allocated on
memory because it is a struct. Also, the address of this variable is passed to routine
init date.
1 typedef struct
2 {
3 int year;
4 int month;
5 int day;
6 } date_t;
7
8 int get_current_day()
9 {
10 date_t d;
11 init_date(&d);
12 return d.day;
13 }
The following code shows the assembly code for the get current day routine.
First, the stack pointer is decreased to allocate space for the d variable (line 2)9 .
Then, the address of variable d variable is loaded into register a0 to be passed as
parameter to routine init date (line 3). Again, since the last element added to the
program stack was variable d, the stack pointer points to (contains the address) of this
variable10 . Next, the init date routine is invoked (line 4) and, after returning, the
value of d.day is loaded into register a0 for return (line 5)11 . Finally, before returning
6 Noticethat the my array variable is a 32-byte long array - It contains eight 4-byte integers.
7 At this
point, the stack pointer points to the first element of the my array array, i.e., my array[0].
8 Since the stack pointer (sp) is pointing to my array[0]) and each array element has four bytes,
field.
11 Since the stack pointer (sp) is pointing to d.year) and each field has four bytes, d.day is located
from the get current day routine, the stack pointer is increased to deallocate the d
variable (line 6) from the program stack.
1 get_current_day:
2 addi sp, sp, -12 # Allocate d
3 mv a0, sp # a0 = address of d
4 jal init_date # Invoke the init_date routine
5 lw a0, 8(sp) # Load d.day into a0
6 addi sp, sp, 12 # Deallocate d
7 ret
1 exchange:
2 lw a2, (a1) # tmp = *b
3 lw a3, (a0) # a3 = *a
4 sw a3, (a1) # *b = a3
5 sw a2, (a0) # *a = tmp
6 ret
Notice that the assembly code uses registers a2 and a3 to perform the computation.
In this case, the contents of these two registers are destroyed by the lw instructions
(lines 2 and 3).
Now, lets assume that the mix routine loads an “important information” on reg-
ister a2 and, before using this information, it invokes the exchange routine. The
following code illustrates this situation. First, the mix routine loads the important
information into register a2 (line 2). Then, it sets the parameters and invoke the
exchange routine (lines 3-5). Finally, the mix routine returns the important infor-
mation by copying it from register a2 to register a0 (line 6) and executing the ret
instruction (line 7).
1 mix:
2 lw a2, (a0) # load important information into a2
3 la a0, x # Sets parameter 0 with address of var. x
4 la a1, y # Sets parameter 1 with address of var. y
5 jal exchange # Invokes exchange to swap x an y values
6 mv a0, a2 # Move important information into a0 to return
7 ret
Notice, however, that the exchange routine destroys the contents of registers a2.
Consequently, the value returned by the mix routine is not the “important informa-
tion” that was loaded into register a2 at line 2.
To solve the problem, the mix routine could save the contents of register a2 on the
program stack before invoking the exchange routine and restore it after the exchange
routine returns. The following code illustrates this situation. Notice that the contents
of register a2 are saved into the program stack (lines 3 and 4) before invoking the
exchange routine and restored (lines 8 and 9) after the exchange routine returns.
1 mix:
2 lw a2, (a0) # load important information into a2
3 addi sp, sp, -4 # Saves a2: Allocate stack space
4 sw a2, (sp) # Store a2 into the stack
5 la a0, x # Sets parameter 1 with address of var. x
6 la a1, y # Sets parameter 1 with address of var. y
7 jal exchange # Invokes exchange to swap x an y values
8 lw a2, (sp) # Restores a2: Loads a2 from the stack
9 addi sp, sp, 4 # Deallocate the stack space
10 mv a0, a2 # Move important information into a0 to return
11 ret
Another way to solve this problem is to modify the exchange routine to save
and restore all the registers that it might change. The following code illustrates
this situation. Notice that the contents of registers a2 and a3 are saved into the
program stack (lines 2-4) at the beginning of the routine and restored (lines 9-11)
before returning from the routine.
1 exchange:
2 addi sp, sp, -8 # Allocate stack space
3 sw a2, 4(sp) # Save contents of a2
4 sw a3, 0(sp) # Save contents of a3
5 lw a2, (a1) # tmp = *b
6 lw a3, (a0) # a3 = *a
7 sw a3, (a1) # *b = a3
8 sw a2, (a0) # *a = tmp
9 lw a3, 0(sp) # Restore contents of a3
10 lw a2, 4(sp) # Restore contents of a2
11 addi sp, sp, 8 # Deallocate stack space
12 ret
the caller routine after the call site. In the previous example, the mix routine must
save a2 because it needs the value of a2 after the call site.
Also, it is important to notice that there is no need for the callee routine to
save and restore all the callee-saved registers, only the ones that are modified by the
routine. As an example, there is no need for the exchange routine to save registers
s0-s11 since it does not modify these registers.
1 mix:
2 addi sp, sp, -4 # Saves ra: Allocate stack space
3 sw ra, (sp) # Store ra into the stack
4 lw a2, (a0) # load important information into a2
5 addi sp, sp, -4 # Saves a2: Allocate stack space
6 sw a2, (sp) # Store a2 into the stack
7 la a0, x # Sets parameter 1 with address of var. x
8 la a1, y # Sets parameter 1 with address of var. y
9 jal exchange # Invokes exchange to swap x an y values
10 lw a2, (sp) # Restores a2: Loads a2 from the stack
11 addi sp, sp, 4 # Deallocate the stack space
12 mv a0, a2 # Move important information into a0 to return
13 lw ra, (sp) # Restores ra: Loads ra from the stack
14 addi sp, sp, 4 # Deallocate the stack space
15 ret
Leaf routines are routines that do not call other routines. Since they do
not call other routines, the contents of register ra are not modified. Hence, there is
no need to save the return address on the stack when implementing leaf routines. In
the previous examples, the exchange routine is a leaf routine, hence, there is no need
to save and restore the contents of the return address.
Finally, the standard ABI specifies that routines should not modify the integer
registers tp and gp, because signal handlers may rely upon their values.
Main
Memory
0x0500
0x04FC A
0x04F8
0x04F4
0x04F0 B
0x04EC
0x04E8 C
SP 0x04E4
0x04E0
0x04FC
0x04F8
...
0x0000
Figure 8.3: Program stack with data from three active routines A, B, and C.
The following code shows the addijx routine implemented in assembly. Notice
that the return address is saved to (lines 2 and 3) and restored from (lines 9 and 10)
the program stack. At the entry point, the stack pointer points to the 9th parameter
(i), however, after the return address is saved on (pushed into) the stack, the stack
pointer points to the return address. Hence, to access the 9th parameter after this
point, the code must add four to the stack pointer (line 5).
1 addijx:
2 addi sp, sp, -4 # Saves the
3 sw ra, (sp) # return address
4 jal get_x # Invoke the get_x routine
5 lw a1, 4(sp) # Loads i from the program stack
6 lw a2, 8(sp) # Loads j from the program stack
7 add a0, a1, a1 # a0 = get_x() + i
8 add a0, a2, a2 # a0 = get_x() + i + j
9 lw ra, (sp) # Restore the
10 addi sp, sp, 4 # return address
11 ret # Returns
12 Notice that the same routine may be invoked multiple times before returning, hence, this routine
may have multiple activations and, hence, multiple stack frames. This is usually the case of recursive
routines.
The more information is added to the stack, the harder it may get to track the
addresses of all parameters and local variables across the routine. One way of mit-
igating this problem is to keep a fixed pointer to the stack so that all parameters
and local variables can be accessed using this pointer plus a fixed offset. The frame
pointer points to the beginning of the stack frame of the currently execut-
ing routine. As a consequence, it provides a fixed pointer to the stack across the
execution of a routine and may be used as a fixed reference to access parameters and
local variables.
In the RISC-V ilp32 ABI, the frame pointer is stored by the frame pointer register,
or fp. The fp register must be initialized at the beginning of the routine, however,
its previous contents must be saved so that it can be restored before returning from
the routine. Also, in most cases, instead of pushing information one by one on the
program stack, each stack frame can be allocated with a single instruction at the
beginning of the routine and deallocated with a single instruction before returning.
The following code shows an example in which the stack frame is allocated (deallocate)
in the beginning (end) of the routine (lines 2 and 15) and the frame pointer is used
to access the parameters using a fixed offset (lines 8 and 9).
1 addijx:
2 addi sp, sp, -8 # Allocates the stack frame
3 sw ra, 4(sp) # Saves return address
4 sw fp, 0(sp) # Saves previous frame pointer
5 addi fp, sp, 8 # Adjust frame pointer.
6
In the previous example, the addijx stack frame had 8 bytes and stored the return
address and the previous frame pointer. In case more registers need to be saved or
local variables need to be stored on the program stack, the stack frame may be easily
increased by changing the constant (8) in lines 2 and 15.
• Include a label to define the routine entry point. When translating “C” code to
assembly code, the label must match the “C” routine name;
• Use the return instruction (ret) to return from the routine. This instruction
jumps to the address that is stored in the return address register (ra);
• Parameters must be accessed accordingly to the RISC-V ilp32 ABI. Consider-
ing scalar parameters smaller than or equal to 32 bits, the first eight parameters
are expected in registers a0 to a7, and the remaining ones on the stack. Also,
integer scalars narrower than 32 bits (e.g., char or short) are widened according
to the sign of their type up to 32 bits;
Parameters passed on the stack are organized so that, the last parameter, i.e.,
the Nth parameter, is pushed first and the 9th is pushed last. As a consequence,
upon the routine entrance, the stack pointer points to the 9th parameter, sp+4
points to the 10 th parameter, and so on. Parameters are allocated on the pro-
gram stack by the caller routine and must also be deallocated by the caller
routine. The callee must not deallocate parameters allocated by the caller;
• In case the routine needs to store information on the program stack, a stack
frame should be allocated at the beginning of the routine and deallocated before
returning. The size of the stack frame must be a multiple of 16 to ensure the
stack pointer keeps aligned to a 128-bit boundary, as required by the standard
ABI;
• The routine may use registers to implement its functionality, however, callee-
saved registers that are modified by the routine must be saved in the beginning
of the routine and restored before returning from it. These registers must be
saved on the stack frame;
• The routine may modify and use caller-saved registers without saving them,
however, in case their value needs to be preserved across a call site, the routine
must save (restore) its contents before (after) the call site. Routines that call
other routines must save and restore the return address register to preserve its
contents across call sites. These registers must be saved on the stack frame;
• Local variables may be allocated on registers or on memory. Local variables
that need to be allocated on memory must be allocated on the stack frame;
• Optionally, the frame pointer register (fp) may be used to keep a pointer to the
beginning of the stack frame and provide a fixed reference to access parameters
and local variables. In this case, the previous frame pointer must be preserved
when returning from the routine, hence, the contents of the frame pointer reg-
ister must be saved in the stack frame at the beginning of the function and
restored before returning.
• The standard ABI specifies that routines should not modify the integer registers
tp and gp.
8.10 Examples
This section presents examples of assembly code generated for “C” routines.
1 int factorial(int n)
2 {
3 if (n>1)
4 return n * factorial(n-1);
5 else
6 return 1;
7 }
Notice that, if the parameter n is greater than one, then factorial of n is computed
by multiplying the value of n by the factorial of n-1, which is computed by a recursive
call to the factorial routine.
Generating code for a recursive routine is as simple as generating code for any
non-leaf routine. The following code shows how the previous recursive routine can be
implemented in assembly. First, the stack frame is allocated and the return address
is saved on it (lines 2 and 3). Then, n is compared with 1 (lines 4 and 5) and, if n is
less or equal to one, the code jumps to the “else block” (lines 12 and 13), otherwise,
it proceeds to the “then block” (lines 6 to 11). The “else block” simply sets a0 with 1
for return and proceeds with the routine finalization code, i.e., the code that restores
the return address, deallocates the stack frame and returns (lines 15 to 17). The
“then block” (lines 6 to 11) implements the code that performs the recursive call
(i.e., n * factorial(n-1)). First, it saves the value of a0 (n) on the stack frame to
preserve it across the call site (line 6). Then, it sets the parameter for the recursive
call and invokes the routine13 (lines 7 and 8). Next, it recovers the value of n from
the stack frame into register a1 (line 9) and multiplies it by the value returned by
the recursive call, which is located in a0 (line 10). Finally, it jumps to the fact end
label to execute the routine finalization code.
1 factorial:
2 addi sp, sp, -16 # Allocates the routine frame
3 sw ra, 0(sp) # Saves the return address
4 li a1, 1
5 ble a0, a1, else # if (n>1)
6 sw a0, 4(sp) # Saves n (a0) on the routine frame
7 addi a0, a0, -1 # Set the parameter (n-1)
8 jal factorial # Perform the recursive call
9 lw a1, 4(sp) # Loads n from the routine frame (into a1)
10 mul a0, a0, a1 # a0 = factorial(n-1) * n
11 j fact_end # Jumps to end
12 else:
13 li a0, 1 # Set the return value to 1
14 fact_end:
15 lw ra, 0(sp) # Restores the return addres
16 addi sp, sp, 16 # Deallocate the routine frame
17 ret # Return
This routine takes three parameters: the file descriptor (fildes), a pointer to the
buffer that contains the information that must be written to the file (buf), and the
13 The only difference between a recursive routine and a non-leaf regular routine, is that the recur-
sive one is invoking the same routine while other on-leaf routines invoke other routines.
number of bytes that must be written (nbyte). Also, it returns the number of bytes
written to the file.
The following assembly code shows a possible implementation for the write rou-
tine. This routine receives parameters fildes, buf, and nbyte on registers a0, a1,
and a2, respectively. These parameters are the same parameters that must be passed
to the syscall and are already placed on the correct registers, hence, there is no need
to adjust the parameters when invoking the system call on line 5.
1 write:
2 addi sp, sp, -16 # Allocates the stack frame
3 sw ra, 12(sp) # Saves the return address
4 li a7, 64 # Sets the syscall code (64 = write)
5 ecall # Invokes the operating system
6 lw ra, 12(sp) # Restores the return address
7 addi sp, sp, 16 # Deallocates the stack frame
8 ret # Returns
System-level programming
106
Chapter 9
Accessing peripherals
As discussed in the previous chapters, the CPU executes programs that are stored on
the main memory. In this process, the CPU fetches the program’s instructions from
the main memory and executes them, which may cause the CPU to load or store data
on the main memory. The previous chapters also explain that user-level programs
perform input and output operations by invoking the operating system.
This chapter discusses how programs may directly interact with input and output
hardware devices to perform input and output operations. This task is useful when
developing software for a system that does not contain an operating system or when
implementing operating systems’ components, such as device drivers.
The remainder of the chapter is organized as follows: Section 9.1 introduces the
concept of peripherals and discusses how they are connected to the CPU. Section 9.2
presents the two main methods for programs to interact with peripherals: port-
mapped I/O and memory-mapped I/O. Section 9.3 discusses how I/O operations
are performed on RISC-V-based computing systems. Finally, Section 9.3 discusses
the busy waiting concept.
9.1 Peripherals
Peripherals are input/output, or I/O, devices that are connected to the
computer. There are several kinds of peripherals. Mouse, keyboard, image scanners,
barcode readers, game controllers, microphones, webcams, and read-only memories
are examples of input devices. Monitors, projectors, printers, headphones, and com-
puter speakers are examples of output devices. There are also devices that perform
both input and output operations, such as data storage devices (including a disk drive,
USB flash drive, memory card, and tape drive), network cards, etc.
Input and output devices interface with the CPU through a bus, which is a com-
munication system that transfers information between the computer components.
This system is usually composed of wires that are responsible for transmitting the
information and associated circuitries, which orchestrate communication. Figure 9.1
illustrates a computer system in which a system bus connects the CPU, the main
memory, a persistent storage device (HDD), an input device, and an output device.
Main Output
Memory HDD device
bus
Input
CPU
device
107
CHAPTER 9. ACCESSING PERIPHERALS
the CPU to perform input and output operations. To discuss this concept, let us
consider a hypothetical computing system that has a seven-segment display (an output
peripheral) attached to a display controller, which, in turn, is connected to the CPU
through the bus, as illustrated in Figure 9.2.
a
CPU Main Memory
f b segments
g
bus
electrical
Control Reg. e c
(0x40)
wires dot
Display Controller d p
Figure 9.2: Computing system with a seven-segment display and a display controller.
Seven-segment displays are devices that contain seven segments and one dot that
can be light up individually. Modern seven-segment displays are implemented using
one light-emitting diode (LED) per segment and one for the dot. The segments and
the dot are positioned on the display so that it is possible to display patterns that
resemble decimal digits by lighting up a subset of the display segments. For example,
one may turn on segments a, f, g, c, and d to show a pattern that resembles the
decimal digit ‘5’, as illustrated in Figure 9.2.
The display controller is the device responsible for controlling the seven-segment
display. It is connected to the seven-segment display LEDs using electrical wires, and
it turns on or off each one of the segments and the dot according to the contents of
an eight-bit register called control register. Each bit of the control register (Control
Reg.) controls whether each display segment or dot must be turned on or off. In this
case, bits 7, 6, 5, 4, 3, 2, 1, and 0 (the rightmost one) control the dot (p), and the
segments a, b, c, d, e, f, and g, respectively. Figure 9.3 shows the value that must be
written into the control register (0x5b) to turn on the segments that show a pattern
that resembles the decimal digit ‘5’.
address and wait for the main memory to place the data on the bus so it can copy the
data into one of its internal registers. The CPU employs the same process to interact
with peripherals’ controllers, i.e., the CPU sends/receives information (commands,
addresses, and data) to/from controllers through the bus.
CPUs are usually connected physically to the main memory and to peripherals’
controllers through one or more buses. There are several kinds of buses and their
organization and implementation may vary dramatically. For example, some buses
may employ a single set of wires to transmit addresses, data, and commands, while
others may use dedicated wires for each one of these tasks. Also, the number of buses
and their disposition on the system may vary significantly across computing systems.
Even though buses’ implementation and organization may vary dramatically, their
characteristics are usually transparent to the programmer, i.e., they do not affect
how the programmer generates code that interacts with peripherals’ controllers nor
the main memory. The CPU ISA usually provides the programmer with instructions
that hide the details (e.g., their protocols and inner workings) of how the CPU or
the peripherals interact with each other or the bus. These instructions allow the
programmer to instruct the CPU to write/read data to/from peripherals’ registers
and their internal memories in a simple way. For example, the RV32I ISA contains
load and store instructions (e.g., lw and sw1 ) that allow the programmer to instruct
the CPU to read/write data from/to the main memory without worrying how the bus
that connects the CPU to the main memory works.
There may be several peripherals on the system, and each one of them may have
multiple registers or internal memories. Hence, there must be a way for programmers
to specify the proper peripheral register or internal memory position to be accessed
by the instruction. This is usually performed by associating each peripheral register
and internal memory position with a different identifier, often an integer number,
which may be known as an address or an I/O port. In this context, instructions used
for interacting with peripherals usually identify the peripheral register or memory
position by its address or I/O port.
Sections 9.2.1 and 9.2.2 discuss the two main methods of accessing peripheral
registers and their internal memories by executing CPU instructions.
1 in 0x71, %al
1 See Section 6.6 for more information on RV32I load and store instructions.
2 The IA-32 ISA family is a set of ISAs developed by Intel and based on the ISA used on the 8086
microprocessor.
3 al, ax, and eax are 8-bit, 16-bit, and 32-bit CPU registers on the IA-32 ISA.
The output to port instruction, or out, also takes two operands; however, the
first one specifies the target I/O port and the second one the source CPU register.
The out instruction copies the value from the source CPU register into the peripheral
register, or internal memory word, identified by the I/O port operand. The following
code shows an example in which the out instruction is used to write the 8-bit value
stored at the CPU register al into the peripheral register (or internal memory word)
identified by the I/O port 0x70.
Device
0000 0000 Internal RAM
64 KB
0001 0000 (boot memory)
... ...
53FB C000
UART-1 16 KB
53FB FFFF
... ...
53FA 0000
Address space
GPT 16 KB
53FA 3FFF
... ...
53F8 4000
GPIO 12 B
53F8 400B
... ...
0FFF C000
TZIC 16 KB
0FFF FFFF
... ...
7000 0000 Main Memory
256 MB
8000 0000 (off-chip DDR2)
Figure 9.4: A single address space mapped to main memory (addresses 0x70000000
to 0x80000000) and multiple peripherals.
4 This is the address mapping employed on the Freescale i.MX53 platform. The UART (Universal
asynchronous receiver-transmitter), the GPT (General Purpose Timer), the GPIO (General-Purpose
Input/Output), and the TZIC (Trusted-Zone Interrupt Controller) are peripherals in the system.
a
RV32I Main Memory
CPU (0x10000000-0xFFFFFFF)
f b segments
g
00110011
Control Reg. e c
(0x00000040) dot
bus
Display Controller d p
Data Reg.
00000100 Floor Sensor
(0x00000080)
Floor Sensor Controller
Figure 9.5: RV32I-based computing system with a seven-segment display and a floor
sensor.
1 .section .text
2 .set DISPLAY_CONTROL_REG_PORT, 0x00000040
3 .set FLOOR_DATA_REG_PORT, 0x00000080
4
5 update_display:
6 li a0, FLOOR_DATA_REG_PORT # Reads the floor number and
7 lb a1, (a0) # store into a1
8 la a0, floor_to_pattern_table # Converts the floor number
9 add t0, a0, a1 # into a configuration
10 lb a1, (t0) # byte
11 li a0, DISPLAY_CONTROL_REG_PORT # Sets the display controller
12 sb a1, (a0) # with the configuration byte
13 ret # Returns
14
15 .section .rodata
16 floor_to_pattern_table:
17 .byte 0x7e,0x30,0x6d,0x79,0x33,0x5b,0x5f,0x70,0x7f,0x7b
As discussed in Section 9.1, each bit of the configuration byte controls whether
each segment (or the dot) is turned on or off. In this context, the code must convert
the floor number to a configuration byte that turns on a subset of the segments so
the pattern displayed resembles the floor number. For example, if the elevator is
located on the fourth floor (floor number = 4), the code must write the value 0x33
(0b00110011) to turn on segments b, c, f, and g, as illustrated in Figure 9.5. Notice
that the code employs a table (floor to pattern table) that can be indexed by the
floor number to retrieve the proper configuration byte.
Data Reg.
00001000
(0x00000050) electrical
7 8 9
Status Reg. wires
00000000
(0x00000054) * 0 #
Keypad Controller
following code shows an implementation of the read keypad routine that employs the
busy waiting technique to wait for the keypad to be pressed before reading the keypad
controller’s data register. First, it reads the contents of the keypad status register
into register a0 (lines 8 and 9) and then it checks whether the READY bit is set by
performing a bit-wise and operation with the mask defined by the READY MASK symbol
(line 10) and jumping back to the beginning of the routine in case the result is zero
(line 11). Next, it checks if the keypad was pressed more than once by performing a
bit-wise and operation with the mask defined by the OVRN MASK symbol (line 12) and
jumping to the ovrn occured label in case the result is not zero (line 13). Finally, it
reads the key value from the keypad controller’s data register (lines 14 and 15) and
returns.
1 .text
2 .set DATA_REG_PORT, 0x00000050
3 .set STAT_REG_PORT, 0x00000054
4 .set READY_MASK, 0b00000001
5 .set OVRN_MASK, 0b00000010
6
7 read_keypad:
8 li a0, STAT_REG_PORT # Reads the keypad
9 lb a0, 0(a0) # status into a0
10 andi t0, a0, READY_MASK # Check the READY bit and
11 beqz t0, read_keypad # until it is equal to 1
12 andi t0, a0, OVRN_MASK # Check if OVRN bit and jump
13 bnez t0, ovrn_occured # to ovrn_occured if equals to 1
14 la a0, DATA_REG_PORT # Reads the key from the
15 lb a0, 0(a0) # data register into a0
16 ret # Return
17 ovrn_occured:
18 li a0, -1 # Returns -1
19 ret # Return
External Interrupts
10.1 Introduction
As discussed in previous chapters, the CPU fetches and executes instructions from the
main memory. In this context, most of the actions that happen in the system are ini-
tiated by the CPU, as a result of executing instructions. For example, reading/writing
data to/from the main memory and to/from peripherals are events triggered by the
CPU when executing instructions. However, there are some events that are initiated
by other hardware components, such as peripherals. For example, in the system dis-
cussed in Section 9.4, when a keypad key is pressed, the keypad controller registers
this information on the keypad controller registers. Even though the events were not
initiated by the CPU, it might require the CPU attention, i.e., it might require the
CPU to perform some action. Hence, there must be a way to inform the CPU that
the peripheral needs its attention.
To illustrate this concept, let us consider the computing system depicted in Fig-
ure 10.1, which contains a RV32I CPU, the main memory, and a keypad.
Data Reg.
00001000
(0x00000050) electrical
7 8 9
Status Reg. wires
00000000
(0x00000054) * 0 #
Keypad Controller
114
10.1. INTRODUCTION
The longer the program takes to read the data register contents, the higher is
the chance of a data overrun. To prevent data overruns, it is customary to copy the
data register value to a first-in first-out (FIFO) queue1 located at the main memory
as soon as the keypad is pressed. This approach is illustrated in Figure 10.2, which
implements the FIFO queue using an 8-element circular buffer and two pointers, one
that points to the queue head (oldest element inserted) and another that points to
the queue tail (last element inserted). In this example, the keypad keys ‘1’, ‘9’, and
‘6’, have been pressed and stored on the queue.
Data Reg. 1 2 3
0x06
(0x00000050)
Status Reg.
4 5 6
0x00
(0x00000054) 7 8 9
Keypad Controller * 0 #
Figure 10.2: Storing the contents of the data register on a queue located at the main
memory.
In this approach, whenever a key is pressed, its value is pushed into the queue’s
tail, and whenever the user program needs to read a key, it pops it from the queue’s
head, instead of reading from the keypad data register. Notice that the queue works
as a buffer that is capable of storing multiple key values, allowing the program to
perform longer computations before reading each key value. Figure 10.3 illustrates
what happens when the keypad key ‘9’ is pressed. First, the key ‘9’ is pressed 1 .
Then, the keypad controller registers this information on the data and the status
registers 2 . Finally, the CPU executes a routine that pushes the data register value
on the queue’s tail 3 .
Data Reg. 1 2 3
0x09
(0x00000050)
2
Status Reg.
4 5 6
0x01
(0x00000054) 7 8 9 1
Keypad Controller * 0 #
Figure 10.3: Storing the key value on the queue when the keypad key ‘9’ is pressed.
Copying the value from the keypad’s data register to the queue located at the
main memory is usually performed by the CPU, through the execution of a routine.
In this context, whenever the keypad is pressed, the CPU must execute this routine as
soon as possible to prevent data overruns. There are two main methods to direct the
CPU attention to handle events caused by external hardware: Polling and Hardware
Interrupts.
10.1.1 Polling
Polling is a method in which the program is designed so that the CPU
periodically checks whether peripherals need attention. In this approach, the
program has to be designed so that it checks the peripherals that may need CPU
attention from time to time. For example, the program may contain a main loop that
repeatedly checks the peripherals and perform some computation. Whenever there
is a peripheral that needs attention, the program invokes a routine to handle the
peripheral. Algorithm 3 illustrates a program that employs polling to handle periph-
erals. It is composed of a main loop (the outer while loop) that checks peripherals
for attention and perform some computation alternatively.
Algorithm 3: Handling peripherals with polling.
1 while True do
2 // Handle peripherals
3 for p in P eripherals do
4 if needsAttention(p) then
5 handlePeripheral(p) ;
6 end
7 end
8 PerformSomeComputation();
9 end
Algorithm 4 illustrates a code that employs polling to check and handle the keypad
periodically. In this case, the keypadPressed() function checks whether the keypad
READY bit is set, if so, then it returns true and the program invokes the getKey()
and the pushKeyOnQueue() routines to read the contents of the data register and
push it to the queue’s tail. The Compute() routine represents the work that is done
by the program in the meantime.
Algorithm 4: Handling the keypad with polling.
1 while True do
2 if keypadPressed() then
3 k ← getKey() ;
4 pushKeyOnQueue(k) ;
5 end
6 Compute() ;
7 end
Notice that the amount of work performed by the Compute() routine affects the
frequency in which the keypad is checked. On the one hand, the longer the Compute()
routine takes to execute, the higher is the chance of occurring data overrun. On the
other hand, breaking the computation so that each call to Compute() executes quickly
(e.g., performing just a small fraction of the computation every time it is invoked)
may cause a large overhead (checking peripherals may take a long time) and may make
the program hard to design and implement. As a consequence, polling is usually not
the best approach to check for and handle peripherals events.
memory.
To illustrate this concept, let us consider the computing system depicted in Fig-
ure 10.4.
Interrupt
Data Reg. 1 2 3
pin 0x06
(0x00000050)
Status Reg.
4 5 6
0x00
(0x00000054) 7 8 9
Keypad Controller * 0 #
Figure 10.4: RV32I-based computing system with a keypad connected to the CPU
interrupt pin.
This system is very similar to the one presented in Figures 10.1, 10.2, and 10.3. The
main difference is that the CPU contains an interrupt pin and the keypad controller is
connected to the CPU interrupt pin (red arrow). The interrupt pin is an input pin
that informs the CPU whether or not there is an external interrupt. In this
context, whenever a key is pressed, the keypad controller sends a signal to the CPU
through the interrupt pin. The CPU hardware (i) constantly monitors the interrupt
pin and, in case it receives an interrupt signal, it interrupts the current execution
flow to execute an interrupt service routine. The interrupt service routine3 , or
ISR, is a software routine that handles the interrupt. There are several ways of
implementing ISRs, however, in general, they usually save the context of the executing
program (e.g., the contents of the CPU registers) on main memory, interact with the
peripheral that sent the interrupt signal, and, finally, restore the saved context so
that the CPU continues executing the program that was interrupted.
Algorithm 5 illustrates how the CPU instruction execution cycle presented at Sec-
tion 1.2 may be adapted to detect external interrupts. In this example, before fetching
an instruction for execution, it verifies if the interrupt pin is set, i.e., if the CPU
received an interrupt signal, and if interrupts are enabled, i.e., if the interrupts -
enabled is set. If both conditions are met, it saves the contents of the program
counter (PC) into the SAVED PC register, sets the PC register with the address of
the interrupt service routine (ISR ADDRESS), and disables interrupts by clearing the
interrupts enabled register. As a result, the next instruction that will be fetched
NOTE: CPUs usually disable interrupts on power-up to allow the boot soft-
ware to configure the hardware and register the proper ISRs before the system
tries to handle interruptions.
SW-only design
In the SW-only design, the ISR is responsible for identifying which peripheral inter-
rupted the CPU, and invoking the proper routine to handle the interrupt. In this
approach, upon an interrupt, the CPU invokes a generic ISR that must perform both
tasks. Since there is no hardware support to identify which peripheral interrupted
the CPU, the ISR may have to interact with all peripherals to find out which one is
requiring the CPU attention. Once the ISR finds out which peripheral interrupted
the CPU, it can invoke the proper interrupt service routine to handle the peripheral
interrupt.
The main advantage of this approach is that it simplifies the CPU hardware design,
which is usually an important goal since hardware bugs are hard to find and do not
allow easy patching once the CPU is manufactured and sold. Nonetheless, in case
there are several peripherals or peripherals are slow, the ISR may take a long time
trying to figure out which peripheral interrupted the CPU. This may affect overall
system performance and may even cause the system to lose data due to data overruns,
as discussed in previous sections.
SW/HW design
In the SW-HW design, the ISR is also responsible for performing both tasks; however,
the hardware provides some support to identify which peripheral interrupted the
CPU. In this case, upon an interrupt, the hardware sets a register4 with a value
that indicates which peripheral generated the interrupt. Consequently, the ISR may
simply read this register to find out which peripheral generated the interrupt. Once
the ISR finds out which peripheral interrupted the CPU, it can invoke the proper
interrupt service routine to handle the peripheral interrupt.
Both the SW-only and the SW/HW designed jumps to a single generic ISR. This
approach is known as “direct mode” in RISV-V terminology. Algorithm 5 illustrates
how this approach may be implemented in hardware.
The CPU hardware design may not be as simple as in the SW-only approach;
however, in this approach, the ISR takes very little time (usually the time required
to execute one or two instructions) to figure out which peripheral send the interrupt
signal.
HW-only design
In the HW-only design, the hardware is responsible for identifying which peripheral
interrupted the CPU and to invoke the proper ISR. In this case, each peripheral is
associated with an interrupt identifier5 and the CPU must automatically map this
identifier to its respective ISR. This is usually performed with a table, often called
interrupt vector table, that maps the interrupt identifier to the address of the ISR6 .
To illustrate this concept, let us consider a system in which each peripheral is
associated with a unique interrupt identifier that may range from 0 to 15, and that
the CPU automatically registers the interrupt identifier on the INTERRUPT ID register
whenever an interrupt signal is received.Also, there is an array on main memory,
called interrupt vector table, that contains in position i the address of the ISR that
must be invoked to handle interrupts from the peripheral that is associated with
interrupt identifier i. The system also contains a register called INT TABLE BASE that
stores the interrupt vector table base address. In this context, to invoke the proper
ISR, the CPU may load the ISR address from the interrupt vector table using the
interrupt identifier. Algorithm 6 illustrates how a CPU may automatically load the
address of the proper ISR by accessing the interrupt vector table. The CPU multiplies
the contents of the INTERRUPT ID register by four because each entry in the interrupt
4 This register may be an internal CPU register or a register on an interrupt controller, which is
On power-up, before enabling interrupts, the boot software must write the inter-
rupt vector table on main memory and set the INT TABLE BASE register with its base
address.
This approach’s main advantage is the performance since the CPU directly in-
vokes the proper ISR upon an interrupt. However, the CPU hardware design usually
becomes more complicated.
31 30 23 22 21 20 19 18 17
SD WPRI TSR TW TVM MXR SUM MPRV
1 8 1 1 1 1 1 1
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
XS[1:0] FS[1:0] MPP[1:0] WPRI SPP MPIE WPRI SPIE UPIE MIE WPRI SIE UIE
2 2 2 2 1 1 1 1 1 1 1 1 1
mstatus CSR.
• mcause (Machine Interrupt Cause): The machine interrupt cause CSR stores
the interrupt cause, i.e., a value that identifies why an interrupt was generated.
It has two fields: mcause.EXCCODE (bits 0 to 30) and mcause.INTERRUPT (bit
31). The mcause.INTERRUPT subfield specifies whether the interrupt is an ac-
tual interrupt (1) or an exception ( 0)7 . The mcause.EXCCODE subfield specifies
the interrupt (or exception) identifier. On Machine mode, interrupts caused by
peripherals are classified as “Machine external interrupt” and the value regis-
tered on the mcause.INTERRUPT and mcause.EXCCODE subfields are 0x1 and 0xB,
respectively.
• mtvec (Machine Trap Vector): The machine trap vector CSR stores information
that allows the CPU to identify the proper interrupt service routine address
when an interrupt occurs. It has two fields: mtvec.MODE (bits 0 to 1) and
mtvec.BASE (bits 2 to 31). The mtvec.MODE specifies whether the CPU is working
in the direct (00) of vectored (01) mode. In the direct mode, upon an interrupt,
the CPU sets the PC with the contents of the mtvec.BASE subfield. In the
vectored mode, upon an interrupt, the CPU sets the PC with mtvec.BASE + (4
× mcause.EXCCODE).
7 In RISC-V terminology, exceptions are interrupts caused by the CPU. We will discuss these kind
– mie.MEIE: The Machine External Interrupt Enabled subfield (bit 11) con-
trols whether the CPU must accept or ignore external interrupts.
– mie.MTIE: RISC-V CPUs contain an internal timer that may be configured
to generate interrupts. The Machine Timer Interrupt Enabled subfield (bit
7) controls whether the CPU must accept or ignore interrupts from this
timer.
– mie.MSIE: The Machine Software Interrupt Enabled subfield (bit 3) controls
whether the CPU must accept or ignore software interrupts 8 on Machine
mode.
• mip (Machine Interrupt Pending): The machine interrupt pending CSR registers
which interrupts are pending, i.e., they have been signaled but not handled by
the CPU yet. The following subfields indicate the status of pending interrupts
on the Machine mode:
– mip.MEIP: The Machine External Interrupt Pending subfield (bit 11) indi-
cates whether an external interrupt is pending.
– mip.MTIP: The Machine Timer Interrupt Pending subfield (bit 7) indicates
whether a timer interrupt is pending.
– mip.MSIP: The Machine Software Interrupt Pending subfield (bit 3) indi-
cates whether a software interrupt is pending.
• mepc (Machine Exception Program Counter): Upon an interrupt, the CPU saves
the contents of the PC register into the machine exception program counter
CSR.
subfield (line 2). If it is set (‘1’), then it checks for interrupts (lines 3-20), otherwise, it
ignores interrupts and proceeds with the normal instruction execution cycle (line 22).
Assuming mstatus.MIE = ‘1’, in case there is an external interrupt pending (mip.MEIP
= ‘1’) and external interrupts are enabled (mie.MEIE = ‘1’) (line 4), then the CPU
handles the interrupt (lines 5-19). When handling an interrupt, the CPU first saves
the value of the mstatus.MIE subfield and clears it so that new interrupts are ignored 9
(lines 6 and 7). Then, the CPU saves the contents of the PC into the mepc CSR (line
8), and sets the mcause CSR (lines 10 and 11). Finally, it changes the PC register so
it points to the first instruction of the interrupt service routine (lines 13-19).
Algorithm 7: RV32I CPU external interrupt handling flow.
1 while True do
2 if mstatus.MIE = ‘1’ then
3 // Check for external interrupts
4 if (mip.MEIP = ‘1’) and (mie.MEIE = ‘1’) then
5 // Save part of the context and ignore new interrupts
6 mstatus.MPIE ← mstatus.MIE ;
7 mstatus.MIE = ‘0’ ;
8 mepc ← PC ;
9 // Sets the interrupt cause
10 mcause.INTERRUPT ← ‘1’ ;
11 mcause.EXCCODE ← ‘0xB’ ;
12 // Change PC to execute the ISR
13 if mtvec.MODE = ’0’ then
14 // Direct mode (0)
15 PC ← mtvec.BASE ;
16 else
17 // Vectored mode (1)
18 PC ← mtvec.BASE + (4 × mcause.EXCCODE) ;
19 end
20 end
21 end
22 // Fetch instruction and update PC
23 IR ← MainMemory[PC] ;
24 PC ← PC+4;
25 ExecuteInstruction(IR);
26 end
There are several strategies that may be employed to save the registers’ contents
in the main memory. In our discussion, we will assume there is a dedicated stack
for interrupt service routines. This stack, called here ISR stack, is allocated on main
memory on a set of addresses that does not collide with the addresses used by other
programs running on the system. In this way, whenever an interrupt occurs, the
interrupt service routine can safely save the context of the currently executing program
into the ISR stack.
To push values into the ISR stack, we must first make the SP register point to the
top of the ISR stack. In the RV32I ISA, this task can be performed with help from
the mscratch CSR. To do so, we first configure the system so that the mscratch CSR
points to the top of the ISR stack on power-up. Then, at the beginning/end of the
interrupt service routine, we exchange the value of mscratch and SP by executing the
csrrw instruction. The following code illustrates this process. First (line 3), the ISR
swaps the sp and the mscratch registers’ contents so that the sp register points to the
top of the ISR stack and the mscratch points to the top of the previous program stack.
Then, the ISR allocates space on the ISR stack and saves all the necessary context
(lines 4-7). After this, it identifies the interrupt source by inspecting the mcause CSR
and invokes the specialized ISR to handle the interrupt (lines 9-11). Finally, the ISR
restores the context by loading the registers’ values from the ISR stack, swapping the
mscratch and sp registers’ contents, and executing the mret instruction.
1 main_isr:
2 # Saves the context
3 csrrw sp, mscratch, sp # Exchange sp with mscratch
4 addi sp, sp, -64 # Allocates space at the ISR stack
5 sw a0, 0(sp) # Saves a0
6 sw a1, 4(sp) # Saves a1
7 ...
8
The mret instruction is a special instruction that recovers the context that was au-
tomatically saved by the CPU hardware. More specifically, it recovers the mstatus.MIE
subfield contents by copying the value from mstatus.MPIE and the PC register’s con-
tents by copying the values from the mepc register10.
on the mtvec CSR. Assuming the main isr routine starts on an address that is a
multiple of four11 , the following code shows how to write the address of the main isr
routine on the mtvec CSR and configure it to work in direct mode. Since the main -
isr starts on an address that is a multiple of four, the two least significant bits of the
address are zero, hence, by writing this value into the mtvec CSR we are configuring
the mtvec.MODE subfield to work on the direct mode.
To configure the system to work with the vectored mode, the base address of the
interrupt vector table may be loaded into a register and the least significant bit before
writing the register’s value into the mtvec CSR. The following code illustrates this
process. In this case, the base address of the interrupt vector table, represented by
the ivt label, is first loaded into register t0. Then, its least significant bit is set by
the ori instruction and the final valued written into the mtvec CSR using the csrw
instruction.
1 .section .bss
2 .align 4
3 isr_stack:
4 .skip 1024
5 isr_stack_end:
6
7 .section .text
8 .align 2
9 start:
10 la t0, isr_stack_end
11 csrw mscratch, t0
Enabling interrupts
Once peripherals that generate interrupt signals are properly configured, and the
interrupt service routine and the ISR stack are set, the initialization code must enable
the mie.MEIE and the mstatus.MIE subfields to allow the CPU to handle external
interrupts. The following code shows how this process can be performed.
specifies that RV32I instructions must be stored on addresses that are multiples of four.
12 The ilp32 ABI specifies that the stack pointer must always contain an address that is a multiple
of 16.
As discussed in Section 5, many computer systems are organized so that the software
is divided into user and system software. The system software (e.g., the operating
system kernel and device drivers) is the software responsible for protecting and man-
aging the whole system, including interacting with peripherals to perform input and
output operations and loading and scheduling user applications for execution. The
user software is usually limited to performing operations with data that is located
on registers and the main memory. Whenever the user software needs to perform a
procedure that requires interacting with other parts of the system, such as reading
data from a file or showing information on the computer display, it invokes the system
software to perform the procedure on its behalf.
This chapter discusses the hardware mechanisms that protect the system from
faulty or malicious user programs and how to program these mechanisms.
• U: User/Application;
• S: Supervisor; and
• M: Machine
The Machine privilege level has the highest privileges, allowing full access to the
hardware. The Supervisor privilege level has the second-highest privileges, and the
User/Application privilege level has the least privileges.
A RISC-V hardware platform may implement a subset or all of these privilege
levels. For example, when implementing a hardware platform for a compact and
straightforward embedded system, only the Machine privilege level may be required.
On the other hand, when implementing a hardware platform for a system that re-
lies on an operating system to manage applications (e.g., a computer desktop), it is
usually useful to include all three privilege levels to facilitate the operating system
implementation.
127
CHAPTER 11. SOFTWARE INTERRUPTS AND EXCEPTIONS
The RISC-V privilege mode defines the privilege level for the currently executing
software. For example, when the Machine privilege mode is active, the currently
executing software has Machine privilege levels and, hence, full access to the hardware.
The unprivileged mode is the privilege mode with the least privileges. In RISC-
V, the unprivileged mode is the User/Application privilege mode, also known as the
user-mode or U-mode. The unprivileged ISA is the sub-set of the Instruction
Set Architecture accessible by the software running on unprivileged mode.
To simplify the discussion, the remaining of this chapter will focus on RISC-V
processors that have only two privilege modes: User/Application and Machine mode.
• Configuring the system: on power on, the hardware automatically sets the
privilege mode to Machine mode and starts executing the boot code. The boot
code loads the operating system software into memory and invokes its initial-
ization code in Machine mode, which allows the operating system to configure
the whole system.
• Executing user code: once the system is set, the operating system may load
user programs into main memory and execute them. However, before transfer-
ring control to execute the user code, it sets the privilege mode as User/Appli-
cation mode.
• Handling illegal operations: in case the user software tries to perform a
privileged operation, such as interacting with peripherals, the hardware stops
executing the user code and invokes the operating system so it can handle the
illegal operation. The hardware transfer the control to the operating system
using the exception handling mechanism, as discussed in Section 11.3.
• Invoking the operating system: if the user program needs to perform a
sensitive procedure, such as an output to a peripheral, it must invoke the op-
erating system, which will perform the procedure on the user-program behalf.
When transferring control to the operating system, the hardware must change
the privilege mode to Supervisor or Machine mode so the operating system may
execute with proper privilege. To do so, ISAs usually include a mechanism,
called software interrupt, that allows code running on unprivileged mode to in-
voke the system code and change the privilege mode at the same time. This
mechanism is designed so that the user code may not change the privilege mode
and execute its own code. Once the operating system finishes performing the
procedure on the user-program behalf, it may change the privilege mode back
to the User/Application mode and return to the user program.
• Handling external interrupts: upon an external interrupt, the hardware
sets the privilege mode as Machine mode, so the interrupt service routine has
enough privilege to handle the interrupt. Notice that the interrupt service
routines belongs to the system software.
11.3 Exceptions
Exceptions are events generated by the CPU in response to exceptional
conditions when executing instructions. Trying to execute an illegal instruc-
tion1 , for example, is a condition that causes a RISC-V CPU to generate an exception.
1 An illegal instruction is an instruction that is not recognized by the CPU.
Notice that the exception handling flow is very similar to the one employed to
handle hardware interrupts. In fact, RISC-V CPUs use the same mechanism to
handle both interrupts and exceptions, i.e., it saves part of the current context (e.g.,
PC contents), sets the mcause CSR, and redirects the execution flow to an interrupt
service routine. As discussed in Section 10.3.4, the interrupt service routine can
distinguish between an interrupt and an exception by inspecting the mcause Control
and Status Register. More specifically, the mcause.INTERRUPT CSR field indicates
whether the CPU is handling an interrupt or an exception. Also, the mcause.EXCCODE
CSR field indicates the source of the interrupt or the exception. There may be several
sources of exceptions and interrupts on RISC-V. Table 11.1 shows the sources of
interrupts and exceptions3 and their respective codes on the mcause CSR.
The exception handling mechanism is usually employed to protect the system from
illegal user code operations. In this context, the hardware is configured by the system
software to generate exceptions in case the privilege mode is set as User/Application
and the CPU tries to execute certain operations, such as accessing addresses that
are mapped to peripheral devices or accessing Control and Status Registers that can
only be accessed in Machine mode. Upon an exception, the interrupt service routine,
which belongs to the system software, may decide what to do with the user program.
mcause fields
Cause
INTERRUPT EXCCODE
1 0 User software interrupt
1 1 Supervisor software interrupt
1 2 Reserved for future standard use
1 3 Machine software interrupt
1 4 User timer interrupt
1 5 Supervisor timer interrupt
1 6 Reserved for future standard use
1 7 Machine timer interrupt
1 8 User external interrupt
1 9 Supervisor external interrupt
1 10 Reserved for future standard use
1 11 Machine external interrupt
1 12-15 Reserved for future standard use
1 ≥ 16 Reserved for platform use
0 0 Instruction address misaligned
0 1 Instruction access fault
0 2 Illegal instruction
0 3 Breakpoint
0 4 Load address misaligned
0 5 Load access fault
0 6 Store/AMO address misaligned
0 7 Store/AMO access fault
0 8 Environment call from U-mode
0 9 Environment call from S-mode
0 10 Reserved
0 11 Environment call from M-mode
0 12 Instruction page fault
0 13 Load page fault
0 14 Reserved for future standard use
0 15 Store/AMO page fault
0 16-23 Reserved for future standard use
0 24-31 Reserved for custom use
0 32-47 Reserved for future standard use
0 48-63 Reserved for custom use
0 ≥64 Reserved for future standard use
Table 11.1: Sources of interrupts and exceptions and their codes on the mcause CSR.
The following code shows how the system software may change the privilege mode
to User/Application mode and simultaneously invoke a user program. First, it sets
the mstatus.MPP CSR field with “00”, the User/Application mode code. Then, it
loads the user software entry point into the mepc CSR. Finally, it executes the mret
instruction, which changes the mode using the code in the mstatus.MPP CSR field
and changes the program counter using the value in the mepc CSR at the same time.
from a protected address, the system generates an “Instruction access fault” exception
(mstatus.EXCCODE=1).
Whenever an exception is generated, the RISC-V CPU:
• Saves the current program counter into the mepc CSR.
• Sets the mcause CSR with the code that identifies the exception source.
• Saves the current mode into the mstatus.MPP CSR field.
• Changes the mode to Machine mode.
• Sets the program counter to redirect the execution to the exception handling
routine.
Some exceptions may also set the Machine Trap Value (mtval) CSR with extra
information about the exception. For example, when a load or store access exception
occurs, the mtval CSR is set with the faulting virtual address.
Depending on the exception, it may make sense to fix the problem that caused
the exception and return the execution to the software that caused the exception to
continue its execution. Page faults are examples of exceptions that can be handled by
the system, so the software that caused the exception may continue its execution. In
these cases, handling the exception is usually similar to handling an external interrupt,
i.e. the exception handling routine must save the context, handle the exception, and,
finally, recover the context so the software running on the CPU may continue its
execution. In cases where illegal operations generate exceptions, and there is no
known way to recover from the problem, the operating system may kill the offending
process, i.e., the process that tried to execute the illegal operation.
[1] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual.
Volume 1: Basic Architecture., September 2016.
[2] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual.
Volume 2A: Instruction Set Reference, A-L., September 2016.
[3] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual.
Volume 2B: Instruction Set Reference, M-U., September 2016.
[4] Andrew Waterman and Krste Asanović. The RISC-V Instruction Set Manual.
Volume i: Unprivileged isa, version 20191213. Technical report, SiFive Inc., 2019.
[5] Andrew Waterman and Krste Asanović. The RISC-V Instruction Set Manual.
Volume II: Privileged architecture, document version 20190608-priv-msu-ratified.
Technical report, SiFive Inc., 2019.
133
Appendix A
The next pages contain a reference card for the RV32IM ISA.
134
RV32IM assembly instructions reference card
Prof. Edson Borin
Institute of Computing - Unicamp