Assembly Programming PDF
Assembly Programming PDF
Programming
C, Assembly, and Program Execution on
Intel® 64 Architecture
—
Igor Zhirkov
Low-Level Programming
C, Assembly, and Program Execution on
Intel® 64 Architecture
Igor Zhirkov
Low-Level Programming: C, Assembly, and Program Execution on Intel® 64 Architecture
Igor Zhirkov
Saint Petersburg, Russia
ISBN-13 (pbk): 978-1-4842-2402-1 ISBN-13 (electronic): 978-1-4842-2403-8
DOI 10.1007/978-1-4842-2403-8
Library of Congress Control Number: 2017945327
Copyright © 2017 by Igor Zhirkov
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or
hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material
contained herein.
Cover image designed by Freepik
Managing Director: Welmoed Spahr
Editorial Director: Todd Green
Acquisitions Editor: Robert Hutchinson
Development Editor: Laura Berendson
Technical Reviewer: Ivan Loginov
Coordinating Editor: Rita Fernando
Copy Editor: Lori Jacobs
Compositor: SPi Global
Indexer: SPi Global
Distributed to the book trade worldwide by Springer Science+Business Media New York,
233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail
orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC
and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc).
SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail rights@apress.com, or visit http://www.apress.com/
rights-permissions.
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and
licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales web
page at http://www.apress.com/bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available to readers
on GitHub via the book’s product page, located at www.apress.com/9781484224021. For more detailed
information, please visit http://www.apress.com/source-code.
Printed on acid-free paper
Contents at a Glance
iii
■ CONTENTS AT A GLANCE
iv
Contents
v
■ CONTENTS
vi
■ CONTENTS
vii
■ CONTENTS
viii
■ CONTENTS
x
■ CONTENTS
xi
■ CONTENTS
xii
■ CONTENTS
xiii
■ CONTENTS
xiv
■ CONTENTS
xv
■ CONTENTS
xvii
About the Author
xix
About the Technical Reviewer
xxi
Acknowledgments
I was blessed to meet a great number of persons, both very gifted and extremely dedicated, who helped me
and often guided me toward the areas of knowledge I could never have imagined myself.
I thank Vladimir Nekrasov, my most beloved math teacher, for his course and his influence on me,
which enabled me to think better and more logically.
I thank Andrew Dergachev, who entrusted me to create and teach my course and helped me so much
during these years, Boris Timchenko, Arkady Kluchev, Ivan Loginov (who also kindly agreed to be the
technical reviewer for this book), and all my colleagues from ITMO university, who helped me to shape this
course in one way or another.
I thank all my students who provided feedback or even helped me in teaching. You are the very reason
I am doing this. Several students helped by reviewing the draft of this book, I want to note the most useful
remarks of Dmitry Khalansky and Valery Kireev.
For me, the years I have spent in Saint-Petersburg Academic University are easily the best of my life.
Never have I had more opportunities to study with world-class specialists working in the leading companies
along with other students, much smarter than me. I want to express my deepest gratitude to Alexander
Omelchenko, Alexander Kulikov, Andrey Ivanov, and everyone contributing to the quality of computer
science education in Russia. I also thank Dmitry Boulytchev, Andrey Breslav, and Sergey Sinchuk from
JetBrains, my supervisors who have taught me a lot.
I am also very grateful to my french colleagues: Ali Ed-Dbali, Frédéric Loulergue, Rémi Douence, and
Julien Cohen.
I also want to thank Sergei Gorlatch and Tim Humernbrum for providing much necessary feedback on
Chapter 17, which helped me shape it into a much more consistent and understandable version. Special
thanks go to Dmitry Shubin for his most useful impact on fixing the imperfections of this book.
I am very grateful to my friend Alexey Velikiy and to his agency CorpGlory.com, which focused on data
visualizations and infographics and crafted the best illustrations in this book.
Behind every little success of mine is an infinite amount of support from my family and friends. I would
not have achieved anything without you.
Last, but not least, I thank the Apress team, including Robert Hutchinson, Rita Fernando, Laura
Berendson, and Susan McDermott, for putting their trust in me and this project and doing everything they
could to bring this book into reality.
xxiii
Introduction
This book aims to help you develop a consistent vision of the domain of low-level programming. We want to
enable a careful reader to
• Freely write in assembly language.
• Understand the Intel 64 programming model.
• Write maintainable and robust code in C11.
• Understand the compilation process and decipher assembly listings.
• Debug errors in compiled assembly code.
• Use appropriate models of computation to greatly reduce program complexity.
• Write performance-critical code.
There are two kinds of technical books: those used as a reference and those used to learn. This book
is, without doubt, the second kind. It is pretty dense on purpose, and in order to successfully digest the
information we highly suggest continuous reading. To quickly memorize new information you should try to
connect it with the information with which you are already familiar. That is why we tried, whenever possible,
to base our explanation of each topic on the information you received from previous topics.
This book is written for programming students, intermediate-to-advanced programmers, and low-level
programming enthusiasts. The prerequisites are a basic understanding of binary and hexadecimal systems
and a basic knowledge of Unix commands.
■ Questions and Answers Throughout this book you will encounter numerous questions. Most of them
are meant to make you think again about what you have just learned, but some of them encourage you to do
additional research, pointing to the relevant keywords.
We propose the answers to these questions in our GitHub page, which also hosts all listings and starting
code for assignments, updates and other goodies.
Refer to the book’s page on Apress site for additional information: http://www.apress.com/us/
book/9781484224021.
There you can also find several preconfigured virtual machines with Debian Linux installed, with and
without a graphical user interface (GUI), which allows you to start practicing right away without spending
time setting up your system. You can find more information in section 2.1.
We start with the very simple core ideas of what a computer is, explaining concepts of model of
computation and computer architecture. We expand the core model with extensions until it becomes
adequate enough to describe a modern processor as a programmer sees it. From Chapter 2 onward we start
programming in the real assembly language for Intel 64 without resorting to older 16-bit architectures, that
are often taught for historical reasons. It allows us to see the interactions between applications and operating
xxv
■ INTRODUCTION
system through the system calls interface and the specific architecture details such as endianness. After a
brief overview of legacy architecture features, some of which are still in use, we study virtual memory in great
detail and illustrate its usage with the help of procfs and examples of using mmap system call in assembly.
Then we dive into the process of compilation, overviewing preprocessing, static, and dynamic linking. After
exploring interrupts and system calls mechanisms in greater detail, we finish the first part with a chapter
about different models of computations, studying examples of finite state machines, stack machines, and
implementing a fully functional compiler of Forth language in pure assembly.
The second part is dedicated to the C language. We start from the language overview, building a core
understanding of its model of computation necessary to start writing programs. In the next chapter we study
the type system of C and illustrate different kinds of typing, ending with about a discussion of polymorphism
and providing exemplary implementations for different kinds of polymorphism in C. Then we study the
ways of correctly structuring the program by splitting it into multiple files and also viewing its effect on the
linking process. The next chapter is dedicated to the memory management, input and output. After that,
we elaborate three facets of each language: syntax, semantics, and pragmatics and concentrate on the first
and the third ones. We see how the language propositions are transformed into abstract syntax trees, the
difference between undefined and unspecified behavior in C, and the effect of language pragmatics on
the assembly code produced by the compiler. In the end of the second part, we dedicate a chapter to the
good code practices to give readers an idea of how the code should be written depending on its specific
requirements. The sequence of the assignments for this part is ended by the rotation of a bitmap file and a
custom memory allocator.
The final part is a bridge between the two previous ones. It dives into the translation details such as
calling conventions and stack frames and advanced C language features, requiring a certain understanding
of assembly, such as volatile and restrict keywords. We provide an overview of several classic low-level
bugs such as stack buffer overflow, which can be exploited to induce an unwanted behavior in the program.
The next chapter tells about shared objects in great details and studies them on the assembly level, providing
minimal working examples of shared libraries written in C and assembly. Then, we discuss a relatively
rare topic of code models. The chapter studies the optimizations that modern compilers are capable of
and how that knowledge can be used to produce readable and fast code. We also provide an overview of
performance-amplifying techniques such as specialized assembly instructions usage and cache usage
optimization. This is followed by an assignment where you will implement a sepia filter for an image using
specialized SSE instructions and measure its performance. The last chapter introduces multithreading via
pthreads library usage, memory models, and reorderings, which anyone doing multithreaded programming
should be aware of, and elaborates the need for memory barriers.
The appendices include short tutorials on gdb (debugger), make (automated build system), and a table
of the most frequently used system calls for reference and system information to make performance tests
given throughout the book easier to reproduce. They should be read when necessary, but we recommend
that you get used to gdb as soon as you start assembly programming in Chapter 2.
Most illustrations were produced using VSVG library aimed to produce complex interactive vector
graphics, written by Alexey Velikiy (http://www.corpglory.com). The sources for the library and book
illustrations are available at VSVG Github page: https://github.com/corpglory/vsvg.
We hope that you find this book useful and wish you an enjoyable read!
xxvi
PART I
This chapter is going to give you a general understanding of the fundamentals of computer functioning. We
will describe a core model of computation, enumerate its extensions, and take a closer look at two of them,
namely, registers and hardware stack. It will prepare you to start assembly programming in the next chapter.
One type of machine soon became dominant: the von Neumann architecture computer.
Computer architecture describes the functionality, organization, and implementation of computer
systems. It is a relatively high-level description, compared to a calculation model, which does not omit even
a slight detail.
von Neumann architecture had two crucial advantages: it was robust (in a world where electronic
components were highly unstable and short-lived) and easy to program.
In short, this is a computer consisting of one processor and one memory bank, connected to a common
bus. A central processing unit (CPU) can execute instructions, fetched from memory by a control unit.
The arithmetic logic unit (ALU) performs the needed computations. The memory also stores data. See
Figures 1-1 and 1-2.
Following are the key features of this architecture:
• Memory stores only bits (a unit of information, a value equal to 0 or 1).
• Memory stores both encoded instructions and data to operate on. There are no means
to distinguish data from code: both are in fact bit strings.
• Memory is organized into cells, which are labeled with their respective indices in
a natural way (e.g., cell #43 follows cell #42). The indices start at 0. Cell size may
vary (John von Neumann thought that each bit should have its address); modern
computers take one byte (eight bits) as a memory cell size. So, the 0-th byte holds the
first eight bits of the memory, etc.
• The program consists of instructions that are fetched one after another. Their
execution is sequential unless a special jump instruction is executed.
Assembly language for a chosen processor is a programming language consisting of mnemonics for
each possible binary encoded instruction (machine code). It makes programming in machine codes much
easier, because the programmer then does not have to memorize the binary encoding of instructions, only
their names and parameters.
Note, that instructions can have parameters of different sizes and formats.
An architecture does not always define a precise instruction set, unlike a model of computation.
A common modern personal computer have evolved from old von Neumann architecture computers,
so we are going to investigate this evolution and see what distinguishes a modern computer from the simple
schematic in Figure 1-2.
4
CHAPTER 1 ■ BASIC COMPUTER ARCHITECTURE
■ Note Memory state and values of registers fully describe the CPU state (from a programmer’s point of
view). Understanding an instruction means understanding its effects on memory and registers.
1.2 Evolution
1.2.1 Drawbacks of von Neumann Architecture
The simple architecture described previously has serious drawbacks.
First of all, this architecture is not interactive at all. A programmer is limited by manual memory editing
and visualizing its contents somehow. In the early days of computers, it was pretty straightforward, because
the circuits were big and bits could have been flipped literally with bare hands.
Moreover, this architecture is not multitask friendly. Imagine your computer is performing a very
slow task (e.g., controlling a printer). It is slow because a printer is much slower than the slowest CPU. The
CPU then has to wait for a device reaction a percentage of time close to 99%, which is a waste of resources
(namely, CPU time).
Then, when everyone can execute any kind of instruction, all sorts of unexpected behavior can occur.
The purpose of an operating system (OS) is (among others) to manage the resources (such as external
devices) so that user applications will not cause chaos by interacting with the same devices concurrently.
Because of this we would like to prohibit all user applications from executing some instructions related to
input/output or system management.
Another problem is that memory and CPU performance differ drastically.
Back in the old times, computers were not only simpler: they were designed as integral entities.
Memory, bus, network interfaces—everything was created by the same engineering team. Every part was
specialized to be used in this specific model. So parts were not destined to be interchangeable. In these
circumstances none tried to create a part capable of higher performance than other parts, because it could
not possibly increase overall computer performance.
But as the architectures became more or less stable, hardware developers started to work on different
parts of computers independently. Naturally, they tried to improve their performance for marketing
purposes. However, not all parts were easy and cheap1 to speed up. This is the reason CPUs soon became
much faster than memory. It is possible to speed up memory by choosing other types of underlying circuits,
but it would be much more expensive [12].
1
Note how often solutions the engineers come up with are dictated by economic reasons rather than technical limitations.
5
CHAPTER 1 ■ BASIC COMPUTER ARCHITECTURE
When a system consists of different parts and their performance characteristics differ a lot, the slowest
part can become a bottleneck. It means that if is the slowest part is replaced with a faster analogue, the
overall performance will increase significantly. That’s where the architecture had to be heavily modified.
2
Also known as x86_64 and AMD64.
6
CHAPTER 1 ■ BASIC COMPUTER ARCHITECTURE
Problem Solution
Nothing is possible without querying slow memory Registers, caches
Lack of interactivity Interrupts
No support for code isolation in procedures, or for context saving Hardware stack
Multitasking: any program can execute any instruction Protection rings
Multitasking: programs are not isolated from one another Virtual memory
■ Sources of information No book should cover the instruction set and processor architecture completely.
Many books try to include exhaustive information about instruction set. It gets outdated quite soon; moreover, it
bloats the book unnecessarily.
We will often refer you to Intel® 64 and IA-32 Architectures Software Developer’s Manual available online:
see [15]. Get it now!
There is no virtue in copying the instruction descriptions from the “original” place they appear in; it is much
more mature to learn to work with the source.
The second volume covers instruction set completely and has a very useful table of contents. Please, always use
it to get information about instruction set: it is not only a very good practice, but also a quite reliable source.
Note, that many educational resources devoted to assembly language in the Internet are often heavily outdated
(as few people program in assembly these days) and do not cover the 64-bit mode at all. The instructions present
in older modes often have their updated counterparts in long mode, and those are working in a different way. This
is a reason we strongly discourage using search engines to find instruction descriptions, as tempting as it might be.
1.3 Registers
The data exchange between CPU and memory is a crucial part of computations in a von Neumann
computer. Instructions have to be fetched from memory, operands have to be fetched from memory; some
instructions store results also in memory. It creates a bottleneck and leads to wasted CPU time when it waits
for the data response from the memory chip. To avoid constant wait, a processor was equipped with its own
memory cells, called registers. These are few but fast. Programs are usually written in such a way that most of
the time the working set of memory cells is small enough. This fact suggests that programs can be written so
that most of the time the CPU will be working with registers.
7
CHAPTER 1 ■ BASIC COMPUTER ARCHITECTURE
Registers are based on transistors, while main memory uses condensers. We could have implemented
main memory on transistors and gotten a much faster circuit. There are several reasons engineers prefer
other ways of speeding up computations.
• Registers are more expensive.
• Instructions encode the register’s number as part of their codes. To address more
registers the instructions have to grow in size.
• Registers add complexity to the circuits to address them. More complex circuits are
harder to speed up. It is not easy to set up a large register file to work on 5 GHz.
Naturally, register usage slows down computers in the worst case. If everything has to be fetched into
registers before the computations are made and flushed into memory after, where’s the profit?
The programs are usually written in such a way, that they have one particular property. It is a result of
using common programming patterns such as loops, function, and data reusage, not some law of nature.
This property is called locality of reference and there are two main types of it: temporal and spatial.
Temporal locality means that accesses to one address are likely to be close in time.
Spatial locality means that after accessing an address X the next memory access will likely to be close
to X, (like X − 16 or X + 28).
These properties are not binary: you can write a program exhibiting stronger or weaker locality.
Typical programs are using the following pattern: the data working set is small and can be kept inside
registers. After fetching the data into registers once we will work with them for quite some time, and then the
results will be flushed into memory. The data stored in memory will rarely be used by the program. In case
we need to work with this data we will lose performance because
• We need to fetch data into the registers.
• If all registers are occupied with data we still need later on, we will have to spill some of
them, which means save their contents into temporally allocated memory cells.
■ Note A widespread situation for an engineer: decreasing performance in the worst case to improve it in average
case. It does work quite often, but it is prohibited when building real-time systems, which impose constraints on the
worst system reaction time. Such systems are required to issue a reaction to events in no more than a certain amount
of time, so decreasing performance in the worst case to improve it in other cases is not an option.
8
CHAPTER 1 ■ BASIC COMPUTER ARCHITECTURE
■ Note Unlike the hardware stack, which is implemented on top of the main memory, registers are a
completely different kind of memory. Thus they do not have addresses, as the main memory’s cells do!
The alternate names are in fact more common for historical reasons. We will provide both for reference
and give a tip for each one. These semantic descriptions are given for a reference; you don’t have to
memorize them right now.
You usually do not want to use rsp and rbp registers because of their very special meaning (later we
will see how they corrupt stack and stack frame). However, you can perform arithmetic operations on them
directly, which makes them general purpose.
Table 1-3 shows registers sorted by their names following an indexing convention.
r0 r1 r2 r3 r4 r5 r6 r7
rax rcx rdx rbx rsp rbp rsi rdi
Addressing a part of a register is possible. For each register you can address its lowest 32 bits, lowest 16
bits, or lowest 8 bits.
When using the names r0,...,r15 it is done by adding an appropriate suffix to a register’s name:
• d for double word—lower 32 bits;
• w for word—lower 16 bits;
• b for byte—lower 8 bits.
9
CHAPTER 1 ■ BASIC COMPUTER ARCHITECTURE
For example,
• r7b is the lowest byte of register r7;
• r3w consists of the lowest two bytes of r3; and
• r0d consists of the lowest four bytes of r0.
The alternate names also allow addressing the smaller parts.
Figure 1-4 shows decomposition of wide general purpose registers into smaller ones.
The naming convention for accessing parts of rax, rbx, rcx, and rdx follows the same pattern; only the
middle letter (a for rax) is changing. The other four registers do not allow an access to their second lowest
bytes (like rax does by the name of ah). The lowest byte naming differs slightly for rsi, rdi, rsp, and rbp.
• The smallest parts of rsi and rdi are sil and dil (see Figure 1-5).
• The smallest parts pf rsp and rbp are spl and bpl (see Figure 1-6).
In practice, the names r0-r7 are rarely seen. Usually programmers stick with alternate names for the
first eight general purpose registers. It is done for both legacy and semantic reasons: rsp relates a lot more
information, than r4. The other eight (r8-r15) can only be named using an indexed convention.
■ Inconsistency in writes All reads from smaller registers act in an obvious way. The writes into 32-bit
parts, however, fill the upper 32 bits of the full register with sign bits. For example, zeroing eax will zero the
entire rax, storing -1 into eax will fill the upper 32 bits with ones. Other writes (e.g., in 16-bit parts) act as
intended: they leave all other bits unaffected. See section 3.4.2 “CISC and RISC” for the explanation.
10
CHAPTER 1 ■ BASIC COMPUTER ARCHITECTURE
11
CHAPTER 1 ■ BASIC COMPUTER ARCHITECTURE
A programmer has access to rip register. It is a 64-bit register, which always stores an address of the
next instruction to be executed. Branching instructions (e.g., jmp) are in fact modifying it. So, every time any
instruction is being executed, rip stores the address of the next instruction to be executed.
Another accessible register is called rflags. It stores flags, which reflect the current program state—for
example, what was the result of the last arithmetic instruction: was it negative, did an overflow happened,
etc. Its smaller parts are called eflags (32 bit) and flags (16 bit).
■ Question 1 It is time to do preliminary research based on the documentation [15]. Refer to section 3.4.3
of the first volume to learn about register rflags. What is the meaning of flags CF, AF, ZF, OF, SF? What is the
difference between OF and CF?
In addition to these core registers there are also registers used by instructions working with floating
point numbers or special parallelized instructions able to perform similar actions on multiple pairs of
operands at the same time. These instructions are often used for multimedia purposes (they help speed up
multimedia decoding algorithms). The corresponding registers are 128-bit wide and named xmm0 - xmm15.
We will talk about them later.
Some registers have appeared as non-standard extensions but became standardized shortly after. These
are so-called model-specific registers. See section 6.3.1 “Model specific registers” for more details.
• cr8 (aliased as tpr) is used to perform a fine tuning of the interrupts mechanism
(see section 6.2 “Interrupts”).
• efer is another flag register used to control processor modes and extensions
(e.g., long mode and system calls handling).
• idtr stores the address of the interrupt descriptors table (see section 6.2 “Interrupts”).
• gdtr and ldtr store the addresses of the descriptor tables (see section 3.2 “Protected
mode”).
• cs, ds, ss, es, gs, fs are so-called segment registers. The segmentation mechanism
they provide is considered legacy for many years now, but a part of it is still used to
implement privileged mode. See section 3.2 “Protected mode”.
13
CHAPTER 1 ■ BASIC COMPUTER ARCHITECTURE
14
CHAPTER 1 ■ BASIC COMPUTER ARCHITECTURE
The hardware stack is most useful to implement function calls in higher-level languages. When a
function A calls another function B, it uses the stack to save the context of computations to return to it after B
terminates.
Here are some important facts about the hardware stack, most of which follow from its description:
1. There is no such situation as an empty stack, even if we performed push zero times.
A pop algorithm can be executed anyway, probably returning a garbage “topmost”
stack element.
2. Stack grows toward zero address.
3. Almost all kinds of its operands are considered signed integers and thus can be
expanded with sign bit. For example, performing push with an argument B916 will
result in the following data unit being stored on the stack:
By default, push uses an 8-byte operand size. Thus an instruction push -1 will
store 0xff ff ff ff ff ff ff ff on the stack.
4. Most architectures that support stack use the same principle with its top defined
by some register. What differs, however, is the meaning of the respective address.
On some architectures it is the address of the next element, which will be written
on the next push. On others it is the address of the last element already pushed into
the stack.
15
CHAPTER 1 ■ BASIC COMPUTER ARCHITECTURE
■ Working with Intel docs: How to read instruction descriptions Open the second volume of [15]. Find
the page corresponding to the push instruction. It begins with a table. For our purpose we will only investigate
the columns OPCODE, INSTRUCTION, 64-BIT MODE, and DESCRIPTION. The OPCODE field defines the machine
encoding of an instruction (operation code). As you see, there are options and each option corresponds to a different
DESCRIPTION. It means that sometimes not only the operands vary but also the operation codes themselves.
INSTRUCTION describes the instruction mnemonics and allowed operand types. Here R stands for any general
purpose register, M stands for memory location, IMM stands for immediate value (e.g., integer constant like 42
or 1337). A number defines operand size. If only specific registers are allowed, they are named. For example:
• push r/m16—push a general purpose 16-bit register or a 16-bit number taken from
memory into the stack.
• push CS—push a segment register cs.
The DESCRIPTION column gives a brief explanation of the instruction’s effects. It is often enough to understand
and use the instruction.
• Read the further explanation of push. When is the operand not sign extended?
• Explain all effects of the instruction push rsp on memory and registers.
1.6 Summary
In this chapter we provided a quick overview of von Neumann architecture. We have started adding features to this
model to make it more adequate for describing modern processors. So far we took a closer look at registers and the
hardware stack. The next step is to start programming in assembly, and that is what the next chapter is dedicated
to. We are going to view some sample programs, pinpoint several new architectural features (such as endianness
and addressing modes), and design a simple input/output library for *nix to ease interaction with a user.
■ Question 6 What are the main problems that the modern extensions of the von Neumann model are trying
to solve?
■ Question 7 What are the main general purpose registers of Intel 64?
16
CHAPTER 2
Assembly Language
In this chapter we will start practicing assembly language by gradually writing more complex programs for Linux.
We will observe some architecture details that impact the writing of all kinds of programs (e.g., endianness).
We have chosen a *nix system in this book because it is much easier to program in assembly compared
to doing so in Windows.
section .data
message: db 'hello, world!', 10
section .text
_start:
mov rax, 1 ;system call number should be stored in rax
mov rdi, 1 ; argument #1 in rdi: where to write (descriptor)?
18
CHAPTER 2 ■ ASSEMBLY LANGUAGE
mov rsi, message ; argument #2 in rsi: where does the string start?
mov rdx, 14 ; argument #3 in rdx: how many bytes to write?
syscall ; this instruction invokes a system call
This program invokes a write system call with correct arguments on lines 6-9. It is really the only thing it
does. The next sections will explain this sample program in greater detail.
■ Note Assembly language is, in general, case insensitive, but label names are not!
mov, mOV, Mov are all the same thing, but global _start and global _START are not! Section names are case
sensitive too: section .DATA and section .data differ!
The db directive is used to create byte data. Usually data is defined using one of these directives, which
differ by data format:
• db—bytes;
• dw—so-called words, equal to 2 bytes each;
• dd—double words, equal to 4 bytes; and
• dq—quad words, equal to 8 bytes.
Let’s see an example, in Listing 2-2.
1
The NASM manual also uses the name “pseudo instruction” for a specific subset of directives.
19
CHAPTER 2 ■ ASSEMBLY LANGUAGE
times n cmd is a directive to repeat cmd n times in program code. As if you copy-pasted it n times. It also
works with central processor unit (CPU) instructions.
Note that you can create data inside any section, including .text. As we told you earlier, for a CPU data
and instructions are all alike and the CPU will try to interpret data as encoded instructions when asked to.
These directives allow you to define several data objects one by one, as in Listing 2-3, where a sequence
of characters is followed by a single byte equal to 10.
Letters, digits, and other characters are encoded in ASCII. Programmers have agreed upon a table,
where each character is assigned a unique number—its ASCII-code. We start at address corresponding to the
label message. We store the ASCII codes for all letters of string "hello, world!", then we add a byte equal to
10. Why 10? By convention, to start a new line we output a special character with code 10.
■ Terminological chaos It is quite common to refer to the integer format most native to the computer as
machine word. As we are programming a 64-bit computer, where addresses are 64-bit, general purpose
registers are 64-bit, it is pretty convenient to take the machine word size as 64 bits or 8 bytes.
In assembly programming for Intel architecture the term word was indeed used to describe a 16-bit data entry,
because on the older machines it was exactly the machine word. Unfortunately, for legacy reasons, it is still
used as in old times. That’s why 32-bit data is called double words and 64-bit data is referred to as quad words.
To compile our first program, save the code in hello.asm2 and then launch these commands in the shell:
The details of compilation process along with compilation stages will be discussed in Chapter 5. Let’s
launch “Hello, world!”
> ./hello
hello, world!
Segmentation fault
We have clearly output what we wanted. However, the program seems to have caused an error. What
did we do wrong? After executing a system call, the program continues its work. We did not write any
instructions after syscall, but the memory holds indeed some random values in the next cells.
■ Note If you did not put anything at some memory address, it will certainly hold some kind of garbage, not
zeroes or any kind of valid instructions.
A processor has no idea whether these values were intended to encode instructions or not. So, following
its very nature, it tries to interpret them, because rip register points at them. It is highly unlikely these values
encode correct instructions, so an interrupt with code 6 will occur (invalid instruction).3
So what do we do? We have to use the exit system call, which terminates the program in a correct way,
as shown in Listing 2-4.
section .text
global _start
_start:
mov rax, 1 ; 'write' syscall number
mov rdi, 1 ; stdout descriptor
mov rsi, message ; string address
mov rdx, 14 ; string length in bytes
syscall
2
Remember: all source code, including listings, can be found on www.apress.com/us/book/9781484224021 and is also
stored in the home directory of the preconfigured virtual machine!
3
Even if not, soon the sequential execution will lead the processor to the end of allocated virtual addresses, see section
4.2. In the end, the operating system will terminate the program because it is unlikely that the latter will recover from it.
21
CHAPTER 2 ■ ASSEMBLY LANGUAGE
section .text
global _start
_start:
; number 1122... in hexadecimal format
mov rax, 0x1122334455667788
mov rdi, 1
mov rdx, 1
mov rcx, 64
; Each 4 bits should be output as one hexadecimal digit
; Use shift and bitwise AND to isolate them
; the result is the offset in 'codes' array
.loop:
push rax
sub rcx, 4
; cl is a register, smallest part of rcx
; rax -- eax -- ax -- ah + al
; rcx -- ecx -- cx -- ch + cl
sar rax, cl
and rax, 0xf
pop rax
; test can be used for the fastest 'is it a zero?' check
; see docs for 'test' command
test rcx, rcx
jnz .loop
22
CHAPTER 2 ■ ASSEMBLY LANGUAGE
By shifting rax value and logical ANDing it with mask 0xF we transform the whole number into one of
its hexadecimal digits. Each digit is a number from 0 to 15. Use it as an index and add it to the address of the
label codes to get the representing character.
For example, given rax = 0x4A we will use indices 0x4 = 410 and 0xA = 1010.4 The first one will give us a
character '4' whose code is 0x34. The second one will result into character 'a' whose code is 0x61.
■ Question 14 Check that the ASCII codes mentioned in the last example are correct.
We can use a hardware stack to save and restore register values, like around syscall instruction.
■ Question 15 What is the difference between sar and shr? Check Intel docs.
■ Question 16 How do you write numbers in different number systems in a way understandable to NASM?
Check NASM documentation.
■ Note When a program starts, the value of most registers is not well defined (it can be absolutely random).
It is a great source of rookie mistakes, as one tends to assume that they are zeroed.
Square brackets denote indirect addressing; the address is written inside them.
• mov rsi, rax—copies rax into rsi
• mov rsi, [rax]—copies memory contents (8 sequential bytes) starting at address, stored
in rax, into rsi. How do we know that we have to copy exactly 8 bytes? As we know, mov
operands are of the same size, and the size of rsi is 8 bytes. Knowing these facts, the
assembler is able to deduce that exactly 8 bytes should be taken from memory.
4
The subscript denotes the number system’s base.
23
CHAPTER 2 ■ ASSEMBLY LANGUAGE
The instructions lea and mov have a subtle difference between their meanings. lea means “load
effective address.”
It allows you to calculate an address of a memory cell and store it somewhere. This is not always trivial,
because there are tricky address modes (as we will see later): for example, the address can be a sum of
several operands.
Listing 2-7 provides a quick demonstration of what lea and mov are doing.
5
This action is impossible to encode using the mov command. Check Intel docs to verify that it is not implemented.
24
CHAPTER 2 ■ ASSEMBLY LANGUAGE
It is a common (and fast) way to test register value for being zero with test reg,reg instruction.
At least two commands exist for each arithmetic flag F: jF and jnF. For example, sign flag: js and jns.
Other useful commands include
1. ja (jump if above)/jb (jump if below) for a jump after a comparison of unsigned
numbers with cmp.
2. jg (jump if greater)/jl (jump if less) for signed.
3. jae (jump if above or equal), jle (jump if less or equal) and similar. Some of
common jump instructions are shown in Listing 2-9.
push rip
jmp <address>
The address now stored in the stack (former rip contents) is called return address.
Any function can accept an unlimited number of arguments. The first six arguments are passed in rdi,
rsi, rdx, rcx, r8, and r9, respectively. The rest is passed on to the stack in reverse order.
What we consider an end to a routine is unclear. The most straightforward thing to say is that ret
instruction denotes the function end. Its semantic is fully equivalent to pop rip.
25
CHAPTER 2 ■ ASSEMBLY LANGUAGE
Apparently, the fragile mechanism of call and ret only works when the state of the stack is carefully
managed. One should not invoke ret unless the stack is exactly in the same state as when the function
started. Otherwise, the processor will take whatever is on top of the stack as a return address and use it as the
new rip content, which will certainly lead to executing garbage.
Now let’s talk about how functions use registers. Obviously, executing a function can change registers.
There are two types of registers.
• Callee-saved registers must be restored by the procedure being called. So, if it needs
to change them, it has to change them back.
These registers are callee-saved: rbx, rbp, rsp, r12-r15, a total of seven registers.
• Caller-saved registers should be saved before invoking a function and restored after. One
does not have to save and restore them if their value will not be of importance after.
All other registers are caller-saved.
These two categories are a convention. That is, a programmer must follow this agreement by
• Saving and restoring callee-saved registers.
• Being always aware that caller-saved registers can be changed during function execution.
■ A source of bugs A common mistake is not saving caller-saved registers before call and using them after
returning from function. Remember:
1. If you change rbx, rbp, rsp, or r12-r15, change them back!
2. If you need any other register to survive function call, save it yourself before calling!
Some functions can return a value. This value is usually the very essence of why the function is written and
executed. For example, we can write a function that accepts a number as its argument and returns it squared.
Implementation-wise, we are returning values by storing them in rax before the function ends its
execution. If you need to return two values, you are allowed to use rdx for the second one.
So, the pattern of calling a function is as follows:
• Save all caller-saved registers you want to survive function call
(you can use push for that).
• Store arguments in the relevant registers (rdi, rsi, etc.).
• Invoke function using call.
• After function returns, rax will hold the return value.
• Restore caller-saved registers stored before the function call.
■ Why do we need conventions? A function is used to abstract a piece of logic, forgetting completely about
its internal implementation and changing it when necessary. Such changes should be completely transparent to
the outside program. The convention described previously allows you to call any function from any given place
and be sure about its effects (may change any caller-saved register; will keep callee-saved registers intact).
Some system calls also return values—be careful and read the docs!
You should never use rbp and rsp. They are implicitly used during the execution. As you already know,
rsp is used as a stack pointer.
26
CHAPTER 2 ■ ASSEMBLY LANGUAGE
■ On system call arguments The arguments for system calls are stored in a different set of registers than
those for functions. The fourth argument is stored in r10, while a function accepts the fourth argument in rcx!
The reason is that syscall instruction implicitly uses rcx. System calls cannot accept more than six
arguments.
If you do not follow the described convention, you will be unable to change your functions without
introducing bugs in places where they are called.
Now it is time to write two more functions: print_newline will print the newline character; print_hex
will accept a number and print it in hexadecimal format (see Listing 2-10).
newline_char: db 10
codes: db '0123456789abcdef'
section .text
global _start
print_newline:
mov rax, 1 ; 'write' syscall identifier
mov rdi, 1 ; stdout file descriptor
mov rsi, newline_char ; where do we take data from
mov rdx, 1 ; the amount of bytes to write
syscall
ret
print_hex:
mov rax, rdi
mov rdi, 1
mov rdx, 1
mov rcx, 64 ; how far are we shifting rax?
iterate:
push rax ; Save the initial rax value
sub rcx, 4
sar rax, cl ; shift to 60, 56, 52, ... 4, 0
; the cl register is the smallest part of rcx
and rax, 0xf ; clear all bits but the lowest four
lea rsi, [codes + rax]; take a hexadecimal digit character code
mov rax, 1 ;
27
CHAPTER 2 ■ ASSEMBLY LANGUAGE
pop rcx
ret
_start:
mov rdi, 0x1122334455667788
call print_hex
call print_newline
mov rax, 60
xor rdi, rdi
syscall
section .text
_start:
mov rdi, [demo1]
call print_hex
call print_newline
mov rax, 60
xor rdi, rdi
syscall
When we launch it, to our surprise, we get completely different results for demo1 and demo2.
> ./main
1122334455667788
8877665544332211
28
CHAPTER 2 ■ ASSEMBLY LANGUAGE
The bits in each byte are stored in a straightforward way, but the bytes are stored from the least
significant to the most significant.
This applies only to memory operations: in registers, the bytes are stored in a natural way. Different
processors have different conventions on how the bytes are stored.
• Big endian multibyte numbers are stored in memory starting with the most
significant bytes.
• Little endian multibyte numbers are stored in memory starting with the least
significant bytes.
As the example shows, Intel 64 is following the little endian convention. In general, choosing one
convention over the other is a matter of choice, made by hardware engineers.
These conventions do not concern arrays and strings. However, if each character is encoded using 2
bytes rather than just 1, those bytes will be stored in reverse order.
The advantage of little endian is that we can discard the most significant bytes effectively converting the
number from a wider format to a narrower one, like 8 bytes.
For example, demo3: dq 0x1234. Then, to convert this number into dw we have to read a dword number
starting at the same address demo3. See Table 2-1 for a complete memory layout.
Table 2-1. Little Endian and Big Endian for quad word number 0x1234
Big endian is a native format often used inside network packets (e.g., TCP/IP). It is also an internal
number format for Java Virtual Machine.
Middle endian is a not very well-known notion. Assume we want to create a set of routines to perform
arithmetic with 128-bit numbers. Then the bytes can be stored as follows: first will be the 8 least significant
bytes in reversed order and then the 8 most significant bytes also in reverse order:
7 6 5 4 3 2 1 0, 16 15 14 13 12 11 10 9 8
2.5.2 Strings
As we already know, the characters are encoded using the ASCII table. A code is assigned to each character.
A string is obviously a sequence of character codes. However, it does not say anything about how to
determine its length.
1. Strings start with their explicit length.
29
CHAPTER 2 ■ ASSEMBLY LANGUAGE
2. A special character denotes the string ending. Traditionally, the zero code is used.
Such strings are called null-terminated.
lab: db 0
...
mov rax, lab + 1 + 2*3
NASM supports arithmetic expressions with parentheses and bit operations. Such expressions can only
include constants known to the compiler. This way it can precompute all such expressions and insert the
computation results (as constant numbers) in executable code. So, such expressions are NOT calculated at
runtime.
A runtime analogue would need to use such instructions as add or mul.
section .data
test: dq -1
section .text
mov byte[test], 1 ;1
mov word[test], 1 ;2
mov dword[test], 1 ;4
mov qword[test], 1 ;8
■ Question 18 What is test equal to after each of the commands listed previously?
mov rax, 10
30
CHAPTER 2 ■ ASSEMBLY LANGUAGE
2. Through a register:
This instruction transfers rbx value into rax.
mov r9, 10
mov rax, [r9]
We can use precomputations:
The address inside this instruction was precomputed, because both base and
offset are constants in control of compiler. Now it is just a number.
4. Base-indexed with scale and displacement
Most addressing modes are generalized by this mode. The address here is
calculated based on the following components:
Address = base + index ∗ scale + displacement
• Base is either immediate or a register;
• Scale can only be immediate equal to 1, 2, 4, or 8;
• Index is immediate or a register; and
• Displacement is always immediate.
Listing 2-12 shows examples of different addressing types.
A big picture You can think about byte, word, etc. as about type specifiers. For instance, you can either
push 16-, 32-, or 64-bit numbers into the stack. Instruction push 1 is unclear about how many bits wide the
operand is. In the same way mov word[test], 1 signifies, that [test] is a word; there is an information about
number format encoded in push word 1.
31
CHAPTER 2 ■ ASSEMBLY LANGUAGE
> true
> echo $?
0
> false
> echo $?
1
Let’s write an assembly program that mimics the false shell command, as shown in Listing 2-13.
section .text
_start:
mov rdi, 1
mov rax, 60
syscall
Now we have everything needed to calculate string length. Listing 2-14 shows the code.
section .data
test_string: db "abcdef", 0
section .text
32
CHAPTER 2 ■ ASSEMBLY LANGUAGE
.end:
ret ; When we hit 'ret', rax should hold return value
_start:
mov rax, 60
syscall
The important part (and the only part we will leave) is the strlen function. Notice, that
1. strlen changes registers, so after performing call strlen the registers can
change their values.
2. strlen does not change rbx or any other callee-saved registers.
■ Question 19 Can you spot a bug or two in Listing 2-15? When will they occur?
section .data
test_string: db "abcdef", 0
section .text
strlen:
.loop:
cmp byte [rdi+r13], 0
je .end
inc r13
jmp .loop
.end:
mov rax, r13
ret
_start:
mov rdi, test_string
call strlen
mov rdi, rax
mov rax, 60
syscall
33
CHAPTER 2 ■ ASSEMBLY LANGUAGE
34
CHAPTER 2 ■ ASSEMBLY LANGUAGE
Function Definition
exit Accepts an exit code and terminates current process.
string_length Accepts a pointer to a string and returns its length.
print_string Accepts a pointer to a null-terminated string and prints it to stdout.
print_char Accepts a character code directly as its first argument and prints it to stdout.
print_newline Prints a character with code 0xA.
print_uint Outputs an unsigned 8-byte integer in decimal format.
We suggest you create a buffer on the stack6 and store the division results there. Each
time you divide the last value by 10 and store the corresponding digit inside the
buffer. Do not forget, that you should transform each digit into its ASCII code
(e.g., 0x04 becomes0x34).
print_int Output a signed 8-byte integer in decimal format.
read_char Read one character from stdin and return it. If the end of input stream occurs, return 0.
read_word Accepts a buffer address and size as arguments. Reads next word from stdin
(skipping whitespaces7 into buffer). Stops and returns 0 if word is too big for the
buffer specified; otherwise returns a buffer address.
This function should null-terminate the accepted string.
parse_uint Accepts a null-terminated string and tries to parse an unsigned number from its start.
Returns the number parsed in rax, its characters count in rdx.
parse_int Accepts a null-terminated string and tries to parse a signed number from its start.
Returns the number parsed in rax; its characters count in rdx (including sign if any).
No spaces between sign and digits are allowed.
string_equals Accepts two pointers to strings and compares them. Returns 1 if they are equal,
otherwise 0.
string_copy Accepts a pointer to a string, a pointer to a buffer, and buffer’s length. Copies string
to the destination. The destination address is returned if the string fits the buffer;
otherwise zero is returned.
Use test.py to perform automated tests of correctness. Just run it and it will do the rest.
Remember, that a string of n characters needs n + 1 bytes to be stored in memory because of a null-terminator.
Read Appendix A to see how you can execute the program step by step observing the changes in register
values and memory state.
2.7.1 Self-Evaluation
Before testing or when facing an unexpected result, check the following quick list:
1. Labels denoting functions should be global; others should be local.
2. You do not assume that registers hold zero “by default.”
3. You save and restore callee-saved registers if you are using them.
6
In fact, by decreasing rsp you allocate memory on the stack.
7
We consider spaces, tabulation, and line breaks as whitespace characters. Their codes are 0x20, 0x9, and 0x10, respectively.
35
CHAPTER 2 ■ ASSEMBLY LANGUAGE
4. You save caller-saved registers you need before call and restore them after.
5. You do not use buffers in .data. Instead, you allocate them on the stack, which
allows you to adapt multithreading if needed.
6. Your functions accept arguments in rdi, rsi, rdx, rcx, r8, and r9.
7. You do not print numbers digit after digit. Instead you transform them into strings of
characters and use print_string.
8. parse_int and parse_uint are setting rdx correctly. It will be really important in the
next assignment.
9. All parsing functions and read_word work when the input is terminated via Ctrl-D.
Done right, the code will not take more than 250 lines.
■ Question 20 Try to rewrite print_newline without calling print_char or copying its code. Hint: read
about tail call optimization.
■ Question 21 Try to rewrite print_int without calling print_uint or copying its code. Hint: read about tail
call optimization.
■ Question 22 Try to rewrite print_int without calling print_uint, copying its code, or using jmp. You will
only need one instruction and a careful code placement.
2.8 Summary
In this chapter we started to do real things and apply our basic knowledge about assembly language. We
hope that you have overcome any possible fear of assembly. Despite being verbose to an extreme, it is not
a hard language to use. We have learned to make branches and cycles and perform basic arithmetic and
system calls; we have also seen different addressing modes, little and big endian. The following assembly
assignments will use the little library we have built to facilitate interaction with user.
■ Question 23 What is the connection between rax, eax, ax, ah, and al?
■ Question 25 How can you work with a hardware stack? Describe the instructions you can use.
mov [rax], 0
cmp [rdx], bl
mov bh, bl
mov al, al
36
CHAPTER 2 ■ ASSEMBLY LANGUAGE
add bpl, 9
add [9], spl
mov r8d, r9d
mov r3b, al
mov r9w, r2d
mov rcx, [rax + rbx + rdx]
mov r9, [r9 + 8*rax]
mov [r8+r7+10], 6
mov [r8+r7+10], r6
37
CHAPTER 2 ■ ASSEMBLY LANGUAGE
■ Question 34 How do you check whether an integer number is contained in a certain range (x, y)?
■ Question 37 How do you test whether rax is zero without the cmp command?
■ Question 40 By using exactly two instructions (the first is neg), take an absolute value of an integer
stored in rax.
■ Question 44 rax = 0x112233445567788. We have performed push rax. What will be the contents of
byte at address [rsp+3]?
38
CHAPTER 3
Legacy
This chapter will introduce you to the legacy processor modes, which are no longer used, and to their mostly
legacy features, which are still relevant today. You will see how processors evolved and learn the details
of protection rings implementation (privileged and user mode). You will also understand the meaning of
Global Descriptor Table. While this information helps you understanding the architecture better, it is not
crucial for assembly programming in user space.
As processors evolved, each new mode increased the machine word’s length and added new features.
A processor can function in one of the following modes:
• Real mode (the most ancient, 16-bit one);
• Protected (commonly referred as 32-bit one);
• Virtual (to emulate real mode inside protected);
• System management mode (for sleep mode, power management, etc.);
• Long mode, with which we are already a bit familiar.
We are going to take a closer look at real and protected mode.
• Each logical address consists of two components. One is taken from a segment register
and encodes the segment start. The other is an offset inside this segment. The hardware
calculates the physical address from these components the following way:
physical address = segment base * 16 + offset
You can often see addresses written in form of segment:offset, for example:
4a40:0002, ds:0001, 7bd3:ah.
As we already stated, programmers want to separate code from data (and stack), so they intend to use
different segments for these code sections. Segment registers are specialized for that: cs stores the code
segment start address, ds corresponds to data segment, and ss to stack segment. Other segment registers are
used to store additional data segments.
Note that strictly speaking, the segment registers do not hold segments’ starting addresses but rather
their parts (the four most significant hexadecimal digits). By adding another zero digit to multiply it by 1610
we get the real segment starting address.
Each instruction referencing memory implicitly assumes usage of one of segment registers.
Documentation clarifies the default segment registers for each instruction. However, common sense can
help as well. For instance, mov is used to manipulate data, so the address is relative to the data segment.
When the program is loaded, the loader sets ip, cs, ss, and sp registers so that cs:ip corresponds to the
entry point, and ss:sp points on top of the stack.
The central processing unit (CPU) always starts in real mode, and then the main loader usually executes
the code to explicitly switch it to protected mode and then to the long mode.
Real mode has numerous drawbacks.
• It makes multitasking very hard. The same address space is shared between all
programs, so they should be loaded at different addresses. Their relative placement
should usually be decided during compilation.
• Programs can rewrite each other’s code or even operating system as they all live in the
same address space.
• Any program can execute any instruction, including those used to set up the
processor’s state. Some instructions should only be used by the operating system
(like those used to set up virtual memory, perform power management, etc.) as their
incorrect usage can crash the whole system.
The protected mode was intended to solve these problems.
40
CHAPTER 3 ■ LEGACY
The way of obtaining a segment starting address has changed compared to real mode. Now the start is
calculated based on an entry in a special table, not by direct multiplication of segment register contents.
Linear address = segment base (taken from system table) + offset
Each of segment registers cs, ds, ss, es, gs, and fs stores so-called segment selector, containing
an index in a special segment descriptor table and a little additional information. There are two types of
segment descriptor tables: possibly numerous LDT (Local Descriptor Table) and only one GDT (Global
Descriptor Table).
LDTs were intended for a hardware task-switching mechanism; however, operating system
manufacturers did not adapt it. Today programs are isolated by virtual memory, and LDTs are not used.
GDTR is a register to store GDT address and size.
Segment selectors are structured as shown in Figure 3-1.
Index denotes descriptor position in either GDT or LDT. The T bit selects either LDT or GDT. As LDTs
are no longer used, it will be zero in all cases.
The table entries in GDT/LDT also store information about which privilege level is assigned to the
described segment. When a segment is accessed through segment selector, a check of Request Privilege
Level (RPL) value (stored in selector = segment register) against Descriptor Privilege Level (stored in
descriptor table) is performed. If RPL is not privileged enough to access a high privileged segment, an error
will occur. This way we could create numerous segments with various permissions and use RPL values in
segment selectors to define which of them are accessible to us right now (given our privilege level).
Privilege levels are the same thing as protection rings!
It is safe to say that current privilege level (e.g., current ring) is stored in the lowest two bits of cs or ss
(these numbers should be equal). This is what affects the ability to execute certain critical instructions
(e.g., changing GDT itself ).
It’s easy to deduce that for ds, changing these bits allows us to override the current privilege level to be
less privileged specifically for data access to a selected segment.
For example, we are currently in ring0 and ds= 0x02. Even though the lowest two bits of cs and ss are 0
(as we are inside ring0), we can’t access data in a segment with privilege level higher than 2 (like 1 or 0).
In other words, the RPL field stores how privileged we are when requesting access to a segment.
Segments in turn are assigned to one of four protection rings. When requesting access with a certain
privilege level, the privilege level should be higher than the privilege level attributed to segment itself.
1
In this book we are approximating things a bit because certain data structures can have a different format based on page
size, etc. The documentation will give you most precise answers (read volume 3, chapter 3 of [15]
41
CHAPTER 3 ■ LEGACY
jmp 0x08:addr
Listing 3-1 shows a small snippet of how we can turn on protected mode (assuming start32 is a label
on 32-bit code start).
align 16
_gdtr: ; stores GDT's last entry index + GDT address
dw 47
dq _gdt
align 16
_gdt:
; Null descriptor (should be present in any GDT)
dd 0x00, 0x00
42
CHAPTER 3 ■ LEGACY
Align directives control alignment, the essence of which we explain later in this book.
You might think that every memory transaction needs another one now to read GDT contents. This is
not true: for each segment register there is a so-called shadow register, which cannot be directly referenced.
It serves as a cache for GDT contents. It means that once a segment selector is changed, the corresponding
shadow register is loaded with the corresponding descriptor from GDT. Now this register will serve as a
source of all information needed about this segment.
The D flag needs a little explanation, because it depends on segment type.
• It is a code segment: default address and operand sizes. One means 32-bit addresses
and 32-bit or 8-bit operands; zero corresponds to 16-bit addresses and 16-bit or 8-bit
operands. We are talking about encoding of machine instructions here. This behavior
can be altered by preceding an instruction by a prefix 0x66 (to alter operand size) or
0x67 (to alter address size).
• Stack segment (it is a data segment AND we are talking about one selected by ss).2 It is
again default operand size for call, ret, push/pop, etc. If the flag is set, operands are
32-bit wide and instructions affect esp; otherwise operands are 16-bit wide and sp is
affected.
• For data segments, growing toward low addresses, it denotes their limits (0 for 64 KB, 1
for 4 GB). This bit should always be set in long mode.
As you see, the segmentation is quite a cumbersome beast. There are reasons it was not largely adopted
by operating systems and programmers alike (and is now pretty much abandoned).
• No segmentation is easier for programmers;
• No commonly used programming language includes segmentation in its memory
model. It is always flat memory. So it is a compiler’s job to set up segments (which is
hard to implement).
• Segments make memory fragmentation a disaster.
• A descriptor table can hold up to 8192 segment descriptors. How can we use this small
amount efficiently?
After the introduction of long mode segmentation was purged from processor, but not completely. It is
still used for protection rings and thus a programmer should understand it.
2
In this case, documentation names this flag B.
43
CHAPTER 3 ■ LEGACY
■ Why do we need separate descriptors for code and data? No combination of descriptor flags allows a
programmer to set up read/write permissions and execution permission simultaneously.
Even with the very small experience in assembly language we already have, it is not hard to decipher
this loader fragment, showing an exemplary GDT. It is taken from Pure64, an open source operating system
loader. As it is executed before the operating system, it does not contain user-level code or data descriptors
(see Listing 3-2).
44
CHAPTER 3 ■ LEGACY
As you see, writing in 8-bit or 16-bit parts leaves the rest of bits intact. Writing to 32-bit parts, however,
fills the upper half of a wide register with sign bit!
The reason is that how programmers are used to perceiving a processor is much different from how
things are really done inside. In reality, registers rax, eax, and all others do not exist as fixed physical entities.
To explain this inconsistency, we have to first elaborate two types of instruction sets: CISC and RISC.
45
CHAPTER 3 ■ LEGACY
The Intel 64 instruction set is indeed a CISC one. It has thousands of instructions—just look at the
second volume of [15]! However, these instructions are decoded and translated into a stream of simpler
microcode instructions. Here various optimizations take effect; the microcode instructions are reordered
and some of them can even be executed simultaneously. This is not a native feature of processors but rather
an adaptation aimed at better performance together with backward compatibility with older software.
It is quite unfortunate that there is not much information available on the microcode-level details of
modern processors. By reading technical reviews such as [17] and optimization manuals such as the one
provided by Intel, you can develop a certain intuition about it.
3.4.3 Explanation
Now back to the example shown in Listing 3-3. Let’s think about instruction decoding. The part of a CPU
called instruction decoder is constantly translating commands from an older CISC system to a more
convenient RISC one. Pipelines allow for a simultaneous execution of up to six smaller instructions.
To achieve that, however, the notion of registers should be virtualized. During microcode execution,
the decoder chooses an available register from a large bank of physical registers. As soon as the bigger
instruction ends, the effects become visible to programmer: the value of some physical registers may be
copied to those, currently assigned to be, let’s say, rax.
The data interdependencies between instructions stall the pipeline, decreasing performance. The worst
cases occur when the same register is read and modified by several consecutive instructions (think about
rflags!).
If modifying eax means keeping upper bits of rax intact, it introduces an additional dependency
between current instruction and whatever instruction modified rax or its parts before. By discarding upper
32 bits on each write to eax we eliminate this dependency, because we do not care anymore about previous
rax value or its parts.
This kind of a new behavior was introduced with the latest general purpose registers’ growth to 64 bits
and does not affect operations with their smaller parts for the sake of compatibility. Otherwise, most older
binaries would have stopped working because assigning to, for example, bl, would have modified the entire
ebx, which was not true back when 64-bit registers had not yet been introduced.
3.5 Summary
This chapter was a brief historical note on processor evolution over the last 30 years. We have also elaborated
on the intended use of segments back in the 32-bit era, as well as which leftovers of segmentation we are
stuck with for legacy reasons. In the next chapter we are going to take a closer look at the virtual memory
mechanism and its interaction with protection rings.
46
CHAPTER 4
Virtual Memory
This chapter covers virtual memory as implemented in Intel 64. We are going to start by motivating an
abstraction over physical memory and then getting a general understanding of how it looks like from a
programmer’s perspective. Finally, we will dive into implementation details to achieve a more complete
understanding.
4.1 Caching
Let’s start with a truly omnipresent concept called caching.
The Internet is a big data storage. You can access any part of it, but the delay after you made a query can
be significant. To smoothen your browsing experience, web browser caches web pages and their elements
(images, style sheets, etc.). This way it does not have to download the same data over and over again. In
other words, the browser saves the data on the hard drive or in RAM (random access memory) to give much
faster access to a local copy. However, downloading the whole Internet is not an option, because the storage
on your computer is very limited.
A hard drive is much bigger than RAM but also a great deal slower. This is why all work with data is done
after preloading it in RAM. Thus main memory is being used as a cache for data from external storage.
Anyway, a hard drive also has a cache on its own...
On CPU crystal there are several levels of data caches (usually three: L1, L2, L3). Their size is much
smaller than the size of main memory, but they are much faster too (the closest level to the CPU is almost as
close as registers). Additionally, CPUs possess at least an instruction cache (queue storing instructions) and
a Translation Lookaside Buffer to improve virtual memory performance.
Registers are even faster than caches (and smaller) so they are a cache on their own.
Why is this situation so pervasive? In information system, which does not need to give strict guarantees
about its performance levels, introducing caches often decreases the average access time (the time between
a request and a response). To make it work we need our old friend locality: in each moment of time we only
have a small working set of data.
The virtual memory mechanism allows us, among other things, to use physical memory as a cache for
chunks of program code and data.
4.2 Motivation
Naturally, given a single task system where there is only one program running at any moment of time, it is
wise just to put it directly into physical memory starting at some fixed address. Other components (device
drivers, libraries) can also be placed into memory in some fixed order.
48
CHAPTER 4 ■ VIRTUAL MEMORY
4.4 Features
Virtual memory is an abstraction over physical memory. Without it we would work directly with physical
memory addresses.
In the presence of virtual memory we can pretend that every program is the only memory consumer,
because it is isolated from others in its own address space.
The address space of a single process is split into pages of equal length (usually 4KB). These pages are
then dynamically managed. Some of them can be backed up to external storage (in a swap file), and brought
back in case of a need.
Virtual memory offers some useful features, by assigning an unusual meaning to memory operations
(read, write) on certain memory pages.
• We can communicate with external devices by means of Memory Mapped Input/
Output (e.g., by writing to the addresses, assigned to some device, and reading from
them).
• Some pages can correspond to files, taken from external storage with the help of the
operating system and file system.
• Some pages can be shared among several processes.
• Most addresses are forbidden—their value is not defined, and an attempt to access
them results in an error.1 This situation usually results in abnormal termination of
program.
Linux and other Unix-based systems use a signal mechanism to notify
applications of exceptional situations. It is possible to assign a handler to almost all
types of signals.
Accessing a forbidden address will be intercepted by the operating system, which
will throw a SIGSEGV signal at the application. It is quite common to see an error
message, Segmentation fault, in this situation.
• Some pages correspond to files, taken from storage (executable file itself, libraries,
etc.), but some do not. These anonymous pages correspond to memory regions of
stack and heap —dynamically allocated memory. They are called so because there
are no names in file system to which they correspond. To the contrary, an image of the
running executable data files and devices (which are abstracted as files too) all have
names in the file system.
1
An interrupt #PF (Page Fault) occurs.
49
CHAPTER 4 ■ VIRTUAL MEMORY
■ Question 47 Read about different replacement strategies. What other strategies exist?
Each process has a working set of pages. It consists of his exclusive pages present in physical memory.
■ Allocation What happens when a process needs more memory? It cannot get more pages on its own, so it
asks the operating system for more pages. The system provides it with additional addresses.
Dynamic memory allocation in higher-level languages (C++, Java, C#, etc.) eventually ends up querying pages
from the operating system, using the allocated pages until the process runs out of memory and then querying
more pages.
2
To find the process identifier, use such standard programs as ps or top.
50
CHAPTER 4 ■ VIRTUAL MEMORY
Let’s write a simple program, which enters a loop (and thus does not terminate) (Listing 4-1). It will
allow us to see its memory layout while it is running.
global _start
_start:
jmp _start
Now we have to launch a file /proc/?/maps, where ? is the process ID. See the complete terminal
contents in Listing 4-2.
Left column defines the memory region range. As you may notice, all regions are contained between
addresses ending with three hexadecimal zeros. The reason is that they are composed of pages whose size is
4KB each (= 0x1000 bytes).
We observe that different sections defined in the assembly file were loaded as different regions. The first
region corresponds to the code section and holds encoded instructions; the second corresponds to data.
As you see, the address space is huge and spans from 0-th to 264 −1-th byte. However, only a few
addresses are allocated; the rest are being forbidden.
The second column shows read, write, and execution permissions on pages. It also indicates whether
the page is shared among several processes or it is private to this specific process.
■ Question 48 Read about meaning of the fourth (08:01) and fifth (144225) column in man procfs.
So far we did nothing wrong. Now let’s try to write into a forbidden location.
; exit
mov rax, 60
xor rdi, rdi
syscall
We are accessing memory at address 0x3fffff, which is one byte before the code segment start. This
address is forbidden and hence the writing attempt results in a segmentation fault, as the message suggests.
4.6 Efficiency
Loading a missing page into physical memory from a swap file is a very costly operation, involving a huge
amount of work from operating system. How come this mechanism turned out not only to be effective
memory-wise but also to perform adequately? The key success factors are:
1. Thanks to locality, the need to load additional pages occurs rarely. In the worst
case we have indeed very slow access; however, such cases are extremely rare.
Average access time stays low.
In other words, we rarely try to access a page which is not loaded in physical
memory.
2. It is clear that efficiency could not be achieved without the help of special
hardware. Without a cache of translated page addresses TLB (Translation
Lookaside Buffer) we would have to use a translation mechanism all the time.
TLB stores the starting physical addresses for some pages we will likely to work
with. If we translate a virtual address inside one of these pages, the page start will
be immediately fetched from TLB.
In other words, we rarely try to translate an address from a page, that we did not
recently locate in physical memory.
Remember that a program that uses less memory can be faster because it produces fewer page faults.
4.7 Implementation
Now we are going to dive into details and see how exactly the translation happens.
■ Note For now we are only talking about a dominant case of 4KB pages. Page size can be tuned and other
parameters will change accordingly; refer to section 4.7.3 and [15] for additional details.
52
CHAPTER 4 ■ VIRTUAL MEMORY
The address itself is in fact only 48 bits wide; it is sign-extended to a 64-bit canonical address. Its
characteristic is that its 17 left bits are equal. If the condition is not satisfied, the address gets rejected
immediately when used.
Then 48 bits of virtual address are transformed into 52 bits of physical address with the help of special
tables.3
■ Bus Error When occasionally using a non-canonical address you will see another error message:
Bus error.
Physical address space is divided into slots to be filled with virtual pages. These slots are called page
frames. There are no gaps between them, so they always start from an address ending with 12 zero bits.
The least significant 12 bits of virtual address and of physical page correspond to the address offset
inside page, so they are equal.
The other four parts of virtual address represent indexes in translation tables. Each table occupies
exactly 4KB to fill an entire memory page. Each record is 64 bits wide; it stores a part of the next table’s
starting address as well as some service flags.
3
Theoretically we could support all 64 bits of physical addresses, but we do not need that many addresses yet.
53
CHAPTER 4 ■ VIRTUAL MEMORY
First, we take the first table starting address from cr3. The table is called Page Map Level 4 (PML4).
Fetching elements from PML4 is performed as follows:
• Bits 51:12 are provided by cr3.
• Bits 11:3 are bits 47:39 of the virtual address.
• The last three bits are zeroes.
The entries of PML4 are referred as PML4E. The next step of fetching an entry from the Page Directory
Pointer table mimics the previous one:
• Bits 51:12 are provided by selected PML4E.
• Bits 11:3 are bits 38:30 of the virtual address.
• The last three bits are zeroes.
54
CHAPTER 4 ■ VIRTUAL MEMORY
The process iterates through two more tables until at last we fetch the page frame address (to be precise,
its 51:12 bits). The physical address will use them and 12 bits will be taken directly from the virtual address.
Are we going to perform so many memory reads instead of one now? Yes, it does look bulky. However,
thanks to the page address cache, TLB, we usually access memory on already translated and memorized
pages. We should only add the correct offset inside page, which is blazingly fast.
As TLB is an associative cache; it is quickly providing us with translated page addresses given a starting
virtual address of the page.
Note that translation pages can be cached for a faster access. Figure 4-3 specifies the Page Table Entry
format.
■ On segmentation faults In general, segmentation faults occurs when there is an attempt to access
memory with insufficient permissions (e.g., writing into read-only memory). In case of forbidden addresses we
can consider them to have no valid permissions, so accessing them is just a particular case of memory access
with insufficient permissions.
EXB (also called NX) bit forbids code execution. The DEP (Data Execution Prevention) technology is based
on it. When a program is being executed, parts of its input can be stored in a stack or its data section. A malicious
user can exploit its vulnerabilities to mix encoded instructions into the input and then execute them. However,
if data and stack section pages are marked with EXB, no instructions can be executed from them. The .text
section, however, will remain executable, but it is usually protected from any modifications by W bit anyway.
55
CHAPTER 4 ■ VIRTUAL MEMORY
Figure 4-4. Page Directory Pointer table and Page Directory table entry format
This is controlled by the 7-th bit in the respective PDP or PD entry. If it is set, the respective table maps
pages; otherwise, it stores addresses of the next level tables.
56
CHAPTER 4 ■ VIRTUAL MEMORY
After a call to mmap, rax will hold a pointer to the newly allocated pages.
#define NAME 42
defines a substitution performed in compile time. Whenever a programmer writes NAME, the compiler
substitutes it with 42. This is useful to create mnemonic names for various constants. NASM provides similar
functionality using
%define directive
%define NAME 42
See section 5.1 “Preprocessor” for more details on how such substitutions are made.
Let’s take a look at a man page for mmap system call, describing its third argument prot.
The prot argument describes the desired memory protection of the mapping (and must not conflict
with the open mode of the file). It is either PROT_NONE or the bitwise OR of one or more of the following flags:
57
CHAPTER 4 ■ VIRTUAL MEMORY
PROT_NONE and its friends are examples of such mnemonic names for integers used to control mmap
behavior. Remember that both C and NASM allow you to perform compile-time computations on constant
values, including bitwise AND and OR operations. Following is an example of such computation:
Unless you are writing in C or C++, you will have to check these predefined values somewhere and copy
them to your program.
Following is how to know the specific values of these constants for Linux:
1. Search them in header files of the Linux API in /usr/include.
2. Use one of the Linux Cross Reference (lxr) online, like: http://lxr.free-
electrons.com.
We do recommend the second way for now, as we do not know C yet. You may even use a search engine
like Google and type lxr PROT_READ as a search query to get relevant results immediately after following the
first link.
For example, here is what LXR shows when being queried PROT_READ:
PROT_READ
So, we can type %define PROT_READ 0x01 in the beginning of the assembly file to use this constant
without memorizing its value.
58
CHAPTER 4 ■ VIRTUAL MEMORY
%define O_RDONLY 0
%define PROT_READ 0x1
%define MAP_PRIVATE 0x2
section .data
; This is the file name. You are free to change it.
fname: db 'test.txt', 0
section .text
global _start
_start:
; call open
mov rax, 2
mov rdi, fname
mov rsi, O_RDONLY ; Open file read only
mov rdx, 0 ; We are not creating a file
; so this argument has no meaning
syscall
59
CHAPTER 4 ■ VIRTUAL MEMORY
; mmap
mov r8, rax ; rax holds opened file descriptor
; it is the fourth argument of mmap
mov rax, 9 ; mmap number
mov rdi, 0 ; operating system will choose mapping destination
mov rsi, 4096 ; page size
mov rdx, PROT_READ ; new memory region will be marked read only
mov r10, MAP_PRIVATE ; pages will not be shared
4.10 Summary
In this chapter we have studied the concept and the implementation of virtual memory. We have elaborated
it as a particular case of caching. Then we have reviewed the different types of address spaces (physical,
virtual) and the connection between them through a set of translation tables. Then we dived into the virtual
memory implementation details.
Finally, we have provided a minimal working example of the memory the mapping using Linux system
calls. We will use it again in the assignment for Chapter 13, where we will base our dynamic memory
allocator on it. In the next chapter we are going to study the process of translation and linkage to see how an
operating system uses the virtual memory mechanism to load and execute programs.
■ Question 57 What is the virtual address space? How is it different from the physical one?
60
CHAPTER 4 ■ VIRTUAL MEMORY
■ Question 65 Can we write a string in .text section? What happens if we read it? And if we overwrite it?
■ Question 66 Write a program that will call stat, open, and mmap system calls (check the system calls table
in Appendix C). It should output the file length and its contents.
■ Question 67 Write the following programs, which all map a text file input.txt containing an integer x in
memory using a mmap system call, and output the following:
61
CHAPTER 5
Compilation Pipeline
This chapter covers the compilation process. We divide it into three main stages: preprocessing, translation,
and linking. Figure 5-1 shows an exemplary compilation process. There are two source files: first.asm and
second.asm. Each is treated separately before linking stage.
Preprocessor transforms the program source to obtain other program in the same language. The
transformations are usually substitutions of one string instead of others.
Compiler transforms each source file into a file with encoded machine instructions. However, such a
file is not yet ready to be executed because it lacks the right connections with the other separately compiled
files. We are talking about cases in which instructions address data or instructions, which are declared in
other files.
Linker establishes connections between files and makes an executable file. After that, the program is
ready to be run. Linkers operate with object files, whose typical formats are ELF (Executable and Linkable
Format) and COFF (Common Object File Format).
Loader accepts an executable file. Such files usually have a structured view with metadata included. It
then fills the fresh address space of a newborn process with its instructions, stack, globally defined data, and
runtime code provided by the operating system.
5.1 Preprocessor
Each program is created as a text. The first stage of compilation is called preprocessing. During this stage,
a special program is evaluating preprocessor directives found in the program source. According to them,
textual substitutions are made. As a result we get a modified source code without preprocessor directives
written in the same programming language. In this section we are going to discuss the usage of the NASM
macro processor.
To see the preprocessing results for an input file.asm, run nasm -E file.asm. It is often very useful for
debug purposes. Let’s see the result in Listing 5-2 for the file in Listing 5-1.
mov rax, 42
The commands to declare substitutions are called macros. During a process called macro expansion
their occurrences are replaced with pieces of text. The resulting text fragments are called macro instances.
In Listing 5-2, a number 42 in line mov rax, cat_count is a macro instance. Names such as cat_count are
often referred to as preprocessor symbols.
64
CHAPTER 5 ■ COMPILATION PIPELINE
It is important that the preprocessor knows little to nothing about the programming language syntax.
The latter defines valid language constructions.
For example, the code shown in Listing 5-3 is correct. It doesn’t matter if neither a nor b alone
constitutes a valid assembly construction; as long as the final result of substitutions is syntactically valid, the
compiler is fine with it.
a b
In assembly and C people usually define global constants using macro definitions.
1
D. Knuth takes this idea to extreme in his approach called Literate Programming
65
CHAPTER 5 ■ COMPILATION PIPELINE
Its action is simple: for each argument it will create a quad word data entry. As you see, arguments
are referred by their indices starting at 1. When this macro is defined, a line test 666, 555, 444 will be
replaced by those shown in Listing 5-5
■ Question 68 Find more examples of %define and %macro usage in NASM documentation.
%if x == 10
%elif x == 15
%elif x == 200
mov rax, 0
%else
mov rax, rbx
%endif
Listing 5-7 shows an instantiated macro. Remember, you can check the preprocessing result using nasm -E.
66
CHAPTER 5 ■ COMPILATION PIPELINE
The condition is an expression similar to what you might see in high-level languages: arithmetics and
logical conjectures (and, or, not).
As you can see, the symbol flag is not defined here using %define directive. Thus, we have the line
labeled by hellostring.
It is worth mentioning that preprocessor symbols can be defined directly when calling NASM thanks to
-d key. For example, the macro condition in Listing 5-8 will be satisfied when NASM is called with -d myflag
argument.
In the next sections we are going to see more preprocessor directives similar to %if.
pushr rax
pushr rflags
67
CHAPTER 5 ■ COMPILATION PIPELINE
Listing 5-10 shows what the two macros in Listing 5-9 become after instantiation.
push rax
pushf
As you can see, the macro adjusted its behavior based on the argument’s text representation. Notice that
%else clauses are allowed just like for regular %if. To make the comparison case insensitive, use the %ifidni
directive instead.
%ifnum %1
mov rdi, %1
call print_uint
%else
%error "String literals are not supported yet"
%endif
%endif
%endmacro
68
CHAPTER 5 ■ COMPILATION PIPELINE
The indentation is completely optional and is done for the sake of readability.
In case the argument is neither string nor identifier, we use the %error directive to force NASM into
throwing an error. If we had used %fatal instead, we would have stopped assembling completely and
any further errors would be ignored; a simple %error, however, will give NASM a chance to signal about
following errors too before it stops processing input files.
Let’s observe the macro instantiations in Listing 5-12
69
CHAPTER 5 ■ COMPILATION PIPELINE
%define d i * 3
%xdefine xd i * 3
%assign a i * 3
mov rax, d
mov rax, xd
mov rax, a
; let's redefine i
%define i 100
mov rax, d
mov rax, xd
mov rax, a
mov rax, 1 * 3
mov rax, 1 * 3
mov rax, 3
5.1.8 Repetition
The times directive is executed after all macro definitions are fully expanded and thus cannot be used to
repeat pieces of macros.
But there is another way NASM can make macro loops: by placing the loop body between %rep and
%endrep directives. Loops can be executed only a fixed amount of times, specified as %rep argument.
An example in Listing 5-15 shows how a preprocessor calculates a sum of integers from 1 to 10 and then uses
this value to initialize a global variable result.
70
CHAPTER 5 ■ COMPILATION PIPELINE
result: dq a
After preprocessing the result value is correctly initialized to 55 (see Listing 5-16). You can check it
manually.2
result: dq 55
We can use %exitrep to immediately leave the cycle. It is thus analogous to break instruction in
high-level languages.
2
A simple formula for the sum of first n natural numbers is: n ( n + 1)
2
71
CHAPTER 5 ■ COMPILATION PIPELINE
%endif
%endrep
db current ; n
%assign n n+1
%endrep
By accessing the n-th element of the is_prime array we can find out whether n is a prime number or
not. After preprocessing the following code in Listing 5-18 will be generated:
db 1
■ Question 70 Modify the macro the way it would produce a bit table, taking eight times less space in
memory. Add a function that will check number for primarily and return 0 or 1, based on this precomputed table.
■ Hint for the macro you will probably have to copy and paste a lot.
72
CHAPTER 5 ■ COMPILATION PIPELINE
Mymacro
Mymacro
mymacro
The macro mymacro is instantiated three times. Each time the local label gets a unique name. The base
name (after double percent) becomes prepended with a numerical prefix different in each instance. The first
prefix is ..@0., the second is ..@1., and so on.
..@0.labelname:
%line 6+0 macro_local_labels/macro_local_labels.asm
..@0.labelname:
%line 7+1 macro_local_labels/macro_local_labels.asm
..@1.labelname:
%line 8+0 macro_local_labels/macro_local_labels.asm
..@1.labelname:
%line 9+1 macro_local_labels/macro_local_labels.asm
..@2.labelname:
%line 10+0 macro_local_labels/macro_local_labels.asm
..@2.labelname:
5.1.11 Conclusion
You can think about macros as about a programming meta-language executed during compilation. It can do
quite complex computations and is limited in two ways:
• These computations cannot depend on user input (so they can only operate
constants).
• The cycles can be executed no more than a fixed amount of times. It means that while-
like constructions are impossible to encode.
73
CHAPTER 5 ■ COMPILATION PIPELINE
5.2 Translation
A compiler usually translates source code from one language into another language. In case of translation
from high-level programming languages into machine code, this process incorporates multiple inner steps.
During these stages we gradually push the code IR (Intermediate Representation) toward the target
language. Each push of IR is closer to the target language. Right before producing assembly code the IR will be
very close to assembly, so we can flush the assembly into a readable listing instead of encoding instructions.
Not only is translation a complex process, it also loses information about source code structure, so
reconstructing readable high-level code from the assembly file is impossible.
A compiler works with atomic code entities called modules. A module usually corresponds to a code
source file (but not a header or include file). Each module is compiled independently from the other
modules. The object file is produced from each module. It contains binary encoded instructions but usually
cannot be executed right away. There are several reasons.
For instance, the object file is completed separately from other files but refers to outside code and data.
It is not yet clear whether that code or data will reside in memory, or the position of the object file itself.
The assembly language translation is quite straightforward because the correspondence between
assembly mnemonics and machine instructions is almost one to one. Apart from label resolution there is
not much nontrivial work. Thus, for now we will concentrate on the following compilation stage, namely,
linking.
5.3 Linking
Let’s return to our first examples of assembly programs. To transform a “Hello, world!” program from source
code to executable file, we used the following two commands:
We used NASM first to produce an object file. Its format, elf64, was specified by the -f key. Then we
used another program, ld (a linker), to produce a file ready to be executed. We will take this file format as an
example to show you what the linker really does.
74
CHAPTER 5 ■ COMPILATION PIPELINE
3. Shared object files can be loaded when needed by the main program. They are
linked to it dynamically. In Windows OS these are well known dll-files; in *nix
systems their names often end with .so.
The purpose of any linker is to make an executable (or shared) object file, given a set of relocatable
ones. In order to do it, a linker must perform the following tasks:
• Relocation
• Symbol resolution. Each time a symbol (function, variable) is dereferenced, a linker
has to modify the object file and fill the instruction part, corresponding to the operand
address, with the correct value.
5.3.1.1 Structure
An ELF file starts with the main header, which stores global meta-information.
See Listing 5-21 for a typical ELF header. The hello file is a result of compiling a “Hello, world!”
program shown in Listing 2-4.
ELF files then provide information about a program that can be observed from two points of view:
• Linking view, consisting of sections.
It is described by section table, which is accessible through readelf -S.
Each section in turn can be:
– Raw data to be loaded into memory.
– Formatted metadata about other sections, used by loader (e.g., .bss), linker (e.g., relocation
tables), or debugger (e.g., .line).
75
CHAPTER 5 ■ COMPILATION PIPELINE
3
Not to be confused with preprocessor symbols!
76
CHAPTER 5 ■ COMPILATION PIPELINE
section .bss
bssvar1: resq 4*1024*1024
bssvar2: resq 1
section .text
extern somewhere
global _start
mov rax, datavar1
mov rax, bssvar1
mov rax, bssvar2
mov rdx, datavar2
_start:
jmp _start
ret
textlabel: dq 0
This program uses extern and global directives to mark symbols in a different way. These two
directives control the creation of a symbol table. By default, all symbols are local to the current module.
extern defines a symbol that is defined in other modules but referenced in the current one. On the other
hand, global defines a globally available symbol that other modules can refer to by defining it as extern
inside them.
■ Avoid confusion Do not confuse global and local symbols with global and local labels!
The GNU binutils is a collection of binary tools used to work with object files. It includes several tools
used to explore the object file contents. Several of them are of particular interest for us.
• If you only need to look up the symbol table, use nm.
• Use objdump as a universal tool to display general information about an object file. In
addition to ELF, it does support other object file formats.
• If you know that the file is in ELF format, readelf is often the best and most
informative choice.
Let’s feed this program to objdump to produce the results shown in Listing 5-23.
77
CHAPTER 5 ■ COMPILATION PIPELINE
SYMBOL TABLE:
0000000000000000 l df *ABS* 0000000000000000 main.asm
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l .data 0000000000000000 datavar1
0000000000000008 l .data 0000000000000000 datavar2
0000000000000000 l .bss 0000000000000000 bssvar1
0000000002000000 l .bss 0000000000000000 bssvar2
0000000000000029 l .text 0000000000000000 textlabel
0000000000000000 *UND* 0000000000000000 somewhere
0000000000000028 g .text 0000000000000000 _start
We are shown a symbol table, where each symbol is annotated with useful information. What do its
columns mean?
1. Virtual address of the given symbol. For now we do not know the section starting
addresses, so all virtual addresses are given relative to section start. For example,
datavar1 is the first variable stored in .data, its address is 0, and its size is 8 bytes.
The second variable, datavar2, is located in the same section with a greater offset
of 8, next to datavar1. As somewhere is defined as extern, it is obviously located in
some other module, so for now its address has no meaning and is left zero.
2. A string of seven letters and spaces; each letter characterizes a symbol in some
way. Some of them are of interest to us.
(a) l, g,- – local, global, or neither.
(b) …
(c) …
(d) …
(e) I,- – a link to another symbol or an ordinary symbol.
(f ) d, D,- – debug symbol, dynamic symbol, or an ordinary symbol.
(g) F, f, O,- – function name, file name, object name, or an ordinary symbol.
3. What section does this label correspond to? *UND* for unknown section (symbol is
referenced, but not defined here), *ABS* means no section at all.
4. Usually, this number shows an alignment (or its absence).
5. Symbol name.
For example, let’s investigate the first symbol shown in Listing 5-23. It is
f a file name,
d only necessary for debug purposes,
l local to this module.
The global label _start (which is also an entry point) is marked with the letter g in the second column.
■ Note Symbol names are case sensitive: _start and _STaRT are different.
78
CHAPTER 5 ■ COMPILATION PIPELINE
As the addresses in the symbol table are not yet the real virtual addresses but ones relative to sections,
we might ask ourselves: how do these look in machine code? NASM has already performed its duty, and
the machine instructions should be assembled. We can look inside interesting sections of object files by
invoking objdump with parameters -D (disassemble) and, optionally, -M intel-mnemonic (to make it show
Intel-style syntax rather than AT&T one). Listing 5-24 shows the results.
■ How to read disassembly dumps The left column usually is the absolute address where the data will be
loaded. Before linking, it is an address relative to the section start.
The mov operand in section .text with offsets 0 and 14 relative to section start should be datavar1
address, but it is equal to zero! The same thing happened with bssvar. It means that the linker has to change
compiled machine code, filling the right absolute addresses in instruction arguments. To achieve that, for
each symbol all references to it are remembered in relocation table. As soon as the linker understands what
its true virtual address will be, it goes through the list of symbol occurrences and fills in the holes.
A separate relocation table exists for each section in need of one.
To see the relocation tables use readelf --relocs. See Listing 5-25.
79
CHAPTER 5 ■ COMPILATION PIPELINE
An alternative way to display the symbol table is to use a more lightweight and minimalistic nm utility.
For each symbol it shows the symbol’s virtual address, type, and name. Note that the type flag is in different
format compared to objdump. See Listing 5-26 for a minimal example.
Listing 5-26. nm
> nm main.o
0000000000000000 b bssvar
0000000000000000 d datavar
U somewhere
000000000000000a T _start
000000000000000b t textlabel
section .data
somewhere: dq 999
private: dq 666
section .text
func:
mov rax, somewhere
ret
We are going to compile it as usual using nasm -f elf64, and then link it using ld with the previous object
file, obtained by compiling the file shown in Listing 5-22. Listing 5-28 shows the changes in objdump output.
80
CHAPTER 5 ■ COMPILATION PIPELINE
SYMBOL TABLE:
00000000004000b0 l d .code 0000000000000000 .code
00000000006000bc l d .data 0000000000000000 .data
0000000000000000 l df *ABS* 0000000000000000 executable_object.asm
00000000006000c4 l .data 0000000000000000 private
00000000006000bc g .data 0000000000000000 somewhere
0000000000000000 *UND* 0000000000000000 _start
00000000006000cc g .data 0000000000000000 __bss_start
00000000004000b0 g F .code 0000000000000000 func
00000000006000cc g .data 0000000000000000 _edata
00000000006000d0 g .data 0000000000000000 _end
The flags are different: now the file can be executed (EXEC_P); there are no more relocation tables
(the HAS_RELOC flag is cleared). Virtual addresses are now intact, and so are addresses in code. This file
is ready to be loaded and executed. It retains a symbol table, and if you want to cut it out making the
executable smaller, use the strip utility.
■ Question 71 Why does ld issue a warning if _start is not marked global? Look the entry point address in
this case by using readelf with appropriate arguments.
■ Question 72 Find out the ld option to automatically strip the symbol table after linking.
81
CHAPTER 5 ■ COMPILATION PIPELINE
While static libraries are just undercooked executables without entry points, dynamic libraries have
some differences which we are going to look at now.
Dynamic libraries are loaded when they are needed. As they are object files on their own, they have all
kind of meta-information about which code they provide for external usage. This information is used by a
loader to determine the exact addresses of exported functions and data.
Dynamic libraries can be shipped separately and updated independently. It is both good and bad.
While the library manufacturer can provide bug fixes, he can also break backward compatibility by, for
example, changing functions arguments, effectively shipping a delayed action mine.
A program can work with any amount of shared libraries. Such libraries should be loadable at any
address. Otherwise they would be stuck at the same address, which puts us in exactly the same situation as
when we are trying to execute multiple programs in the same physical memory address space. There are two
ways to achieve that:
• We can perform a relocation in runtime, when the library is being loaded. However,
it steals a very attractive feature from us: the possibility to reuse library code in
physical memory without its duplication when several processes are using it. If each
process performs library relocation to a different address, the corresponding pages
become patched with different address values and thus become different for different
processes.
Effectively the .data section would be relocated anyway because of its mutable
nature. Renouncing global variables allows us to throw away both the section and
the need to relocate it.
Another problem is that .text section must be left writable in order to perform its
modification during the relocation process. It introduces certain security risks,
leaving its modification possible by malicious code. Moreover, changing .text of
every shared object when multiple libraries are required for an executable to run
can take a great deal of time.
• We can write PIC (Position Independent Code). It is now possible to write code
which can be executed no matter where it resides in memory. For that we have to
get rid of absolute addresses completely. These days processors support rip-relative
addressing, like mov rax, [rip + 13]. This feature facilitates PIC generation.
This technique allows for .text section sharing. Today programmers are strongly
encouraged to use PIC instead of relocations.
■ Note Whenever you are using non-constant global variables, you prevent your code from being
reenterable, that is, being executable inside multiple threads simultaneously without changes. Consequently,
you will have difficulties reusing it in a shared library. It is one of many arguments against a global mutable
state in program.
Dynamic libraries spare disk space and memory. Remember that pages may be either marked private or
shared among several processes. If a library is used by multiple processes, most parts of it are not duplicated
in physical memory.
We will show you how to build a minimal shared object now. However, we will defer the explanation of
things like Global Offset Tables and Procedure Linkage Tables until Chapter 15.
Listing 5-29 shows minimal shared object contents. Notice the external symbol _GLOBAL_OFFSET_TABLE
and :function specification for the global symbol func. Listing 5-30 shows a minimal launcher that calls a
function in a shared object file and exits correctly.
82
CHAPTER 5 ■ COMPILATION PIPELINE
global func:function
section .rodata
message: db "Shared object wrote this", 10, 0
section .text
func:
mov rax, 1
mov rdi, 1
mov rsi, message
mov rdx, 14
syscall
ret
extern func
section .text
_start:
mov rdi, 10
call func
mov rdi, rax
mov rax, 60
syscall
Listing 5-31 shows build commands and two views of an ELF file.
Notice that dynamic library has more specific sections such as .dynsym. Sections .hash, .dynsym, and
.dynstr are necessary for relocation.
.dynsym stores symbols visible from outside the library.
.hash is a hash table, needed to decrease the symbol search time for .dynsym.
.dynstr stores strings, requested by their indices from .dynsym.
83
CHAPTER 5 ■ COMPILATION PIPELINE
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .hash HASH 00000000000000e8 000000e8
000000000000002c 0000000000000004 A 2 0 8
[ 2] .dynsym DYNSYM 0000000000000118 00000118
0000000000000090 0000000000000018 A 3 2 8
[ 3] .dynstr STRTAB 00000000000001a8 000001a8
000000000000001e 0000000000000000 A 0 0 1
[ 4] .rela.dyn RELA 00000000000001c8 000001c8
0000000000000018 0000000000000018 A 2 0 8
[ 5] .text PROGBITS 00000000000001e0 000001e0
000000000000001c 0000000000000000 AX 0 0 16
[ 6] .rodata PROGBITS 00000000000001fc 000001fc
000000000000001a 0000000000000000 A 0 0 4
[ 7] .eh_frame PROGBITS 0000000000000218 00000218
0000000000000000 0000000000000000 A 0 0 8
[ 8] .dynamic DYNAMIC 0000000000200218 00000218
00000000000000f0 0000000000000010 WA 3 0 8
[ 9] .got.plt PROGBITS 0000000000200308 00000308
0000000000000018 0000000000000008 WA 0 0 8
[10] .shstrtab STRTAB 0000000000000000 00000320
0000000000000065 0000000000000000 0 0 1
[11] .symtab SYMTAB 0000000000000000 00000388
00000000000001c8 0000000000000018 12 15 8
[12] .strtab STRTAB 0000000000000000 00000550
000000000000004f 0000000000000000 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), l (large)
I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
O (extra OS processing required) o (OS specific), p (processor specific)
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .interp PROGBITS 0000000000400158 00000158
000000000000000f 0000000000000000 A 0 0 1
[ 2] .hash HASH 0000000000400168 00000168
0000000000000028 0000000000000004 A 3 0 8
[ 3] .dynsym DYNSYM 0000000000400190 00000190
0000000000000078 0000000000000018 A 4 1 8
[ 4] .dynstr STRTAB 0000000000400208 00000208
0000000000000027 0000000000000000 A 0 0 1
[ 5] .rela.plt RELA 0000000000400230 00000230
0000000000000018 0000000000000018 AI 3 6 8
84
CHAPTER 5 ■ COMPILATION PIPELINE
■ Question 73 Study the symbol tables for an obtained shared object using readelf --dyn-syms and
objdump -ft.
■ Question 75 Separate the first assignment into two modules. The first module will store all functions
defined in lib.inc. The second will have the entry point and will call some of these functions.
■ Question 76 Take one of the standard Linux utilities (from coreutils). Study its object file structure using
readelf and objdump.
The things we observed in this section apply in most situations. However, there is a bigger picture of
different code models that affect the addressing. We will dive into those details in Chapter 15 after getting
more familiar with assembly and C. There we will also revise the dynamic libraries again and introduce the
notions of Global Offset Table and Procedure Linkage Table.
5.3.5 Loader
Loader is a part of the operating system that prepares executable file for execution. It includes mapping its
relevant sections into memory, initializing .bss, and sometimes mapping other files from disk.
The program headers for a file symbols.asm, shown in Listing 5-22, are shown in Listing 5-32.
85
CHAPTER 5 ■ COMPILATION PIPELINE
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000000e3 0x00000000000000e3 R E 200000
LOAD 0x00000000000000e4 0x00000000006000e4 0x00000000006000e4
0x0000000000000010 0x000000000200001c RW 200000
86
CHAPTER 5 ■ COMPILATION PIPELINE
As we see, the program header is telling us the truth about section placement.
■ Note In some cases, you will find that the linker needs to be finely tuned. The section loading addresses
and relative placement can be adjusted by using linker scripts, which describe the resulting file. Such cases
usually occur when you are programming an operating system or a microcontroller firmware. This topic is
beyond the scope of this book, but we recommend that you look at [4] in case you encounter such a need.
x1:
dq x2
dq 100
x2:
dq x3
dq 200
x3:
dq 0
dq 300
Linked lists are often useful in situations that have numerous insertions and removals in the middle
of the list. Accessing elements by index, however, is hard because it does not boil down to simple pointer
addition. Linked list elements’ mutual positions in flat memory are usually not predictable.
In this assignment the dictionary will be constructed statically as a list and each newly defined element
will be prepended to it. You have to use macros with local labels and symbol redefinition to automatize the
linked list creation. We explicitly instruct you to make a macro colon with two arguments, where the first
will hold a dictionary key string and the second will hold the internal element representation name. This
differentiation is needed because key strings can sometimes contain characters which are not parts of valid
label names (space, punctuation, arithmetic signs, etc.). Listing 5-35 shows an example of such a dictionary.
87
CHAPTER 5 ■ COMPILATION PIPELINE
88
CHAPTER 5 ■ COMPILATION PIPELINE
5.5 Summary
In this chapter we have looked at the different compilation stages. We have studied the NASM
macroprocessor in detail and learned conditionals and loops. Then we talked about three object file types:
relocatable, executable, and shared. We elaborated the ELF file structure and observed the relocation
process performed by the linker. We have touched on the shared object files, and we will revisit them again
in the Chapter 15.
89
CHAPTER 5 ■ COMPILATION PIPELINE
■ Question 95 Is there a difference between a static library and a relocatable object file?
90
CHAPTER 6
The IOPL field in rflags register works as follows: if the current privilege level is less or equal to the
IOPL, the following instructions are allowed to be executed:
• in and out (normal input/output).
• ins and outs (string input/output).
• cli and sti (clear/set interrupt flag).
Thus, setting IOPL in an application individually allows us to forbid it from writing even if it is working
at a higher privilege level than the user applications.
Additionally, Intel 64 allows an even finer permission control through an I/O permission bit map. If the
IOPL check has passed, the processor checks the bit corresponding to the used port. The operation proceeds
only if this bit is not set.
The I/O permission bit map is a part of Task State Segment (TSS), which was created to be an entity
unique to a process. However, as the hardware task-switching mechanism is considered obsolete, only one
TSS (and I/O permission bit map) can exist in long mode.
92
CHAPTER 6 ■ INTERRUPTS AND SYSTEM CALLS
The first 16 bits store an offset to an Input/Output Port Permission Map, which we already discussed in
section 6.1. The TSS then holds eight pointers to special interrupt stack tables (ISTs) and stack pointers for
different rings. Each time a privilege level changes, the stack is automatically changed accordingly. Usually,
the new rsp value will be taken from the TSS field corresponding to the new protection ring. The meaning of
ISTs is explained in section 6.2.
93
CHAPTER 6 ■ INTERRUPTS AND SYSTEM CALLS
6.2 Interrupts
Interrupts allow us to change the program control flow at an arbitrary moment in time. While the program
is executing, external events (device requires CPU attention) or internal events (division by zero, insufficient
privilege level to execute an instruction, a non-canonical address) may provoke an interrupt, which results
in some other code being executed. This code is called an interrupt handler and is a part of an operating
system or driver software.
In [15], Intel separates external asynchronous interrupts from internal synchronous exceptions, but
both are handled alike.
Each interrupt is labeled with a fixed number, which serves as its identifier. For us it is not important
exactly how the processor acquires the interrupt number from the interrupt controller.
When the n-th interrupt occurs, the CPU checks the Interrupt Descriptor Table (IDT), which resides in
memory. Analogously to GDT, its address and size are stored in idtr. Figure 6-2 describes the idtr.
Each entry in IDT takes 16 bytes, and the n-th entry corresponds to the n-th interrupt. The entry
incorporates some utility information as well as an address of the interrupt handler. Figure 6-3 describes the
interrupt descriptor format.
94
CHAPTER 6 ■ INTERRUPTS AND SYSTEM CALLS
■ Question 96 What are non-maskable interrupts? What is their connection with the interrupt with code 2
and IF flag?
The application code is executed with low privileges (in ring3). Direct device control is only possible
on higher privilege levels. When a device requires attention by sending an interrupt to the CPU, the handler
should be executed in a higher privilege ring, thus requiring altering the segment selector.
What about the stack? The stack should also be switched. Here we have several options based on how
we set up the IST field of interrupt descriptor.
• If the IST is 0, the standard mechanism is used. When an interrupt occurs, ss is
loaded with 0, and the new rsp is loaded from TSS. The RPL field of ss then is set to an
appropriate privilege level. Then old ss and rsp are saved in this new stack.
• If an IST is set, one of seven ISTs defined in TSS is used. The reason ISTs are created is
that some serious faults (non-maskable interrupts, double fault, etc.) might profit from
being executed on a known good stack. So, a system programmer might create several
stacks even for ring0 and use some of them to handle specific interrupts.
There is a special int instruction, which accepts the interrupt number. It invokes an interrupt handler
manually with respect to its descriptor contents. It ignores the IF flag: whether it is set or cleared, the handler
will be invoked. To control execution of privileged code using int instruction, a DPL field exists.
Before an interrupt handler starts its execution, some registers are automatically saved into stack. These
are ss, rsp, rflags, cs, and rip. See a stack diagram in Figure 6-4. Note how segment selectors are padded
to 64 bit with zeros.
95
CHAPTER 6 ■ INTERRUPTS AND SYSTEM CALLS
Sometimes an interrupt handler needs additional information about the event. An interrupt error
code is then pushed into stack. This code contains various information specific for this type of interrupt.
Many interrupts are described using special mnemonics in Intel documentation. For example, the
13-th interrupt is referred to as #GP (general protection).1 You will find the short description of the some
interesting interrupts in the Table 6-1.
Not all binary code corresponds to correctly encoded machine instructions. When rip is not addressing
a valid instruction, the CPU generates the #UD interrupt.
The #GP interrupt is very common. It is generated when you try to dereference a forbidden address
(which does not correspond to any allocated page), when trying to perform an action, requiring a higher
privilege level, and so on.
The #PF interrupt is generated when addressing a page which has its present flag cleared in the
corresponding page table entry. This interrupt is used to implement the swapping mechanism and file
mapping in general. The interrupt handler can load missing pages from disk.
The debuggers rely heavily on the #BP interrupt. When the TF is set in rflags, the interrupt with
this code is generated after each instruction is executed, allowing a step-by-step program execution.
Evidently, this interrupt is handled by an OS. It is thus an OS’s responsibility to provide an interface for user
applications that allows programmers to write their own debuggers.
To sum up, when an n-th interrupt occurs, the following actions are performed from a programmer’s
point of view:
1. The IDT address is taken from idtr.
2. The interrupt descriptor is located starting from 128 × n-th byte of IDT.
3. The segment selector and the handler address are loaded from the IDT entry into
cs and rip, possibly changing privilege level. The old ss, rsp, rflags, cs, and rip
are stored into stack as shown in Figure 6-4.
4. For some interrupts, an error code is pushed on top of handler’s stack. It provides
additional information about interrupt cause.
5. If the descriptor’s type field defines it as an Interrupt Gate, the interrupt flag IF is
cleared. The Trap Gate, however, does not clear it automatically, allowing nested
interrupt handling.
1
See section 6.3.1 of the third volume of [15]
96
CHAPTER 6 ■ INTERRUPTS AND SYSTEM CALLS
If the interrupt flag is not cleared immediately after the interrupt handler start, we cannot have any
kind of guarantees that we will execute even its first instruction without another interrupt appearing
asynchronously and requiring our attention.
■ Question 97 Is the TF flag cleared automatically when entering interrupt handlers? Refer to [15].
The interrupt handler is ended by a iretq instruction, which restores all registers saved in the stack, as
shown in Figure 6-4, compared to the simple call instruction, which restores only rip.
97
CHAPTER 6 ■ INTERRUPTS AND SYSTEM CALLS
• STAR (MSR number 0xC0000081), which holds two pairs of cs and ss values: for
system call handler and for sysret instruction. Figure 6-5 shows its structure.
• LSTAR (MSR number 0xC0000082) holds the system call handler address (new rip).
• SFMASK (MSR number 0xC0000084) shows which bits in rflags should be cleared in
the system call handler.
The syscall performs the following actions:
• Loads cs from STAR;
• Changes rflags with regards to SFMASK;
• Saves rip into rcx; and
• Initializes rip with LSTAR value and takes new cs and ss from STAR.
Note that now we can explain why system calls and procedures accept arguments in slightly different
sets of registers. The procedures accept their fourth argument in rcx, which, as we know, is used to store the
old rip value.
Contrary to the interrupts, even if the privilege level changes, the stack pointer should be changed by
the handler itself.
System call handling ends with sysret instruction, which loads cs and ss from STAR and rip from rcx.
As we know, the segment selector change leads to a read from GDT to update its paired shadow
register. However, when executing syscall, these shadow registers are loaded with fixed values and no
reads from GDT are performed.
Here are these two fixed values in deciphered form:
• Code Segment shadow register:
– Base = 0
– Limit = FFFFFH
– Type = 112 (can be executed, was accessed)
– S = 1 (System)
– DPL = 0
– P=1
– L = 1 (Long mode)
– D=0
– G = 1 (always the case in long mode)
98
CHAPTER 6 ■ INTERRUPTS AND SYSTEM CALLS
6.4 Summary
In this chapter we have provided an overview of interrupts and system call mechanisms. We have studied
their implementation down to the system data structures residing in memory. In the next chapter we
are going to review different models of computation, including stack machines akin to Forth and finite
automatons, and finally work on a Forth interpreter and compiler in assembly language.
■ Question 103 How is #PF error related to the swapping? How does the operating system use it?
■ Question 106 Why does the interrupt handler need a DPL field?
99
CHAPTER 6 ■ INTERRUPTS AND SYSTEM CALLS
■ Question 108 Does a single thread application have only one stack?
■ Question 112 How are the model-specific registers used in the system call mechanism?
100
CHAPTER 7
Models of Computation
In this chapter we are going to study two models of computations: finite state machines and stack machines.
Model of computation is akin to the language you are using to describe the solution to a problem.
Typically, a problem that is really hard to solve correctly in one model of computation can be close to
trivial in another. This is the reason programmers who are knowledgeable about many different models
of computations can be more productive. They solve problems in the model of computation that is most
suitable and then they implement the solution with the tools they have at their disposal.
When you are trying to learn a new model of computation, do not think about it from the “old” point of
view, like trying to think about finite state machines in terms of variables and assignments. Try to start fresh
and logically build the new system of notions.
We already know much about Intel 64 and its model of computation, derived from von Neumann’s. This
chapter will introduce finite state machines (used to implement regular expressions) and stack machines
akin to the Forth machine.
Why bother with automatons? Some tasks are particularly easy to solve when applying such paradigm
of thinking. Such tasks include controlling embedded devices and searching substrings that match a certain
pattern.
For example, we are checking, whether a string can be interpreted as an integer number. Let’s draw a
diagram, shown in Figure 7-1. It defines several states and shows possible transitions between them.
• The alphabet consists of letters, spaces, digits, and punctuation signs.
• The set of states is {A, B, C}.
• The initial state is A.
• The final state is C.
We start execution from the state A. Each input symbol causes us to change current state based on
available transitions.
■ Note Arrows labeled with symbol ranges like 0. . . 9 actually denote multiple rules. Each of these rules
describes a transition for a single input character.
Table 7-1 shows what will happen when this machine is being executed with an input string +34. This is
called a trace of execution.
Table 7-1. Tracing a finite state machine shown in Figure 7-1, input is: +34
The machine has arrived into the final state C. However, given an input idkfa, we could not have
arrived into any state, because there are no rules to react to such input symbols. This is where the
automaton’s behavior is undefined. To make it total and always arrive in either yes- state or no-state, we have
to add one more final state and add rules in all existing states. These rules should direct the execution into
the new state in case no old rules match the input symbol.
102
CHAPTER 7 ■ MODELS OF COMPUTATION
The empty string has zero ones; zero is an even number. Because of this, the state A is both the starting
and the final state.
All zeros are ignored no matter the state. However, each one occurring in input changes the state to the
opposite one. If, given an input string, we arrive into the finite state A, then the number of ones is even. If we
arrive into the finite state B, then it is odd.
To implement the exemplary automaton in assembly, we will make it total first, as shown in Figure 7-3
We will modify this automaton a bit to force the input string to be null-terminated, as shown in Figure 7-4.
Listing 7-1 shows a sample implementation.
Figure 7-4. Check if the string is a number: a total automaton for a null-terminated string
_A:
call getsymbol
cmp al, '+'
je _B
cmp al, '-'
je _B
; The indices of the digit characters in ASCII
; tables fill a range from '0' = 0x30 to '9' = 0x39
; This logic implements the transitions to labels
; _E and _C
cmp al, '0'
jb _E
104
CHAPTER 7 ■ MODELS OF COMPUTATION
_B:
call getsymbol
cmp al, '0'
jb _E
cmp al, '9'
ja _E
jmp _C
_C:
call getsymbol
cmp al, '0'
jb _E
cmp al, '9'
ja _E
test al, al
jz _D
jmp _C
_D:
; code to notify about success
_E:
; code to notify about failure
This automaton is arriving into states D or E;the control will be passed to the instructions on either the
_D or _E label.
The code can be isolated inside a function returning either 1 (true) in state _D or 0 (false) in state _E.
■ Question 114 Draw a finite state machine to count the words in the input string. The input length is no
more than eight symbols.
The finite state machines are often used to describe embedded systems, such as coffee machines. The
alphabet consists of events (buttons pressed); the input is a sequence of user actions.
105
CHAPTER 7 ■ MODELS OF COMPUTATION
The network protocols can often also be described as finite state machines. Every rule can be annotated
with an optional output action: “if a symbol X is read, change state to Y and output a symbol Z.” The input
consists of packets received and global events such as timeouts; the output is a sequence of packets sent.
There are also several verification techniques, such as model checking, that allow one to prove certain
properties of finite automatons—for example, “if the automaton has reached the state B, he will never reach
the state C.” Such proofs can be of a great value when building systems required to be highly reliable.
■ Question 115 Draw a finite state machine to check whether there is an even or an odd number of words in
the input string.
■ Question 116 Draw and implement a finite state machine to answer whether a string should be trimmed
from left, right, or both or should not be trimmed at all. A string should be trimmed if it starts or ends with
consecutive spaces.
106
CHAPTER 7 ■ MODELS OF COMPUTATION
These rules allow us to define a complex search pattern. The regular expressions engine will try to
match the pattern starting with every position in text.
The regular expression engines usually follow one of these two approaches:
• Using a straightforward approach, trying to match all described symbol sequences. For
example, matching a string ab against regular expression aa?a?b may result in such
sequence of events:
1. Trying to match against aaab — failure.
2. Trying to match against aab — failure.
3. Trying to match against ab — success.
So, we are trying out different branches of decisions until we hit a successful one
or until we see definitively that all options lead to a failure.
This approach is usually quite fast and also simple to implement. However, there
is a worst-case scenario in which the complexity starts growing exponentially.
Imagine matching a string:
aaa...a (repeat a n times)
against a regular expression:
a?a?a?...a?aaa...a (repeat a? n times, then repeat a n times)
The given string will surely match the regular expression. However, when
applying a straightforward approach the engine will have to go through all
possible strings that do match this regular expression. To do it, it will consider
two possible options for each a? expression, namely, those containing a and
those not containing it. There will be 2n such strings. It is as many as there are
subsets in a set of n elements. You do not need more symbols than there are
in this line of text to write a regular expression, which a modern computer will
evaluate for days or even years. Even for a length n = 50 the number of options
will hit 250 = 1125899906842624 options.
Such regular expressions are called “pathological” because due to the matching
algorithm nature they are handled extremely slowly.
• Constructing a finite state machine based on a regular expression.
It is usually a NFA (Non-deterministic Finite Automaton). As opposed to DFA
(Deterministic Finite Automaton), they can have multiple rules for the same state
and input symbol. When such a situation occurs, the automaton performs both
transitions and now has several states simultaneously. In other words, there is no
single state but a set of states an automaton is in.
This approach is a bit slower in general but has no worst-case scenario with exponential
working time. Standard Unix utilities such as grep are using this approach.
How to build a NFA from a regular expression? The rules are pretty
straightforward:
– A character corresponds to an automaton, which accepts a string of one such
character, as shown in Figure 7-5.
– We can enlarge the alphabet with additional symbols, which we put in the beginning
and end of each line.
107
CHAPTER 7 ■ MODELS OF COMPUTATION
– An asterisk has a transition to itself and a special thing called ϵ-rule. This rule
occurs always. Figure 7-7 shows the automaton for an expression a*b.
108
CHAPTER 7 ■ MODELS OF COMPUTATION
■ Question 117 Using any language you know, implement a grep analogue based on NFA construction. You
can refer to [11] for additional information.
■ Question 118 Study this regular expression: ˆ1?$|ˆ(11+?)\1+$. What might be its purpose? Imagine that
the input is a string consisting of characters 1 uniquely. How does the result of this regular expression matching
correlate with the string length?
7.2.1 Architecture
Let’s start by studying a Forth abstract machine. It consists of a processor, two separate stacks for data and
return addresses, and linear memory, as shown in Figure 7-8.
109
CHAPTER 7 ■ MODELS OF COMPUTATION
Stacks should not necessarily be part of the same memory address space.
The Forth machine has a parameter called cell size. Typically, it is equal to the machine word size of the
target architecture. In our case, the cell size is 8 bytes. The stack consists of elements of the same size.
Programs consist of words separated by spaces or newlines. Words are executed consecutively. The
integer words denote pushing into the data stack. For example, to push numbers 42, 13, and 9 into the data
stack you can write simply 42 13 9.
There are three types of words:
1. Integer words, described previously.
2. Native words, written in assembly.
3. Colon words, written in Forth as a sequence of other Forth words.
The return stack is necessary to be able to return from the colon words, as we will see later.
Most words manipulate the data stack. From now on when speaking about the stack in Forth we will
implicitly consider the data stack unless specified otherwise.
The words take their arguments from the stack and push the result there. All instructions operating on
the stack consume their operands. For example, words +, -, *, and / consume two operands from the stack,
perform an arithmetic operation, and push its result back in the stack. A program 1 4 8 8 + * + computes
the expression (8 + 8) * 4 + 1.
We will follow the convention that the second operand is popped from the stack first. It means that the
program '1 2 -' evaluates to −1, not 1.
The word : is used to define new words. It is followed by the new word’s name and a list of other words
terminated by the word ;. Both semicolon and colon are words on their own and thus should be separated
by spaces.
A word sq, which takes an argument from the stack and pushes its square back, will look as follows:
: sq dup * ;
Each time we use sq in the program, two words will be executed: dup (duplicate cell in top of the stack)
and * (multiply two words on top of the stack).
To describe the word’s actions in Forth it is common to use stack diagrams:
swap (a b -- b a)
110
CHAPTER 7 ■ MODELS OF COMPUTATION
In parentheses you see the stack state before and after word execution. The stack cells are names to
highlight the changes in stack contents. So, the swap word swaps two topmost elements in stack.
The topmost element is on the right, so the diagram 1 2 corresponds to Forth pushing first 1, then 2 as a
result of execution of some words.
rot places on top the third number from stack:
rot (a b c -- b c a)
Now we are going to execute discr a b c step by step for some numbers a, b, and c. The stack state at
the end of each step is shown on the right.
a ( a )
b ( a b )
c ( a b c )
rot ( b c a )
4 ( b c a 4 )
* ( b c (a*4) )
* ( b (c*a*4) )
swap ( (c*a*4) b )
sq ( (c*a*4) (b*b) )
swap ( (b*b) (c*a*4) )
- ( (b*b - c*a*4) )
1 ( 1 )
2 ( 1 2 )
3 ( 1 2 3 )
rot ( 2 3 1 )
4 ( 2 3 1 4 )
* ( 2 3 4 )
* ( 2 12 )
swap ( 12 2 )
sq ( 12 4 )
swap ( 4 12 )
- ( -8 )
111
CHAPTER 7 ■ MODELS OF COMPUTATION
7.2.3 Dictionary
A dictionary is a part of a Forth machine that stores word definitions. Each word is a header followed by a
sequence of other words.
The header stores the link to the previous word (as in linked lists), the word name itself as a null-
terminated string, and some flags. We have already studied a similar data structure in the assignment,
described in section 5.4. You can reuse a great part of its code to facilitate defining new Forth words. See
Figure 7-9 for the word header generated for the discr word described in section 7.2.2
112
CHAPTER 7 ■ MODELS OF COMPUTATION
Each word stores the address of its native implementation (assembly code) immediately after the
header. For colon words the implementation is always the same: docol. The implementation is called using
the jmp instruction.
Execution token is the address of this cell, pointing to an implementation. So, an execution token is an
address of an address of the word implementation. In other words, given the address A of a word entry in the
dictionary, you can obtain its execution token by simply adding the total header size to A.
Listing 7-3 provides us with a sample dictionary. It contains two native words (starting at w_plus and
w_dup) and a colon word (w_sq).
113
CHAPTER 7 ■ MODELS OF COMPUTATION
dq xt_plus
dq xt_exit
last_word: dq w_double
section .text
plus_impl:
pop rax
add rax, [rsp]
mov [rsp], rax
jmp next
dup_impl:
push qword [rsp]
jmp next
The core of the Forth engine is the inner interpreter. It is a simple assembly routine fetching code from
memory. It is shown in Listing 7-4.
114
CHAPTER 7 ■ MODELS OF COMPUTATION
exit:
mov pc, [rstack]
add rstack, 8
jmp next
docol saves PC in the return stack and sets up new PC to the first execution token stored inside the
current word. The return is performed by exit, which restores PC from the stack.
This mechanism is akin to a pair of instructions call/ret.
■ Question 119 Read [32]. What is the difference between our approach (indirect threaded code) and direct
threaded code and subroutine threaded code? What advantages and disadvantages can you name?
To better grasp the concept of an indirect threaded code and the innards of Forth, we prepared a
minimal example shown in Listing 7-6. It uses routines developed in the first assignment from section 2.7.
Take your time to launch it (the source code is shipped with the book) and check that it really reads a
word from input and outputs it back.
global _start
%define pc r15
%define w r14
%define rstack r13
section .bss
resq 1023
rstack_start: resq 1
input_buf: resb 1024
section .text
115
CHAPTER 7 ■ MODELS OF COMPUTATION
; Initializes registers
xt_init: dq i_init
i_init:
mov rstack, rstack_start
mov pc, main_stub
jmp next
; Exits program
xt_bye: dq i_bye
i_bye:
mov rax, 60
xor rdi, rdi
syscall
116
CHAPTER 7 ■ MODELS OF COMPUTATION
7.2.5 Compiler
Forth can work in either interpreter or compiler mode. Interpreter just reads commands and executes them.
When executing the colon : word, Forth switches into compiler mode. Additionally, the colon : reads
one next word and uses it to create a new entry in the dictionary with docol as implementation. Then Forth
reads words, locates them in dictionary, and adds them to the current word being defined.
So, we have to add another variable here, which stores the address of the current position to write
words in compile mode. Each write will advance here by one cell.
To quit compiler mode we need special immediate words. They are executed no matter which mode
we are in. Without them we would never be able to exit compiler mode. The immediate words are marked
with an immediate flag.
The interpreter puts numbers in the stack. The compiler cannot embed them in words directly, because
otherwise they will be treated as execution tokens. Trying to launch a command by an execution token 42
will most certainly result in a segmentation fault. However, the solution is to use a special word lit followed
by the number itself. The lit’s purpose is to read the next integer that PC points at and advance PC by one
cell further, so that PC will never point at the embedded operand.
117
CHAPTER 7 ■ MODELS OF COMPUTATION
■ Question 120 Look the documentation for commands sete, setl, and their counterparts.
It is convenient to store PC and W in some general purpose registers, especially the ones that are
guaranteed to survive function calls unchanged (caller-saved): r13, r14, or r15.
Compare two ways of defining Forth dictionary: without macros (shown in Listing 7-8) and with them
(shown in Listing 7-9).
118
CHAPTER 7 ■ MODELS OF COMPUTATION
section .text
plus_impl:
pop rax
add [rsp], rax
jmp next
Listing 7-9. forth_dict_example_macro.asm
native '+', plus
pop rax
add [rsp], rax
jmp next
Then define a macro colon, analogous to the previous one. Listing 7-10 shows its usage.
Do not forget about docol address in every colon word! Then create and test the following assembly
routines:
• find_word, which accepts a pointer to a null-terminated string and returns the address
of the word header start. If there is no word with such name, zero is returned.
• cfa (code from address), which takes the word header start and skips the whole
header till it reaches the XT value.
Using these two functions and the ones you have already written in section 2.7, you can write an
interpreter loop. The interpreter will either push a number into the stack or fill the special stub, consisting of
two cells, shown in Listing 7-11.
It should write the freshly found execution token to program_stub. Then it should point PC at the
stub start and jump to next. It will execute the word we have just parsed, and then pass control back to
interpreter.
Remember, that an execution token is just an address of an address of an assembly code. This is why the
second cell of the stub points at the third, and the third stores the interpreter address—we simply feed this
data to the existing Forth machinery.
119
CHAPTER 7 ■ MODELS OF COMPUTATION
Remember that the Forth machine also has memory. We are going to pre-allocate 65536 Forth cells for it.
■ Question 122 Should we allocate these cells in .data section, or are there better options?
To let Forth know where the memory is, we are going to create the word mem, which will simply push the
memory starting address on top of the stack.
rot (a b c -- b c a)
swap (a b -- b a)
dup (a -- a a)
drop (a -- )
120
CHAPTER 7 ■ MODELS OF COMPUTATION
• Input/output:
key ( -- c)—reads one character from stdin; The top cell in stack stores 8 bytes,
it is a zero extended character code.
emit ( c -- )—writes one symbol into stdout.
number ( -- n )—reads a signed integer number from stdin (guaranteed to fit
into one cell).
• mem—stores the user memory starting address on top of the stack.
• Working with memory:
! (address data -- )—stores data from stack starting at address.
c! (address char -- )—stores a single byte by address.
@ (address -- value)—reads one cell starting from address
c@ (address -- charvalue)—reads a single byte starting from address Then test
the resulting interpreter.
Then create a memory region for the return stack and implement docol and exit. We recommend
allocating a register to point at the return stack’s top.
Implement colon-words or and greater using macro colon and test them.
7.3.2 Compilation
Now we are going to implement compilation. It is easy!
1. We need to allocate other 65536 Forth cells for the extensible part of the dictionary.
2. Add a variable state, which is equal to 1 when in compilation mode, 0 for
interpretation mode.
3. Add a variable here, which points at the first free cell in the preallocated dictionary
space.
4. Add a variable last_word, which stores the address of the last word defined.
5. Add two new colon words, namely, : and ;.
Colon:
1: word ← stdin
2: Fill the new word’s header starting at here. Do not forget to update it!
3: Add docol address immediately at here; update here.
4: Update last_word.
5: state ← 1;
6: Jump to next.
121
CHAPTER 7 ■ MODELS OF COMPUTATION
■ Question 123 Why do we need a separate case for branch and 0branch?
122
CHAPTER 7 ■ MODELS OF COMPUTATION
123
CHAPTER 7 ■ MODELS OF COMPUTATION
• word( addr -- len ) Reads word from stdin and stores it starting at address addr.
Word length is pushed into stack.
• number ( str -- num len) Parses an integer from string.
• prints ( addr -- ) Prints a null-terminated string.
• bye Exits Forthress
• syscall ( call num a1 a2 a3 a4 a5 a6 -- new rax ) Executes syscall The
following regis- ters store arguments (according to ABI) rdi, rsi, rdx, r10, r8, and r9.
• branch <offset> Jump to a location. Location is an offset relative to the argument end
For example:
124
CHAPTER 7 ■ MODELS OF COMPUTATION
7.4 Summary
This chapter has introduced us to two new models of computation: finite state machines (also known as
finite automatons) and stack machines akin to the Forth machine. We have seen the connection between
finite state machines and regular expressions, used in multiple text editors and other text processing utilities.
We have completed the first part of our journey by building a Forth interpreter and compiler, which we
consider a wonderful summary of our introduction to assembly language. In the next chapter we are going
to switch to the C language to write higher-level code. Your knowledge of assembly will serve as a foundation
for your understanding of C because of how close its model of computation is to the classical von Neumann
model of computation.
■ Question 134 What is the implementation difference between embedded and colon words?
■ Question 135 Why are two stacks used in Forth?
■ Question 136 Which are the two distinct modes that Forth is operating in?
■ Question 137 Why does the immediate flag exist?
■ Question 138 Describe the colon word and the semicolon word.
■ Question 143 When an integer literal is encountered, do interpreter and compiler behave alike?
■ Question 144 Add an embedded word to check the remainder of a division of two numbers. Write a word
to check that one number is divisible by another.
125
CHAPTER 7 ■ MODELS OF COMPUTATION
■ Question 145 Add an embedded word to check the remainder of a division of two numbers. Write a word
to check the number for primarity.
■ Question 146 Write a Forth word to output the first n number of the Fibonacci sequence.
■ Question 147 Write a Forth word to perform system calls (it will take the register contents from stack).
Write a word that will print “Hello, world!” in stdout.
126
PART II
Basics
In this chapter we are going to start exploring another language called C. It is a low-level language with
quite minimal abstractions over assembly. At the same time it is expressive enough so we could illustrate
some very general concepts and ideas applicable to all programming languages (such as type system or
polymorphism).
C provides almost no abstraction over memory, so the memory management task is the programmer’s
responsibility. Unlike in higher-level languages, such as C# or Java, the programmer must allocate and free
the reserved memory himself, instead of relying on an automated system of garbage collection.
C is a portable language, so if you write correctly, your code can often be executed on other
architectures after a simple recompilation. The reason is that the model of computation in C is practically
the same old von Neumann model, which makes it close to the programming models of most processors.
When learning C remember that despite the illusion of being a higher-level language, it does not tolerate
errors, nor will the system be kind enough to always notify you about things in your program that were
broken. An error can show itself much later, on another input, in a completely irrelevant part of the program.
■ Language standard described The very important document about the language is the C language
standard. You can acquire a PDF file of the standard draft online for free [7]. This document is just as important
for us as the Intel Software Developer’s Manual [15].
8.1 Introduction
Before we start, we need to state several important points.
• C is always case sensitive.
• C does not care about spacing as long as the parser can separate lexemes from one
another. The programs shown in Listing 8-1 and Listing 8-2 are equivalent.
• There are different C language standards. We do not study the GNU C (a version
possessing various extensions), which is supported mostly by GCC. Instead, we
concentrate on C89 (also known as ANSI C or C90) and C99, which are supported by
many different compilers. We will also mention several new features of C11, some of
which are not mandatory to implement in compilers.
Unfortunately C89 still remains the most pervasive standard, so there are
compilers that support C89 for virtually every existing platform. This is why we will
focus on this specific revision first and then extend it with the newer features.
To force the compiler to use only those features supported by a certain standard
we use the following set of flags:
– -std=c89 or -std=c99 to select either the C89 or C99 standard.
– -pedantic-errors to disable non-standard language extensions.
– -Wall to show all warnings no matter how important they are.
– -Werror to transform warnings into errors so you would not be able to compile code with
warnings.
■ Warnings are errors It is a very bad practice to ship code that does not compile without warnings.
Warnings are emitted for a reason.
Sometimes there are very specific cases in which people are forced to do non-standard things, such as calling a
function with more arguments than it accepts, but such cases are extremely rare. In these cases it is much better
to turn off one specific warning type for one specific file via a corresponding compiler key. Sometimes compiler
directives can make the compiler omit a certain warning for a selected code region, which is even better.
For example, to compile an executable file main from source files file1.c and file2.c you could use
the following command:
This command will make a full compilation pass including object file generation and linking.
• Global variables (declared outside functions). For example, we can create a global
variable i_am_global of type int initialized to 42 outside all function scopes. Note that
global variables can only be initialized with constant values.
• Comments starting at // until the end of the line (in C99 and more recent).
int x; // this is a one line comment, which ends at the end of the line
#define CATS_COUNT 42
#define ADD(x, y) (x) + (y)
Inside functions, we can define variables or data types local to this function, or perform actions. Each
action is a statement; these are usually separated by a semicolon. The actions are performed sequentially.
You cannot define functions inside other functions.
Statements will declare variables, perform computations and assignments, and execute different
branches of code depending on conditions. A special case is a block between curly braces {}, which is used
to group statements.
Listing 8-3 shows an exemplary C program. It outputs Hello, world! y=42 x=43. It defines a function
main, which declares two variables x and y, the first is equal to 43, and the second is computed as the value
of x minus one. Then a call to function printf is performed.
The function printf is used to output strings into stdout. The string has some parts (so-called format
specifiers) replaced by the following arguments. The format specifier, as its name suggests, provides
information about the argument nature, which usually includes its size and a presence of sign. For now, we
will use very few format specifiers.
• %d for int arguments, as in the example.
• %f for float arguments.
Variable declarations, assignment, and a function call all ended by semicolons are statements.
■ Spare printf for format output Whenever possible, use puts instead of printf. This function can only
output a single string (and ends it with a newline); no format specifiers are taken into account. Not only is it
faster but it works uniformly with all strings and lacks security flaws described in section 14.7.3.
131
CHAPTER 8 ■ BASICS
For now, we will always start our programs with line #include <stdio.h>. It allows us to access a part
of standard C library. However, we state firmly that this is not a library import of any sort and should never be
treated as one.
/* `main` is the entry point for the program, like _start in assembly
* Actually, the hidden function _start is calling `main`.
* `main` returns the `return code` which is then given to the `exit` system
* call.
* The `void` keyword instead of argument list means that `main` accepts no
* arguments */
int main(void) {
/* A variable local to `main`. Will be destructed as soon as `main` ends*/
int x = 43;
int y;
y = x - 1;
/* Calling a standard function `printf` with three arguments.
* It will print 'Hello, world! y=42 x=43
* All %d will be replaced by the consecutive arguments */
printf( "Hello, world! y=%d x=%d\n", y, x);
return 0;
}
Literal is a sequence of characters in the source code which represents an immediate value. In C,
literals exist for
• Integers, for example, 42.
• Floating point numbers, for example, 42.0.
• ASCII-code of characters, written in single quotes, for example, 'a'.
• Pointers to null-terminated strings, for example, "abcde".
The execution of any C program is essentially a data manipulation.
The C abstract machine has a von Neumann architecture. It is done on purpose, because C is a
language that should be as close to the hardware as possible. The variables are stored in the linear memory
and each of them has a starting address.
You can think of variables like labels in assembly.
132
CHAPTER 8 ■ BASICS
Static typing means that all types are known in compile time. There can be absolutely incertitude about
data types. Whether you are using a variable, a literal, or a more complex expression, which evaluates to
some data, its type will be known.
Weak typing means that sometimes a data element can be implicitly converted to another type when
appropriate.
For example, when evaluating 1 + 3.0 it is apparent that these two numbers have different types. One
of them is integer; the other is a real number. You cannot directly add one to another, because their binary
representation differs. You need to convert them both to the same type (probably, floating point number).
Only then will you be able to perform an addition. In strongly typed languages, such, as OCaml, this
operation is not permitted; instead, there are two separate operations to add numbers: one acts on integers
(and is written +), the other on real numbers (is written +. in OCaml).
Weak typing is in C for a reason: in assembly, it is absolutely possible to take virtually any data and
interpret it as data of another type (pointer as an integer, part of the string as an integer, etc.)
Let’s see what happens when we try to output a floating point value as an integer (see Listing 8-4). The
result will be the floating point value reinterpreted as an integer, which does not make much sense.
int main(void) {
printf("42.0 as an integer %d \n", 42.0);
return 0;
}
This program’s output depends on the target architecture. In our case, the output was
For this brief introductory section, we will consider that all types in C fall into one of these categories:
• Integer numbers (int, char, …).
• Floating point numbers (double and float).
• Pointer types.
• Composite types: structures and unions.
• Enumerations.
In Chapter 9 we are going to explore the type system in more detail. If you come with a background in a
higher-level language, you might find some commonly known items missing from this block. Unfortunately,
there are no string and Boolean types in C89. An integer value equal to zero is considered false; any non-zero
value is considered truth.
133
CHAPTER 8 ■ BASICS
8.3.1 if
Listing 8-5 shows an if statement with an optional else part. If the condition is satisfied, the first block
is executed. If the condition is not satisfied, the second block is executed, but the second block is not
mandatory.
if (x > 3) {
puts("X is greater than 3");
}
else
{
puts("X is less than 3");
}
The braces are optional. Without braces, only one statement will be considered part of each branch, as
shown in Listing 8-6.
Notice that there is a syntax fault, called dangling else. Check Listing 8-7 and see if you can certainly
attribute the else branch to the first or the second if. To solve this disambiguation in case of nested ifs, use
braces.
if (x == 0) {
if (y == 0) { printf("A"); }
else { puts("B"); }
}
if (x == 0) {
if (y == 0) { puts("A"); }
} else { puts("B"); }
134
CHAPTER 8 ■ BASICS
8.3.2 while
A while statement is used to make cycles.
If the condition is satisfied, then the body is executed. Then the condition is checked once again, and if
it is satisfied, then the body is executed again, and so on.
An alternative form do ... while ( condition ); allows you to check conditions after executing the
loop body, thus guaranteeing at least one iteration. Listing 8-9 shows an example.
Notice that a body can be empty, as follows: while (x == 0);. The semicolon after the parentheses
ends this statement.
8.3.3 for
A for statement is ideal to iterate over finite collections, such as linked lists or arrays. It has the following
form: for ( initializer ; condition; step ) body. Listing 8-10 shows an example.
First, the initializer is executed. Then there is a condition check, and if it holds, the loop body is
executed, and then the step statement.
In this case, the step statement is an increment operator ++, which modifies a variable by increasing its
value by one. After that, the loop begins again by checking the condition, and so on. Listing 8-11 shows two
equivalent loops.
135
CHAPTER 8 ■ BASICS
/* as a `while` loop */
i = 0;
while ( i < 10 ) {
puts("Hello!");
i = i + 1;
}
/* as a `for` loop */
for( i = 0; i < 10; i = i + 1 ) {
puts("Hello!");
}
The break statement is used to end the cycle prematurely and fall to the next statement in the code.
continue ends the current iteration and starts the next iteration right away. Listing 8-12 shows an example.
Note also that in the for loop, the initializer, step, or condition expressions can be left empty.
Listing 8-13 shows an example.
8.3.4 goto
A goto statement allows you to make jumps to a label inside the same function. As in assembly, labels can
mark any statement, and the syntax is the same: label: statement. This is often described a bad codestyle;
however, it might be quite handy when encoding finite state machines. What you should not do is to
abandon well-thought-out conditionals and loops for goto-spaghetti.
The goto statement is sometimes used as a way to break from several nested cycles. However, this is
often a symptom of a bad design, because the inner loops can be abstracted away inside a function (thanks
to the compiler optimizations, probably for no runtime cost at all). Listing 8-14 shows how to use goto to
break out of all inner loops.
136
CHAPTER 8 ■ BASICS
The goto statement mixed with the imperative style makes analyzing the program behavior harder for both
humans and machines (compilers), so the cheesy optimizations the modern compilers are capable of become
less likely, and the code becomes harder to maintain. We advocate restricting goto usage to the pieces of code that
perform no assignments, like the implementations of finite state machines. This way you won’t have to trace all the
possible program execution routes and how the values of certain variables change when the program executes one
way or another.
8.3.5 switch
A switch statement is used like multiple nested if’s when the condition is some integer variable being equal
to one or another value. Listing 8-15 shows an example.
default: /* otherwise... */
puts( "It is not one nor two" );
break;
}
Every case is, in fact, a label. The cases are not limited by anything but an optional break statement to
leave the switch block. It allows for some interesting hacks.1 However, a forgotten break is usually a source of
bugs. Listing 8-16 shows these two behaviors: first, several labels are attributed to the same case, meaning no
matter whether x is 0, 1 or 10, the code executed will be the same. Then, as the break is not ending this case,
after executing the first printf the control will fall to the next instruction labeled case 15, another printf.
int main(void) {
int i;
for( i = 1; i < 11; i++ )
printf( "%d \n", first_divisor( i ) );
return 0;
}
138
CHAPTER 8 ■ BASICS
nature of a Fibonacci sequence implies that it is ascending, so if we found a member greater than n and still
have not enumerated n, we conclude, that n is not in the sequence. The function is_fib accepts an integer
n and calculates all elements less or equal to n. If the last element of this sequence is n, then n is a Fibonacci
number and it returns 1; otherwise, it returns 0.
int a = 1;
int b = 1;
if ( n == 1 ) return 1;
if (n == a || n == b) return 1;
b = a;
a = t + a;
}
return 0;
int main(void) {
int i;
for( i = 1; i < 11; i = i + 1 ) {
check( i );
}
return 0;
}
139
CHAPTER 8 ■ BASICS
Expressions are data, so they can be used at the right side of the assignment operator =. Some of the
expressions can be also used at the left side of the assignment. They should correspond to data entities
having an address in memory.2
Such expressions are called lvalue; all other expressions, which have no address, are called rvalue. This
difference is actually very intuitive as long as you think in terms of abstract machine. Expressions such as
shown in Listing 8-20 bear no meaning, because an assignment means memory change.
1 + 3;
42;
square(3);
3. Control flow statements: if, while, for, switch. They do not require a semicolon.
2
We are talking about abstract C machine memory here. Of course, the compiler has the right to optimize variables and
never allocate real memory for them on the assembly level. The programmer, however, is not constrained by it and can
think that every variable is an address of a memory cell.
140
CHAPTER 8 ■ BASICS
We have already talked about assignments; the evil truth is that assignments are expressions
themselves, which means that they can be chained. For example, a = b = c means
• Assign c to b;
• Assign the new b value to a.
A typical assignment is thus a statement from the first category: expression ended by a semicolon.
Assignment is a right-associative operation. It means that when being parsed by a compiler (or your
eye) the parentheses are implicitly put from right to left, the rightmost part becoming the most deeply
nested. Listing 8-22 provides an example of two equivalent ways to write a complex assignment.
On the other hand, the left-associative operations consider the opposite nesting order, as shown in
Listing 8-23
Most operators have an evident meaning. We will mention some of the less used and more obscure ones.
• The increment and decrement operators can be used in either prefix or postfix
form: either for a variable i it is i++ or ++i. Both expressions will have an
immediate effect on i, meaning it is incremented by 1. However, the value of i++
is the “old” i, while the value of ++i is the “new,” incremented i.
• There is a difference between logical and bit-wise operators. For logical operators,
any non-zero number is essentially the same in its meaning, while the bit-wise
operations are applied to each bit separately. For example, 2 & 4 is equal to zero,
because no bits are set in both 2 and 4. However, 2 && 4 will return 1, because
both 2 and 4 are non-zero numbers (truth values).
• Logical operators are evaluated in a lazy way. Consider the logical and operator
&&. When applied to two expressions, the first expression will be computed. If
its value is zero, the computation ends immediately, because of the nature of
AND operation. If any of its operands is zero, the result of the big conjunction
will be zero as well, so there is no need to evaluate it further. It is important for
us because this behavior is noticeable. Listing 8-24 shows an example where the
program will output F and will never execute the function g.
int main(void) {
f() && g();
return 0;
}
8.5 Functions
We can draw a line between procedures (which do not return a value) and functions (which return a value
of a certain type). The procedure call cannot be embedded into a more complex expression, unlike the
function call.
Listing 8-25 shows an exemplary procedure. Its name is myproc; it returns void, so it does not return
anything. It accepts two integer parameters named a and b.
142
CHAPTER 8 ■ BASICS
Listing 8-26 shows an exemplary function. It accepts two arguments and returns a value of type int.
A call to this function is used as a part of a more complex expression later.
Every function’s execution is ended with return statement; otherwise which value it will return is
undefined. Procedures can have the return keyword omitted; it might be still used without an operand to
immediately return from the procedure.
When there are no arguments, a keyword void should be used in function declaration, as shown in
Listing 8-27.
The body of function is a block statement, so it is enclosed in braces and is not ended with a semicolon.
Each block defines a lexical scope for variables.
All variables should be declared in the block start, before any statements. That restriction is present in
C89 but not in C99. We will adhere to it to make the code more portable.
Additionally, it forces a certain self-discipline. If you have a large amount of local variables declared at
the scope start, it will look cluttered. At the same time it is usually sign of bad program decomposition and/
or poor choice of data structures.
Listing 8-28 shows examples of good and bad variable declarations.
void f(void) {
int y = 12;
printf( "%d", y);
int x = 10;
...
}
143
CHAPTER 8 ■ BASICS
/* Good: any block can have additional variables declared in its beginning */
/* `x` is local to one `for` iteration and is always reinitialized to 10 */
for( i = 0; i < 10; i++ ) {
int x = 10;
}
If a variable in a certain scope has the same name as the variable already declared in a higher scope, the
more recent variable hides the ancient one. There is no way to address the hidden variable syntactically (by
not storing its address somewhere and using the address).
The local variables in different functions can of course have the same names.
■ Note The variables are visible until the end of their respective blocks. So a commonly used notion of ‘local‘
variables is in fact block-local, not function-local. The rule of thumb is: make variables as local as you can (including
variables local to loop bodies, for example. It greatly reduces program complexity, especially in large projects.
8.6 Preprocessor
The C preprocessor is acting similar to the NASM preprocessor. Its power, though, is much more limited. The
most important preprocessor directives you are going to see are
• #define
• #include
• #ifndef
• #endif
The #define directive is very similar to its NASM %define counterpart. It has three main usages.
• Defining global constants (see Listing 8-29 for an example).
144
CHAPTER 8 ■ BASICS
As you see, the value of x will not be 25 but 4+(1∗4)+1 because of multiplication having a higher priority
comparing to addition.
The #include directive pastes the given file contents in place of itself. The file name is enclosed in either
quotes (#include "file.h") or angle brackets (#include <stdio.h>).
• In case of angle brackets, the file is searched in a set of predefined directories. For GCC
it is usually:
– /usr/local/include
– <libdir>/gcc/target/version/include
Here <libdir> stands for the directory that holds libraries (a GCC setting) and
is usually /usr/lib or /usr/local/lib by default.
– /usr/target/include
– /usr/include
Using the -I key one can add directories to this list. You can make a special include/
directory in your project root and add it to the GCC include search list.
• In case of quotes, the files are also searched in the current directory.
You can get the preprocessor output by evaluating a file filename.c in the same way as when working
with NASM: gcc -E filename.c. This will execute all preprocessor directives and flush the results into
stdout without doing anything.
145
CHAPTER 8 ■ BASICS
8.7 Summary
In this chapter we have elaborated the C basics. All variables are labels in memory of the C language abstract
machine, whose architecture greatly resembles the von Neumann architecture. After describing a universal
program structure (functions, data types, global variables, . . . ), we have defined two syntactical categories:
statements and expressions. We have seen that expressions are either lvalues or rvalues and learned to
control the program execution using function calls and control statements such as if and while. We are
already able to write simple programs which perform computations on integers. In the next chapter we are
going to discuss the type system in C and the types in general to get a bigger picture of how types are used
in different programming languages. Thanks to the notion of arrays our possible input and output data will
become much more diverse.
■ Question 150 What is the difference between the statements and expressions?
■ Question 154 How are truth and false values encoded in C89?
146
CHAPTER 9
Type System
The notion of type is one of the key ones. A type is essentially a tag assigned to a data entity. Every data
transformation is defined for specific data types, which ensures their correctness (you would not want to add
the amount of active Reddit users to the average temperature at noon in Sahara, because it makes no sense).
This chapter will study the C type system in depth.
• Despite the name making a direct reference to the word “character,” this is an
integer type and should be treated as such. It is often used to store the ASCII code
of a character, but it can be used to store any 1-byte number.
• A literal 'x' and corresponds to an ASCII code of the character “x.” Its type is int
but it is safe to assign it to a variable of type char.1
Listing 9-1 shows an example.
2. int
• An integer number.
• Can be signed and unsigned. It is signed by default.
• It can be aliased simply as: signed, signed int (similar for unsigned).
• Can be short (2 bytes), long (4 bytes on 32-bit architectures, 8 bytes in Intel 64). Most
compilers also support long long, but up to C99 it was not part of standard.
• Other aliases: short, short int, signed short, signed short int.
• The size of int without modifiers varies depending on architecture. It was designed
to be equal to the machine word size. In the 16-bit era the int size was obviously
2 bytes, in 32-bit machines it is 4 bytes. Unfortunately, this did not prevent
programmers from relying on an int of size 4 in the era of 32-bit computing.
Because of the large pool of software that would break if we change the size of int,
its size is left untouched and remains 4 bytes.
• It is important to note that all integer literals have the int format by default. If we add
suffixes L or UL we will explicitly state that these numbers are of type long int or
unsigned int. Sometimes it is of utter importance not to forget these suffixes.
Consider an expression 1 << 48. Its value is not 248 as you might have thought,
but 0. Why? The reason is that 1 is a literal of the type int, which occupies 4 bytes
and thus can vary from −231 to 231 − 1. By shifting 1 to the left 48 times, we are
moving the only bit set outside of integer format. Thus the result is zero. However,
if we do add a correct suffix, the answer will be more evident. An expression 1L
<< 48 is evaluated to 248, because 1L is now 8 bytes long.
3. long long
• In x64 architecture it is the same as a long (except for Windows, where long is
4 bytes).
• Its size is 8 bytes.
• Its range is : −263 … 263 – 1 for signed and 0...264 –1 for unsigned.
1
This language design flaw is corrected in C++, where 'x' has type char.
148
CHAPTER 9 ■ TYPE SYSTEM
4. float
• Floating point number.
• Its size is 4 bytes.
• Its range is : ±1, 17549 × 10−38 … ± 3, 40282 × 1038 (approximately six digits precision).
5. double
• Floating point number.
• Its size is 8 bytes.
• Its range is: ±2, 22507 × 10−308 … ± 1, 79769 × 10308 (approximately 15 digits precision).
6. long double
• Floating point number.
• Its size is usually 80 bits.
• It was only introduced in C99 standard.
First of all, remember, that floating point types are a very rough approximation of the real numbers. For
example, they are more precise near 0 and less precise for big values. This is exactly the reason their range is
so great compared even to longs.
As a consequence, doing floating point arithmetic with values closer to zero yields more precise results.
Finally, in certain contexts (e.g., kernel programming) the floating point arithmetic is not available. As a rule of
thumb, avoid it when you do not need it. For example, if your computations can be performed by manipulating a
quotient and a remainder, calculated by using / and % operators, you should stick with them.
int b = 129;
char k = (char)b; //???
Surely, this wonderful open world of possibilities is better controlled by your benevolent dictatorship
because these implicit conversions often lead to subtle bugs when an expression is not evaluated to what it
“should” be evaluated.
149
CHAPTER 9 ■ TYPE SYSTEM
For example, as char is a (usually) signed number in range -128 . . . 127, the number 129 is too big to
fit into this range. The result of an action, shown in Listing 9-2, is not described in the language standard,
but given how typical processors and compilers function, the result will be probably a negative number,
consisting of the same bits as an unsigned representation of 129.
■ Question 158 What will be the value of k? Try to compile and see in your own computer.
■ Note Remember that long long and long double have appeared only in C99. They are, however,
supported as a language extension by many compilers that do not support C99 yet.
The “convert to int first” rule means that the overflows in lesser types can be handled differently than in
int type itself. The example shown in Listing 9-3 assumes that sizeof(int) == 4.
2
The keyword is usual arithmetic conversions.
150
CHAPTER 9 ■ TYPE SYSTEM
In the last line, neither x, y, nor z is promoted to long, because it is not required by standard. The
arithmetic will be performed within the int type and then the result will be converted to long.
■ Be understood As a rule of thumb, when uncertain, always provide the types explicitly! For example, you
can write long x = (long)a + (long)b + (long)c.
While the code might seem more verbose after that, it will at least work as intended.
Let’s look at an example shown in Listing 9-4. The expression in the third line will be computed as follows:
1. The value of i will be converted to float (of course, the variable itself will not
change);
2. This value is added to the value of f, the resulting type is float again; and
3. This result is converted to double to be stored in d.
All these operations are not free and are encoded as assembly instructions. It means that whenever you
are acting on numbers of different formats, it probably has runtime costs. Try to avoid it especially in cycles.
9.1.5 Pointers
Given a type T, one can always construct a type T*. This new type corresponds to data units which hold
address of another entity of type T.
As all addresses have the same size, all pointer types have the same size as well. It is specific for
architecture and, in our case, is 8 bytes wide.
Using operands & and * one can take an address of a variable or dereference a pointer (look into the
memory by the address this pointer stores). Listing 9-5 shows an example.
In section 2.5.4 we discussed a subtle problem: if a pointer is just an address, how do we know, the size
of a data entity we are trying to read starting from this address? In assembly, it was straightforward: either the
size could have been deduced based on the fact that two mov operands should have the same size or the size
should have been explicitly given, for example, mov qword [rax], 0xABCDE. Here the type system takes care
of it: if a pointer is of a type int*, we surely know that dereferencing it produces a value of size sizeof(int).
151
CHAPTER 9 ■ TYPE SYSTEM
When you program in C, pointers are your bread and butter. As long as you do not introduce a pointer
to non-existing data, the pointers will serve you right.
A special pointer value is 0. When used in pointer context (specifically, comparison with 0), 0 signifies “a
special value for a pointer to nowhere.” In place of 0 you can also write NULL, and you are advised to do so. It
is a common practice to assign NULL to the pointers which are not yet initialized with a valid object address,
or return NULL from functions returning an address of something to make the caller aware of an error.
■ Is zero a zero? There are two contexts in which you might use the 0 expression in C. The first context
expects just a normal integer number. The second one is a pointer context, when you assign a pointer to 0 or
compare it with 0. In the second context 0 does not always mean an integer value with all bits cleared, but will
always be equal to this “invalid pointer” value. In some architectures it can be, for example, a value with all bits
set. But this code will work no matter the architecture because of this rule:
int* px = ... ;
There is a special kind of pointer type: void*. This is the pointer to any kind of data. C allows us to
assign any type of pointer to a variable of type void*; however, this variable cannot be dereferenced. Before
we do it, we need to take its value and convert to a legit pointer type (e.g., int*). A simple cast is used to do it
(see section 9.1.2). Listing 9-6 shows an example.
You can also pass a pointer of type void* to any function that accepts a pointer to some other type.
Pointers have many purposes, and we are going to list a couple of them.
• Changing a variable created outside a function.
• Creating and navigating complex data structures (e.g., linked lists).
• Calling functions by pointers means that by changing pointer we switch between
different functions being called. This allows for pretty elegant architectural solutions.
152
CHAPTER 9 ■ TYPE SYSTEM
Pointers are closely tied with arrays, which are discussed in the next section.
9.1.6 Arrays
In C, an array is a data structure that holds a fixed amount of data of the same type. So, to work with an array
we need to know its start, size of a single element and the amount of elements that it can store. Refer to
Listing 9-7 to see several variations of array declaration.
As the amount of elements should be fixed, it cannot be read from a variable.3To allocate memory for
such arrays whose dimensions we do not know in advance, memory allocators are used (which are even
not always at your disposal, for example, when programming kernels). We will learn to use the standard C
memory allocator (malloc / free) and will even write our own.
You can address elements by index. Indices start from 0. The origins of this solution is in the nature of
address space. The zero-th element is located at an array’s starting address plus 0 times the element size.
Listing 9-8 shows an array declaration, two reads and one write.
myarray[10] = 42;
If we think for a bit about the C abstract machine, the arrays are just continuous memory regions
holding the data of the same type. There is no information about type itself or about the array length. It is
fully a programmer’s responsibility to never address an element outside an allocated array.
Whenever you write the allocated array’s name, you are actually referring to its address. You can think
about it as a constant pointer value. Here is the place where the analogy between assembly labels and
variables is the strongest. So, in Listing 9-8, an expression myarray has actually a type int*, because it is a
pointer to the first array element!
It also means that an expression *myarray will be evaluated to its first element, just as myarray[0].
3
Until C99; but even nowadays variable length arrays are discouraged by many because if the array size is big enough,
the stack will not be able to hold it and the program will be terminated.
153
CHAPTER 9 ■ TYPE SYSTEM
Unsurprisingly, the same function can be rewritten keeping the same behavior, as shown in Listing 9-10.
But that’s not all. You can actually mix these and use the indexing notation with pointers, as shown in
Listing 9-11.
The compiler immediately demotes constructions such as int array[] in the arguments list to a
pointer int* array, and then works with it as such. Syntactically, however, you can still specify the array
length, as shown in Listing 9-12. This number indicates that the given array should have at least that many
elements. However, the compiler treats it as a commentary and performs no runtime or compile-time checks.
C99 introduced a special syntax, which corresponds essentially to your promise given to a compiler,
that the corresponding array will have at least that many elements. It allows the compiler to perform some
specific optimizations based on this assumption. Listing 9-13 shows an example.
154
CHAPTER 9 ■ TYPE SYSTEM
The initialization order is irrelevant. It is often useful to use enum values or character values as indices.
Listing 9-14 shows an example.
enum colors {
RED,
GREEN,
BLUE,
MAGENTA,
YELLOW
};
You can see the suffix _t in type names quite often. All names ending with _t are reserved by POSIX
standard.4
This way newer standards will be able to introduce new types without the fear of colliding with types
in existing projects. So, using these type names is discouraged. We will speak about practical naming
conventions later.
What are these new types for?
1. Sometimes they improve the ease of reading code.
2. They may enhance portability, because to change the format of all variables of
your custom type you should only change the typedef.
3. Types are essentially another way of documenting program.
4. Type aliases are extremely useful when dealing with function pointer types
because of their cumbersome syntax.
4
POSIX is a family of standards specified by the IEEE Computer Society. It includes the description of utilities,
application programming interface (API), etc. Its purpose is to ease the portability of software, mostly between different
branches of UNIX-derived systems.
155
CHAPTER 9 ■ TYPE SYSTEM
A very important example of a type alias is size_t. This is a type defined in the language standard (it
requires including one of the standard library headers, for example, #include <stddef.h>). Its purpose is to
hold array lengths and array indices. It is usually an alias for unsigned long; thus, in Intel 64 it typically is an
unsigned 8-byte integer.
■ Never use int for array indices Unless you are dealing with a poorly designed library which forces you to
use int as an index, always favor size_t.
Always use types appropriately. Most standard library functions that deal with sizes return a value of type
size_t (even the sizeof() operator returns size_t!). Let’s take a look at the example shown in Listing 9-16.
An expression s of type size_t could have been obtained from one of library calls such as strlen. There are
several problems that arise because of int usage:
• int is 4 bytes long and signed, so its maximal value is 231 − 1. What if i is used as an
array index? It is more than possible to create a bigger array on modern systems, so all
elements may not be indexed. The standard says that arrays are limited in size by an
amount of elements encodable using a size_t variable (unsigned 64-bit integer).
• Every iteration is only performed if the current i value is less than s. Thus a
comparison is needed, but these two variables have a different format! Because of it, a
special number conversion code will be executed by each iteration, which can be quite
significant for small loops with a lot of iterations.
• When dealing with bit arrays (not so uncommon) a programmer is likely to compute i/8
for a byte offset in a byte array and i%8 to see which specific bit we are referring to. These
operations can be optimized into shifts instead of actual division, but only for unsigned
integers. The performance difference between shifts and “fair” division is radical.
INDEX STRING
0 "ls"
1 "-l"
2 "-a"
156
CHAPTER 9 ■ TYPE SYSTEM
The shell will split the whole calling string into pieces by spaces, tabs, and newline
symbols and the loader and C standard library will ensure that main gets this
information.
• argc will be equal to 3 as it is a number of elements in argv.
Listing 9-17 shows an example. This program prints all given arguments, each in a separate line.
long array[] = { 1, 2, 3 };
int main(void) {
printf( "%zu \n", sizeof( array ) ); /* output: 24 */
printf( "%zu \n", sizeof( array[0] ) ); /* output: 8 */
return 0;
}
Notice, how you cannot use sizeof to get the size of an array accepted by a function as an argument.
Listing 9-19 shows an example. This program will output 8 in our architecture
157
CHAPTER 9 ■ TYPE SYSTEM
■ Which format specifier? Starting at C99 you can use a format specifier %zu for size_t. In earlier versions
you should use %lu which stands for unsigned long.
■ Question 159 Create sample programs to study the values of these expressions:
• sizeof(void)
• sizeof(0)
• sizeof('x')
• sizeof("hello")
■ Question 161 How do you compute how many elements an array stores using sizeof?
...
...
158
CHAPTER 9 ■ TYPE SYSTEM
It is interesting to note how the const modifier interacts with the asterisk * modifier. The type is read from
right to left and so the const modifiers as well as the asterisk are applied in this order. Following are the options:
• int const* x means “a mutable pointer to an immutable int.” Thus, *x = 10 is not
allowed, but modifying x itself is allowed.
An alternate syntax is const int* x.
• int* const x = &y; means “an immutable pointer to a mutable int y.” In other
words, x will never be pointing at anything but y.
• A superposition of the two cases: int const* const x = &y; is “an immutable
pointer to an immutable int y.”
■ Simple rule The const modifier on the left of the asterisk protects the data we point at; the const modifier
on the right protects the pointer itself.
Making a variable constant is not foolproof. There is still a way to modify it. Let’s demonstrate it for a
variable const int x (see Listing 9-21).
• Take a pointer to it. It will have type const int*.
• Cast this pointer to int*.
• Dereference this new pointer. Now you can assign a new value to x.
int main(void) {
const int x = 10;
*( (int*)&x ) = 30;
printf( "%d\n", x );
return 0;
}
This technique is strongly discouraged but you might need it when dealing with poorly designed
legacy code. const modifiers are made for a reason, and if your code does not compile it, it is by no means a
justification for such hacks.
Note that you cannot assign a int const* pointer to int* (this is true for all types). The first pointer
guarantees that its contents will never be changed, while the second one does not. Listing 9-22 shows an example.
159
CHAPTER 9 ■ TYPE SYSTEM
■ Should I use const at all? It is cumbersome. Absolutely. In large projects it can save you a lifetime of
debugging. I myself recall several very subtle bugs that were caught by the compiler and resulted in compilation
error. Without the variables being protected by const, the compiler would have accepted the program which
would have resulted in the wrong behavior.
Additionally, the compiler may use this information to perform useful optimizations.
9.1.13 Strings
In C, strings are null-terminated. A single character is represented by its ASCII code of type char. A string is
defined by a pointer to its start, which means that the equivalent of a string type would be char*. Strings can
also be thought of as character arrays, whose last element is always equal to zero.
The type of string literals is char*. Modifying them, however, while being syntactically possible (e.g.,
"hello"[1] = 32), yields an undefined result. It is one of the cases of undefined behavior in C. This usually
results in a runtime error, which we will explain in the next chapter.
When two string literals are written one after another, they are concatenated (even if they are separated
with line breaks). Listing 9-23 shows an example.
■ Note The C++ language (unlike C) forces the string literal type to char const*, so if you want your code
to be portable, consider it. Additionally, it forces the immutability of the strings (which is what you will often
want) on the syntax level. So whenever you can, assign string literals to const char* variables.
160
CHAPTER 9 ■ TYPE SYSTEM
The syntax, as you see, is quite particular. The type declaration is mixed with the argument name itself,
so the general pattern is:
What are these types useful for? As the function pointer types are rather difficult to write and read, they
are often hidden in a typedef. The bad (but very common) practice is to add an asterisk inside the type alias
declaration. Listing 9-26 shows an example where a type to a procedure returning nothing is created.
In this case you can write directly proc my_pointer = &some_proc. However, this hides an information
about proc being a pointer: you can deduce it but you do not see it right away, which is bad. The nature of
the C language is, of course, to abstract things as much as you can, but pointers are such a fundamental
concept and so pervasive in C that you should not abstract them, especially in the presence of weak typing.
So, a better solution would be to write down what is shown in Listing 9-27.
...
Additionally, these types can be used to write function declarations. Listing 9-28 shows an example.
161
CHAPTER 9 ■ TYPE SYSTEM
/* declaration */
proc myproc;
/* ... */
/* definition */
double myproc( int x ) { return 42.0 + x; }
162
CHAPTER 9 ■ TYPE SYSTEM
Before we start polishing the code, we can immediately spot a bug: the starting value of sum is not
defined and can be random. Local variables in C are not initialized by default, so you have to do it by hand.
Check Listing 9-30.
First of all, this code is totally not reusable. Let’s extract a piece of logic into an array_sum procedure,
shown in Listing 9-31.
What is this magic number 5? Every time we change an array we have to change this number as well, so
we probably want to calculate it dynamically, as shown in Listing 9-32.
163
CHAPTER 9 ■ TYPE SYSTEM
But why are we dividing the array size by 4? The size of int varies depending on the architecture, so we
have to calculate it too (in compile time) as shown in Listing 9-33.
We immediately face a problem: sizeof returns a number of type size_t, not int. So, we have to
change the type of i and are doing it for a good reason (see section 9.1.9). Listing 9-34 shows the result.
164
CHAPTER 9 ■ TYPE SYSTEM
Right now, array_sum works only on statically defined arrays, because they are the only ones whose size
can be calculated by sizeof. Next we want to add enough parameters to array_sum so it would be able to
sum any array. You cannot add only a pointer to an array, because the array size is unknown by default, so
you give it two parameters: the array itself and the amount of elements in the array, as shown in Listing 9-35.
This code is much better but it still breaks the rule of not mixing input/output and logic. You cannot use
array_sum anywhere in graphical programs, you also can do nothing with its result. We are going to get rid of
the output in the summation function and make it return its result. Check Listing 9-36.
For convenience, we renamed the global array variable g_array, but it is not necessary.
Finally, we have to think about adding const qualifiers. The most important place is function arguments
of pointer types. We really want to declare that array_sum will never change the array that its argument is
pointing at. We can also like the idea of protecting the global array itself from being changed by adding a
const qualifier.
Remember that if we make g_array itself constant but will not mark array in the argument list as such,
we would not be able to pass g_array to array_sum, because there are no guarantees that array_sum will not
change data that its argument is pointing at. Listing 9-37 shows the final result.
When you write a solution for an assignment in this book, remember all the points stated previously
and check whether your program conforms to them, and if not, how it can be improved.
Can this program be improved further? Of course, and we are going to give you some hints about how.
• Can the pointer array be NULL? If so, how do we signalize it without dereferencing a
NULL pointer, which will probably result in crash?
• Can sum overflow?
åa b = a b + a b
i =1
i i 1 1 2 2 + ! + an bn
1 . 4 + 2 . 5 + 3 . 6 = 4 + 10 + 18 = 32
166
CHAPTER 9 ■ TYPE SYSTEM
9.2.1 Structures
Abstraction is absolutely key to all programming. It replaces the lower-level, more verbose concepts with
those closer to our thinking: higher-level, less verbose. When you are thinking about visiting your favorite
pizzeria and plan an optimal route, you do not think about “moving your right foot X centimeters forward,”
but rather about “crossing the road” or “turning to the right.” While for program logic the abstraction
mechanism is implemented using functions, the data abstraction is implemented using complex data types.
A structure is a data type which packs several fields. Each field is a variable of its own type. Mathematics
would probably be happy calling structures “tuples with named fields.”
To create a variable of a structural type we can refer to the example shown in Listing 9-38. There we
define a variable d which has two fields: a and b of types int and char, respectively. Then d.a and d.b
become valid expressions that you can use just as you are using variable names.
This way, however, you only create a one-time structure. In fact, you are describing a type of d but you
are not creating a new named structural type. The latter can be done using a syntax shown in Listing 9-39.
167
CHAPTER 9 ■ TYPE SYSTEM
...
struct pair d;
d.a = 0;
d.b = 1;
Be very aware that the type name is not pair but struct pair, and you cannot omit the struct keyword
without confusing the compiler. The C language has a concept of namespaces quite different from the
namespaces in other languages (including C++). There is a global type namespace, and then there is a tag-
namespace, shared between struct, union, and enum datatypes. The name following the struct keyword is a
tag. You can define a structural type whose name is the same as other type, and the compiler will distinguish
them based on the struct keyword presence.
An example shown in Listing 9-40 demonstrates two variables of types struct type and type, which
are perfectly accepted by the compiler.
It does not mean, though, that you really should make types with similar names.
However, as struct type is a perfectly fine type name, it can be aliased as type using the
typedef keyword, as shown in Listing 9-41. Then the type and struct type names will be completely
interchangeable.
■ Please, do not do it It is not a good practice to alias structural types using typedef, because it hides
information about the type nature.
168
CHAPTER 9 ■ TYPE SYSTEM
You can also assign 0 to all fields of a structure, as shown in Listing 9-43.
...
struct pair p = { 0 };
In C99, there is a better syntax for structure initialization, which allows you to name the fields to
initialize. The unmentioned fields will be initialized to zeros. Listing 9-44 shows an example.
The fields of the structures are guaranteed to not overlap; however, unlike arrays, structures are not
continuous in a sense that there can be free space between their fields. Thus, sizeof of a structural type can
be greater than the sum of element sizes because of these gaps. We will talk about it in Chapter 12.
9.2.2 Unions
Unions are very much like structures, but their fields are always overlapping. In other words, all union fields
start at the same address. The unions share their namespace with structures and enumerations.
Listing 9-45 shows an example.
...
dword test;
test.integer = 0xAABBCCDD;
We have just defined a union which stores a number of size 4 bytes (on x86 or x64 architectures). At the
same time it stores an array of two numbers, each of which is 2 bytes wide. These two fields (a 4-byte number
and a pair of 2-byte numbers) overlap. By changing the .integer field we are also modifying .shorts array. If
we assign .integer = 0xAABBCCDD and then try to output shorts[0] and shorts[1], we will see ccdd aabb.
169
CHAPTER 9 ■ TYPE SYSTEM
■ Question 162 Why do these shorts seem reversed? Will it always be the case, or is it architecture
dependent?
By mixing structures and unions we can achieve interesting results. An example shown in Listing 13-17
demonstrates, how one can address parts of a 3-byte structure using indices.5
Remember that if you assigned a union field to a value, the standard does not guarantee you anything
about the values of other fields. An exception is made for the structures that have the same initial sequence
of fields.
Listing 9-47 shows an example.
struct sb {
int x;
char y;
int notz;
};
union test {
struct sa as_sa;
struct sb as_sb;
};
5
Note that this might not work out of the box for wider types due to possible gaps between struct fields.
170
CHAPTER 9 ■ TYPE SYSTEM
Now, in the next example, shown in Listing 9-49, we got rid of the name of the first field (named). This is
an anonymous structure, and now we can access its fields as if they were the fields of vec itself: vec.x.
double z;
};
double raw[3];
};
9.2.4 Enumerations
Enumerations are a simple data type based on int type. It fixes certain values and gives them names, similar
to how DEFINE works.
For example, the traffic light can be in one of the following states (based on which lights are turned on):
• Red.
• Red and yellow.
• Yellow.
• Green.
• No lights.
This can be encoded in C as shown in Listing 9-50.
171
CHAPTER 9 ■ TYPE SYSTEM
NOTHING
};
...
enum light l = nothing;
...
When is it useful? It is often used to encode a state of an entity, for example, as a part of a finite
automaton; it can serve as a bag of error codes or code mnemonics.
The constant value 0 was named RED, RED_AND_YELLOW stands for 1, etc.
172
CHAPTER 9 ■ TYPE SYSTEM
4 +. 1.0
We used the data of type int when the compiler expected a float and, unlike in C, where a conversion
would have occurred, has thrown an error. This is the essence of very strong typing.
Now let’s try to evaluate an expression 1 if True else "3" + 2. This expression is evaluated to 1 if
True is true (which obviously holds); otherwise its value is a result of the same invalid operation "3" + 2.
However, as we are never reaching into the else branch, there will be no error raised even in runtime.
Listing 9-52 shows the terminal dump. When applied to two strings, the plus acts as a concatenation
operator.
Listing 9-52. Python Typing: No Error Because the Statement Is Not Executed
>>> 1 if True else "3" + 2
1
>>> "1" + "2"
'12'
173
CHAPTER 9 ■ TYPE SYSTEM
true
By studying this example only we can deduce that when a number and a string are compared, both
sides are apparently converted to a number and then compared. It is not clear whether the numbers are
integers or reals, but the amount of implicit operations in action here is quite astonishing.
9.3.2 Polymorphism
Now that we have a general understanding of typing, let’s go after one of the most important concepts
related to the type systems, namely, polymorphism.
Polymorphism (from Greek: polys, “many, much” and morph, “form, shape”) is the possibility of
calling different actions for different types in a uniform way. You can also think about it in another way: the
data entities can take different types.
There are four different kinds of polymorphism [8], which we can also divide into two categories:
1. Universal polymorphism, when a function accepts an argument of an infinite
number of types (including maybe even those who are not defined yet) and
behaves in a similar way for each of them.
• Parametric polymorphism, where a function accepts an additional argument,
defining the type of another argument.
In languages such as Java or C#, the generic functions are an example of
parametric compile-time polymorphism.
• Inclusion, where some types are subtypes of other types. So, when given an
argument of a child type, the function will behave in the same way as when the
parent type is provided.
2. Ad hoc, where functions accept a parameter from a fixed set of types and these
functions may operate differently on each type.
• Overloading, several functions exist with the same name and one of them is called
based on an argument type.
• Coercion, where a conversion exists from type X to type Y and a function accepting
an argument of type Y is called with an argument of type X.
174
CHAPTER 9 ■ TYPE SYSTEM
The popular object-oriented programming paradigm has popularized the notion of polymorphism, but
in a very particular way. The object-oriented programming usually refers to only one kind of polymorphism,
namely, subtyping, which is essentially the same as inclusion, because the objects of the child type form a
subset of objects of the parent type.
Sometimes it is hard to say which type of polymorphism is used in a certain place. Consider the
following four lines:
3 + 4
3 + 4.0
3.0 + 4
3.0 + 4.0
The “plus” operation here is obviously polymorphic, because it is used in the same way with all kinds of
int and double operands. But how is it really implemented? We can think of different options, for example,
• This operator has four overloads for all combinations.
• This operator has two overloads for int + int and double + double cases.
Additionally, a coercion from int to double is defined.
• This operator can only add up two reals, and all ints are coerced to double.
9.4 Polymorphism in C
The C language allows for different types of polymorphisms, and some can be emulated through little tricks.
The ## operator is even more interesting. It allows us to form symbol names dynamically. Listing 9-55
shows an example.
175
CHAPTER 9 ■ TYPE SYSTEM
Some higher-level language features can be boiled down to compiler logic performing a program
analysis and making a call to one or another function, using one or another data structure, etc. In C we can
imitate it by relying on a preprocessor.
Listing 9-56 shows an example.
DEFINE_PAIR(int)
First, we included stdbool.h file to get access to the bool type, as we said in section 9.1.3.
• pair(T) when called like that: pair(int) will be replaced by the string pair_int.
• DEFINE_PAIR is a macro which, when called like that: DEFINE_PAIR(int), will be
replaced by the code shown in Listing 9-57.
Notice the backslashes at the end of each line: they are used to escape the newline
character, thus making this macro span across multiple lines. The last line of the
macro is not ended by the backslash.
This code defines a new structural type called struct pair_int, which essentially
contains two integers as fields. If we instantiated this macro with a parameter other
than T, we would have had a pair of elements of a different type.
Then a function is defined, which will have a specific name for each macro
instantiation, since the parameter name T is encoded into its name. In our case
it is pair_int_any, whose purpose is to check whether any of two elements in
the pair satisfies the condition. It accepts the pair itself as the first argument and
the condition as the second. The condition is essentially a pointer to a function
accepting T and returning bool, a predicate, as its name suggests.
176
CHAPTER 9 ■ TYPE SYSTEM
pair_int_any launches the condition function on the first element and then on the
second element.
When used, DEFINE_PAIR defines the structure that holds two elements of a given
type, and functions to work with it. We can have only one copy of these functions
and structure definition for each type, but we need them, so we want to instantiate
DEFINE_PAIR once for every type we want to work with.
• Then a macro #define any(T) pair_##T##_any is defined. Notice that its sole
purpose is apparently just to form a valid function name depending on type. It allows
us to call pair_##T##_any in a rather elegant way: any(int), as if it was a function
returning a pointer to a function.
So, syntactically we got very close to a concept of parametric polymorphism: we are providing an
additional argument (int) which serves to determine the type of other argument (struct pair_int). Of
course, it is not as good as the type arguments in functional languages or even generic type parameters in C#
or Scala, but it is something.
9.4.2 Inclusion
The inclusion is fairly easy to achieve in C for pointer types. The idea is that every struct’s address is the same
as the address of its first member.
Take a look at the example shown in Listing 9-58.
struct parent {
const char* field_parent;
};
struct child {
struct parent base;
const char* field_child;
};
177
CHAPTER 9 ■ TYPE SYSTEM
return 0;
}
The function parent_print accepts an argument of a type parent*. As the definition of child suggests,
its first field has a type parent. So, every time we have a valid pointer child*, there exists a pointer to an
instance of parent which is equal to the former. Thus it is safe to pass a pointer to a child when a pointer to
the parent is expected.
The type system, however, is not aware of this; thus you have to convert the pointer child* to parent*,
as seen in the call parent_print( (struct parent*) &c );. We could replace the type struct parent*
with void* in this case, because any pointer type can be converted to void* (see section 9.1.5).
9.4.3 Overloading
Automated overloading was not possible in C until C11. Until recently, people included the argument type names
in the function names to provide different “overloadings” given some base name. Now the newer standard has
included a special macro which expands based on the argument type: _Generic. It has a wide range of usages.
The _Generic macro accepts an expression E and then many association clauses, separated by a comma.
Each clause is of the form type name: string. When instantiated, the type of E is checked against all types in
the associations list, and the corresponding string to the right of colon will be the instantiation result.
In the example shown in Listing 9-59, we are going to define a macro print_fmt, which can choose an
appropriate printf format specifier based on argument type, and a macro print, which forms a valid call to
printf and then outputs newline.
print_fmt matches the type of the expression x with one of two types: int and double. In case the
type of x is not in this list, the default case is executed, providing a fairly generic %x specifier. However,
in absence of the default case, the program would not compile should you provide print_fmt with an
expression of the type, say, long double. So in this case it would be probably wise to just omit default case,
forcing the compilation to abort when we don’t really know what to do.
int main(void) {
int x = 101;
double y = 42.42;
print(x);
print(y);
return 0;
}
178
CHAPTER 9 ■ TYPE SYSTEM
We can use _Generic to write a macro that will wrap a function call and select one of differently named
functions based on an argument type.
9.4.4 Coercions
C has several coercions embedded into the language itself. We are speaking essentially about pointer
conversions to void* and back and integer conversions, described in section 9.1.4. To our knowledge, there
is no way to add user-defined coercions or anything that looks at least remotely similar, akin to Scala’s
implicit functions or C++ implicit conversions.
As you see, in some form, C allows for all four types of polymorphism.
9.5 Summary
In this chapter we have made an extensive study of the C type system: arrays, pointers, constant types. We
learned to make simple function pointers, seen the caveats of sizeof, revised strings, and started to get used
to better code practices. Then we learned about structures, unions, and enumerations. At the end we talked
briefly about type systems in mainstream programming languages and polymorphism and provided some
advanced code samples to demonstrate how to achieve similar results using plain C. In the next chapter we
are going to take a closer look at the ways of organizing your code into a project and the language properties
that are important in this context.
■ Question 166 How do we create a literal of types unsigned long, long, and long long?
■ Question 175 What happens when trying to access an element outside the array’s bounds?
179
CHAPTER 9 ■ TYPE SYSTEM
■ Question 179 How are the arguments passed to the main function?
■ Question 183 What are structure types and why do we need them?
■ Question 184 What are union types? How do they differ from the structure types?
■ Question 185 What are enumeration types? How do they differ from the structure types?
■ Question 187 What kinds of polymorphism exist and what is the difference between them?
180
CHAPTER 10
Code Structure
In this chapter we are going to study how to better split your code into multiple files and which relevant
language features exist. Having a single file with a mess of functions and type definitions is far from
convenient for large projects. Most programs are split into multiple modules. We are going to study which
benefits it brings and how each module looks before linkage.
void g(void) {
f();
}
In case of structures, we are talking about two structural types. Each of them has a field of pointer type,
pointing to an instance of the other structure. Listing 10-2 shows an example.
The solution is in using split declarations and definitions. When a declaration precedes the definition, it
is called forward declaration.
/* This is definition */
void f( int x ) {
puts( "Hello!" );
}
Such declarations are sometimes called function prototypes. Every time you are using a function
whose body is not yet defined OR is defined in another file, you should write its prototype first.
In function prototype the argument names can be omitted, as shown in Listing 10-4.
...
int z = square(5);
2. Prototype first, then call, then the function is defined (see Listing 10-6).
...
int z = square(5);
...
182
CHAPTER 10 ■ CODE STRUCTURE
Listing 10-7 shows a typical error situation, where the function body is declared after the call, but no
declaration precedes the call.
However, in case of two mutually recursive structures, you have to add a forward declaration for at least
one of them. Listing 10-9 shows an example.
If there is no definition of a tagged type but only a declaration, it is called an incomplete type. In this
case we can work freely with pointers to it, but we can never create a variable of such type, dereference it, or
work with arrays of such type. The functions must not return an instance of such type, but, similarly, they
can return a pointer. Listing 10-10 shows an example.
These types have a very specific use case which we will elaborate in Chapter 13.
183
CHAPTER 10 ■ CODE STRUCTURE
int main(void) {
printf( "%d\n", square( 5 ) );
return 0;
}
Each code file is a separate module and thus is compiled independently, just as in assembly. A .c file
is translated into an object file. As for our educational purposes we stick with ELF (Executable and Linkable
Format) files; let’s crack the resulting object files open and see what’s inside. Refer to Listing 10-13 to see the
symbol table inside the main_square.o object file, and to Listing 10-14 for the file square.o. Refer to section
5.3.2 for the symbol table format explanation.
SYMBOL TABLE:
0000000000000000 l df *ABS* 0000000000000000 main.c
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 l d .note.GNU-stack
0000000000000000 .note.GNU-stack
0000000000000000 l d .eh_frame
0000000000000000 .eh_frame
0000000000000000 l d .comment
0000000000000000 .comment
0000000000000000 g F .text 000000000000001c main
0000000000000000 *UND* 0000000000000000 square
184
CHAPTER 10 ■ CODE STRUCTURE
SYMBOL TABLE:
0000000000000000 l df *ABS* 0000000000000000 square.c
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 l d .note.GNU-stack
0000000000000000 .note.GNU-stack
0000000000000000 l d .eh_frame
0000000000000000 .eh_frame
0000000000000000 l d .comment
0000000000000000 .comment
0000000000000000 g F .text 0000000000000010 square
As you see, all functions (namely, square and main) have become global symbols, as the letter g in the
second column suggests, despite not being marked in some special way. It means that all functions are like
labels marked with global keyword in assembly—in other words, visible to other modules.
The function prototype for square, located in main_square.c, is attributed to an undefined section.
GCC is providing you an access to the whole compiler toolchain, which means that it is not only
translating files but calling linker with appropriate arguments. It also links files against standard C library.
After linking, the symbol table becomes more populated due to standard library and utility symbols,
such as .gnu.version.
■ Question 188 Compile the file main by using gcc -o main main_square.o square.o line. Study its
object table using objdump -t main. What can you tell about functions main and square?
185
CHAPTER 10 ■ CODE STRUCTURE
int main(void) {
printf( "%d\n", square( 5 ) );
return 0;
}
The C standard marks the keyword extern as optional. We recommend that you never omit extern
keyword so that you might easily distinguish in which file exactly you want to create a variable.
However, in case you do omit extern keyword, how does the compiler distinguish between variable
definition and declaration, when no initializing is provided? It is especially interesting given that the files are
compiled separately.
In order to study this question, we are going to take a look at the symbol tables for object files using the
nm utility.
We write down files main.c and other.c, and then we compile them into .o files by using -c flag and
then link them. Listing 10-17 shows the command sequence.
There is one global variable called x. It is not assigned with a value in main.c, but it is initialized in
other.c.
Using nm we can quickly view the symbol table, as shown in Listing 10-18. We have shortened the table
for the main executable file on purpose to avoid cluttering the listing with service symbols.
> nm other.o
0000000000000000 D x
> nm main
0000000000400526 T main
U printf@@GLIBC_2.2.5
0000000000601038 D x
As we see, in main.o the symbol x, corresponding to the variable int x, is marked with the flag C (global
common), while in the other object file main.o it is marked D (global data). There can be as many similar
global common symbols as you like, and in the resulting executable file they will all be squashed into one.
However, you cannot have multiple declarations of the same symbol in the same source file; you are
limited to a maximum of one declaration and one definition.
186
CHAPTER 10 ■ CODE STRUCTURE
print_two();
return 0;
}
void print_one(void) {
puts( "One" );
}
void print_two(void) {
puts( "Two" );
}
Here is the real-world scenario. In order to use a function from the file printer.c in some file other.c,
you have to write down prototypes of the functions defined in printer.c somewhere in the beginning
of other.c. To use them in the third file, you will have to write their prototypes in the third file too. So,
why do it by hand when we can create a separate file that will only contain functions and global variables
declarations, but not definitions, and then include it with the help of a preprocessor?
We are going to modify this example by introducing a new header file printer.h, containing all
declarations from printer.c. Listing 10-21 shows the header file.
Now, every time you want to use functions defined in printer.c you just have to put the following line
in the beginning of current code file:
#include "printer.h"
The preprocessor will replace this line with the contents of printer.h. Listing 10-22 shows the new
main file.
187
CHAPTER 10 ■ CODE STRUCTURE
int main(void) {
print_one();
print_two();
return 0;
}
■ Note The header files are not compiled themselves. The compiler only sees them as parts of .c files.
This mechanism, which looks similar to the modules or libraries importing from such languages as Java
or C#, is by its nature very different. So, telling that the line #include "some.h" means “importing a library
called some” is very wrong. Including a text file is not importing a library! Static libraries, as we know, are
essentially the same object files as the ones produced by compiling .c files. So, the picture for an exemplary
file f.c looks as follows:
• Compilation of f.c starts.
• The preprocessor encounters the #include directives and includes corresponding .h
files “as is.”
• Each .h file contains function prototypes, which will become entries in the symbol
table after the code translation.
• For each such import-like entry, the linker will search through all object files in its
input for a defined symbol (in section .data, .bss, or .text). In one place, it will find
such a symbol and link the import-like entry with it.
This symbol might be found in the C standard library.
But wait, are we giving to the linker the standard library as input? We are going to discuss it in the next
section.
188
CHAPTER 10 ■ CODE STRUCTURE
We won’t speak about the restrict keyword yet, so let’s pretend it is not here. The file stdio.h,
included in our test file p.c, obviously contains the function prototype of printf (pay attention to the
semicolon at the end of the line!), which has no body. Three dots in place of the last argument mean an
arbitrary arguments count. This feature will be discussed in Chapter 14. The same experiment can be
conducted for any function that you gain access to by including stdio.h.
GCC is a universal interface of sort: you can use it to compile single files separately without linkage
(-c flag), you can perform the whole compilation cycle including linkage on several files, but you can also
call the linker indirectly by providing GCC with .o files as input:
When performing linkage, GCC does not just call ld blindly. It also provides it with the correct version
of the C library, or libraries. Additional libraries can be specified with help of the -l flag.
In the most common scenario, C library consists of two parts:
• Static part (usually called crt0 – C RunTime, zero stands for “the very beginning”)
contains _start routine, which performs initialization of the standard utility
structures, required by this specific library implementation. Then it calls the main
function. In Intel 64, the command-line arguments are passed onto the stack. It means
that _start should copy argc and argv from the stack to rdi and rsi in order to
respect the function calling convention.
If you link a single file and check its symbol table before and after linkage, you will see
quite a lot of new symbols, which originate in crt0, for example, a familiar _start,
which is the real entry point.
• Dynamic part, which contains the functions and global variables themselves. As these
are used by a vast majority of running applications, it is wise not to copy it but to share
between them for the sake of an overall smaller memory consumption and better
locality. We are going to prove its existence by using the ldd utility on a compiled
sample file main_ldd.c, shown in Listing 10-24. It will help us to locate the standard C
library. Listing 10-25 shows the ldd output.
189
CHAPTER 10 ■ CODE STRUCTURE
■ Question 189 Try to find the same symbols using nm utility instead of readelf.
10.4 Preprocessor
Apart from defining global constants with #define, the preprocessor is also used as a workaround to solve a
multiple inclusion problem. First, we are going to briefly review the relevant preprocessor features.
The #define directive is used in the following typical forms:
• #define FLAG means that the preprocessor symbol FLAG is defined, but its value is
an empty string (or, you could say it has no value). This symbol is mostly useless in
substitutions, but we can check whether a definition exists at all and include some
code based on it.
• #define MY_CONST 42 is a familiar way to define global constants. Every time
MY_CONST occurs in the program text, it is substituted with 42.
• #define MAX(a, b) ((a)>(b))?(a):(b) is a macrosubstitution with parameters.
A line int x = MAX(4+3, 9) will be then replaced with: int x = ((4+3)>(9))?(4+3):(9).
190
CHAPTER 10 ■ CODE STRUCTURE
■ Macro parameters in parentheses Note that all parameters in a macro body should be surrounded by
parentheses. It ensures that the complex expressions, given to the macro as parameters, are parsed correctly.
Imagine a simple macro SQ.
int z = 4 + 3 * 4 + 3
which, due to multiplication having a higher priority than addition, will be parsed as 4 + (3*4) + 3, which is
not quite an expression we intended to form.
If you want additional preprocessor symbols to be defined, you can also provide them when launching
GCC with the -D key. For example, instead of writing #define SYM VALUE, you can launch gcc -DSYM=VALUE,
or just gcc -DSYM for a simple #define SYM.
Finally, we need a macro conditional: #ifdef. This directive allows us to either include or exclude some
text fragment from the preprocessed file, based on whether a symbol is defined or not.
You can include the lines between #ifdef SYMBOL and #endif if the SYMBOL is defined, as shown in
Listing 10-27.
You can include the lines between #ifdef SYMBOL and #endif if the SYMBOL is defined, OR ELSE include
other code, as shown in Listing 10-28.
#endif
You can also state that some code will only be included if a certain symbol is not defined, as shown in
Listing 10-29.
191
CHAPTER 10 ■ CODE STRUCTURE
/* b.h */
#include "a.h"
void b(void);
/* main.c */
#include "a.h"
#include "b.h"
What will the preprocessed main.c file look like? We are going to launch gcc -E main.c. Listing 10-31
shows the result.
void b(void);
# 2 "main.c" 2
Now main.c contains a duplicate function declaration void a(void), which results in a compilation
error. The first declaration comes from the a.h file directly; the second one comes from file b.h which
includes a.h on its own.
There are two common techniques to prevent that.
• Using a directive #pragma once in the header start. This is a non-standard way of
forbidding the multiple inclusion of a header file. Many compilers support it, but
because it is not a part of the C standard, its usage is discouraged.
• Using so-called Include guards.
192
CHAPTER 10 ■ CODE STRUCTURE
void a(void);
#endif
The text between directives #ifndef _FILE_H_ and #endif will only be included if the symbol X is not
defined. As we see, the very first line in this text is: #define _FILE_H_. It means that the next time all this
text will be included as a result of #include directive execution; the same #ifndef _FILE_H_ directive will
prevent the file contents from being included for the second time.
Usually, people name such preprocessor symbols based on the file name, one such convention was
shown and consists of
#include <stdio.h>
struct pair {
int x;
int y;
};
#endif
The include guard is the first thing we observe in this file. Then come other includes. Why do you need
to include files in header files? Sometimes, your functions or structures rely on external types, defined
elsewhere. In this example, the function pair_tofile accepts an argument of type FILE*, which is defined in
the stdio.h standard header file (or in one of the headers it includes on its own). The type definition comes
after that, and then the function prototypes.
193
CHAPTER 10 ■ CODE STRUCTURE
2. You have to find all occurrences of the min macro, which is defined as
As you have seen in the previous example, to parse the program you have to first
perform preprocessing passes, otherwise the tool might not even understand
the functions boundaries. Once you perform preprocessing, all min macros are
substituted and thus become untraceable and indistinguishable from such lines as
3. Static analysis (and even your own program understanding) will suffer because
of macro usage. Syntactically, macro instantiations with parameters are
indistinguishable from function calls. However, while function arguments are
evaluated before a function call is performed, macro arguments are substituted
and then the resulting lines of code are executed.
For example, take the same macro #define min(x,y) ((x) < (y) ? (x) : (y)).
The instantiation with arguments a and b-- will look like: ((a) < (b--) ? (a) :
(b--)). As you see, if a >= b, then the variable b will be decremented twice. If min
was a function, b-- would have been executed only once.
194
CHAPTER 10 ■ CODE STRUCTURE
10.5.2 Example
Listing 10-36 shows the example. It contains three functions of interest:
• array_read to read an array from stdin. The memory allocation happens here.
Notice the usage of scanf function to read from stdin. Do not forget that it accepts not the variable
values but their addresses, so it could perform an actual writing into them.
• array_print to print a given array to stdout.
• array_sum to sum all elements in an array.
Notice that the array allocated somewhere using malloc persists until the moment free is called on its
starting address. Freeing an already freed array is an error.
195
CHAPTER 10 ■ CODE STRUCTURE
#include <malloc.h>
*out_count = cnt;
return array;
}
196
CHAPTER 10 ■ CODE STRUCTURE
In fact, for a list of length N, we can calculate the number of times elements will be addressed to
compute a sum.
N ( N + 1)
1 + 2 + 3 + ... + N =
2
We start with a sum equal to 0. Then we add the first element, for that we need to address it alone (1).
Then we add the second element, addressing the first and the second (2). Then we add the third element,
addressing the first, the second, and the third as we look through the list from its beginning. In the end what
we get is something like O(N2) for those familiar with the O-notation. Essentially it means that by increasing
the list size by 1, the time to sum such a list will have N added to it.
In such case it is indeed wiser to just pass through the list, adding a current element to the accumulator.
• Writing small functions is very good most of the time.
• Consider writing separate functions to: add an element to the front, add to the back,
create a new linked list node.
• Do not forget to extensively use const, especially in functions accepting pointers as
arguments!
What we see is that all symbol names are marked global except for those marked static in C. In
assembly level it means that most labels are marked global, and to prevent it we have to be explicit and use
the static keyword.
198
CHAPTER 10 ■ CODE STRUCTURE
0000000000000000 t module_function
0000000000000000 b module_int
0000000000000004 b static_local_var.1464
...
demo(); //outputs 42
demo(); //outputs 43
demo(); //outputs 44
10.8 Linkage
The concept of linkage is defined in the C standard and systematizes what we have studied in this chapter
so far. According to it, “an identifier declared in different scopes or in the same scope more than once can be
made to refer to the same object or function by a process called linkage” [7].
So, each identifier (variable or a function name) has an attribute called linkage. There are three types
of linkage:
• No linkage, which corresponds to local (to a block) variables.
• External linkage, which makes an identifier available to all modules that might want to
touch it. This is the case for global variables and any functions.
– All instances of a particular name with external linkage refer to the same object in
the program.
– All objects with external linkage must have one and only one definition. However,
the number of declarations in different files is not limited.
• Internal linkage, which restricts the visibility of the identifier to the .c file where it was
defined.
It’s easy for us to map the kinds of language entities we know to the linkage types:
• Regular functions and global variables—external linkage.
• Static functions and global variables—internal linkage.
• Local variables (static or not)—internal linkage.
While being important to grasp in order to read the standard freely, this concept is rarely encountered
in everyday programming activities.
199
CHAPTER 10 ■ CODE STRUCTURE
10.9 Summary
In this chapter we learned how to split code into separate files. We have reviewed the concepts of header files
and studied include guards and learned to isolate functions and variables inside a file. We have also seen what
the symbol tables look like for the basic C programs and the effects the keyword static produces on object
files. We have completed an assignment and implemented linked lists (one of the most fundamental data
structures). In the next chapter we are going to study the memory from the C perspective in greater details.
■ Question 194 How can the functions defined in other files be called?
■ Question 195 What effect does a function declaration make on the symbol table?
■ Question 197 What is the concept of header files? What are they typically used for?
■ Question 198 Which parts does the standard C library consist of?
■ Question 199 How does the program accept command-line arguments?
■ Question 200 Write a program in assembly that will display all command-line arguments, each on a
separate line.
■ Question 201 How can we use the functions from the standard C library?
■ Question 202 Describe the machinery that allows the programmer to use external functions by including
relevant headers.
■ Question 203 Read about ld-linux.
■ Question 205 What is the include guard used for and how do we write it?
■ Question 206 What is the effect of static global variables and functions on the symbol table?
200
CHAPTER 11
Memory
Memory is a core part of the model of computation used in C. It stores all types of variables as well as
functions. This chapter will study the C memory model and related language features closely.
■ Note You can only apply & once, because for any x the expression &x will already not be an lvalue.
• Take its own address. If the pointer is a variable, it is located somewhere in memory
too. So, it has an address on its own! Use the & operator to take it.
• Dereference, which is a basic operation that we have also seen. We are taking a
data entry from memory starting at the address, stored in the given pointer.
The * operator does it. Listing 11-3 shows an example.
202
CHAPTER 11 ■ MEMORY
In all other cases (subtracting greater pointer from lesser one, subtracting pointers pointing into
different areas, etc.) the result can be absolutely random.
Addition, multiplication, and division of two pointers are syntactically incorrect; thus, they trigger an
immediate compilation error.
11.1.4 NULL
C defines a special preprocessor constant NULL equal to 0. It means a pointer “pointing to nowhere,” an
invalid pointer. By writing this value to a pointer, we can be sure that it is not yet initialized to a valid address.
Otherwise, we would not be able to distinguish initialized pointers.
In most architectures people reserve a special value for invalid pointers, assuming no program will
actually hold a useful value by this address.
203
CHAPTER 11 ■ MEMORY
As we already know, 0 in pointer context does not always mean a binary number with all bits cleared.
Pointer-0 can be equal to 0, but this is not enforced by standard. The history knows architectures where the
null-pointer was chosen in a rather exotic way. For example, some Prime 50 series computers used segment
07777, offset 0 for the null pointer; some Honeywell-Bull mainframes use the bit pattern 06000 for a kind of
null pointers.
Listing 11-6 shows the correct ways to check whether the pointer is NULL or not.
What happens if cur > max? It implies, that the difference between cur and max is negative. Its type is
ptrdiff_t. Comparing it with a value of type unsigned int is an interesting case to study.
ptrdiff_t has as many bits as the address on the target architecture. Let’s study two cases:
• 32-bit system, where sizeof( unsigned int ) == 4 and sizeof( ptrdiff_t ) == 4.
In this case, the types in our comparison will pass through these conversions.
The compiler will issue a warning, because the cast from int to unsigned int is
not always preserving values. You cannot freely map values in range −231 . . .
231 − 1 to the range 0 . . . 232 − 1.
For example, in case the left-hand side was equal to -1, after the conversion to
unsigned int type it will become the maximal value representable in unsigned
int type (232 − 1). Apparently, the result of this comparison will be almost always
equal to 0, which is wrong, because -1 is smaller than any unsigned integer.
204
CHAPTER 11 ■ MEMORY
Here the right-hand side is going to be cast. This cast preserves information, so the
compiler will issue no warning.
As you see, the behavior of this code depends on target architecture, which is a big no. To avoid it,
ptrdiff_t should always go in par with size_t, because only then their sizes are guaranteed to be the same.
We have described the pointer fptr of type “a pointer to a function, that accepts int and returns
double.” Then we assigned the doubler function address to it and performed a call by this pointer with an
argument 10, storing the returned value in the variable a.
typedef works, and is sometimes a great help. The previous example can be rewritten as shown in
Listing 11-9.
...
double a;
megapointer_type* variable = &doubler;
a = variable(10); /* a = 25.0 */
205
CHAPTER 11 ■ MEMORY
Now by means of typedef we have created a function type that cannot be instantiated directly.
However, we can create variables of the said pointer type. We cannot create variables of the function types
directly, so we add an asterisk.
First-class objects in programming languages are the entities that can be passed as a parameter,
returned from functions, or assigned to a variable.
As we see, functions are not first-class objects in C. Sometimes they are called “second-class objects”
because the pointers to them are first-class objects.
206
CHAPTER 11 ■ MEMORY
■ Note Never return pointers to local variables from functions! They point to the data that no longer exists.
• Static memory allocation happens during compilation in the data or constant data
region. These variables exist until the program terminates. By default, the variables are
initialized with zeros, and thus end up in .bss section. The constant data is allocated in
.rodata; the mutable data is allocated in .data.
• Dynamic memory allocation is needed when we do not know the size of the memory
we need to allocate until some external events happen. This type of allocation relies
on an implementation in the standard C library. It means that when the C standard
library is not available (e.g., bare metal programming), this type of memory allocation
is also unavailable.
This type of memory allocation uses the heap.
A part of the standard library keeps track of the reserved and available memory
addresses. This part’s interface consists of the following functions, whose
prototypes are located in malloc.h header file.
– void* malloc(size_t size) allocates size bytes in heap and returns an address of
the first one. Returns NULL if it fails.
207
CHAPTER 11 ■ MEMORY
...
int* a = malloc(200);
a[4] = 2;
However, in C++, a popular language that was originally derived from C (and which tries to maintain
backward compatibility), the void* pointer should be explicitly cast to the type of the pointer you are
assigning it to. Listing 11-11 shows the difference.
■ Why some programmers recommend omitting the cast The older C standards had an “implicit int” rule
about function declarations. Lacking a valid function declaration, its first usage was considered a declaration. If
a name that has not been previously declared occurs in an expression and is followed by a left parenthesis, it
declares a function name. This function is also assumed to return an int value. The compiler can even create a
stub function returning 0 for it (if it does not find an implementation).
In case you do not include a valid header file, containing a malloc declaration, this line will trigger an error,
because a pointer is assigned an integer value, returned by malloc:
int* x = malloc( 40 );
208
CHAPTER 11 ■ MEMORY
However, the explicit cast will hide this error, because in C we can cast whatever we want to whatever type we want.
int* x = (int*)malloc( 40 );
The modern versions of the C standard (starting at C99) drop this rule and the declarations become mandatory,
so this reasoning becomes invalid.
a[i] = 2;
*(a+i) = 2
The address of the i-th element can be obtained by one of these following constructions:
&a[i];
a+i;
As we see, every operation with pointers can be rewritten using the array syntax! And it even goes
further. In fact, the braces syntax a[i] gets immediately translated into a + i, which is the same thing as
i+a. Because of this, exotic constructions such as 4[a] are also possible (because 4+a is legitimate).
Arrays can be initialized with zeros using the following syntax:
Arrays have a fixed size. However, there are two notable exceptions to this rule, which are valid in C99
and newer versions.
• Stack allocated arrays can be of a size determined in runtime. These are called
variable length arrays. It is evident that these cannot be marked static because the
latter implies allocation in .data section.
• Starting from C99, you can add a flexible array member as the last member of a
structure, as shown in Listing 11-12.
209
CHAPTER 11 ■ MEMORY
In this case, the sizeof operator, applied to a structure instance, will return the
structure size without the array. The array will refer to the memory immediately
following the structure instance. So, in the example given in Listing 11-12,
sizeof(struct char_array) == sizeof(size_t). Assuming it’s equal to 8,
data[0] refers to the 8-th byte (counting from 0) from the structure instance
starting address.
Listing 11-13 shows an example.
struct int_array {
size_t size;
int array[];
};
int a,b = 4, c;
To declare several pointers, however, you have to add an asterisk before every pointer.
Listing 11-14 shows an example: a and b are pointers, but the type of c is int.
This rule can be worked around by creating a type alias for int* using typedef, hiding an asterisk.
Defining multiple variables in a row is a generally discouraged practice as in most cases it makes the
code harder to read.
It is possible to create rather complex type definitions by mixing function pointers, arrays, pointers, etc.
You can use the following algorithm to decipher them:
1. Find an identifier, and start from it.
2. Go to the right until the first closing parenthesis. Find its pair on the left. Interpret
an expression between these parentheses.
3. Go “up” one level, relative to the expression we have parsed during the previous
step. Find outer parentheses and repeat step 2.
210
CHAPTER 11 ■ MEMORY
We will illustrate this algorithm in an example shown in Listing 11-15. Table 11-1 describes the
parsing process.
Expression Interpretation
fp First identifier.
(*fp) Is a pointer.
(* (*fp) (int)) A function accepting int and returning a pointer…
int* (* (*fp) (int)) [10] … to an array of ten pointers to int
As you see, the process of deciphering complex declarations is not a breeze. It can be made simpler by
using typedefs for parts of the declarations.
In C++, the string literals have the type char const* by default, which reflects their immutable nature.
Consider using variables of type char const* whenever you can when the strings you are dealing with are
not intended to be mutated.
The constructions shown in Listing 11-18, are also correct, albeit you are most probably never going to
use the second one.
1
To be precise, the result of such an operation is not well defined.
211
CHAPTER 11 ■ MEMORY
When manipulating strings, there are several common scenarios based on where the string is allocated.
1. We can create a string among global variables. It will be mutable, and under no
circumstances will it be doubled in constant data region. Listing 11-19 shows an example.
free( str );
}
212
CHAPTER 11 ■ MEMORY
■ Question 211 Read man for the functions: memcpy, memset, strcpy.
String interning would be impossible if string literals were not protected from rewriting. Otherwise, by
changing such strings in one place of a program we are introducing an unpredictable change in data used in
another place, as both share the same copy of string.
213
CHAPTER 11 ■ MEMORY
As we have chosen the GNU/Linux 64-bit system for studying purposes, it our data model is LP64. When
you develop for 64-bit Windows system, the size of long will differ.
Everyone wants to write portable code that can be reused across different platforms, and fortunately
there is a standard-conforming way to never run into data model changes.
Before C99, it was a common practice to make a set of type aliases of form int32 or uint64 and use
them exclusively across the program in lieu of ever-changing ints or longs. Should the target architecture
change, the type aliases were easy to fix. However, it created a chaos because everyone created their own set
of types.
C99 introduced platform independent types. To use them, you should just include a header stdint.h.
It gives access to the different integer types of fixed size. Each of them has a form:
• u, if the type is unsigned;
• int;
• Size in bits: 8, 16, 32 or 64; and
• _t.
For example, uint8_t, int64_t, int16_t.
The printf function (and similar format input/output) functions have been given
a similar treatment by introducing special macros to select the correct format
specifiers. These are defined in the file inttypes.h.
In the common cases, you want to read or write integer numbers or pointers. Then
the macro name will be formed as follows:
• PRI for output (printf, fprintf etc.) or SCN for input (scanf, fscanf etc.).
• Format specifier:
– d for decimal formatting.
– x for hexadecimal formatting.
– o for octal formatting.
– u for unsigned int formatting.
– i for integer formatting.
• Additional information includes one of the following:
– N for N bit integers.
– PTR for pointers.
– MAX for maximum supported bit size.
– FAST is implementation defined.
We have to use the fact that several string literals, delimited by spaces, are concatenated automatically.
The macro will produce a string containing a correct format specifier, which will be concatenated with
whatever is around it.
Listing 11-23 shows an example.
214
CHAPTER 11 ■ MEMORY
void f( void ) {
int64_t i64 = -10;
uint64_t u64 = 100;
printf( "Signed 64-bit integer: %" PRIi64 "\n", i64 );
printf( "Unsigned 64-bit integer: %" PRIu64 "\n", u64 );
}
For example, I have seen a situation in which reading a large block from a picture file
on Windows (the compiler was MSVC) ended prematurely because the picture was
obviously binary, while the associated stream was created in text mode.
The standard library provides machinery to create and work with streams. Some functions it defines
should only be used on text streams (like fscanf). The relevant header file is called stdio.h.
Let’s analyze the example shown in Listing 11-24.
/* This line is optional. By means of `fseek` function we can navigate the file */
fseek( f, 0, SEEK_SET );
• The instance of FILE is created via a call to fopen function. The latter accepts the path
to file and a set of flags, squashed into a string.
The important flags of fopen are listed here.
– b - open file in a binary mode. That is what makes a real distinction between
text and binary streams. By default, files are opened in text mode.
– w - open a stream with a possibility to write into it.
– r - open a stream with a possibility to read from it.
– + - if you write simply w, the file will be overwritten. When + is present,
the writes will append data to the end of file.
If the file does not exist, it will be created.
The file hello.img is opened in binary mode for both reading and writing.
The file contents will be overwritten.
• After being created, the FILE holds a kind of a pointer to a position inside the file,
a cursor of sorts. Reads and writes move this cursor further.
• The fseek function is used to move cursor without performing reads or writes.
It allows moving cursor relatively to either its current position or the file start.
• fwrite and fread functions are used to write and read data from the opened FILE
instance.
Taking fread, for example, it accepts the memory buffer to read from. The two integer parameters are
the size of an individual block and the amount of blocks read. The returning value is the amount of blocks
successfully read from the file. Every block’s read is atomic: either it is completely read, or not read at all.
In this example, the block size equals sizeof(int), and the amount of blocks is one.
The fwrite usage is symmetrical.
• fclose should be called when the work with file is complete.
216
CHAPTER 11 ■ MEMORY
There exist a special constant EOF. When it is returned by a function that works with a file, it means that
the end of file is reached.
Another constant BUFSIZ stores the buffer size that works best in the current environment for input and
output operations.
Streams can use buffering. It means that they have an internal buffer that proxies all reads and writes. It
allows for rarer system calls (which are expensive performance-wise due to context switching). Sometimes
when the buffer is full the writing will actually trigger a write system call. A buffer can be manually flushed
using fflush command. Any delayed writes will be executed and the buffer will be reset.
When the program starts, three FILE* instances are created and attached to the streams with descriptors 0,
1, and 2. They can be referred to as stdin, stdout, and stderr. All three are usually using a buffer, but the stderr
is automatically flushing the buffer after every writing. It is necessary to not delay or lose error messages.
■ Note Again, descriptors are integers, FILE instances are not. The int fileno( FILE* stream ) function
is used to get the underlying descriptor for the file stream.
■ Question 212 Read man for functions: fread, fread, fwrite, fprintf, fscanf, fopen, fclose, fflush.
■ Question 213 Do research and find out what will happen if the fflush function is applied to a bidirectional
stream (opened for both reading and writing) when the last action on the stream before it was reading.
3. We repeat the process until the list is consumed. In the end the final accumulator
value is the final result.
For example, let’s take f (x, a) = x * a. By launching foldl with the accumulator value 1 and this function
we will compute the product of all elements in the list.
• iterate accepts the initial value s, list length n, and function f. It then generates a list
of length n as follows:
( )
⎡ s , f ( s ) , f ( f ( s ) ) , f f ( f ( s ) ) …⎤
⎣ ⎦
The functions described above are called higher-order functions, because they do accept other
functions as arguments. Another example of such a function is the array sorting function qsort.
It accepts the array starting address base, elements count nmemb, size of individual elements size, and
the comparator function compar. This function is the decision maker which tells which one of the given
elements should be closer to the beginning of the array.
11.7.2 Assignment
The input contains an arbitrary number of integers.
1. Save these integers in a linked list.
2. Transfer all functions written in previous assignment into separate .h and c files.
Do not forget to put an include guard!
3. Implement foreach; using it, output the initial list to stdout twice: the first time,
separate elements with spaces, the second time output each element on the new line.
4. Implement map; using it, output the squares and the cubes of the numbers from list.
5. Implement foldl; using it, output the sum and the minimal and maximal element
in the list.
6. Implement map_mut; using it, output the modules of the input numbers.
7. Implement iterate; using it, create and output the list of the powers of two (first
10 values: 1, 2, 4, 8, …).
8. Implement a function bool save(struct list* lst, const char* filename);,
which will write all elements of the list into a text file filename. It should return
true in case the write is successful, false otherwise.
9. Implement a function bool load(struct list** lst, const char* filename);,
which will read all integers from a text file filename and write the saved list into
*lst. It should return true in case the write is successful, false otherwise.
218
CHAPTER 11 ■ MEMORY
10. Save the list into a text file and load it back using the two functions above. Verify
that the save and load are correct.
11. Implement a function bool serialize(struct list* lst, const char*
filename);, which will write all elements of the list into a binary file filename. It
should return true in case the write is successful, false otherwise.
12. Implement a function bool deserialize(struct list** lst, const char*
filename);, which will read all integers from a binary file filename and write
the saved list into *lst. It should return true in case the write is successful, false
otherwise.
13. Serialize the list into a binary file and load it back using two functions above. Verify
that the serialization and deserialization are correct.
14. Free all allocated memory.
You will have to learn to use
• Function pointers.
• limits.h and constants from it. For example, in order to find the minimal element in
an array, you have to use foldl with the maximal possible int value as an accumulator
and a function that returns a minimum of two elements.
• The static keyword for functions that you only want to use in one module.
You are guaranteed, that
• Input stream contains only integer numbers separated by whitespace characters.
• All numbers from input can be contained as int.
It is probably wise to write a separate function to read a list from FILE.
The solution takes about 150 lines of code, not counting the functions, defined in the previous
assignment.
■ Question 215 In languages such as C#, code like the following is possible:
var count = 0;
Here we launch an anonymous function (i.e., a function which has no name, but whose address can be
manipulated, for example, passed to other function) for each element of a list. The function is written as x =>
count += 1 and is the equivalent of
The interesting thing about it is that this function is aware of some of the local variables of the caller and thus
can modify them.
Can you rewrite the function forall so that it accepts a pointer to a “context” of sorts, which can hold an
arbitrary number of variables addresses and then pass the context to the function called for each element?
219
CHAPTER 11 ■ MEMORY
11.8 Summary
In this chapter we have studied the memory model. We have gotten a better understanding of the type dimensions
and the data models, studied pointer arithmetic, and learned to decipher complex type declarations. Additionally,
we have seen how to use the standard library functions to perform the input and output. We have practiced it by
implementing several higher-order functions and doing a little file input and output.
We will further deepen our understanding of memory layout in the next chapter, where we will
elaborate the difference between three “facets” of a language (syntax, semantics, and pragmatics), study the
notions of undefined and unspecified behavior, and show why the data alignment is important.
■ Question 216 What arithmetic operations can you perform with pointers, and on what conditions?
■ Question 217 What is the purpose of void*?
■ Question 218 What is the purpose of NULL?
■ Question 219 What is the difference between 0 in pointer context and 0 as an integer value?
■ Question 220 What is ptrdiff_t and how is it used?
■ Question 221 What is the difference between size_t and ptrdiff_t?
■ Question 222 What are first-class objects?
■ Question 223 Are functions first-class objects in C?
■ Question 224 What data regions does the C abstract machine contain?
■ Question 225 Is the constant data region usually write-protected by hardware?
■ Question 226 What is the connection between pointers and arrays?
■ Question 227 What is the dynamic memory allocation?
■ Question 228 What is the sizeof operator? When is it computed?
■ Question 229 When are the string literals stored in .rodata?
■ Question 230 What is string interning?
■ Question 231 Which data model are we using?
■ Question 232 Which header contains platform-independent types?
■ Question 233 How do we concatenate string literals in compile time?
■ Question 234 What is the data stream?
■ Question 235 Is there a difference between a data stream and a descriptor?
■ Question 236 How do we get the descriptor from stream?
■ Question 237 Are there any streams opened when the program starts?
■ Question 238 What is the difference between binary and text streams?
■ Question 239 How do we open a binary stream? A text stream?
220
CHAPTER 12
In this chapter we are going to revise the very essence of what the programming language is. These
foundations will allow us to better understand the language structure, the program behavior, and the details
of translation that you should be aware of.
As we see, this is exactly the description of a nonterminal complex structure. We can write multiple
possible rules for the same nonterminal and the convenient one will be applied. To make it less verbose, we
will use the notation with the symbol | to denote “or,” just as in regular expressions.
This way of describing grammar rules is called BNF (Backus-Naur form): the terminals are denoted
using quoted strings, the production rules are written using ::= characters, and the nonterminal names are
written inside brackets.
Sometimes it is also quite convenient to introduce a terminal ϵ, which, during parsing, will be matched
with an empty (sub)string.
So, grammars are a way to describe language structure. They allow you to perform the following kinds
of tasks:
• Test a language statement for syntactical correctness.
• Generate correct language statements.
• Parse language statements into hierarchical structures where, for example, the if
condition is separated from the code around it and unfolded into a tree-like structure
ready to be evaluated.
222
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
However, as it is very cumbersome and not so easy to read, we will use the different notation to describe
exactly the same rules:
<notzero> ::= '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
Then we define the nonterminal <raw> to encode all digit sequences. A sequence of digits is defined in a
recursive way as either one digit or a digit followed by another sequence of digits.
The <number> will serve us as a starting symbol. Either we deal with a one-digit number, which has no
constraints on itself, or we have multiple digits, and then the first one should not be zero (otherwise it is a
leading zero we do not want to see); the rest can be arbitrary.
Listing 12-1 shows the final result.
223
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
The grammar allows us to build a tree-like structure on top of the text, where each leaf is a terminal, and
each other node is a nonterminal. For example, let’s apply the current set of rules to a string 1+42 and see
how it is deconstructed. Figure 12-1 shows the result.
The first expansion is performed according to the rule <expr> ::= number '+' <expr>. The latter
expression is just a number, which in turn is a sequence of digit and a number.
224
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
People usually operate with a notion of stream when performing parsing with grammar rules. A stream
is a sequence of whatever is considered symbols. Its interface consists of two functions:
• bool expect(symbol) accepts a single terminal and returns true if the stream contains
exactly this kind of terminal in the current position.
• bool accept(symbol) does the same and then advances the stream position by one in
case of success.
Up to now, we operated with abstractions such as symbols and streams. We can map all the abstract
notions to the concrete instances. In our case, the symbol will correspond to a single char.1
Listing 12-4 shows an example text processor built based on grammar rules definitions. This is a
syntactic checker, which verifies whether the string is holding a natural number without leading zeroes and
nothing else (like spaces around the number).
bool accept(char c) {
if (*stream == c) {
stream++;
return true;
}
else return false;
}
bool notzero( void ) {
return accept( '1' ) || accept( '2' ) || accept( '3' )
|| accept( '4' ) || accept( '5' ) || accept( '6' )
|| accept( '7' ) || accept( '8' ) || accept( '9' );
}
bool digit( void ) {
return accept('0') || notzero();
}
1
For parsers of programming languages it is much simpler to pick keywords and word classes (such as identifiers or
literals) as terminal symbols. Breaking them into single characters introduces unnecessary complexity.
225
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
This example shows how each nonterminal is mapped to a function with the same name that tries to
apply the relevant grammar rules. The parsing occurs in a top-down manner: we start with the most general
starting symbol and try to break it into parts and parse them.
When the rules start alike we factorize them by applying the common part first and then trying to
consume the rest, as in number function. The two branches start with overlapping nonterminals: <digit>
and <notzero>. Each of them contains the range 1...9, the only difference being <digit>’s range including
zero. So, if we found a terminal in range 1...9 we try to consume as many digits after that as we can and we
succeed anyway. If not, we check for the first digit being 0 and stop if it is so, consuming no more terminals.
The <notzero> function succeeds if at least one of the symbols in range 1-9 is found. Due to the
lazy application of ||, not all accept calls will be performed. The first of them that succeeds will end the
expression evaluation, so only one advancement in stream will occur.
The <digit> function succeeds if a zero is found or if <notzero> succeeded, which is a literal translation
of a rule:
The other functions are performing in the same manner. Should we not limit ourselves with a null-
terminator, the parsing would answer us a question: “does this sequence of symbols start with a valid
language sentence?”
In Listing 12-4 we have used a global variable on purpose in order to facilitate understanding. We still
strongly advise against their usage in real programs.
226
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
The parsers for real programming languages are usually quite complex. In order to write them
programmers use a special toolset that can generate parsers from the declarative description close to BNF. In
case you need to write a parser for a complex language we recommend you taking a look at ANTLR or yacc
parser generators.
Another popular technique of handwriting parsers is called parser combinators. It encourages creating
parsers for the most basic generic text elements (a single character, a number, a name of a variable, etc.).
Then these small parsers are combined (OR, AND, sequence…) and transformed (one or many occurences,
zero or more occurences…) to produce more complex parsers. This technique, however, is easy to apply
when the language supports a functional style of programming, because it often relies on higher-order
functions.
■ On recursion in grammars The grammar rules can be recursive, as we see. However, depending on the
parsing technique using certain types of recursion might be ill-advised. For example, a rule expr ::= expr '+'
expr, while being valid, will not permit us to construct a parser easily. To write a grammar well in this sense,
you should avoid left-recursive rules such as the one listed previously, because, encoded naively, it will only
produce an infinite recursion, when the expr() function will start its execution with another call to expr(). The
rules that refine the first nonterminal on the right-hand side of the production avoid this problem.
■ Question 240 Write a recursive descent parser for floating point arithmetic with multiplication, subtraction,
and addition. For this assignment, we consider no negative literals exist (so instead of writing -1.20 we will
write 0-1.20.
227
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
Without taking the multiplication priority into account, the parse tree for the expression 1*2+3 will look
as shown in Figure 12-2.
Figure 12-2. Parse trees without priorities for the expression 1*2+3
However, as we notice, the multiplication and addition are equals here: they are expanded in order of
appearance. Because of this, the expression 1*2+3 is parsed as 1*(2+3), breaking the common evaluation
order, tied to the tree structure.
From a parser’s point of view, the priority means that in the parse tree the “add” nodes should
be closer to the root than the “multiply” nodes, since addition is performed on the bigger parts of the
expression. The evaluation of the arithmetical expressions is performed, informally, starting from leaves
and ending in the root.
How do we prioritize some operations over others? It is acquired by splitting one syntactical category
<expr> into several classes. Each class is a refinement of the previous class of sorts. Listing 12-6 shows an
example.
228
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
nonterminal ::=
<sequence of terminal and nonterminal symbols>
a A b ::= a y b
229
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
The difference between levels 2 and 1 is that the nonterminal on the left side is
substituted for y only when it occurs between a and b (which are left untouched).
Remember, both a and b can be rather complex.
0. The unrestricted grammars have rules of form:
As there are absolutely no restrictions on the left- and right-hand sides of the
rules, these grammars are most powerful. It can be shown that these types of
grammars can be used to encode any computer program, so these grammars are
Turing-complete.
The real programming languages are almost never truly context-free. For example, a usage of a variable
declared earlier is apparently a context-sensitive construction, because it is only valid when following a
corresponding variable declaration. However, for simplicity, they are often approximated with context-free
grammars and then additional passes on the parsing tree transform are done to check whether such context-
sensitive conditions are satisfied.
Figure 12-3. Parse tree and abstract syntax tree of the expression 1 + 2*3
As we see, the tree on the right is much more concise and to the point. This tree can be directly
evaluated by an interpreter or some executable code to calculate what might be generated.
230
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
12.3 Semantics
The language semantics is a correspondence between the sentences as syntactical constructions and their
meaning. Each sentence is usually described as a type of node in the program abstract syntax tree. This
description is performed in one of the following ways:
• Axiomatically. The current program state can be described with a set of logical
formulas. Then each step of the abstract machine will transform these formulas in a
certain way.
• Denotationally. Each language sentence is mapped into a mathematical object of
a certain theory (e.g., domain theory). Then the program effects can be described
in terms of this theory. It is of a particular interest when reasoning about program
behavior of different programs written in different languages.
• Operationally. Each sentence produces a certain change of state in the abstract
machine, which is subject to description. The descriptions in the C standard are
informal but resemble the operational semantic description more than the other two.
The language standard is the language description in human-readable form. However, while being
more comprehensible for an unprepared one, it is more verbose and sometimes less unambiguous. In order
to write concise descriptions, a language of mathematical logic and lambda calculus is usually used. We will
not dive into details in this book, because this topic demands a pedantic approach on its own. We refer you
to the books [29] and [35] for an immaculate study of type theory and language semantics.
231
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
However, this is much trickier than it might appear. In the example in Listing 12-8, we could have
assumed, that as no writes to variable p are performed, it is always holding the address of x. However, this is
not always true, as illustrated by the example shown in Listing 12-9.
So, solving this problem actually requires a very complex analysis in presence of pointer arithmetic.
Once the variable’s address is taken, or worse still, its address is passed to a function, you have to analyze the
entire function calling sequence, take function pointers into account, pointers to the pointers, etc.
The analysis will not always yield correct results (in the most general case this problem is even
theoretically undecidable), and the performance can suffer because of it. So, in accordance with the C
laissez-faire spirit, the correctness of pointer dereferencing is left to the responsibility of the programmer
himself.
232
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
In managed languages such as Java or C#, the defined behavior of pointer dereferencing is much easier
to achieve. First, they are usually run inside a framework, which provides code for exception raising and
handling. Second, the nullability analysis is much simpler in the absence of address arithmetic. Finally,
they are usually compiled just-in-time, which means that the compiler has access to runtime information
and can use it to perform some optimizations unavailable to an ahead-of-time compiler. For example, after
the program has launched and given the user input, a compiler deduced that the pointer x is never NULL if
a certain condition P holds. Then it can generate two versions of the function f containing this dereference:
one with a check and the other without check. Then every time f is called, only one of two versions is called.
If the compiler can prove that P holds in a calling situation, the non-checked version is called; otherwise the
checked one is called.
The undefined behavior can be dangerous (and usually is). It leads to subtle bugs, because it does
not guarantee an error in compile or in runtime. The program can encounter a situation with undefined
behavior and continue execution silently; however, its behavior will randomly change after a certain amount
of instructions are executed.
A typical situation is the heap corruption. The heap is in fact structured; each block is delimited with
utility information, used by the standard library. Writing out of block bounds (but close to them) is likely to
corrupt this information, which will result in a crash during one of future calls to malloc of free, making this
bug a time-bomb.
Here are some cases of undefined behavior, explicitly specified by the C99 standard. We are not
providing the full list, because there are at least 190 cases.
• Signed integer overflow.
• Dereferencing an invalid pointer.
• Comparing the pointers to elements of two different memory blocks.
• Calling function with arguments that do not match its initial signature (possible by
taking a pointer to it and casting to other function type).
• Reading from an uninitialized local variable.
• Division by 0.
• Accessing an array element out of its bounds.
• Attempting to change a string literal.
• The return value of a function, which does not have an executed return statement.
233
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
What is i equal to? Unfortunately, the best answer we can give is the following: there is an undefined
behavior in this code. Apparently, we do not know whether the i will be incremented before assigning i*10
to i or after that. There are two writes in the same memory location before the sequence point and it is
undefined in which order will they occur.
The cause of this is as we have seen in section 12.3.2, the subexpression evaluation order is not fixed.
As subexpressions might have effects on the memory state (think function calls or pre-or postincrement
operators), and there is no enforced order in which these effects occur, even the result of one subexpression
may depend on the effects of the other.
234
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
12.4 Pragmatics
12.4.1 Alignment
From the point of view of the abstract machine, we are working with bytes of memory. Each byte has its
address. The hardware protocols, used on the chip, are, however, quite different. It is quite common that the
processor can only read packs of, say, 16 bytes, which start from an address divisible by 16. In other words, it
can either read the first 16 byte-chunk from memory or the second one, but not a chunk that starts from an
arbitrary address.
We say that the data is aligned on N-byte boundary if it starts from an address divisible by N.
Apparently, if the data is aligned on kn-byte boundary, it is automatically aligned on n-byte boundary. For
example, if the variable is aligned on 16-byte boundary, it is simultaneously aligned on an 8-byte boundary.
What happens when the programmer requests a read of a multibyte value which spans over two such
blocks (e.g., 8-byte value, whose first three bytes lie in one chunk, and the rest is in another one)? Different
architectures give different answers to this question.
Some hardware architectures forbid unaligned memory access. It means that an attempt to read any
value which is not aligned to, for example, an 8-byte boundary results in an interrupt. An example of such
architecture is SPARC. The operating systems can emulate unaligned accesses by intercepting the generated
interrupt and placing the complex accessing logic into the handler. Such operations, as you might imagine,
are extremely costly because the interrupt handling is relatively slow.
Intel 64 adapts a less strict behavior. The unaligned accesses are allowed but bear an overhead. For
example, if we want to read 8 bytes starting from the address 6 and we can only read chunks that are 8
bytes long, the CPU (central processing unit) will perform two reads instead of one and then compose the
requested value from the parts of two quad words.
So, aligned accesses are cheaper, because they require less reads. The memory consumption is often
a lesser concern for a programmer than the performance; thus compilers automatically adjust variables
alignment in memory even if it creates gaps of unused bytes. This is commonly referred to as data structure
padding.
The alignment is a parameter of the code generation and program execution, so it is usually viewed as a
part of language pragmatics.
235
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
Assuming an alignment on an 8-byte boundary, the size of such structure, returned by sizeof, will be
16 bytes. The a field starts at an address divisible by 8, and then six bytes are wasted to align b on an 8-byte
boundary.
There are several instances in which we should be aware of it:
• You might want to change the trade-off between memory consumption and
performance to lesser memory consumption. Imagine you are creating a million
copies of structures and every structure wastes 30% of its size because of alignment
gaps. Forcing the compiler to decrease these gaps will then lead to a memory usage
gain of 30% which is nothing to sneeze at. It also brings benefits of better locality
which can be far more beneficial than the alignment of individual fields.
• Reading file headers or accepting network data into structures should take possible
gaps between structure fields into account. For example, the file header contains a
field of 2 bytes and then a field of 8 bytes. There are no gaps between them. Now we
are trying to read this header into a structure, as shown in Listing 12-12.
The problem is that the structure’s layout has gaps inside it, while the file stores fields in a contiguous
way. Assuming the values in file are a=0x1111 and b=0x 22 22 22 22 22 22 22, Figure 12-4 shows the
memory state after reading.
Figure 12-4. Memory layout structure and the data read from file
There are ways to control alignment; up until C11 they are compiler-specific. We will study them first.
The #pragma keyword allows us to issue one of the pragmatic commands to the compiler. It is supported
in MSVC, Microsoft’s C compiler, and is also understood by GCC for compatibility reasons.
236
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
Listing 12-13 shows how to use it to locally change the alignment choosing strategy by using the pack
pragma.
The second argument of pack is a presumed size of the chunk that the machine is able to read from
memory on the hardware level.
The first argument of pack is either push or pop. During the translation process, the compiler keeps
track of the current padding value by checking the top of the special internal stack. We can temporarily
override the current padding value by pushing a new value into this new stack and restore the old value
when we are done. Changing padding value globally is possible by using the following form of this pragma:
#pragma pack(2)
However, it is very dangerous because it leads to unpredictable subtle changes in other parts of
program, which are very difficult to trace.
Let’s see how the alignment value affects the individual field’s alignment by analyzing an example
shown in Listing 12-14.
The padding value tells us how many bytes a hypothetical target computer can fetch from memory in
one read. The compiler tries to minimize the amount of reads for each field. There is no reason to skip bytes
between a and b here, because it brings no benefits with regard to the padding value. Assuming that a=0x11
11 and b=0x22 22 22 22 22 22 22 22, the memory layout will look like the following:
11 11 22 22 22 22 22 22 22 22
Listing 12-15 shows another example with the padding value equal to 4.
237
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
What if we adapt the same memory layout without gaps? As we can only read 4 bytes at a time, it is not
optimal. We have delimited the bounds of memory chunks that are readable atomically.
Pack: 2
11 11 | 22 22 | 22 22 | 22 22 | 22 22 | ?? ??
Pack: 4, same memory layout
11 11 22 22 | 22 22 22 22 | 22 22 ?? ??
Pack: 4, memory layout really used
11 11 ?? ?? | 22 22 22 22 | 22 22 22 22
As we see, when the padding is set to 4, adapting a gapless memory layout forces the CPU to perform
three reads to access b. So, basically, the idea is to minimize the amount of reads while placing struct
members as close as possible.
The GCC specific way of doing roughly the same thing is the packed specification of the __attribute__
directive. In general, __attribute__is describing the additional specification of a code entity such as a
type or a function. This packed keyword means that the structure fields are stored consecutively in memory
without gaps at all. Listing 12-16 shows an example.
Remember that packed structures are not part of the language and are not supported on some
architectures (such as SPARC) even on the hardware level, which means not only a performance hit but also
program crashes or reading invalid values.
238
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
int main(void) {
short x;
printf("%zu\n", alignof(x));
return 0;
}
In fact, alignof(x) returns the greatest power of two x is aligned at, since aligning anything at, for
example, 8 implies alignment on 4, 2, and 1 as well (all its divisors).
Prefer using alignof to _Alignof and alignas to _Alignas.
alignas accepts a constant expression and is used to force an alignment on a certain variable or array.
Listing 12-18 shows an example. Once launched, it outputs 8.
By combining alignof and alignas we can align variables at the same boundary as other variables.
You cannot align variables to a value less than their size and alignas cannot be used to produce the
same effect as __attribute__((packed)).
12.6 Summary
In this chapter we have structured and expanded our knowledge about what the programming language is.
We have seen the basics of writing parsers and studied the notions of undefined and unspecified behavior
and why they are important. We then introduced the notion of pragmatics and elaborated one of the most
important things
We defer an assignment for this chapter until the next one, where we will elaborate the most important
good code practices. Assuming our readers are not yet very familiar with C, we want them to adapt good
habits as early as possible in the course of their C journey.
■ Question 245 How do we write a recursive descent parser having the grammar description in BNF?
239
CHAPTER 12 ■ SYNTAX, SEMANTICS, AND PRAGMATICS
■ Question 248 Why are regular languages less expressive than context-free grammars?
■ Question 252 What is unspecified behavior and how is it different from undefined behavior?
240
CHAPTER 13
In this chapter we want to concentrate on the coding style. When writing code a developer is constantly
faced with a decision-making procedure. What kinds of data structures should he use? How should they
be named? Where and when should they be allocated? Experienced programmers make these decisions
in a different way compared to beginners, and we find it extremely important to speak about this decision
making process.
242
CHAPTER 13 ■ GOOD CODE PRACTICES
The rest of this section concentrates on different language features and associated naming and usage
conventions.
13.2.3 Types
• When possible (C99 or newer), prefer the types defined in stdint.h, such as uint64_t
or uint8_t.
• If you want to be POSIX-compliant, do not define your own types with _t suffix. It
is reserved for standard types, so the new types that might be introduced in future
revisions of standard will not clash with the custom types defined in some programs.
• Types are often named with a prefix common to the project. For example, you want to
write a calculator, then the type tags will be prefixed with calc_.
243
CHAPTER 13 ■ GOOD CODE PRACTICES
• When you are defining structures and if you can choose the order of fields, define
them in the following order:
– First try to minimize the memory losses from data structure padding.
– Then order fields by size.
– Finally, sort them alphabetically.
– Sometimes structures have fields that should not be modified by user directly.
For example, a library defines the structure shown in Listing 13-2.
The fields of such structure can be modified directly using dot or arrow syntax.
Our convention, however, implies that only specific library functions should
modify the _refcount field, and the library user should never do it by hand.
C lacks a concept of structure private fields, so it is as close as we can get
without using more or less dirty hacks.
– Enumeration members should be written in uppercase, like constants. The
common prefix is suggested for the members of one enumeration. An example is
shown in Listing 13-3.
13.2.4 Variables
Choosing the right names for variables and functions is crucial.
• Use nouns for names.
• Boolean variables should have meaningful names too. Prefixing them with is_ is
advisable. Then append the exact property that is being checked. is_good is probably
too broad to be a good name in most cases, unlike is_prime or is_before_last.
Prefer positive names to negative ones, as the human brain parses them easily—
for example, is_even over is_not_odd.
• It is not advisable to use names that bear no meaning, like a, b, or x4. The notable
exception is the code that illustrates an article or a paper, which describes an
algorithm in pseudo code using such names. In this case, any naming change is more
likely to confuse readers than to bring more clarity. The indices are traditionally
named i and j and you will be understood if you stick to them.
244
CHAPTER 13 ■ GOOD CODE PRACTICES
– The program is parallelized and the function is being used in multiple threads
(which is often the case on modern computers).
245
CHAPTER 13 ■ GOOD CODE PRACTICES
In case of a complex call hierarchy, knowing whether the function is reenterable or not requires an
additional analysis.
• They introduce security risks, because usually their values have to be checked before
being modified or used. Programmers tend to forget these checks. If something can go
wrong, it will go wrong.
• They make testing function harder because of the data dependency they are introducing.
Writing code without tests, however, is always a practice to avoid.
Global static mutable variables are evil too, but at least they do not pollute the global namespace in other files.
Global static immutable variables (const static) are, however, perfectly fine and can be often inlined
by compiler.
13.2.6 Functions
• Use verbs to name functions—for example, packet_checksum_calc.
• The prefix is_ is also quite common for functions checking conditions—for example,
int is_prime( long num ).
• The functions that operate on a struct with a certain tag are often prefixed with the
respective tag name—for example, bool list_is_empty(struct list* lst );.
As C does not allow for fine namespace control, this seems to be the simplest form of controlling the
chaos that emerges when most functions are accessible from anywhere.
• Use the static modifier for all functions except for those you want to be available for everyone.
• Probably the most important place to use const is for function arguments of type
“pointer to immutable data.” It ensures that function does not occasionally change
them due to a programmer’s mistake.
246
CHAPTER 13 ■ GOOD CODE PRACTICES
There are many build systems; some of the most popular ones for C are make, cmake, and automake.
Different languages have different ecosystems and often have dedicated build tools (e.g., Gradle or
OCamlBuild).
• We recommend you study these projects, which, to our knowledge, are well organized
www.gnu.org/software/gsl/
• www.gnu.org/software/gsl/design/gsl-design.html
• www.kylheku.com/kaz/kazlib.html
Doxygen is a de facto standard for creating documentation for C and C++ programs. It allows us to
generate a fully structured set of HTML or LATEXpages from the program source code. The descriptions of
functions and variables are taken from specifically formatted comments. Listing 13-5 shows an example of a
source file which is accepted by Doxygen.
/** Change the constant pool by adding the other pool's contents in its end.
* @param[out] src The source pool which will be modified.
* @param fresh The pool to merge with the `src` pool.
*/
void const_merge(
struct vm_const_pool* src,
struct vm_const_pool const* fresh );
/**@} */
The specially formatted comments (starting with /** and containing commands such as @defgroup)
are processed by Doxygen to generate documentation for the respective code entities. For more information,
refer to Doxygen documentation.
247
CHAPTER 13 ■ GOOD CODE PRACTICES
13.4 Encapsulation
One of the thinking fundamentals is abstraction. In software engineering, it is a process of hiding
implementation details and data.
If we want to implement a certain behavior like an image rotation, we would like to think only about
the image rotation. The input file format, the format of its headers, is of little importance to us. What is really
important is to be able to work with dots which form the image and know its dimensions. However, you
cannot write a program without considering all this information that is actually independent of the rotation
algorithm itself.
We are going to split the program into parts; each part will do its purpose and only it. This logic can
be used by calling a set of exposed functions and/or a set of exposed global variables. Together they form
an interface for this program part. To implement them, however, we usually have to write more functions,
which are better hidden from the end user.
■ Working with version control systems When working in a team where many people perform changes
simultaneously, making smaller functions is very important. If a function performs many actions, and its code is
huge, multiple independent changes will be harder to merge automatically.
In programming languages supporting packages or classes, these are used to hide pieces of code and
create interfaces for them. Unfortunately, C has none of them; furthermore, there is no concept of “private
fields” in structures: all fields are seen by everyone.
The support for separate code files, called translation units, is the only real language feature to help us
isolate parts of program code. We use a notion of module as a synonym for a translation unit, a .c file.
The C standard does not define a notion of module. In this book we are using them interchangeably
because for the C language they are roughly equivalent.
As we know, functions and global variables become public symbols by default and thus accessible to
other files. What is reasonable is to mark all “private” functions and global variables as static in the .c file
and declare all “public” functions in the .h file.
As an example, we are going to write a module that implements a stack.
The header file will describe the structure and the functions that can operate its instances. It resembles
object-oriented programming without subtyping (no inheritance).
The interface will consist of the following functions:
• Create or destroy a stack;
• Push and pop elements from a stack.
• Check if the stack is empty.
• Launch a function for each element in the stack.
The code file will define all functions and probably some more, which won’t be accessible outside of it
and are only created for the sake of decomposition and code reusability.
Listings 13-6 and 13-7 show the resulting code. stack.h describes an interface. It has an include guard,
enumerates all other headers (first standard headers, then custom ones), and defines custom types.
248
CHAPTER 13 ■ GOOD CODE PRACTICES
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
struct list;
struct stack {
struct list* first;
struct list* last;
size_t count;
};
#endif /* _STACK_H_ */
There are two types defined: list and stack. The first one is only used internally inside the stack,
and so we declared it an incomplete type. Only pointers to instances of such type are allowed unless its
definition is specified later.
For everyone who includes stack.h, the type struct list will remain incomplete. The implementation
file stack.c, however, will define the structure, completing the type and allowing to access its fields
(but only in stack.c).
Then the struct stack is defined and the functions that work with it are declared (stack_push,
stack_pop, etc.) (see Listing 13-7).
249
CHAPTER 13 ■ GOOD CODE PRACTICES
This file defines all functions declared in the header. It can be split into multiple .c files, which will
sometimes do good for the project structure; what is important is that the compiler should accept them all
and then the compiled code should get to the linker.
A static function list_new is defined to isolate the instance initialization of struct list. It is not
exposed to the outside world. During optimizations, not only can the compiler inline it, but it can even
delete the function itself, effectively eliminating any possible implications on the code performance.
Marking function static is necessary (but not sufficient) for this optimization to occur. Additionally, the
instructions of static functions might be placed closer to their respective callers, improving locality.
By splitting the program on modules with well-described interfaces you reduce the overall complexity
and achieve better reusability.
250
CHAPTER 13 ■ GOOD CODE PRACTICES
The need to create header files makes modifications a bit cumbersome because the consistency of
headers with the code itself is the programmer’s responsibility. However, we can benefit from it as well by
specifying a clear interface description, which lacks the implementation details.
13.5 Immutability
It is quite common to have to choose between creating a new modified copy of a structure and performing
modifications in place.
Here are some advantages and disadvantages of both choices.
• Creating copy:
– Easier to write: you won’t accidentally pass the wrong instance to a function.
– Easier to debug, because you don’t have to track changes of variable.
– Can be optimized by the compiler.
– Friendly to parallelization.
– Can be slower.
13.6 Assertions
There is a mechanism that allows you to test certain conditions during program execution. When such a
condition is not being satisfied, an error is produced and the program is terminated abnormally.
To use the assertion mechanism, we have to use #include <assert.h> and then use the assert macro.
Listing 13-8 shows an example.
The condition, given to the assert macro, is obviously false; hence the program will terminate
abnormally and inform us about the failed assertion:
If the preprocessor symbol NDEBUG is defined (which can be achieved by using -D NDEBUG compiler
option or #define NDEBUG directive), the assert is replaced by an empty string and thus turned off. So,
assertions will produce zero overhead and the checks will not be performed.
You should use asserts to check for impossible conditions that signify the inconsistency of the program
state. Never use asserts to perform checks on user input.
Symmetrically, you can return values as you do and set up error code using a pointer to a respective
variable.
Error codes can be described using an enum or with several #defines. Then you can use them as indices
in a static array of messages or use a switch statement. Listing 13-10 shows an example.
252
CHAPTER 13 ■ GOOD CODE PRACTICES
/* alternatively */
Never use global variables as error code holders (or to return a value from a function).
According to C standard, a standard variable-like entity errno exists. It should be a modifiable lvalue
and must not be explicitly declared. Its usage is akin to a global variable, albeit its value is thread-local. The
library functions use it as an error code holder, so after seeing a failure from a function (e.g., fopen returned
NULL), one should check the errno value for an error code. The man pages for the respective function
enumerate possible errno values (e.g., EEXIST).
Despite this feature having sneaked into the standard library, it is largely considered an anti-pattern and
should not be imitated.
2. Using callbacks.
Callbacks are function pointers that are passed as arguments and called by the function that accepts
them. They can be used to isolate the error handling code, but they often look weird to people who are more
accustomed to traditional return code usage. Additionally, the execution order becomes less obvious.
Listing 13-11 shows an example.
int main(void) {
printf("%d %d\n",
div( 10, 2, div_by_zero ),
div( 10, 0, div_by_zero ) );
return 0;
}
253
CHAPTER 13 ■ GOOD CODE PRACTICES
{
if (!doA()) goto exit;
if (!doB()) goto revertA;
if (!doC()) goto revertB;
revertB:
undoB();
revertA:
undoA();
exit:
return;
}
In this example, three actions have been performed, and they all can fail. The nature of these actions is
such that we have to do a cleanup after. For example, doA might trigger dynamic memory allocation. In case
doA succeeded but doB did not, we have to free this memory to prevent memory leak. This is what the code
labeled revertA does.
The recoveries are performed in reverse order. If doA and doB succeeded, but doC failed, we have to
revert to B, and then to A. So, we label the reverting stages with the labels and let the control fall through
them. So, goto revertB will revert to doB first and then fall to the code, reverting to doA. This trick can often
be seen in a Linux kernel. However, be wary, gotos usually make verification much harder, which is why they
are sometimes banned.
254
CHAPTER 13 ■ GOOD CODE PRACTICES
13.9 On Flexibility
We advocate code reusability indeed. However, taking this to the extreme results in an absurd amount of
abstraction layers and boilerplate code that is only present to support a possible future need for additional
features (which might never happen).
There is no silver bullet, in the large sense. Every programming style, every model of computation,
is good and concise in some cases and bulky and verbose in other ones. Analogously, the best tool is
specialized rather than a jack of all trades. You could transform an image viewer into a powerful editor,
capable of playing video and editing IDv3 tags, but the image viewer facet will surely suffer, and so will the
user experience.
Writing more abstract code can bring benefits because such code is easier to adapt to new contexts. At
the same time, it introduces complexity that might be unnecessary. Only generalize to an extent that does no
harm. To know when to stop you need to answer several questions for yourself, such as
• What is the purpose of your program or library?
• What are the limits of functionality that you imagine for your program?
• Will it be easier to write, use, and/or debug this function if it is written in a more general
way?
While the first two questions are very subjective, the latter one can be provided with an example. Let’s
take a look at the code shown in Listing 13-13.
Compare it to another version with the same logic, split in two functions, shown in Listing 13-14.
255
CHAPTER 13 ■ GOOD CODE PRACTICES
Listing 13-15 shows an example of the same logic with error handling. As you see, there is no error
handling for file opening and closing in dump function; it is performed in fun instead.
enum stat {
STAT_OK,
STAT_ERR_OPEN,
STAT_ERR_CLOSE,
STAT_ERR_WRITE
};
return STAT_OK;
}
In case of multiple writes in the dump function, the function will become encumbered and thus less
readable.
256
CHAPTER 13 ■ GOOD CODE PRACTICES
uint32_t biWidth;
uint32_t biHeight;
uint16_t biPlanes;
uint16_t biBitCount;
uint32_t biCompression;
uint32_t biSizeImage;
uint32_t biXPelsPerMeter;
uint32_t biYPelsPerMeter;
uint32_t biClrUsed;
uint32_t biClrImportant;
};
■ Question 259 Read BMP file specifications to identify what these fields are responsible for.
The file format depends on the bit count per pixel. There are no color palettes when 16 or 24 bits per
pixel are used.
Each pixel is encoded by 24 bits or 3 bytes as shown in Listing 13-17. Each component is a number from
0 to 255 (one byte) which shows the presence of blue, green, or red color in this pixel. The resulting color is a
superposition of these three base colors.
Every row of pixels is padded so that its length would be a multiple of 4. For example, the image width
is 15 pixels. It corresponds to 15 × 3 = 45 bytes. To pad it we skip 3 bytes (to the closest multiple of 4, 48)
before starting the new row of pixels. Because of this, the real image size will differ from the product of width,
height, and pixel size (3 bytes).
257
CHAPTER 13 ■ GOOD CODE PRACTICES
13.10.2 Architecture
We want to think about program architecture that is extensible and modular.
1. Describe the pixel structure struct pixel to not work with the raster table directly
(as with completely structureless data). This should always be avoided.
2. Separate the inner image representation from the input format. The rotation is
performed on the inner image format, which is then serialized back to BMP. There
can be changes in BMP format, you might want to support other formats, and you
do not want to couple the rotation algorithm tightly to BMP.
To achieve that, define a structure structure image to store the pixel array (continuous, now without
padding) and some information that should really be kept. For example, there is absolutely no need to store
BMP signature here, or any of the never-used header fields. We can get away with the image width and
height in pixels.
You will need to create functions to read an image from BMP file and to write it to BMP file (probably
also to generate a BMP header from the inner representation).
3. Separate file opening from its reading.
4. Make error handling unified and handle errors in exactly one place (for this very
program it is enough).
To achieve that, define the from_bmp function, which will read a file from the stream and will return one
of the codes that show whether the operation completed successfully or not.
Remember the flexibility concerns. Your code should be easy to use in applications with graphical user
interface (GUI) as well as in those without GUI at all, so throwing prints into stderr all over the place is not
a good option: restrict them to the error handling piece of code. Your code should be easily adaptable for
different input formats as well.
Listing 13-18 shows several snippets of starting code.
struct image {
uint64_t width, height;
struct pixel_t* data;
};
/* deserializer */
enum read_status {
READ_OK = 0,
READ_INVALID_SIGNATURE,
READ_INVALID_BITS,
READ_INVALID_HEADER
/* more codes */
};
258
CHAPTER 13 ■ GOOD CODE PRACTICES
/* serializer */
enum write_status {
WRITE_OK = 0,
WRITE_ERROR
/* more codes */
};
■ Question 260 Implement blurring. It is done in a very simple way: for each pixel you compute its new
components as an average in a 3 × 3 pixels window (called kernel). The border pixels are left untouched.
■ Question 261 Implement rotation to an arbitrary angle (not only 90 or 180 degrees).
■ Question 262 Implement “dilate” and “erode” transformations. They are similar to the blur, but instead of
doing an average in a window, you have to compute the minimal (erode) or maximal (dilate) component values.
259
CHAPTER 13 ■ GOOD CODE PRACTICES
#define _USE_MISC
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/mman.h>
struct mem;
260
CHAPTER 13 ■ GOOD CODE PRACTICES
#pragma pack(push, 1)
struct mem {
struct mem* next;
size_t capacity;
bool is_free;
};
#pragma pack(pop)
#define DEBUG_FIRST_BYTES 4
#endif
Remember that complex logic begs for well-thought-out decomposition on smaller functions.
You can use the code shown in Listing 13-21 to debug the heap state. Do not forget that you can also
wait for user input and check the /proc/PID/maps file to see the actual mappings of a process with the
identifier PID.
size_t i;
fprintf( f,
"start: %p\nsize: %lu\nis_free: %d\n",
(void*)address,
address-> capacity,
address-> is_free );
for ( i = 0;
i < DEBUG_FIRST_BYTES && i < address-> capacity;
++i )
fprintf( f, "%hhX",
((char*)address)[ sizeof( struct mem_t ) + i ] );
putc( '\n', f );
}
261
CHAPTER 13 ■ GOOD CODE PRACTICES
13.12 Summary
In this chapter we have extensively studied some of the most important recommendations considering
coding style and program architecture. We have seen the naming conventions and the reasons behind the
common code guidelines. When we write code, we should adhere to certain restrictions derived from our
requirements for the code as well as the development process itself. We have seen such important concepts
as encapsulation. Finally, we have provided two more advanced assignments, where you can apply your new
knowledge about program architecture. In the next part we are going to dive into the details of translation,
review some language features that are easier to understand on the assembly level, and talk about
performance and compiler optimizations.
262
PART III
Translation Details
In this chapter we are going to revisit the notion of calling convention to deepen our understanding and
work through translation details. This process requires both understanding program functioning on the
assembly level and a certain degree of familiarity with C. We are also going to review some classic low-level
security vulnerabilities that might be opened by a careless programmer. Understanding these low-level
translation details is sometimes crucial for eradicating very subtle bugs that do not reveal themselves at
every execution.
■ Question 263 Read about the movq, movdqa, and movdqu instructions in [15].
The first six arguments from the first list are passed in general purpose registers
(rdi, rsi, rdx, rcx, r8, and r9). The first eight arguments from the second list
are passed in registers xmm0 to xmm7. If there are more arguments from these lists
to pass, they are passed on to the stack in reverse order. It means that the last
argument will be on top of the stack before the call is performed.
While integers and floats are quite trivial to handle, structures are a bit trickier.
If a structure is bigger than 32 bytes, or has unaligned fields, it is passed in
memory.
A smaller structure is decomposed in fields and each field is treated separately
and, if in an inner structure, recursively. So, a structure of two elements can be
passed the same way as two arguments. If one field of a structure is considered
“memory,” it propagates to the structure itself.
The rbp register, as we will see, is used to address the arguments passed in
memory and local variables.
What about return values? Integer and pointer values are returned in rax and rdx.
Floating point values are returned in xmm0 and xmm1. Big structures are returned
through a pointer, provided as an additional hidden argument, in the spirit of the
following example:
266
CHAPTER 14 ■ TRANSLATION DETAILS
struct s {
char vals[100];
};
struct s f( int x ) {
struct s mys;
mys.vals[10] = 42;
return mys;
}
3. Then the call instruction should be called. Its parameter is the address of the first
instruction of a called function. It pushes the return address into the stack.
Each program can have multiple instances of the same function launched at the
same time, not only in different threads but also due to recursion. Each such
function instance is stored in the stack, because its main principle—“last in, first
out”—corresponds to how functions are launched and terminated. If a function f
is launched and then invokes a function g, g is terminated first (but was invoked
last), and f is terminated last (while being invoked first).
Stack frame is a part of a stack dedicated to a single function instance. It stores the
values of the local variables, temporal variables, and saved registers.
The function code is usually enclosed inside a pair of prologue and epilogue,
which are similar for all functions. Prologue helps initialize the stack frame, and
epilogue deinitializes it.
During the function execution, rbp stays unchanged and points to the beginning
of its stack frame. It is possible to address local variables and stack arguments
relatively to rbp. It is reflected in the function prologue shown in Listing 14-1.
The old rbp value is saved to be restored later in epilogue. Then a new rbp is set
up to the current top of the stack (which stores the old rbp value now by the way).
Then the memory for the local variables is allocated in the stack by subtracting
their total size from rsp. This is the automatic memory allocation in C and the
technique we have used in the very first assignment to allocate buffers on stack.
The functions end with an epilogue shown in Listing 14-2.
By moving the stack frame the beginning address into rsp we can be sure that all
memory allocated in the stack is deallocated. Then the old rbp value is restored,
and now rbp points at the start of the previous stack frame. Finally, ret pops the
return address from stack into rip.
A fully equivalent alternative form is sometimes chosen by the compiler. It is
shown in Listing 14-3.
The leave instruction is made especially for stack frame destruction. Its
counterpart, enter, is not always used by compilers because it is more functional
than the instruction sequence shown in Listing 14-1. It is aimed at languages with
inner functions support.
4. After leaving the function, our work is not always done. In case there were
arguments that were passed in memory (stack), we have to get rid of them too.
int main(void) {
int x = maximum( 42, 999 );
return 0;
}
268
CHAPTER 14 ■ TRANSLATION DETAILS
00000000004004eb <main>:
4004eb: 55 push rbp
4004ec: 48 89 e5 mov rbp,rsp
4004ef: 48 83 ec 10 sub rsp,0x10
4004f3: be e7 03 00 00 mov esi,0x3e7
4004f8: bf 2a 00 00 00 mov edi,0x2a
4004fd: e8 b4 ff ff ff call 4004b6 <maximum>
400502: 89 45 fc mov DWORD PTR [rbp-0x4],eax
After a bit of cleaning, we get a pure and more readable assembly code, which is shown in Listing 14-6.
...
maximum:
push rbp
mov rbp, rsp
sub rsp, 3984
Leave
ret
■ Register assignment Refer to section 3.4.2 for the explanation about why changing esi means a change
in the whole rsi.
We are going to trace the function call and its prologue (check Listing 14-6) and show the stack contents
immediately after its execution.
269
CHAPTER 14 ■ TRANSLATION DETAILS
call maximum
push rbp
270
CHAPTER 14 ■ TRANSLATION DETAILS
271
CHAPTER 14 ■ TRANSLATION DETAILS
How does printf know the exact number of arguments? It knows for sure that at least one argument is
passed (char const* format). By analyzing this string and counting the specifiers it will compute the total
number of arguments as well as their types (in which registers they should be).
■ Note In case of variable number of arguments, al should contain the number of xmm registers used by
arguments.
As you see, there is absolutely no way to know how many arguments have been exactly passed. The
function deduces it from the arguments that are certainly present (format in this case). If there are more
format specifiers than arguments, printf will not know about it and will try to get the contents of the
respective registers and memory naively.
Apparently, this functionality cannot be encoded in C by a programmer directly, because the registers
cannot be accessed directly. However, there is a portable mechanism of declaring functions with variable
argument count that is a part of the standard library. Each platform has its own implementation of this
mechanism. It can be used after stdarg.h file is included and consists of the following:
• va_list–a structure that stores information about arguments.
• va_start–a macro that initializes va_list.
• va_end–a macro that deinitializes va_list.
• va_arg–a macro that takes a next argument from the argument list when given an
instance of va_list and an argument type.
Listing 14-7 shows an example. The function printer accepts a number of arguments and an arbitrary
number of them.
va_end( args );
}
int main () {
printer(10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 );
return 0;
}
First, va_list is initialized with the name of the last argument before dots by va_start. Then, each call
to va_arg gets the next argument. The second parameter is the name of the fresh argument’s type. In the
end, va_list is deinitialized using va_end.
272
CHAPTER 14 ■ TRANSLATION DETAILS
Since a type name becomes an argument and va_list is used by name, but is mutated, this example
can look confusing.
■ Question 264 Can you imagine a situation in which a function, not a macro, accepts a variable by name
(syntactically) and changes it? What should be the type of such variable?
They are being used inside custom functions which in their turn accept an arbitrary number of
arguments.
Listing 14-8 shows an example.
va_end( args );
}
14.2 volatile
The volatile keyword affects greatly the way the compiler optimizes the code.
The model of computation for C is a von Neumann one. It does not support parallel program execution
and the compiler usually tries to do as many optimizations as it can without changing the observable
program behavior. It might include reordering of instructions and caching variables in registers. Reading a
value from memory which is not written anywhere is omitted.
However, reading and writing in volatile variables always happen. The order of operations is also
preserved.
273
CHAPTER 14 ■ TRANSLATION DETAILS
However, this code has no observable effect from the point of view of the compiler, so it might be
optimized away completely. However, when the pointer is marked volatile, this will not be the case.
Listing 14-10 shows an example.
■ Volatile pointers in the language standard If the volatile pointer is pointing at the non-volatile memory,
according to the standard there are no guarantees! They exist only when both the pointer and the memory are
volatile. So, according to the standard, the example above is incorrect. However, as programmers are using the
volatile pointers with exactly this reasoning, the most used compilers (MSVC, GCC, clang) do not optimize away
the dereferencing of volatile pointers. There is no a standard-conforming way of doing this.
274
CHAPTER 14 ■ TRANSLATION DETAILS
There are two variables: one is volatile, the other is not. Both are incremented and given to printf as
arguments. GCC will generate the following code (with -O2 optimization level), shown in Listing 14-12:
; vol = 4
mov DWORD PTR [rsp+0xc],0x4
; vol ++
mov eax,DWORD PTR [rsp+0xc]
add eax,0x1
mov DWORD PTR [rsp+0xc],eax
xor eax,eax
As we see, the contents of a volatile variable are really read and written each time it occurs in C. The
ordinary variable will not even be created: the computations will be performed in compile time and the
final result is stored in rsi, waiting to be used as the second argument of a call.
275
CHAPTER 14 ■ TRANSLATION DETAILS
int main(void) {
jmp_buf jb;
int val;
val = setjmp( jb );
puts("Hello!");
if (val == 0) longjmp( jb, 1 );
else puts("End");
return 0;
}
Local variables that are not marked volatile will all hold undefined values after longjmp. This is the
source of bugs as well as memory freeing related issues: it is hard to analyze the control flow in presence of
longjmp and ensure that all dynamically allocated memory is freed.
In general, it is allowed to call setjmp as a part of a complex expression, but only in rare cases. In most
cases, this is an undefined behavior. So, better not to do it.
It is important to remember that all this machinery is based on stack frames usage. It means that you
cannot perform longjmp in a function with a deinitialized stack frame. For example, the code, shown in
Listing 14-14, yields an undefined behavior for this very reason.
276
CHAPTER 14 ■ TRANSLATION DETAILS
void g(void) {
f();
longjmp(jb);
}
The function f has terminated already, but we are performing longjmp into it. The program behavior is
undefined because we are trying to restore a context inside a destroyed stack frame.
In other words, you can only jump into the same function or into a function that is launched.
jmp_buf buf;
return 0;
}
We are going to compile it without optimizations (gcc -O0, Listing 14-16) and with optimizations
(gcc -O2, Listing 14-17).
Without optimizations,
277
CHAPTER 14 ■ TRANSLATION DETAILS
; `argc` and `argv` are saved in stack to make `rdi` and `rsi` available
mov DWORD PTR [rbp-0x14],edi
mov QWORD PTR [rbp-0x20],rsi
; var = 0
mov DWORD PTR [rbp-0x4],0x0
; b = 0
mov DWORD PTR [rbp-0x8],0x0
; A fair increment
; b++
mov eax,DWORD PTR [rbp-0x8]
add eax,0x1
mov DWORD PTR [rbp-0x8],eax
; var++
add DWORD PTR [rbp-0x4],0x1
; `printf` call
mov eax,DWORD PTR [rbp-0x4]
mov esi,eax
mov edi,0x400684
; There are no floating point arguments, thus rax = 0
mov eax,0x0
call 400450 <printf@plt>
; calling `longjmp`
mov esi,0x1
mov edi,0x600a40
call 400490 <longjmp@plt>
.endlabel:
Mov eax,0x0
Leave
ret
278
CHAPTER 14 ■ TRANSLATION DETAILS
1
2
3
With optimizations,
mov edi,0x600a40
; b = 0
mov DWORD PTR [rsp+0xc],0x0
; instructions are placed in the order different
; from C statements to make better use of pipeline and other inner
; CPU mechanisms.
call 400470 <_setjmp@plt>
; return 0
xor eax,eax
add rsp,0x18
ret
.branch:
; b = b + 1
add eax,0x1
mov DWORD PTR [rsp+0xc],eax
279
CHAPTER 14 ■ TRANSLATION DETAILS
xor eax,eax
call 400450 <printf@plt>
; longjmp( buf, 1 )
mov esi,0x1
mov edi,0x600a40
call 400490 <longjmp@plt>
1
1
1
The volatile variable b, as you see, behaved as intended (otherwise, the cycle would have never ended).
The variable var was always equal to 1, despite being “incremented” according to the program text.
■ Question 265 How do you implement “try–catch”-alike constructions using setjmp and longjmp?
14.4 inline
inline is a function qualifier introduced in C99. It mimics the behavior of its C++ counterpart.
Before you read an explanation, please, do not assume that this keyword is used to force function
inlining!
Before C99, there was a static qualifier, which was often used in the following scenario:
• The header file includes not the function declaration but the full function definition,
marked as static.
• The header is then included in multiple translation units. Each of them receives a copy
of the emitted code, but as the corresponding symbol is object-local, the linker does
not see it as a multiple definition conflict.
In a big project, this gives the compiler the access to the function source code, which enables it to really
inline the function if needed. Obviously, the compiler might also decide that the function is better left not
inlined. In this case we start getting the clones of this function pretty much everywhere. Each file is calling its
own copy, which is bad for locality and bloats the memory image as well as the executable itself.
The inline keyword addresses this issue. Its correct usage is as follows:
• Describe an inline function in a relevant header, for example,
• In exactly one translation unit (i.e., a .c file), add the external declaration
280
CHAPTER 14 ■ TRANSLATION DETAILS
This file will contain the function code, which will be referenced by every other file, where the function
was not inlined.
■ Semantics change In GCC prior to 4.2.1 the inline keyword had a slightly other meaning. See the post
[14] for an in-depth analysis.
14.5 restrict
restrict is a keyword akin to volatile and const which first appeared in the C99 standard. It is used to
mark pointers and is thus placed to the right of the asterisk, as follows:
int x;
int* restrict p_x = &x;
If we create a restricted pointer to an object, we make a promise that all accesses to this object will pass
through the value of this pointer. A compiler can either ignore this or make use of it for certain optimizations,
which is often possible.
In other words, any write by another pointer will not affect the value stored by a restricted pointer.
Breaking this promise leads to subtle bugs and is a clear case of undefined behavior.
Without restrict, every pointer is a source of possible memory aliasing, when you can access the same
memory cells by using different names for them. Consider a very simple example, shown in Listing 14-18. Is
the body of f equal to *x += 2 * (*add);?
The answer is, surprisingly, no, they are not equal. What if add and x are pointing to the same address?
In this case, changing *x changes *add as well. So, in case x == add, the function will add *x to *x making it
two times the initial value, and then repeat it making it four times the initial value. However, when x != add,
even if *x == *add the final *x will be three times the initial value.
The compiler is well aware of it, and even with optimizations turned on it will not optimize away two
reads, as shown in Listing 14-19.
281
CHAPTER 14 ■ TRANSLATION DETAILS
However, add restrict, as shown in Listing 14-20, and the disassembly will demonstrate an
improvement, as shown in Listing 14-21. The second argument is read exactly once, multiplied by 2, and
added to the dereferenced first argument.
Only use restrict if you are sure what you are doing. Writing a slightly ineffective program is much
better than writing an incorrect one.
It is important to use restrict also to document code. For example, the signature for memcpy, a function
that copies n bytes from some starting address s2 to a block starting with s1, has changed in C99:
void*
memcpy(void* restrict s1,
const void* restrict s2,
size_t n );
This reflects the fact that these two areas should not overlap; otherwise the correctness is not
guaranteed.
Restricted pointers can be copied from one to another to create a hierarchy of pointers. However, the
standard limits this by cases when the copy is not residing in the same block with the original pointer.
Listing 14-22 shows an example.
void f(void) {
struct s* restrict p_s = &inst;
int* restrict p_x = p_s->x; /* Bad */
{
int* restrict p_x2 = p_s->x; /* Fine, other block scope */
}
}
282
CHAPTER 14 ■ TRANSLATION DETAILS
To our satisfaction, the compiler optimizes the reads away just as we wanted. Listing 14-24 shows the
disassembly.
We discourage using aliasing rules for optimization purposes in code for C99 and newer standards
because restrict makes the intention more obvious and does not introduce unnecessary type names.
283
CHAPTER 14 ■ TRANSLATION DETAILS
void f( void ) {
char buffer[16];
gets( buffer );
}
After being initialized, the layout of the stack frame will look as follows:
The gets function reads a line from stdin and places it in the buffer, whose address is accepted as an
argument. Unfortunately, it does not control the buffer size at all and thus can surpass it.
If the line is too long, it will overwrite the buffer, then the saved rbp value, and then the return address.
When the ret instruction is executed, the program will most probably crash. Even worse, if the attacker
forms a clever line, it can rewrite the return address with specific bytes forming a valid address.
284
CHAPTER 14 ■ TRANSLATION DETAILS
Should the attacker choose to redirect the return address directly into the buffer being overrun, he can
transmit the executable code directly in this buffer. Such code is often called shellcode, because it is small
and usually only opens a remote shell to work with.
Obviously, this is not only the flaw in gets but the feature of the language itself. The moral is never to
use gets and always to provide a way to check the bounds of the target memory block.
14.7.2 return-to-libc
As we have already elaborated, the malevolent user can rewrite the return address if the program allows him
to overrun the stack buffer. The return-to-libc attack is performed when the return address is the address
of a function in the standard C library. One function is of a particular interest, int system(const char*
command). This function allows you to execute an arbitrary shell command. What’s even worse, it will be
executed with the same privileges as the attacked program.
When the current function terminates by executing the ret command, we will start executing the
function from libc. It is yet a question, how do we form a valid argument for it?
In the presence of ASLR (address space layout randomization), doing this attack is nontrivial (but still
possible).
Function Description
printf Outputs a formatted string.
fprintf Writes the printf to a file.
sprintf Prints into a string.
snprintf Prints into a string checking the length.
vfprintf Prints the va_arg structure to a file.
vprintf Prints the va_arg structure to stdout.
vsprintf Prints the va_arg to a string.
vsnprintf Prints the va_arg to a string checking the length.
Listing 14-26 shows an example. Suppose that the user inputs less than 100 symbols. Can you crash this
program or produce other interesting effects?
285
CHAPTER 14 ■ TRANSLATION DETAILS
The vulnerability does not come from gets usage but from usage of the format string taken from the
user. The user can provide a string that contains format specifiers, which will lead to an interesting behavior.
We will mention several potentially unwanted types of behavior.
• The "%x" specifiers and its likes can be used to view the stack contents. First 5 "%x" will
take arguments from registers (rdi is already occupied with the format string address),
then the following ones will show the stack contents. Let’s compile the example
shown in Listing 14-26 and see its reaction on an input "%x %x %x %x %x %x %x %x
%x %x %x".
> %x %x %x %x %x %x %x %x %x %x
b1b6701d b19467b0 fbad2088 b1b6701e 0 25207825 20782520 78252078 25207825
As we see, it actually gave us four numbers that share a certain informal similarity, a 0 and two more
numbers. Our hypothesis is that the last two numbers are taken from the stack already.
Getting into gdb and exploring the memory near the stack top right after printf call we are going to get
results that prove our point. Listing 14-27 shows the output.
• The "%s" format specifier is used to print strings. As a string is defined by the address
of its start, this means addressing memory by a pointer. So, if no valid pointer is given,
the invalid pointer will be dereferenced.
■ Question 266 What will be the result of launching the code shown in Listing 14-26 on input "%s %s %s %s %s"?
• The "%n" format specifier is a bit exotic but still harmful. It allows one to write an integer
into memory. The printf function accepts a pointer to an integer which will be rewritten
with an amount of symbols written so far (before "%n" occurs). Listing 14-28 shows an
example of its usage.
int main(void) {
int count;
printf( "hello%n world\n", &count);
printf( "%d\n", count );
return 0;
}
286
CHAPTER 14 ■ TRANSLATION DETAILS
This will output 5, because there were five symbols output before "%n". This is not a trivial string length
because there can be other format specifiers before, which will result in an output of variable length (e.g.,
printing an integer can emit seven or ten symbols). Listing 14-29 shows an example.
To avoid that, do not use the string accepted from the user as a format string. You can always write
printf("%s", buffer), which is safe as long as the buffer is not NULL and is a valid null-terminated string.
Do not forget about such functions as puts of fputs, which are not only faster but also safer.
287
CHAPTER 14 ■ TRANSLATION DETAILS
Overrunning the buffer will rewrite the security cookie. Before the ret instruction, the compiler emits a
special check that verifies the integrity of the security cookie, and if it is changed, it crashes the program. The
ret instruction does not get to be executed.
Both MSVC and GCC have this mechanism turned on by default.
14.8.3 DEP
We have already discussed Data Execution Prevention in Chapter 4 This technology protects some pages
from executing instructions stored on these pages. To enable it, programs should be also compiled with
support turned on.
The sad fact is that it does not work well with programs that use just-in-time compilation, which forms
executable code during the program execution itself. This is not as rare as it might seem; for example,
virtually all browsers are using JavaScript engines which support just-in-time compilation.
14.9 Summary
In this chapter we have revisited the calling convention used in *nix on Intel 64. We have seen the example
usages of the more advanced C features, namely, volatile and restrict type qualifiers and non-local
jumps. Finally, we have given a brief overview of several classical vulnerabilities that are possible because
of the way stack frames are organized, and the compiler features that were designed to automatically cope
with them. The next chapter will explain more low-level details related to the creation and usage of dynamic
libraries to strengthen our understanding of them.
■ Question 267 What are xmm registers? How many are they?
■ Question 271 When passing arguments to the function, why is rax sometimes used?
■ Question 274 Why aren’t we addressing the local variables relative to rsp?
288
CHAPTER 14 ■ TRANSLATION DETAILS
■ Question 277 Describe in details, how is the stack frame changing during the function execution.
■ Question 279 How do we declare and use a function with a variable number of arguments?
■ Question 283 Why do only volatile stack variables persist after longjmp?
■ Question 289 How can we achieve the same result without using the restrict keyword?
■ Question 292 What is a security cookie? Does it solve program crashes on buffer overflow?
289
CHAPTER 15
Chapter 5 already provided a short overview of dynamic libraries (also known as shared objects). This
chapter will revisit dynamic libraries and expand our knowledge by introducing the concepts of the
Program Linkage Table and the Global Offset Table. As a result, we will be able to build a shared library
in pure assembly and C, compare the results, and study its structure. We will also study a concept of code
models, which is rarely discussed but gives a consistent view of several important details of assembly code
generation.
Determining dependencies and loading them is relatively easy: it boils down to searching dependencies
recursively and checking whether the object has been already loaded or not. Initializing is also not very
mystified. The relocation, however, is of interest to us.
There are two kinds of relocations:
• Links to locations in the same object. The static linker is performing all such
relocations since they are known at the link time.
• Symbol dependencies, which are usually in the different object.
The second kind of relocation is more costly and is performed by the dynamic linker.
Before doing relocations, we need to do a lookup first to find the symbols we want to link. There is a
notion of lookup scope of an object file, which is an ordered list containing some other loaded objects. The
lookup scope of an object file is used to resolve symbols necessary for it. The way it is computed is described
in [24] and is rather complex, so we refer you to the relevant document in case of need.
The lookup scope consists of three parts, which are listed in reverse order of search—that is, the symbol
gets searched in the third part of the scope first.
1. Global lookup scope, which consists of the executable file and all its dependencies,
including dependencies of the dependencies, etc. They are enumerated in a
breadth-first search fashion, that is:
• The executable itself.
• Its dependencies.
• The dependencies of its first dependency, then of the second, etc. Each object is
loaded only once.
2. The part constructed if DF_SYMBOLIC flag is set in the ELF executable file metadata.
It is considered legacy; its usage is discouraged, so we are not studying it here.
3. Objects loaded dynamically with all their dependencies by means of dlopen
function call. They are not searched for normal lookups.
Each object file contains a hash table which is used for lookup.1 This table stores the symbol
information and is used to quickly find the symbol by its name. The first object in the lookup scope, which
contains the needed symbol, is linked, which allows for symbol overloading—for example, using LD_PRELOAD
mechanism—which will be explored in section 15.5.
The hash table size and the number of exported symbols are affecting the lookup time. When the
-O flag for linker is provided,2 it tries to optimize these parameters for better lookup speed. Remember,
that in languages such as C++, not only are the symbol names computed based on, for example, function
name, but they have all their namespaces (and classname) encoded, which may easily result in names of
several hundred characters. In the case of collisions in hash tables (which are usually frequent), the string
comparison should be performed between the symbol name we are looking for and all symbols in the
bucket we have chosen by computing its hash.
The modern GNU-style hash tables provide an additional heuristic of using a Bloom filter3 in order to
quickly answer a question: “is this symbol even defined in this object file?” That makes unnecessary lookups
much less frequent, which positively impacts performance.
1
We will not provide the details on what the hash tables are or how are they implemented, but if you do not know about
them, we highly advise you to read about them! This is an absolutely classic data structure used everywhere. A good
explanation can be found in [10]
2
Do not confuse with -O flag for the compiler!
3
A probabilistic data structure that is widely used. It allows us to quickly check whether an element is contained in a
certain set, but the answer “yes” is subject to an additional check, while “no” is always certain.
292
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
As we see, there is a global variable in the main file, which we will want to share with the library; the
library explicitly states that it is extern. The main file has the declaration of the library function (which is
usually placed in the header file, shipped with the compiled library).
293
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
First, we create object files as usual. Then we build the dynamic library using -shared flag. When we
build an executable, we provide all dynamic libraries from which it depends, because this information
should be included in ELF metadata. Notice the usage of -fPIC flag, which forces to generate position-
independent code. We will see the effects of this flag on assembly later.
Let’s check the file dependencies using ldd.
Our fresh library is present in the list of dependencies, but ldd cannot find it. An attempt to launch the
executable fails with the expected message:
The libraries are searched in the default locations (such as /lib/). Ours is not there, so we have another
option: an environment variable LD_LIBRARY_PATH is parsed to get a list of additional directories where the
libraries might be located. As soon as we set it to the current directory, ldd finds the library. Note, that the
search starts with the directories defined in LD_LIBRARY_PATH and proceeds to the standard directories.
> ./main
param: 42
global: 100
295
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
Let’s see, how the variable global, created in the main executable file, is addressed in the dynamic library.
To do it, we are going to study a fragment of objdump -D -Mintel-mnemonic output, shown in Listing 15-3.
# Function prologue
6d0: 55 push rbp
6d1: 48 89 e5 mov rbp,rsp
6d4: 48 83 ec 10 sub rsp,0x10
# Function epilogue
70d: 90 nop
70e: c9 leave
70f: c3 ret
Remember that the source code is shown in Listing 15-2. We are interested in seeing how the global
variables are accessed.
First, note that the first argument of printf (which is the address of the format string, residing in
.rodata) is accessed not in a typical way.
In such cases, we used to have an absolute address value (which would have been filled by linker
during the relocation, as explained in section 5.3.2). However, here an address relative to rip is used. As we
understand, rdi as the first argument should hold the address of the format string. So, this address is stored
in memory by the address [rip + 0x32]. This place is a part of GOT.
296
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
Now, let’s see, how global is accessed from the dynamic library code. In fact, the mechanism is
absolutely the same, though there is a need in one more memory read. First we read the GOT contents in
to get the address of global, then we read its value by accessing the memory again in
Quite simple for global variables. For functions, however, the implementation is a bit more complicated.
; PLT
PLT_0: ; the common part
call resolver
...
PLT_n: func@plt:
jmp [GOT_n]
PLT_n_first:
; here the arguments for resolver are prepared
jmp PLT_0
GOT:
...
GOT_n:
dq PLT_n_first
297
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
298
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
■ Question 293 Read in man ld.so about environment variables (such as LD_BIND_NOT), which can alter the
loader behavior.
00000000004006a6 <main>:
push rbp
mov rbp,rsp
mov edi,0x2a
call 400580 <libfun@plt>
mov eax,0x0
pop rbp
ret
299
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
Next, let’s see how PLT looks like. The PLT entry for libfun is called libfun@plt. Find it in Listing 15-5.
0000000000400550 <_init>:
sub rsp,0x8
mov rax,QWORD PTR [rip+0x200a9d] # 600ff8 <_DYNAMIC+0x1e0>
test rax,rax
je 400565 <_init+0x15>
call 4005a0 <__libc_start_main@plt+0x10>
add rsp,0x8
ret
Disassembly of section .plt:
0000000000400570 <libfun@plt-0x10>:
push QWORD PTR [rip+0x200a92] # 601008 <_GLOBAL_OFFSET_TABLE_+0x8>
jmp QWORD PTR [rip+0x200a94] # 601010 <_GLOBAL_OFFSET_TABLE_+0x10>
nop DWORD PTR [rax+0x0]
0000000000400580 <libfun@plt>:
imp QWORD PTR [rip+0x200a92] # 601018 <_GLOBAL_OFFSET_TABLE_+0x18>
push 0x0
jmp 400570 <_init+0x20>
0000000000400590 <__libc_start_main@plt>:
jmp QWORD PTR [rip+0x200a8a] # 601020 <_GLOBAL_OFFSET_TABLE_+0x20>
push 0x1
jmp 400570 <_init+0x20>
...
Disassembly of section .got.plt:
0000000000601000 <_GLOBAL_OFFSET_TABLE_>:
...
The first instruction is a jump into GOT to its third element (because each entry is 8 bytes long and the
offset is 0x18). Then the push instruction is issued, whose operand is the function number in PLT. For libfun
it is 0x0, for libc_start_main it is 0x1.
The next instruction in libfun@plt is a jump to _init+0x20, which is strange, but if we check the actual
_init address, we will see, that
• _init is at 0x400550.
• _init+0x20 is at 0x400570.
• libfun@plt-0x10 is at 0x400570 as well, so they are the same.
300
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
• This address is also the start of .plt section and, according to the explanation
previously, should correspond to the “common” code shared by all PLT entries. It
pushes one more GOT value into the stack and takes an address of the dynamic loader
from GOT to jump to it.
The comments issued by objdump show that the last two values refer to addresses 0x601008 and
0x601010. As we see, they should be stored somewhere in .got.plt section, which is the part of GOT related
to PLT entries. Listing 16 shows the contents of this section.
By looking carefully we see that starting at the address 0x601018 the following bytes are located:
86 05 40 00 00 00 00 00
Remembering the fact that Intel 64 uses little endian, we conclude that the actual quad word stored
here is 0x400586, which is really the address of libfun@plt + 6, in other words, the address of the push
0x0 instruction. That illustrates the fact that the initial values for functions in GOT point at the second
instructions of their respective PLT entries.
15.5 Preloading
Setting up the LD_PRELOAD variable allows you to preload shared objects before any other library (including
the C standard library). The functions from this library will have a priority lookup-wise, so they can override
the functions defined in the normally loaded shared objects.
The dynamic loader ignores the LD_PRELOAD value if the effective user ID and the real user ID do not
match. This is done for security reasons.
We are going to write and compile a simple program, shown in Listing 15-7.
int main(void) {
puts("Hello, world!");
return 0;
}
It does nothing spectacular, but it is important that it uses the puts function, defined in the C standard
library. We are going to overwrite it with our version of puts, which ignores its input and simply outputs a
fixed string.
When this program is launched, the standard puts function is being executed.
301
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
Now let us make a simple dynamic library with the contents shown in Listing 15-8. It proxies the puts
function with its alternative, which ignores its argument and always outputs a fixed string.
Note that the executable was not linked against the dynamic library. Listing 15-9 shows the effect of
setting the LD_PRELOAD variable.
As we see, if the LD_PRELOAD contains a path to a shared object that defines some functions, they will
override other functions that are present in the process address space.
■ Question 294 Refer to the assignment. Use this technique to test your malloc implementation against
some standard utilities from coreutils.
4
This is not always the case, for example, OS X recommends that all executables are made position independent.
302
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
2. Defined in dynamic library and used only there locally (unavailable to external
objects).
In the presence of PIC, it is done by using rip-relative addressing (for data) or
relative offsets (for function calls). The more general case will be discussed later in
section 15.10.
NASM uses the rel keyword to achieve rip-relative addressing. This does not
involve GOT or PLT.
3. Defined in executable and used globally.
This requires the GOT usage (and also PLT for functions) if the user is external. For
internal usage the rules are the same: we do not need GOT or PLT for addressing
inside the same object file.
4. Defined in dynamic library and used globally.
Should be a part of linked list item rather than a paragraph on its own.
15.7 Examples
It is very possible to write a dynamic library in assembly language, which will be position independent and
will use GOT and PLT tables.
■ Linking with gcc The recommended way of linking libraries is by using GCC. However, for this chapter
we will sometimes use more primitive ld to show what is really done in greater detail. When the C runtime is
involved, never use ld.
We will also limit ourselves with Intel 64 as always. The PIC code was a bit harder to write before rip-
relative addressing was introduced.
303
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
extern sofun
section .text
_start:
call sofun wrt ..plt
The first thing that we notice is that extern _GLOBAL_OFFSET_TABLE_ is usually imported in every file
that is dynamically linked.5
The main file imports the symbol called sofun. Then, the call contains not only the function name but
also the wrt ..plt qualifier.
Referring to a symbol using wrt ..plt forces the linker to create a PLT entry. The corresponding
expression will be evaluated to an offset to PLT entry relative to the current position in code. Before static
linkage, this offset is unknown, but it will be filled by the static linker. The type of this kind of relocation
should be a rip-relative relocation (like the one used in call or jmp-like instructions). ELF structure does
not provide means to address the PLT entries by their absolute addresses.
section .rodata
msg: db "SO function called", 10
.end:
section .text
sofun:
mov rax, 1
mov rdi, 1
lea rsi, [rel msg]
mov rdx, msg.end - msg
syscall
ret
Notice that the global symbol sofun is marked as :func (there should be no space before the colon).
It is very important to mark exported functions like this in case they should be accessed by other objects
dynamically.
The .end label allows us to calculate the string length statically to feed it to the write system call. The
important change is the rel keyword usage.
5
This name is specific to ELF and should be changed for other systems. See section 9.2.1 of [27].
304
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
The code is position independent, so the absolute address of msg can be arbitrary. Its offset relative to
this point in code (lea rsi, [rel msg] instruction) is fixed. So, we can use lea to calculate its address as an
offset from rip. This line will be compiled to lea rsi, [rip + offset], where offset is a constant that will
be filled in by the static linker.
The latter form ([rip + offset]) is syntactically incorrect in NASM.
Listing 15-12 shows the Makefile used to build this example. Before launching, make sure that the
environment variable LD_LIBRARY_PATH includes the current directory, otherwise you can simply type
export LD_LIBRARY_PATH=.
lib.so: lib.o
ld -shared lib.o -o lib.so
lib.o:
nasm -felf64 lib.asm -o lib.o
main.o: main.asm
nasm -felf64 main.asm -o main.o
■ Question 296 Perform an experiment. Omit the wrt ..plt construction for the call and recompile
everything. Then use objdump -D -Mintel-mnemonic on the resulting main executable to check whether the
PLT is still in the game or not. Try to launch it.
ld --dynamic-linker=/lib64/ld-linux-x86-64.so.2
If you do not specify it, ld will choose the default path, which might lead to a nonexistent file in your case.
If the dynamic linker does not exist, the attempt to load the library will result in a cryptic message which
does not make any sense. Suppose that you have built an executable main and it uses a library so_lib, and
the LD_LIBRARY_PATH is set correctly.
./main
bash: no such file or directory: ./main
> ldd ./main
linux-vdso.so.1 => (0x00007ffcf7f9f000)
so_lib.so => ./so_lib.so (0x00007f0e1cc0a000)
305
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
The problem is that the linkage was done without an appropriate dynamic linker provided and the
ELF metadata does not hold a correct path to it. Relinking the object files with an appropriate dynamic
linker path should solve this problem. For example, in the Debian Linux distribution installed on the virtual
machine, shipped with this book, the dynamic linker is /lib64/ld-linux-x86-64.so.2.
extern sofun
global msg:data (msg.end - msg)
section .rodata
msg: db "SO function called -- message is stored in 'main'", 10
.end:
section .text
_start:
call sofun wrt ..plt
mov rdi, 0
mov rax, 60
syscall
extern msg
section .text
sofun:
mov rax, 1
mov rdi, 1
mov rsi, [rel msg wrt ..got]
mov rdx, 50
syscall
ret
It is very important to mark the dynamically shared data declaration with its size. The size is given as
an expression, which may include labels and operations on them, such as subtraction. Without the size, the
symbol will be treated as global by the static linker (visible to other modules during static linking phase) but
will not be exported by the dynamic library.
306
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
When the variable is declared as global with its size and type (:data), it will live in the .data section of
the executable file rather than the library! Because of this, you will always have to access it through GOT, even
in the same file.
The GOT, as we know, stores the addresses of the variables global to the process. So, if we want to
know the address of msg, we have to read an entry from GOT. However, as the dynamic library is position
independent, we have to address its GOT relatively to rip as well. If we want to read its value, we need an
additional memory read after fetching its address from GOT.
If the variable is declared in the dynamic library and accessed in the main executable file, it should be
done with exactly the same construction: its address can be read from [rel varname wrt ..got]. If you
need to store an address of the GOT variable, use the following qualifier:
extern fun1
section .rodata
commonmsg: db "fun2", 10, 0
.end:
section .text
_start:
call fun1 wrt ..plt
mov rax, 60
mov rdi, 0
syscall
mainfun:
mov rax, 1
mov rdi, 1
mov rsi, mainfunmsg
mov rdx, 8
syscall
ret
307
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
extern commonmsg
extern mainfun
global fun1:function
section .rodata
msg: db "fun1", 10
section .text
fun1:
mov rax, 1
mov rdi, 1
lea rsi, [rel msg]
mov rdx, 6
syscall
call fun2
call mainfun wrt ..plt
ret
fun2:
mov rax, 1
mov rdi, 1
mov rsi, [rel commonmsg wrt ..got]
mov rdx, 5
syscall
ret
308
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
In the main file, an external function sofun is called from the dynamic library. Its result is printed to
stdout by printf. Then the string, taken from the dynamic library, is output by puts. Note that the global
string is the global character buffer, not a pointer!
extern puts
section .rodata
sostr: db "sostring", 10, 0
.end:
section .text
sofun:
lea rdi, [rel localstr]
call puts wrt ..plt
mov rax, 42
ret
In the library, the sofun is defined as well as the sostr global string. sofun calls puts, the standard
C library function with the localstr address as an argument. As the library is written in a position-
independent way, the address should be calculated as an offset from rip; hence the lea command is used.
This function always returns 42.
Listing 15-19 shows the relevant Makefile.
lib.so: lib.o
gcc -shared lib.o -o lib.so
lib.o: lib.asm
nasm -felf64 lib.asm -o lib.o
main.o: main.asm
gcc -ansi -c main.c -o main.o
clean:
rm -rf *.o *.so main
309
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
/usr/lib/gcc/x86_64-linux-gnu/4.9/collect2
-plugin
/usr/lib/gcc/x86_64-linux-gnu/4.9/liblto_plugin.so
-plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/4.9/lto-wrapper
-plugin-opt=-fresolution=/tmp/ccqEOGnU.res
-plugin-opt=-pass-through=-lgcc
-plugin-opt=-pass-through=-lgcc_s
-plugin-opt=-pass-through=-lc
-plugin-opt=-pass-through=-lgcc
-plugin-opt=-pass-through=-lgcc_s
--sysroot=/
--build-id
--eh-frame-hdr
-m elf_x86_64
--hash-style=gnu
-dynamic-linker /lib64/ld-linux-x86-64.so.2
-o main
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o
/usr/lib/gcc/x86_64-linux-gnu/4.9/crtbegin.o
-L/usr/lib/gcc/x86_64-linux-gnu/4.9
-L/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu
-L/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../../lib
-L/lib/x86_64-linux-gnu
-L/lib/../lib
-L/usr/lib/x86_64-linux-gnu
-L/usr/lib/../lib
-L/usr/lib/gcc/x86_64-linux-gnu/4.9/../../..
main.o
lib.so
-lgcc
--as-needed -lgcc_s
--no-as-needed -lc
-lgcc
--as-needed -lgcc_s
--no-as-needed /usr/lib/gcc/x86_64-linux-gnu/4.9/crtend.o
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o
310
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
The lto abbreviation corresponds to “link-time optimizations”, which is of no interest to us. The
interesting part consists of additional libraries linked. These are:
• crti.o
• crtbegin.o
• crtend.o
• crtn.o
• crt1.o
ELF files support multiple sections, as we know. A separate section .init is used to store code that will be
executed before main, another section .fini is used to store code that is called when the program terminates.
These sections’ contents are split into multiple files. crti and crto contain the prologue and epilogue
of__init function (and likewise for__fini function). These two functions are called before and after the
program execution, respectively. crtbegin and crtend contain other utility code included in .init and .fini
sections. They are not always present. We want to repeat that their order is important. crt1.o contains the
_start function.
To prove our statements, we are going to disassemble crti.o, crtn.o, and crt1.o files using good old
objdump -D -Mintel-mnemonic.
0000000000000000 <_init>:
0: sub rsp, 0x8
4: mov rax, QWORD PTR [rip+0x0] # b <_init+0xb>
b: test rax, rax
e: je 15 <_init+0x15>
10: call 15 <_init+0x15>
0000000000000000 <_fini>:
0: sub rsp, 0x8
0000000000000000 <.init>:
0: add rsp,0x8
4: ret
311
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
0000000000000000 <.fini>:
0: add rsp,0x8
4: ret
As we see, these form functions end up in the executable. To see the complete linked and relocated
code, we are going to take a part of objdump -D -Mintel-mnemonic output for the resulting file, as shown in
Listing 15-23.
00000000004005d8 <_init>:
4005d8: sub rsp,0x8
4005dc: mov rax,QWORD PTR [rip+0x200a15] # 600ff8 <_DYNAMIC+0x1e0>
4005e3: test rax,rax
4005e6: je 4005ed <_init+0x15>
4005e8: call 400650 <__libc_start_main@plt+0x10>
4005ed: add rsp,0x8
4005f1: ret
0000000000400660 <_start>:
400660: xor ebp,ebp
400662: mov r9,rdx
400665: pop rsi
400666: mov rdx,rsp
400669: and rsp,0xfffffffffffffff0
40066d: push rax
40066e: push rsp
40066f: mov r8,0x400800
312
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
0000000000400804 <_fini>:
400804: sub rsp,0x8
400808: add rsp,0x8
40080c: ret
15.9 Optimizations
What impacts the performance when working with a dynamic library?
First of all, never forget the -fPIC compiler option.6 Without it, even the .text section will be relocated,
making dynamic libraries way less attractive to use. It is also crucial to disable some optimizations that might
prevent dynamic libraries from working correctly.
As we have seen, when the function is declared static in the dynamic library and thus is not exported,
it can be called directly without the PLT overhead. Always use static to limit visibility to a single file.
It is also possible to control visibility of the symbols in a compiler-dependent way. For example, GCC
recognizes four types of visibility (default, hidden, internal, protected), of which only the first two are of
interest to us. The visibility of all symbols altogether can be controlled using the -fvisibility compiler
switch, as follows:
> gcc -fvisibility=hidden ... # will hide all symbols from shared object
The “default” visibility level implies that all non-static symbols are visible from outside the shared
object. By using __attribute__ directive, we can finely control visibility on a per-symbol basis. Listing 15-24
shows an example.
The good thing that you can do is to hide all symbols of the shared object and explicitly mark the
symbols with default visibility. This way you will fully describe the interface. It is especially good because no
other symbols will be exposed and you will be free to change the library internals without breaking binary
compatibility of any kind.
The data relocations can slow things down a bit. Every time a variable in .data is storing an address of
another variable, it should be initialized by dynamic linker once the absolute address of the latter becomes
known. Avoid such situations when possible.
Since the access to local symbols bypasses PLT, you might want to reference only “hidden” functions
inside your code and make publicly available wrappers for the functions you want to export. Only the calls to
the wrappers will use PLT. Listing 15-25 shows an example.
6
The -fpic option implies a limit on GOT size for some architectures, which is often faster.
313
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
void otherfunction( ) {
printf(" %d \n", _function( 41 ) );
}
To eliminate possible overhead of the wrapper functions, a technique exists of writing symbol aliases
(which is also compiler specific). GCC handles it by using alias attribute. Listing 15-26 shows an example.
int tester(void) {
printf( "%d\n", global );
printf( "%d\n", global_alias );
fun();
fun_alias();
return 0;
}
When we compile it using gcc -shared -O3 -fPIC and disassemble it, we see the code shown in
Listing 15-27 (disassembly for tester function).
314
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
The global and global_aliased are handled differently; the latter requires one less memory read. The
function call of fun is also handled more efficiently, bypassing PLT and thus sparing an extra jump.
Finally, remember, that the zero-initialized globals are always faster to initialize. However, we strongly
advocate against global variables usage.
More information about shared object optimizations can be found in [13].
■ Note The common way of linking against libraries is by using -l key, for example, gcc -lhello. The only
two differences with specifying the full file path are:
• -lhello will search for a library named libhello.a (so, prefixed with lib and with an
extension .a).
• The library is searched in the standard list of directories. It is also searched in custom
directories, which can be supplied using -L option. For example, to include the
directory /usr/libcustom and the current directory, you can type:
> gcc -lhello -L. -L/usr/libcustom main.c
315
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
A code model is a convention to which the programmer and the compiler both adhere; it describes the
constraints on the program that will use the object file that is currently being compiled. The code generation
depends on it. In short, when the program is relatively small, there is no harm in using 32-bit offsets. However,
when it can be large enough, the slower 64-bit offsets, which are handled by multiple instructions, should be used.
The 32-bit offsets correspond to the small code model; the 64-bit offsets correspond to the large code
model. There is also a sort of compromise called the medium code model. All these models are treated
differently in context of position-dependent and position-independent code, so we are going to review all six
possible combinations.
There can be other code models, such as the kernel code model, but we will leave them out of this
volume. If you make your own operating system you can invent one for your own pleasure.
The relevant GCC option is -mcmodel, for example, -mcmodel=large. The default model is the small model.7
The GCC manual says the following about the -mcmodel option8:
-mcmodel=small
Generate code for the small code model: the program and its symbols must be linked in
the lower 2 GB of the address space. Pointers are 64 bits. Programs can be statically or
dynamically linked. This is the default code model.
-mcmodel=kernel
Generate code for the kernel code model. The kernel runs in the negative 2 GB of the
address space. This model has to be used for Linux kernel code.
-mcmodel=medium
Generate code for the medium model: the program is linked in the lower 2 GB of the
address space. Small symbols are also placed there. Symbols with sizes larger than -mlarge-
data-threshold are put into large data or BSS sections and can be located above 2GB.
Programs can be statically or dynamically linked.
-mcmodel=large
Generate code for the large model. This model makes no assumptions about addresses and
sizes of sections.
To illustrate the differences in compiled code when using different code models, we are going to use a
simple example shown in Listing 15-28.
int main(void) {
glob_small[0] = 42;
glob_big[0] = 42;
loc_small[0] = 42;
7
Not all compilers and GCC versions support the large model.
8
Note that there are different descriptions for different architectures.
316
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
loc_big[0] = 42;
global_f();
local_f();
return 0;
}
The -g flag adds debug information such as .line section, which describes the correspondence between
assembly instructions and the source code lines.
In this example, there are bigger and smaller arrays. It matters only for medium code model, hence we
will omit the big array accesses from other disassembly listings.
; loc_small[0] = 42;
4004fe: c6 05 3b a2 b8 00 2a mov BYTE PTR [rip+0xb8a23b],0x2a
; global_f();
40050c: e8 c5 ff ff ff call 4004d6 <global_f>
; local_f();
400511: e8 cb ff ff ff call 4004e1 <local_f>
The second column shows us the hex codes of the bytes that correspond to each instruction. The array
accesses are performed explicitly relative to rip, and the calls accept the offsets (which are also implicitly
relative to rip). We can see that the size of data accessing instructions is 7 bytes of which 1 byte is the value
(0x2a) and 4 bytes encode the offset relative to rip. It illustrates the core idea of the small code model:
rip-relative addressing.
317
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
; glob_small[0] = 42;
4004f0: 48 b8 40 10 60 00 00 mov rax,0x601040
4004f7: 00 00 00
4004fa: c6 00 2a mov BYTE PTR [rax],0x2a
; loc_small[0] = 42;
40050a: 48 b8 40 a7 f8 00 00 mov rax,0xf8a740
400511: 00 00 00
400514: c6 00 2a mov BYTE PTR [rax],0x2a
; global_f();
400524: 48 b8 d6 04 40 00 00 mov rax,0x4004d6
40052b: 00 00 00
40052e: ff d0 call rax
; local_f();
400530: 48 b8 e1 04 40 00 00 mov rax,0x4004e1
400537: 00 00 00
40053a: ff d0 call rax
Both data accesses and calls are performed uniformly. We always start by moving an immediate value
into one of the general purpose registers and then reference memory using the address stored in this
register.9
For a cost of a more spacious assembly code (and probably a bit slower one) we take the safest road
possible allowing to reference anything in any part of the 64-bit virtual address space.
glob_small[0] = 42;
400530: c6 05 09 0b 20 00 2a mov BYTE PTR [rip+0x200b09],0x2a
glob_big[0] = 42;
400537: 48 b8 40 11 a0 00 00 movabs rax,0xa01140
40053e: 00 00 00
400541: c6 00 2a mov BYTE PTR [rax],0x2a
loc_small[0] = 42;
400544: c6 05 75 0b 20 00 2a mov BYTE PTR [rip+0x200b75],0x2a
9
If you encounter the movabs instruction, consider it equivalent to the mov instruction.
318
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
loc_big[0] = 42;
40054b: 48 b8 c0 a7 38 01 00 movabs rax,0x138a7c0
400552: 00 00 00
400555: c6 00 2a mov BYTE PTR [rax],0x2a
global_f();
400558: e8 b9 ff ff ff call 400516 <global_f>
local_f();
40055d: e8 bf ff ff ff call 400521 <local_f>
As we see, the generated code is using the large model to access big arrays and the small one for the
rest of accesses. It is quite clever and might save you if you only need to work with a big chunk of statically
allocated data.
glob_small[0] = 42;
4004f0: 48 8d 05 49 0b 20 00 lea rax,[rip+0x200b49]
# 601040 <glob_small>
glob_big[0] = 42;
4004fa: 48 8d 05 bf 0b 20 00 lea rax,[rip+0x200bbf]
# 6010c0 <glob_big>
loc_small[0] = 42;
400504: c6 05 35 a2 b8 00 2a mov BYTE PTR [rip+0xb8a235],0x2a
# f8a740 <loc_small>
loc_big[0] = 42;
40050b: c6 05 ae a2 b8 00 2a mov BYTE PTR [rip+0xb8a2ae],0x2a
# f8a7c0 <loc_big>
global_f();
400512: e8 bf ff ff ff call 4004d6 <global_f>
local_f();
400517: e8 c5 ff ff ff call 4004e1 <local_f>
The static arrays are accessed easily relative to rip as expected. The globally visible arrays are
accessed through GOT, which implies an additional read from the table itself to get its address.
319
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
# Standard prologue
400594: 55 push rbp
400595: 48 89 e5 mov rbp,rsp
# What is that?
400598: 41 57 push r15
40059a: 53 push rbx
40059b: 48 8d 1d f9 ff ff ff lea rbx,[rip+0xfffffffffffffff9]
# 40059b <main+0x7>
4005a2: 49 bb 65 0a 20 00 00 movabs r11,0x200a65
4005a9: 00 00 00
4005ac: 4c 01 db add rbx,r11
return 0;
40060f: b8 00 00 00 00 mov eax,0x0
}
320
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
This example needs to be studied carefully. First we want to break down the unusual code in the
function prologue.
We use rbx and r15 because they are callee-saved. They are used here to build up the GOT address out
of the following two components:
• The address of the current instruction, calculated in lea
rbx,[rip+0xfffffffffffffff9]. The operand is equal to -6, while the instruction
itself is 6 bytes long. When it is being executed, the rip value points to the next address
after the instruction.
• Then the number 0x200a65 is being added to rbx. It is done through another register,
because adding an immediate operand of 64 bits wide is not supported by the add
instruction (check the instruction description in [15]!).
• This number is a displacement of GOT relative to the address of lea
rbx,[rip+0xfffffffffffffff9], which, as we know, is always known at link time in
position-independent code.10
The ABI considers that r15 should hold GOT address at all times. rbx is also used by GCC for its
convenience.
The GOT absolute address is unknown at link time since the code is written to be position independent.
Now to the data accesses: the global symbol is accessed through GOT the same way as in non-PIC code;
however, as the GOT address is stored in rbx, we have to compute the entry address using more instructions.
The entry is located with a negative offset of -24 relatively to the rbx (r15) value. This displacement
can be of arbitrary length, so we need to store it in a register to consider cases where it cannot be contained
in 32 bits. Then we load the GOT entry to rax and use this address for our purposes (in this case we store a
value in the array start).
10
Obviously, here r15 and rbx hold not the beginning of GOT but its end, but it does not matter.
321
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
The variables not visible as other objects are accessed using GOT as well. However, we are not reading
their addresses from GOT. Rather than that, we use the rbx value as the base (as it points somewhere in the
data segment). Every global variable has a fixed offset from this base, so we can just pick this offset and use
the base indexed addressing mode.
This is obviously faster, so whenever you can, you should prefer limiting symbol visibility as explained
in section 15.9
The local functions are called in the same manner. Their address is calculated relative to GOT and
stored in a register. We cannot simply use the call command, because its immediate operand is limited to
32 bits (in its description given in [15], there are only operand types rel16 and rel32, but no rel64).
Calling global functions is done in a more traditional way. Its PLT entry is used, whose address is also
calculated as a fixed offset to a known GOT position.
int main(void) {
40057a: 55 push rbp
40057b: 48 89 e5 mov rbp,rsp
glob_small[0] = 42;
400585: 48 8d 05 b4 0a 20 00 lea rax,[rip+0x200ab4]
40058c: c6 00 2a mov BYTE PTR [rax],0x2a
322
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
glob_big[0] = 42;
40058f: 48 8b 05 62 0a 20 00 mov rax,QWORD PTR [rip+0x200a62]
400596: c6 00 2a mov BYTE PTR [rax],0x2a
loc_small[0] = 42;
400599: c6 05 20 0b 20 00 2a mov BYTE PTR [rip+0x200b20],0x2a
loc_big[0] = 42;
4005a0: 48 b8 c0 97 d8 00 00 movabs rax,0xd897c0
4005a7: 00 00 00
4005aa: c6 04 02 2a mov BYTE PTR [rdx+rax*1],0x2a
global_f();
4005ae: e8 a3 ff ff ff call 400556 <global_f>
local_f();
4005b3: e8 b0 ff ff ff call 400568 <local_f>
return 0;
4005b8: b8 00 00 00 00 mov eax,0x0
}
4005bd: 5d pop rbp
4005be: c3 ret
The GOT address is also in reach of rip-relative addressing, so its address is loaded with one instruction.
It is thus not always needed to dedicate a register for it, since this address will not be used everywhere.
The code references are considered to be in reach of 32-bit rip-relative offsets. So, calling any functions
is trivial.
global_f();
4005ae: e8 a3 ff ff ff call 400556 <global_f>
local_f();
4005b3: e8 b0 ff ff ff call 400568 <local_f>
As for the data accesses, the accesses to global variables are performed uniformly no matter the size.
The GOT is involved in any case, and it contains 64-bit global variables addresses, so we have the possibility
of addressing anything for free.
glob_small[0] = 42;
400585: 48 8d 05 b4 0a 20 00 lea rax,[rip+0x200ab4]
40058c: c6 00 2a mov BYTE PTR [rax],0x2a
glob_big[0] = 42;
40058f: 48 8b 05 62 0a 20 00 mov rax,QWORD PTR [rip+0x200a62]
400596: c6 00 2a mov BYTE PTR [rax],0x2a
323
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
The local variables, however, differ. Small arrays can be accessed relative to rip.
loc_small[0] = 42;
400599: c6 05 20 0b 20 00 2a mov BYTE PTR [rip+0x200b20],0x2a
Local big arrays are found relative to GOT starting addresses, as in the large model.
loc_big[0] = 42;
4005a0: 48 b8 c0 97 d8 00 00 movabs rax,0xd897c0
4005a7: 00 00 00
4005aa: c6 04 02 2a mov BYTE PTR [rdx+rax*1],0x2a
15.11 Summary
In this chapter we have received the knowledge we need to understand the machinery behind dynamic
library loading and usage. We have written a library in assembly language and in C and successfully linked it
to an executable.
For further reading we address you above all to a classic article [13] and to the ABI description [24].
In the next chapter we are going to speak about compiler optimizations and their effects on
performance as well as about specialized instruction set extensions (SSE/AVX), aimed to speed up certain
types of computations.
■ Question 297 What is the difference between static and dynamic linkage?
■ Question 299 Can we resolve all dependencies at the link time? What kind of system should we be
working with in order for this to be possible?
■ Question 303 Can we share a .text section between processes when it is being relocated?
■ Question 304 Can we share a .data section between processes when it is being relocated?
■ Question 305 Can we share a .data section when it is being relocated?
■ Question 306 Why are we compiling dynamic libraries with an -fPIC flag?
■ Question 307 Write a simple dynamic library in C from scratch and demonstrate the calling function from it.
324
CHAPTER 15 ■ SHARED OBJECTS AND CODE MODELS
■ Question 313 How come that position-independent code can address GOT directly but cannot address
global variables directly?
■ Question 316 Why don’t we use GOT to call functions from different objects (or do we)?
■ Question 317 What does the initial GOT entry for a function point at?
■ Question 318 How do we preload a library and what can it be used for?
■ Question 319 In assembly, how is the symbol addressed if it is defined in the executable and accessed
from there?
■ Question 320 In assembly, how is the symbol addressed if it is defined in the library and accessed from
there?
■ Question 321 In assembly, how is the symbol addressed if it is defined in the executable and accessed
from everywhere?
■ Question 322 In assembly, how is the symbol addressed if it is defined in the library and accessed from
everywhere?
■ Question 323 How do we control the visibility of a symbol in a dynamic library? How do we make it private
for the library but accessible from anywhere in it?
■ Question 324 Why do people sometimes write wrapper functions for those used in library?
■ Question 325 How do we link against a library that is stored in libdir?
■ Question 326 What is a code model and why do we care about code models?
■ Question 329 What is the compromise between large and small code models?
■ Question 331 How do large code models differ for PIC and non-PIC code?
■ Question 332 How do medium code models differ for PIC and non-PIC code?
325
CHAPTER 16
Performance
In this chapter we will study how to write faster code. In order to do that, we will look into SSE (Streaming
SIMD Extensions) instructions, study compiler optimizations, and hardware cache functioning.
Note that this chapter is a mere introduction to the topic and will not make you an expert in
optimization.
There is no silver bullet technique to magically make everything fast. Hardware has become so complex
that even an educated guess about the code that is slowing down program execution might fail. Testing and
profiling should always be performed, and the performance should be measured in a reproducible way. It
means that everything about the environment should be described in such detail that anyone would be able
to replicate the conditions of the experiment and receive similar results.
16.1 Optimizations
In this section we want to discuss the most important optimizations that happen during the translation
process. They are crucial to understanding how to write quality code. Why? A common type of decision
making in programming is balancing between code readability and performance. Knowing optimizations
is necessary in order to make good decisions. Otherwise, when choosing between two versions of code, we
might choose a less readable one because it “looks” like it performs fewer actions. In reality, however, both
versions will be optimized to exactly the same sequences of assembly instructions. In this case, we just made
a less readable code for no benefit at all.
■ Note In the listings presented in this section we will often use an __attribute__ ((noinline)) GCC
directive. Applying it to a function definition suppresses inlining for the said function. Exemplary functions
are often small, which encourages compilers to inline them, which we do not want to better show various
optimization effects.
There are cases in which a program written in C can be outperformed by another program performing
similar actions but written in, say, Java. It has no connection with the language itself.
For example, a typical malloc implementation has a particular property: it is hard to predict its
execution time. In general, it is dependent on the current heap state: how many blocks exist, how
fragmented the heap is, etc. In any case it is most likely greater than allocating memory on a stack. In a
typical Java Virtual Machine implementation, however, allocating memory is fast. It happens because Java
has a simpler heap structure. With some simplifications, it is just a memory region and a pointer inside it,
which delimits an occupied area from the free one. Allocating memory means moving this pointer further
into the free part, which is fast.
However, it has its cost: to get rid of the memory chunks we do not need anymore, garbage collection is
performed, which might stop the program for an unknown period of time.
We imagine a situation in which garbage collection never happens, for example, a program allocates
memory, performs computations, and exits, destroying all address space without invoking the garbage
collector. In this case it is possible that a Java program performs faster just because of the careful allocation
overhead imposed by malloc. However, if we use a custom memory allocator, fitting our specific needs for a
particular task, we might do the same trick in C, changing the outcome drastically.
Additionally, as Java is usually interpreted and compiled in runtime, virtual machine has access to
runtime optimizations that are based on how exactly the program is executed. For example, methods that
are often executed one after another can be placed near each other in memory, so that they are placed in a
cache together. In order to do that, certain information about program execution trace should be collected,
which is only possible in runtime.
What really distinguishes C from other languages is a very transparent costs model. Whatever you
are writing, it is easy to imagine which assembly instructions will be emitted. Contrary to that, languages
destined primarily to work inside a runtime (Java, C#), or providing multiple additional abstractions, such as
C++ with its virtual inheritance mechanism, are harder to predict. The only two real abstractions C provides
are structures/unions and functions.
Being translated naively in machine instructions, a C program works very slowly. It is no match to a
code generated by a good optimizing compiler. Usually, a programmer does not have more knowledge about
low-level architecture details than the compiler, which is much needed to perform low-level optimizations,
so he will not be able to compete with the compiler. Otherwise, sometimes, for a particular platform and
compiler, one might change a program, usually reducing its readability and maintainability, but in a way that
will speed up the code. Again, performance tests are mandatory for everyone.
328
CHAPTER 16 ■ PERFORMANCE
The most important part of optimizations is often choosing the right algorithm. Low-level optimizations
on the assembly level are rarely so beneficial. For example, accessing elements of a linked list by index
is slow, because we have to traverse it from the beginning, jumping from node to node. Arrays are more
beneficial when the program logic demands accessing its elements by index. However, insertion in a linked
list is easy compared to array, because to insert an element to the i-th position in an array we have to move
all following elements first (or maybe even reallocate memory for it first and copy everything).
A simple, clean code is often also the most efficient.
Then, if the performance is unsatisfactory, we have to locate the code that gets executed the most using
profiler and try to optimize it by hand. Check for duplicated computations and try to memorize and reuse
computation results. Study the assembly listings and check if forcing inlining for some of the functions used
is doing any good.
General concerns about hardware such as locality and cache usage should be taken into account at this
time. We will speak about them in section 16.2.
The compiler optimizations should be considered then. We will study the basic ones later in this
section. Turning specific optimizations on or off for a dedicated file or a code region can have a positive
impact on performance. By default, they are usually all turned on when compiling with -O3 flag.
Only then come lower-level optimizations: manually throwing in SSE or AVX (Advanced Vector
Extensions) instructions, inlining assembly code, writing data bypassing hardware cache, prefetching data
into caches before using it, etc.
The compiler optimizations are boldly controlled by using compiler flags -O0, -O1, -O2, -O3, -Os
(optimize space usage, to produce the smallest executable file possible). The index near -O, increases as the
set of enabled optimizations grows.
Specific optimizations can be turned on and off. Each optimization type has two associated compiler
options for that, for example, -fforward-propagate and -fno-forward-propagate.
329
CHAPTER 16 ■ PERFORMANCE
section .rodata
format : db "%x ", 10, 0
section .code
unwind:
push rbx
; while (rbx != 0) {
; print rbx; rbx = [rbx];
; }
mov rbx, rbp
.loop:
test rbx, rbx
jz .end
mov rdi, format
mov rsi, rbx
call printf
mov rbx, [rbx]
jmp .loop
.end:
pop rbx
ret
How do we use it? Try it as a last resort to improve performance on code involving a huge amount of
non-inlineable function calls.
It calls itself recursively, but this call is particular. Once the call is completed, the function immediately
returns.
330
CHAPTER 16 ■ PERFORMANCE
In Chapter 2, we studied Question 20, which proposes a solution in the spirit of tail recursion. When the
last thing a function does is call other function, which is immediately followed by the return, we can perform
a jump to the said function start. In other words, the following pattern of instructions can be a subject to
optimization:
; somewhere else:
call f
...
...
f:
...
call g
ret ; 1
g:
...
ret ; 2
The ret instruction in this listing are marked as the first and the second one.
Executing call g will place the return address into the stack. This is the address of the first ret
instruction. When g completes its execution, it executes the second ret instruction, which pops the return
address, leaving us at the first ret. Thus, two ret instructions will be executed in a row before the control
passes to the function that called f. However, why not return to the caller of f immediately? To do that,
we replace call g with jmp g. Now g we will never return to function f, nor will we push a useless return
address into the stack. The second ret will pick up the return address from call f, which should have
happened somewhere, and return us directly there.
331
CHAPTER 16 ■ PERFORMANCE
; somewhere else:
call f
...
...
f:
...
jmp g
g:
...
ret ; 2
When g and f are the same function, it is exactly the case of tail recursion. When not optimized,
factorial(5, 1) will launch itself five times, polluting the stack with five stack frames. The last call will end
executing ret five times in a row in order to get rid of all return addresses.
Modern compilers are usually aware of tail recursive calls and know how to optimize tail recursion
into a cycle. The assembly listing produced by GCC for the tail recursive factorial (Listing 16-3) is shown in
Listing 16-5.
struct llist {
struct llist* next;
int value;
};
332
CHAPTER 16 ■ PERFORMANCE
Compiling with -Os will produce the non-recursive code, shown in Listing 16-7.
How do we use it? Never be afraid to use tail recursion if it makes the code more readable for it brings no
performance penalty.
333
CHAPTER 16 ■ PERFORMANCE
__attribute__ ((noinline))
void test(int x) {
printf("%d %d",
x*x + 2*x + 1,
x*x + 2*x - 1 );
}
As a proof, Listing 16-9 shows the compiled code, which does not compute x2 + 2x twice.
How do we use it? Do not be afraid to write beautiful formulae with same common subexpressions: they
will be computed efficiently. Favor code readability.
334
CHAPTER 16 ■ PERFORMANCE
It gets better when the compiler computes complex expressions for you (including function calls).
Listing 16-2 shows an example.
int main(void) {
printf("%d\n", fact( 4 ) );
return 0;
}
Obviously, the factorial function will always compute the same result, because this value does not
depend on user input. GCC is smart enough to precompute this value erasing the call and substituting the
fact(4) value directly with 24, as shown in Listing 16-13. The instruction mov edx, 0x18 places 2410 = 1816
directly into rdx.
How do we use it? Named constants are not harmful, nor are constant variables. A compiler can and will
precompute as much as it is able to, including functions without side effects launched on known arguments.
Multiple function copies for each distinct argument value can be bad for locality and will make the
executable size grow. Take that into account if you face performance issues.
335
CHAPTER 16 ■ PERFORMANCE
__attribute__ ((noinline))
struct p f(void) {
struct p copy;
copy.x = 1;
copy.y = 2;
copy.z = 3;
return copy;
}
An instance of struct p called copy is created in the stack frame of f. Its fields are populated with
values 1, 2, and 3, and then it is copied to the outside world, presumably by the pointer accepted by f as a
hidden argument.
Listing 16-15 shows the resulting assembly code.
336
CHAPTER 16 ■ PERFORMANCE
The compiler can produce a more efficient code as shown in Listing 16-16.
00000000004004d1 <main>:
4004d1: 48 83 ec 20 sub rsp,0x20
4004d5: 48 89 e7 mov rdi,rsp
4004d8: e8 d9 ff ff ff call 4004b6 <f>
4004dd: b8 00 00 00 00 mov eax,0x0
4004e2: 48 83 c4 20 add rsp,0x20
4004e6: c3 ret
4004e7: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
4004ee: 00 00
337
CHAPTER 16 ■ PERFORMANCE
We do not allocate a place in the stack frame for copy at all! Instead, we are operating directly on the
structure passed to us through a hidden argument.
How do we use it? If you want to write a function that fills a certain structure, it is usually not beneficial
to pass it a pointer to a preallocated memory area directly (or allocate it via malloc usage, which is also
slower).
1
We omit the talk about instruction cache for brevity.
338
CHAPTER 16 ■ PERFORMANCE
A part of the CPU that performs the operations and calculations is called the execution unit. It is
implementing different kinds of operations that the CPU wants to handle: instruction fetching, arithmetic,
address translation, instruction decoding, etc. In fact, CPUs can use it in a more or less independent
way. Different instructions are executed in a different number of stages, and each of these stages can be
performed by a different execution unit. It allows for interesting circuitry usages such as the following:
• Fetching one instruction immediately after the other was fetched (but has not
completed its execution).
• Performing multiple arithmetic actions simultaneously despite their being described
sequentially in assembly code.
CPUs of the Pentium IV family were already capable of executing four arithmetic instructions
simultaneously in the right circumstances.
How do we use the knowledge about execution unit’s existence? Let us look at the example shown in
Listing 16-17.
Can we make it faster? We see the dependencies between instructions, which hinder the CPU
microcode optimizer. What we are going to do is to unroll the loop so that two iterations of the old loop
become one iteration of the new one. Listing 16-18 shows the result.
add rdi, 16
sub rcx, 2
jnz looper
339
CHAPTER 16 ■ PERFORMANCE
Now the dependencies are gone, the instructions of two iterations are now mixed. They will be executed
faster in this order because it enhances the simultaneous usage of different CPU execution units. Dependent
instructions should be placed away from each other to allow other instructions to perform in between.
We cannot tell you which execution units are in your CPU, because this is highly model dependent. We
have to read the optimization manuals for a specific CPU, such as [16]. Additional sources are often helpful;
for example, the Haswell processors are well explained in [17].
16.2 Caching
16.2.1 How Do We use Cache Effectively?
Caching is one of the most important mechanisms of performance boosting. We spoke about the general
concepts of caching in Chapter 4. This section will further investigate how to use these concepts effectively.
We want to start by elaborating that contrary to the spirit of von Neumann architecture, the common
CPUs have been using separate caches for instructions and data for at least 25 years. Instructions and code
inhabit virtually always different memory regions, which explains why separate caches are more effective.
We are interested in data cache.
340
CHAPTER 16 ■ PERFORMANCE
By default, all memory operations involve cache, excluding the pages marked with cache-write-through
and cache-disable bits (see Chapter 4).
Cache contains small chunks of memory of 64 bytes called cache-lines, aligned on a 64-byte boundary.
Cache memory is different from the main memory on a circuit level. Each cache-line is identified by a
tag—an address of the respective memory chunk. Using special circuitry it is possible to retrieve the cache
line by its address very fast (but only for small caches, like 4MB per processor, otherwise it is too expensive).
When trying to read a value from memory, the CPU will try to read it from the cache first. If it is missing,
the relevant memory chunk will be loaded into cache. This situation is called cache-miss and often makes a
huge impact on program performance.
There are usually several levels of cache; each of them is bigger and slower.
The LL-cache is the last level of cache closest to main memory.
For programs with good locality, caching works well. However, when the locality is broken for a piece
of code, bypassing cache makes sense. For example, writing values into a large chunk of memory which will
not be accessed any time soon is better performed without using cache.
The CPU tries to predict what memory addresses will be accessed in the near future and preloads the
relevant memory parts into cache. It favors sequential memory accesses.
This gives us two important empirical rules needed to use caches efficiently.
• Try to ensure locality.
• Favor sequential memory access (and design data structures with this point in mind).
16.2.2 Prefetching
It is possible to issue a special hint to the CPU to indicate that a certain memory area will be accessed soon.
In Intel 64 it is done using a prefetch instruction. It accepts an address in memory; the CPU will do its best
to preload it into cache in the near future. This is used to prevent cache misses.
Using prefetch can be effective enough, but it should be coupled with testing. It should be
executed before the data accesses themselves, but not too close. The cache preloading is being executed
asynchronously, which means that it is a running at the same time when the following instructions are being
executed. If prefetch is too close to data accesses, the CPU will not have enough time to preload data in
cache and cache-miss will occur anyway.
Moreover, it is very important to understand that “close” and “far” from the data access mean the
instruction position in the execution trace. We should not necessarily place prefetch close with regard to
the program structure (in the same function), but we have to choose a place that precedes data access. It can
be located in an entirely different module, for example, in the logging module, which just happens to usually
be executed before the data access. This is of course very bad for code readability, introduces non-obvious
dependencies between modules, and is a “last resort” kind of technique.
To use prefetch in C, we can use one of GCC built-ins:
341
CHAPTER 16 ■ PERFORMANCE
int main() {
size_t i = 0;
int NUM_LOOKUPS = SIZE;
int *array;
int *lookups;
srand(time(NULL));
array = malloc(SIZE*sizeof(int));
for (i=0;i<NUM_LOOKUPS;i++)
binarySearch(array, SIZE, lookups[i]);
free(array);
free(lookups);
}
342
CHAPTER 16 ■ PERFORMANCE
The memory access pattern of the binary search is hard to predict. It is highly nonsequential, jumping
from the start to the end, then to the middle, then to the fourth, etc. Let us see the difference in execution times.
Listing 16-22 shows the results of execution with prefetch off.
343
CHAPTER 16 ■ PERFORMANCE
Using valgrind utility with cachegrind module we can check the amount of cache misses. Listing 16-24
shows the results for no prefetch, while Listing 16-25 shows the results with prefetching.
I corresponds to instruction cache, D to the data cache, LL – Last Level Cache). There are almost 100%
data cache misses, which is very bad.
344
CHAPTER 16 ■ PERFORMANCE
#include <xmmintrin.h>
void _mm_stream_pi(__m64 *p, __m64 a);
void _mm_stream_ps(float *p, __m128 a);
#include <ammintrin.h>
void _mm_stream_sd(double *p, __m128d a);
void _mm_stream_ss(float *p, __m128 a);
Bypassing cache is useful if we are sure that we will not touch the related memory area for quite a long
time. For further information refer to [12].
345
CHAPTER 16 ■ PERFORMANCE
We will use the time utility (not shell built-in) again to test the execution time.
346
CHAPTER 16 ■ PERFORMANCE
The execution is so much slower because of cache misses, which can be checked using valgrind utility
with cachegrind module as shown in in Listing 16-29.
347
CHAPTER 16 ■ PERFORMANCE
■ Question 334 Take a look at the GCC man pages, section “Optimizations.”
348
CHAPTER 16 ■ PERFORMANCE
■ Consistency We omit the description of the legacy floating point dedicated stack for brevity. However, we
want to point out that all program parts should be translated using the same method of floating point arithmetic:
either floating point stack or SSE instructions.
int main() {
float x[4] = {1.0f, 2.0f, 3.0f, 4.0f };
float y[4] = {5.0f, 6.0f, 7.0f, 8.0f };
sse( x, y );
349
CHAPTER 16 ■ PERFORMANCE
In this example there is a function sse defined somewhere else, which accepts two arrays of floats.
Each of them should be at least four elements wide. This function performs computations and modifies the
first array.
We call the values packed if they fill an xmm register of consecutive memory cells of the same size. In
Listing 16-30, float x[4] is four packed single precision float values.
We will define the function sse in the assembly file shown in Listing 16-31.
; rdi = x, rsi = y
sse:
movdqa xmm0, [rdi]
mulps xmm0, [rsi]
addps xmm0, [rsi]
movdqa [rdi], xmm0
ret
This file defines the function sse. It performs four SSE instructions:
• movdqa (MOVe Double Qword Aligned) copies 16 bytes from memory pointed by rdi
into register xmm0. We have seen this instruction in section 14.1.1.
• mulps (MULtiply Packed Single precision floating point values) multiplies the contents
of xmm0 by four consecutive float values stored in memory at the address taken from rsi.
• addps (ADD Packed Singled precision floating point) adds the contents of four
consecutive float values stored in memory at the address taken from rsi again.
• movdqa copies xmm0 into the memory pointed by rdi.
In other words, four pair of floats are getting multiplied and then the second float of each pair is added
to the first one.
The naming pattern is common: the action semantics (mov, add, mul…) with suffixes. The first suffix
is either P (packed) or S (scalar, for single values). The second one is either D for double precision values
(double in C) or S for single precision values (float in C).
We want to emphasize again that most SSE instructions accept only aligned memory operands.
In order to complete the assignment, you will need to study the documentation for the following
instructions using the Intel Software Developer Manual [15]:
• movsd–Move Scalar Double-Precision Floating- Point Value.
• movdqa–Move Aligned Double Quad word.
• movdqu–Move Unaligned Double Quad word.
• mulps–Multiply Packed Single-Precision Floating Point Values.
• mulpd–Multiply Packed Double-Precision Floating Point Values.
• addps–Add Packed Single-Precision Floating Point Values.
• haddps–Packed Single-FP Horizontal Add.
350
CHAPTER 16 ■ PERFORMANCE
351
CHAPTER 16 ■ PERFORMANCE
C does not implement such arithmetic, so we will have to check for overflows manually. SSE contains
instructions that convert floating point values to single byte integers with saturation.
Performing the transformation in C is easy. It demands direct encoding of the matrix to vector
multiplication and taking saturation into account. Listing 16-32 shows the code.
struct image {
uint32_t width, height;
struct pixel* array;
};
pixel->r = sat(
old.r * c[0][0] + old.g * c[0][1] + old.b * c[0][2]
);
pixel->g = sat(
old.r * c[1][0] + old.g * c[1][1] + old.b * c[1][2]
);
pixel->b = sat(
old.r * c[2][0] + old.g * c[2][1] + old.b * c[2][2]
);
}
352
CHAPTER 16 ■ PERFORMANCE
The last few pixels that did not fill the last chunk can be processed one by one using the C code provided
in Listing 16-32.
• Make sure that both C and assembly versions produce similar results.
• Compile two programs; the first should use a naive C approach and the second one
should use SSE instructions.
• Compare the time of execution of C and SSE using a huge image as an input
(preferably hundreds of megabytes).
• Repeat the comparison multiple times and calculate the average values for SSE
and C.
To make a noticeable difference, we have to have as many operations in parallel as we can. Each pixel
consists of 3 bytes; after converting its components into floats it will occupy 12 bytes. Each xmm register is 16
bytes wide. If we want to be effective we will have use the last 4 bytes as well. To achieve that we use a frame
of 48 bytes, which corresponds to three xmm registers, to 12-pixel components, and to 4 pixels.
Let the subscript denote the index of a pixel. The image then looks as follows:
b1g1r1b2g2r2b3g3r3b4g4r4 …
We would like to compute the first four components. Three of them correspond to the first pixel, the
fourth one corresponds to the second one.
To perform necessary transformations it is useful to first put the following values into the registers:
xmm0 = b1b1b1b2
xmm1 = g1g1g1g2
xmm2 = r1r1r1r2
We will store the matrix coefficients in either xmm registers or memory, but it is important to store the
columns, not the rows.
To demonstrate the algorithm, we will use the following start values:
xmm3 = c11|c21|c31|c11
xmm4 = c12|c22|c32|c12
xmm5 = c13|c23|c33|c13
We use mulps to multiply these packed values with xmm0…xmm2.
xmm3 = b1c11|b1c21|b1c31|b2c11
xmm4 = g1c12|g1c22|g1c32|g2c12
xmm5 = r1c13|r1c23|r1c33|r2c13
The next step is to add them using addps instructions.
The similar actions should be performed with two other two 16-byte-wide parts of the frame, containing
g2r2b3g3 and r3b4g4r4
This technique using transposed coefficients matrix allows us to cope without horizontal addition
instructions such as haddps. It is described in detail in [19].
To measure time, use getrusage(RUSAGE_SELF, &r) (read man getrusage pages first). It fills a struct r
of type struct rusage whose field r.ru_utime contains a field of type struct timeval. It contains, in turn,
a pair of values for seconds spent and millise conds spent. By comparing these values before transformation
and after it we can deduce the time spent on transformation.
353
CHAPTER 16 ■ PERFORMANCE
getrusage(RUSAGE_SELF, &r );
start = r.ru_utime;
getrusage(RUSAGE_SELF, &r );
end = r.ru_utime;
Use a table to perform a fast conversion from unsigned char into float.
■ Question 335 Read about methods of calculating the confidence interval and calculate the 95%
confidence interval for a reasonably high number of measurements.
16.5 Summary
In this chapter we have talked about the compiler optimizations and why they are needed. We have seen
how far the translated optimized code can go from its initial version. Then we have studied how to get the
most benefit from caching and how to parallelize floating point computations on the instruction level using
SSE instructions. In the next chapter we will see how to parallelize instruction sequences execution, create
multiple threads, and change our vision of memory in the presence of multithreading.
354
CHAPTER 16 ■ PERFORMANCE
■ Question 336 What GCC options control the optimization options globally?
■ Question 337 What kinds of optimizations can potentially bring the most benefits?
■ Question 338 What kinds of benefits and disadvantages can omitting a frame pointer bring?
■ Question 339 How is a tail recursive function different from an ordinary recursive function?
■ Question 340 Can any recursive function be rewritten as a tail recursive without using additional data
structures?
■ Question 341 What is a common subexpression elimination? How does it affect our code writing?
■ Question 343 Why should we mark functions static whenever we can to help the compiler optimizations?
■ Question 344 What benefits does named return value optimization bring?
■ Question 345 What is a branch prediction?
■ Question 346 What are Dynamic Branch Prediction, Global and Local History Tables?
■ Question 347 Check the notes on branch prediction for your CPU in [16].
■ Question 348 What is an execution unit and why do we care about them?
■ Question 349 How are AVX instruction speed and the amount of execution units related?
■ Question 350 What kinds of memory accessing patterns are good?
■ Question 352 In which cases might prefetch bring performance gains and why?
355
CHAPTER 17
Multithreading
In this chapter we will explore the multithreading capabilities provided by the C language. Multithreading is
a topic for a book on its own, so we will concentrate on the language features and relevant properties of the
abstract machine rather than good practices and program architecture-related topics.
Until C11, the support of the multithreading was external to the language itself, via libraries and nonstandard
tricks. A part of it (atomics) is now implemented in many compilers and provides a standard-compliant way of
writing multithreaded applications. Unfortunately, to this day, the support of threading itself is not implemented
in most toolchains, so we are going to use the library pthreads to write down code examples. We will still use the
standard-compliant atomics.
This chapter is by no means an exhaustive guide to multithreaded programming, which is a beast worth
writing a dedicated book about, but it will introduce the most important concepts and relevant language
features. If you want to become proficient in it, we recommend lots of practice, specialized articles, books
such as [34], and code reviews from your more experienced colleagues.
x[5] = y[5];
x[10005] = y[10005];
358
CHAPTER 17 ■ MULTITHREADING
mov al,[rsi + 5]
mov [rdi+5],al
However, it is evident, that this code could be rewritten to ensure better locality; that is, first assign x[4]
and x[5], then assign x[10004] and x[10005], as shown in Listing 17-3.
The effects of these two instruction sequences are similar if the abstract machine only considers one
CPU: given any initial machine state, the resulting state after their executions will be the same. The second
translation result often performs faster, so the compiler might prefer it. This is the simple case of memory
reordering, a situation in which the memory accesses are reordered comparing to the source code.
For single thread applications, which are executed “really sequentially,” we can often expect the order of
operations to be irrelevant as long as the observable behavior will be unchanged. This freedom ends as soon
as we start communicating between threads.
Most inexperienced programmers do not think much about it because they limit themselves with
single-threaded programming. In these days, we can no longer afford not to think about parallelism because
of how pervasive it is and how it is often the only thing that can really boost the program performance. So, in
this chapter, we are going to talk about memory reorderings and how to set them up correctly.
There are two extreme poles: weak and strong memory models. Just like with the strong and weak
typing, most existing conventions fall somewhere between, closer to one or another. We have found a
classification made by Jeff Preshing [31] to be useful and will stick to it in this book.
According to it, the memory models can be divided into four categories, enumerated from the more
relaxed ones to the stronger ones.
1. Really weak.
In these models, any kind of memory reordering can happen (as long as the
observable behavior of a single-threaded program is unchanged, of course).
2. Weak with data dependency ordering (such as hardware memory model of
ARM v7).
Here we speak about one particular kind of data dependency: the one between
loads. It occurs when we need to fetch an address from memory and then use it to
perform another fetch, for example,
Mov rdx, [rbx]
mov rax, [rdx]
In C this is the situation when we use the ➤ operator to get to a field of a certain
structure through the pointer to that structure.
Really weak memory models do not guarantee data dependency ordering.
3. Usually strong (such as hardware memory model of Intel 64).
It means that there is a guarantee that all stores are performed in the same order
as provided. Some loads, however, can be moved around.
Intel 64 is usually falling into this category.
4. Sequentially consistent.
This can be described as what you see when you debug a non-optimized program step by step on a
single processor core. No memory reordering ever happens.
void thread1(void) {
x = 1;
print(y);
}
void thread2(void) {
y = 1;
print(x);
}
360
CHAPTER 17 ■ MULTITHREADING
Both threads share variables x and y. One of them performs a store into x and then loads the value of y,
while the other one does the same, but with y and x instead.
We are interested in two types of memory accesses: load and store. In our examples, we will often omit
all other actions for simplicity.
As these instructions are completely independent (they operate on different data), they can be
reordered inside each thread without changing observable behavior, giving us four options: store + load or
load + store for each of two threads. This is what a compiler can do for its own reasons. For each option six
possible execution orders exist. They depict how both threads advance in time relative to one another.
We show them as sequences of 1 and 2; if the first thread made a step, we write 1; otherwise the second
one made a step.
1. 1-1-2-2
2. 1-2-1-2
3. 2-1-1-2
4. 2-1-2-1
5. 2-2-1-1
6. 1-2-2-1
For example, 1-1-2-2 means that the first process has executed two steps, and then the second process
did the same. Each sequence corresponds to four different scenarios. For example, the sequence 1-2-1-2
encodes one of the traces, shown in Table 17-1:
Table 17-1. Possible Instruction Execution Sequences If Processes Take Turns as 1-2-1-2
If we observe these possible traces for each execution order, we will come up with 24 scenarios (some of
which will be equivalent). As you see, even for the small examples these numbers can be large enough.
We do not need all these possible traces anyway; we are interested in the position of load relatively to store
for each variable. Even in Table 17-1 many possible combinations are present: both x and y can be stored, then
loaded, or loaded, then stored. Obviously, the result of load is dependent on whether there was a store before.
Were reorderings not in the game, we would be limited: any of the two specified loads should have been
preceded by a store because so it is in the source code; scheduling instructions in a different manner cannot
change that. However, as the reorderings are present, we can sometimes achieve an interesting outcome: if
both of these threads have their instructions reordered, we come to a situation shown in Listing 17-5.
void thread1(void) {
print(y);
x = 1;
}
361
CHAPTER 17 ■ MULTITHREADING
void thread2(void) {
print(x);
y = 1;
}
If the strategy 1-2-*-* (where * denotes any of the threads) is chosen, we execute load x and load y first,
which will make them appear to equal to 0 for everyone who uses these loads’ results
It is indeed possible in case compiler reordered these operations. But even if they are controlled well or
disabled, the memory reorderings, performed by CPU, still can produce such an effect.
This example demonstrates that the outcome of such a program is highly unpredictable. Later we
are going to study how to limit reorderings by the compiler and by CPU; we will also provide a code to
demonstrate this reordering in the hardware.
int x,y; x = 1;
y = 2;
x = 3;
As we have already seen, the instructions can be reordered by the compiler. Even more, the compiler
can deduce that the first assignment is the dead code, because it is followed in another assignment to the
same variable x. As it is useless, the compiler can even remove this statement.
The volatile keyword addresses this issue. It forces the compiler to never optimize the writes and
reads to the said variable and also suppresses any possible instruction reorderings. However, it only enforces
these restrictions to one single variable and gives no guarantee about the order in which writes to different
volatile variables are emitted. For example, in the previous code, even changing both x and y type to
volatile int will impose an order on assignments of each of them but will still allow us to interleave the
writes freely as follows:
volatile int x, y;
x = 1;
x = 3;
y = 2;
Or like this:
volatile int x, y;
y = 2;
x = 1;
x = 3;
Obviously, these guarantees are not sufficient for multithreaded applications. You cannot use volatile
variables to organize an access to the shared data, because these accesses can be moved around freely enough.
362
CHAPTER 17 ■ MULTITHREADING
The asm directive is used to include inline assembly code directly into C programs. The volatile
keyword together with the "memory" clobber argument describes that this (empty) piece of inline assembly
cannot be optimized away or moved around and that it performs memory reads and/or writes. Because of
that, the compiler is forced to emit the code to commit all operations to memory (e.g., store the values of
the local variables, cached in registers). It does not prevent the processor from performing speculative reads
past this statement, so it is not a memory barrier for the processor itself.
Obviously, both compiler and CPU memory barriers are costly because they prevent optimizations. That
is why we do not want to use them after each instruction.
There are several kinds of memory barriers. We will speak about those that are defined in the Linux
kernel documentation, but this classification is applicable in most situations.
1. Write memory barrier.
It guarantees that all store operations specified in code before the barrier will
appear to happen before all store operations specified after the barrier.
GCC uses asm volatile(""::: "memory") as a general memory barrier. Intel 64
uses the instruction sfence.
363
CHAPTER 17 ■ MULTITHREADING
asm ("mfence" )
By combining it with the compiler barrier, we get a line that both prevents compiler reordering and also
acts as a full memory barrier.
Any function call whose definition is not available in the current translation unit and that is not an
intrinsic (a cross-platform substitute of a specific assembly instruction) is a compiler memory barrier.
364
CHAPTER 17 ■ MULTITHREADING
365
CHAPTER 17 ■ MULTITHREADING
Using multithreading to speed CPU bound programs might be beneficial. A common pattern is to use
a queue with the requests that are dispatched to the worker threads from a thread pool–a set of created
threads that are either working or waiting for work but are not re-created each time there is a need of them.
Refer to Chapter 7 of [23] for more details.
As for how many threads we need, there is no universal recipe. Creating threads, switching between
them, and scheduling them produces an overhead. It might make the whole program slower if there is not
much work for threads to do. In computation-heavy tasks some people advise to spawn n − 1 threads, where
n is the total number of processor cores. In tasks that are sequential by their own nature (where every step
depends directly on the previous one) spawning multiple threads will not help. What we do recommend is to
always experiment with the number of threads under different workloads to find out which number suits the
most for the given task.
Note that the code that uses pthread library must be compiled with -pthread flag, for example,
That specifying -lpthread will not give us an esteemed result. Linking with the sole libpthread.a is
not enough: there are several preprocessor options that are enabled by -pthread (e.g., _REENTRANT). So,
whenever the -pthread option is available,1 use it.
1
This option is documented as platform specific, so it might be unavailable on some platforms.
366
CHAPTER 17 ■ MULTITHREADING
Initially, there is only one thread which starts executing the main function. A pthread_t type stores all
information about some other thread, so that we can control it using this instance as a handle. Then, the
threads are initialized with pthread_create function with the following signature:
int pthread_create(
pthread_t *thread,
const pthread_attr_t *attr,
void *(*start_routine) (void *),
void *arg);
The first argument is a pointer to the pthread_t instance to be initialized. The second one is a collection
of attributes, which we will touch later–for now, it is safe to pass NULL instead.
The thread starting function should accept a pointer and return a pointer. It accepts a void* pointer to
its argument. Only one argument is allowed; however, you can easily create a structure or an array, which
encapsulates multiple arguments, and pass a pointer to it. The return value of the start_routine is also a
pointer and can be used to return the work result of the thread.2 The last argument is the actual pointer to
the argument, which will be passed to the start_routine function.
In our example, each thread is implemented the same way: it accepts a pointer (to a string) and
then repeatedly outputs it with an interval of approximately one second. The sleep function, declared in
unistd.h, suspends the current thread for a given number of seconds.
After ten iterations, the thread returns. It is equivalent to calling the function pthread_exit with an
argument. The return value is usually the result of the computations performed by the thread; return NULL if
you do not need it. We will see later how it is possible to get this value from the parent thread.
■ Casting to void Constructions such as (void)argc have only one purpose: suppress warnings about
unused variable or argument argc. You can sometimes find them in the source code.
However, the naive return from the main function will lead to process termination. What if other
threads still exist? The main thread should wait for their termination first! This is what pthread_exit does
when it is called in the main thread: it waits for all other threads to terminate and then terminates the
program. All the code that follows will not be executed, so you will not see the bye message in stdout.
This program will output a pair of buzz and fizz lines in random order ten times and then exit. It is
impossible to predict whether the first or the second thread will be scheduled first, so each time the order
can differ. Listing 17-7 shows an exemplary output.
2
Remember to not return the address of a local variable!
367
CHAPTER 17 ■ MULTITHREADING
buzzzz
buzzzz
fizz
buzzzz
fizz
buzzzz
fizz
buzzzz
fizz
buzzzz
fizz
As you see, the string bye is not printed, because the corresponding puts call is below the pthread_exit call.
■ Where are the arguments located? It is important to note that the pointer to an argument, passed to a
thread, should point to the data that stays alive until the said thread is shut down. Passing a pointer to stack
allocated variable might be risky, since after the stack frame for the function is destroyed, accessing these
deallocated variables yields undefined behavior.
Unless the arguments are guaranteed to be constant (or you intend to use them for synchronization purposes),
do not pass them to different threads.
In the example shown in Listing 17-6, the strings that are accepted by threadimpl are allocated in the global
read-only memory (.rodata). Thus passing a pointer to it is safe.
The maximum number of threads spawned depends on implementation. In Linux, for example, you can
use ulimit -a to get relevant information.
The threads can create other threads; there is no limitation on that.
It is indeed guaranteed by the pthreads implementation that a call for pthread_create acts as a full
compiler memory barrier as well as a full hardware memory barrier.
pthread_attr_init is used to initialize an instance of an opaque type pthread_attr_t (implemented
as an incomplete type). Attributes provide additional parameters for threads such as stack size or address.
The following functions are used to set attributes:
• pthread_attr_setaffinity_np–the thread will prefer to be executed on a specific
CPU core.
• pthread_attr_setdetachstate–will we be able to call pthread_join on this thread,
or will it be detached (as opposed to joinable). The purpose of pthread_join will be
explained in the next section.
• pthread_attr_setguardsize–sets up the space before the stack limit as a region of
forbidden addresses of a given size to catch stack overflows.
• pthread_attr_setinheritsched–are the following two parameters inherited from
the parent thread (the one where the creation happened), or taken from the attributes
themselves?
• pthread_attr_setschedparam–right now is all about scheduling priority, but the
additional parameters can be added in the future.
368
CHAPTER 17 ■ MULTITHREADING
■ Question 357 Read man pages for the functions listed earlier.
pthread_t t;
void* result;
369
CHAPTER 17 ■ MULTITHREADING
The thread_join accepts two arguments: the thread itself and the address of a void* variable, which
will be initialized with the thread execution result.
Thread joining acts as a full barrier because we should not place before the joining any reads or writes
that are planned to happen after.
By default, the threads are created joinable, but one might create a detached thread. It might bring a
certain benefit: the resources of the detached thread can be released immediately upon its termination.
The joinable thread, however, will be waiting to be joined before its resources can be released. To create a
detached thread
• Create an attribute instance pthread_attr_t attr;
• Initialize it with pthread_attr_init( &attr );
• Call pthread_attr_setdetachstate( &attr, PTHREAD_CREATE_DETACHED ); and
• Create the thread by using pthread_create with a &attr as the attribute argument.
The current thread can be explicitly changed from joinable to detached by calling pthread_detach().
It is impossible to do it the other way around.
The code is quite simple: we naively iterate over all numbers from 1 to the input and check whether
they are factors or not. Note that the input value is marked volatile to prevent the whole result from being
computed during the compilation. Compile the code with the following command:
370
CHAPTER 17 ■ MULTITHREADING
We will start parallelization with a dumbed-down version of multithreaded code, which will always
perform computations in two threads and will not be architecturally beautiful. Listing 17-10 shows it.
int input = 0;
int result1 = 0;
void* fact_worker1( void* arg ) {
result1 = 0;
for( uint64_t i = 1; i < input/2; i++ )
if ( input % i == 0 ) result1++;
return NULL;
}
int result2 = 0;
void* fact_worker2( void* arg ) {
result2 = 0;
for( uint64_t i = input/2; i <= input; i++ )
if ( input % i == 0 ) result2++;
return NULL;
}
371
CHAPTER 17 ■ MULTITHREADING
What is this program doing? Well, we split the range (0, n] into two halves. Two worker threads are
computing the number of factors in their respective halves. Then when both of them have been joined, we are
guaranteed that they have already had performed all computations. The results just need to be summed up.
Then, in Listing 17-11 we show the multithreaded program that uses an arbitrary number of threads to
compute the same result. It has a better-thought-out architecture.
#define THREADS 4
struct fact_task {
uint64_t num;
uint64_t from, to;
uint64_t result;
};
uint64_t start = 1;
size_t step = num / threads_count;
372
CHAPTER 17 ■ MULTITHREADING
uint64_t result = 0;
for ( size_t i = 0; i < threads_count; i++ ) {
pthread_join( threads[i], NULL );
result += tasks[i].result;
}
free( tasks );
free( threads );
return result;
}
int main( void ) {
uint64_t input = 2000000000;
printf( "Factors of %"PRIu64": %"PRIu64"\n",
input, factors_mp(input, THREADS ) );
return 0;
}
Suppose we are using t threads. To count the number of factors of n, we split the range from 1 to n on t
equal parts. We compute the number of factors in each of those intervals and then sum up the results.
We create a type to hold the information about single task called struct fact_task. It includes the
number itself, the range bounds to and to, and the slot for the result, which will be the number of factors of
num between from and to.
All workers who calculate the number of factors are implemented alike, as a routine fact_worker,
which accepts a pointer to a struct fact_task, computes the number of factors, and fills the result field.
The code performing thread launch and results collection is contained in the factors_mp function,
which, for a given number of threads, is
• Allocating the task descriptions and the thread instances;
• Initializing the task descriptions;
• Starting all threads;
• Waiting for each thread to end its execution by using join and adding up its result to
the common accumulator result; and
• Freeing all allocated memory.
So, we put the thread creation into a black box, which allows us to benefit from the multithreading.
This code can be compiled with the following command:
The multiple threads are decreasing the overall execution time on a multicore system for this CPU
bound task.
To test the execution time, we will stick with the time utility again (a program, not a shell builtin
command). To ensure, that the program is being used instead of a shell builtin, we prepend it with a
backslash.
373
CHAPTER 17 ■ MULTITHREADING
The multithreaded program took 6.5 seconds to be executed, while the single-threaded version took
almost 22 seconds. That is a big improvement.
In order to speak about performance we are going to introduce the notion of speedup. Speedup is
the improvement in speed of execution of a program executed on two similar architectures with different
resources. By introducing more threads we make more resources available, hence the possible improvement
might take place.
Obviously, for the first example we have chosen a task that is easy and more efficient to solve in parallel.
The speedup will not always be that substantial, if any; however, as we see, the overall code is compact
enough (could be even less would we not take extensibility into account—for example, fix a number of
threads, instead of using it as a parameter).
■ Question 359 Experiment with the number of threads and find the optimal one in your own environment.
■ Question 360 Read about functions: pthread_self and pthread_equal. Why can’t we compare threads
with a simple equality operator ==?
17.8.5 Mutexes
While thread joining is an accessible technique, it does not provide means to control the thread execution
“on the run.” Sometimes we want to ensure that the actions performed in one thread are not being
performed before some other action in the other threads are performed. Otherwise, we will get a situation
where the system is not always working in a stable manner: its output will become dependent on the actual
order in which the instructions from the different threads will be executed. It occurs when working with the
mutable data shared between threads. Such situations are called data races, because the threads compete
for the resources, and any thread can win and get to them first.
To prevent such situations, there is a number of tools, and we will start with mutexes.
Mutex (a shorthand for “mutual exclusion”) is an object that can be in two states: locked and unlocked.
We work with them using two queries.
• Lock. Changes the state from unlocked to locked. If the mutex is locked, then the
attempting thread waits until the mutex is unlocked by other threads.
• Unlock. If the mutex is locked, it becomes unlocked.
Mutexes are often used to provide an exclusive access to a shared resource (like a shared piece of data).
The thread that wants to work with the resource locks the mutex, which is exclusively used to control an
access to a resource. After the work with the resource is finished, the thread unlocks the mutex.
Mutex locking and unlocking acts as both a compiler and full hardware memory barriers, so no reads or
writes can be reordered before locking or after unlocking.
Listing 17-12 shows an example program which needs a mutex.
374
CHAPTER 17 ■ MULTITHREADING
uint64_t value = 0;
int main(void) {
pthread_create( &t1, NULL, impl1, NULL );
pthread_create( &t2, NULL, impl1, NULL );
This program has two threads, implemented by the function impl1. Both threads are constantly
incrementing the shared variable value 10000000 times.
This program should be compiled with the optimizations disabled to prevent this incrementing loop
from being transformed into a single value += 10000000 statement (or we can make value volatile).
The resulting output is, however, not 20000000, as we might have thought, and is different each time we
launch the executable:
> ./a.out
11297520
> ./a.out
10649679
> ./a.out
13765500
The problem is that incrementing a variable is not an atomic operation from the C point of view. The
generated assembly code conforms to this description by using multiple instructions to read a value, add
one, and then put it back. It allows the scheduler to give the CPU to another thread “in the middle” of a
running increment operation. The optimized code might or might not have the same behavior.
375
CHAPTER 17 ■ MULTITHREADING
To prevent this mess we are going to use a mutex to grant a thread a privilege to be the sole one working
with value. This way we enforce a correct behavior. Listing 17-13 shows the modified program.
pthread_mutex_t m; //
uint64_t value = 0;
value += 1;
int main(void) {
pthread_mutex_init( &m, NULL ); //
pthread_mutex_destroy( &m ); //
return 0;
}
> ./a.out
20000000
The mutex m is associated by the programmer with a shared variable value. No modifications of value
should be performed outside the code section between the m lock and unlock. If this constraint is satisfied,
there is no way value can be changed by another thread once the lock is taken. The lock acts as a memory
barrier as well. Because of that, value will be reread after the lock is taken and can be cached in a register
safely. There is no need to make the variable value volatile, since it will only suppress optimizations and the
program is correct anyway.
376
CHAPTER 17 ■ MULTITHREADING
Before a mutex can be used, it should be initialized with pthread_mutex_init, as seen in the main
function. It accepts attributes, just like the pthread_create function, which can be used to create a recursive
mutex, create a deadlock detecting mutex, control the robustness (what happens if the mutex owner thread
dies?), and more.
To dispose of a mutex, the call to pthread_mutex_unlock is used.
■ Question 361 What is a recursive mutex? How is it different from an ordinary one?
17.8.6 Deadlocks
A sole mutex is rarely a cause of problems. However, when you lock multiple mutexes at a time, several kinds
of strange situations can happen. Take a look at the example shown in Listing 17-14.
thread1 () {
lock(A);
lock(B);
unlock(B);
unlock(A);
}
thread2() {
lock(B);
lock(A);
unlock(A);
unlock(B);
}
This pseudo code demonstrates a situation where both threads can hang forever. Imagine that the
following sequence of actions happened due to unlucky scheduling:
• Thread 1 locked A; control transferred to thread 2.
• Thread 2 locked B; control transferred to thread 1.
After that, the threads will try to do the following:
• Thread 1 will attempt to lock B, but B is already locked by thread 2.
• Thread 2 will attempt to lock A, but A is already locked by thread 1.
Both threads will be stuck in this state forever. When threads are stuck in a locked state waiting for each
other to perform unlock, the situation is called deadlock.
The cause of the deadlock is the different order in which the locks are being taken by different threads. It
leads us to a simple rule that will save us most of the times when we need to lock several mutexes at a time.
377
CHAPTER 17 ■ MULTITHREADING
■ Preventing deadlocks Order all mutexes in your program in an imaginary sequence. Only lock mutexes in
the same order they appear in this sequence.
For example, suppose we have mutexes A, B, C, and D. We impose a natural order on them: A < B < C < D. If
you need to lock both D and B, you should always lock them in the same order, thus B first, D second.
If this invariant is kept, no two threads will lock a pair of mutexes in a different order.
■ Question 362 What are Coffman’s conditions? How can they be used to diagnose deadlocks?
17.8.7 Livelocks
Livelock is a situation in which two threads are stuck but not in a waiting-for-mutex-unlock state. Their
states are changing, but they are not really progressing. For example, pthreads does not allow you to check
whether the mutex is locked or not. It would be useless to provide information about the mutex state,
because once you obtain information about it, the latter can already be changed by the other thread.
However, pthread_mutex_trylock is allowed, which either locks a mutex or returns an error if it has
already been locked by someone. Unlike pthread_mutex_lock, it does not block the current thread waiting
for the unlock. Using pthread_mutex_trylock can lead to livelock situations. Listing 17-15 shows a simple
example in pseudo code.
thread1() {
lock( m1 );
while ( mutex_trylock m2 indicates LOCKED ) {
unlock( m1 );
wait for some time;
lock( m1 );
}
// now we are good because both locks are taken
}
378
CHAPTER 17 ■ MULTITHREADING
thread2() {
lock( m2 );
while ( mutex_trylock m1 indicates LOCKED ) {
unlock( m2 );
wait for some time;
lock( m2 );
}
// now we are good because both locks are taken
}
Each thread tries to defy the principle “locks should always be performed in the same order.” Both of
them want to lock two mutexes m1 and m2.
The first thread performs as follows:
• Locks the mutex m1.
• Tries to lock mutex m2. On failure, unlocks m1, waits, and locks m1 again.
This pause is meant to provide the other thread time to lock m1 and m2 and perform whatever it wants to
do. However, we might be stuck in a loop when
1. Thread 1 locks m1, thread 2 locks m2.
2. Thread 1 sees that m2 is locked and unlocks m1 for a time.
3. Thread 2 sees that m1 is locked and unlocks m2 for a time.
4. Go back to step one.
This loop can take forever to complete or can produce significant delays; it is entirely up to the
operating system scheduler. So, the problem with this code is that execution traces exist that will forever
prevent threads from progress.
379
CHAPTER 17 ■ MULTITHREADING
while (!sent)
pthread_cond_wait( &condvar, &m );
sent = true;
pthread_cond_signal( &condvar );
pthread_mutex_destroy( &m );
return 0;
}
380
CHAPTER 17 ■ MULTITHREADING
./a.out
Thread2 before wait
Thread1 before signal
Thread1 after signal
Thread2 after wait
We can of course put the thread to sleep for a time, but this way we will still wake up either too rarely to
react to the event in time or too often:
Condition variables let us wait just enough time and continue the thread execution in the locked state.
An important moment should be explained. Why did we introduce a shared variable sent? Why are we
using it together with the condition variable? Why are we waiting inside the while (!sent) cycle?
The most important reason is that the implementation is permitted to issue spurious wake-ups to a
waiting thread. It means that the thread can wake up from waiting on a signal not only after receiving it but
at any time. In this case, as the sent variable is only set before sending the signal, spurious wake-up will
check its value and if it is still equal to false will issue the pthread_cond_wait again.
17.8.9 Spinlocks
A mutex is a sure way of doing synchronization. Trying to lock a mutex which is already taken by another
thread puts the current thread into a sleeping state. Putting the thread to sleep and waking it up has its costs,
notably for the context switch, but if the waiting is long, these costs justify themselves. We spend a little time
going to sleep and waking up, but in a prolonged sleep state the thread does not use the CPU.
381
CHAPTER 17 ■ MULTITHREADING
What would be an alternative? The active idle, which is described by the following simple pseudo code:
The variable locked is a flag showing whether some thread took the lock or not. If another thread took
the lock, the current thread will constantly poll its value until it is changed back. Otherwise it proceeds to
take the lock on its own. This wastes CPU time (and increases power consumption), which is bad. However,
it can increase performance in case the waiting time is expected to be very short. This mechanism is called
spinlock.
Spinlocks only make sense on multicore and multiprocessor systems. Using spinlock in a single core is
useless. Imagine a thread enters the cycle inside the spinlock. It keeps waiting for other thread to change
the locked value, but no other thread is executing at this very time, because there is only one core switching
from thread to thread. Eventually the scheduler will put the current thread to sleep and allow other threads
to perform, but it just means that we have wasted CPU cycles executing an empty loop for no reason at all! In
this case, going to sleep right away is always better, and hence there is no use for a spinlock.
This scenario can of course occur on a multicore system as well, but there is also a (usually) good
chance, that the other thread will unlock the spinlock before the time quantum given to the current thread
expires.
Overall, using spinlocks can be beneficial or not; it depends on the system configuration, program logic,
and workload. When in doubt, test and prefer using mutexes (which are often implemented by first taking a
spinlock for a number of iterations and then falling into the sleep state if no unlock occurred).
Implementing a fast and correct spinlock in practice is not that trivial. There are questions to be
answered, such as the following:
• Do we need a memory barrier on lock and/or unlock? If so, which one? Intel 64, for
example, has lfence, sfence, and mfence.
• How do we ensure that the flag modification is atomic? In Intel 64, for example, an
instruction xchg suffices (with lock prefix in case of multiple processors).
pthreads provide us with a carefully designed and portable mechanism of spinlocks. For more
information, refer to the man pages for the following functions:
• pthread_spin_lock
• pthread_spin_destroy
• pthread_spin_unlock
17.9 Semaphores
Semaphore is a shared integer variable on which three actions can be performed.
• Initialization with an argument N. Sets its value to N.
• Wait (enter). If the value is not zero, it decrements it. Otherwise waits until someone
else increments it, and then proceeds with the decrement.
• Post (leave). Increments its value.
Obviously the value of this variable, not directly accessible, cannot fall below 0.
382
CHAPTER 17 ■ MULTITHREADING
Semaphores are not part of pthreads specification; we are working with semaphores whose interface
is described in the POSIX standard. However, the code that uses the semaphores should be compiled with a
-pthread flag.
Most UNIX-like operating systems implement both standard pthreads features and semaphores. Using
semaphores is fairly common to perform synchronization between threads.
Listing 17-17 shows an example of semaphore usage.
sem_t sem;
uint64_t counter1 = 0;
uint64_t counter2 = 0;
int main(void) {
sem_init( &sem, 0, 0 );
sleep( 1 );
pthread_create( &t1, NULL, t1_impl, NULL );
pthread_create( &t2, NULL, t2_impl, NULL );
383
CHAPTER 17 ■ MULTITHREADING
sem_destroy( &sem );
pthread_exit( NULL );
return 0;
}
The sem_init function initializes the semaphore. Its second argument is a flag: 0 corresponds to a
process-local semaphore (which can be used by different threads), non-zero value sets up a semaphore
visible to multiple processes.3 The third argument sets up the initial semaphore value. A semaphore is
deleted using the sem_destroy function. In the example, two counters and three threads are created.
Threads t1 and t2 increment the respective counters to 1000000 and 20000000 and then increment the
semaphore value sem by calling sem_post. t3 locks itself decrementing the semaphore value twice. Then,
when semaphore was incremented twice by other threads, t3 prints the counters into stdout.
The pthread_exit call ensures that the main thread will not terminate prematurely, until all other
threads finish their work.
Semaphores come up handy in such tasks as
• Forbidding more than n processes to simultaneously execute a code section.
• Making one thread wait for another to complete a specific action, thus imposing an
order on their actions.
• Keeping no more than a fixed number of worker threads performing a certain task in
parallel. More threads than needed might decrease performance.
It is not true that a semaphore with two states is fully analogous to a mutex. Unlike mutex, which can
only be unlocked by the same thread that locked it, semaphores can be changed freely by any thread.
We will see another example of the semaphore usage in Listing 17-18 to make two threads start each
loop iteration simultaneously (and when the loop body is executed, they wait for other loops to finish an
iteration).
Manipulations with semaphores obviously act like both compiler and hardware memory barriers.
For more information on semaphores, refer to the man pages for the following functions:
• em_close
• sem_destroy
• sem_getvalue
• sem_init
• sem_open
• sem_post
• sem_unlink
• sem_wait
■ Question 364 What is a named semaphore? Why should it be mandatorily unlinked even if the process is
terminated?
3
In this case, the semaphore itself will be placed in the shared page, which will not be physically duplicated after
performing the fork() system call
384
CHAPTER 17 ■ MULTITHREADING
sem_wait( &sem_begin0 );
x = 1;
// This only disables compiler reorderings:
asm volatile("" ::: "memory");
385
CHAPTER 17 ■ MULTITHREADING
sem_post( &sem_end );
}
return NULL;
};
sem_wait( &sem_begin1 );
y = 1;
// This only disables compiler reorderings:
asm volatile("" ::: "memory");
// The following line disables also hardware reorderings
// asm volatile("mfence" ::: "memory");
read1 = x;
sem_post( &sem_end );
}
return NULL;
};
sem_wait( &sem_end );
sem_wait( &sem_end );
It might seem magical, but it is the level lower than the assembly language even that is seen here and
that introduces rarely observed (but still persistent) bugs in the software. Such bugs in multithreaded
software are very hard to catch. Imagine a bug appearing only after four months of uninterrupted execution,
which corrupts the heap and crashes the program 42 allocations after it triggers! So, writing high-
performance multithreaded software in a lock-free manner requires tremendous expertise.
387
CHAPTER 17 ■ MULTITHREADING
So, what we need to do is to add mfence instruction. Replacing the compiler barrier with a full memory
barrier asm volatile( "mfence":::"memory"); solves the problem and the reorderings disappear
completely. If we do it, there will be no reorderings detected no matter how many iterations we try.
388
CHAPTER 17 ■ MULTITHREADING
char buf[1024];
uint64_t* data = (uint64_t*)(buf + 1);
/* atomic write */
global_aligned_var = 0;
void f(void) {
/* atomic read */
int64_t local_variable = global_aligned_var;
}
These cases are architecture-specific. We also want to perform more complex operations atomically
(e.g., incrementing the counter). To perform them safely without using mutexes the engineers invented
interesting basic operations, such as compare-and-swap (CAS). Once this operation is implemented as a
machine instruction on a specific architecture, it can be used in combination with more trivial non-atomic
reads and writes to implement many lock-free algorithms and data structures.
CAS instruction acts as an atomic sequence of operations, described by the following equivalent C
function:
A shared counter, which you are reading and writing back a modified value, is a typical case when we
need a CAS instruction to perform an atomical increment or decrement. Listing 17-19 shows a function to
perform it.
389
CHAPTER 17 ■ MULTITHREADING
This example shows a typical pattern, seen in many CAS-based algorithms. They read a certain memory
location, compute its modified value, and repeatedly try to swap the new value back if the current memory
value is equal to the old one. It fails in case this memory location was modified by another thread; then the
whole read-modify-write cycle is repeated.
Intel 64 implements CAS instructions cmpxchg, cmpxchg8b, and cmpxchg16b. In case of multiple
processors, they also require a lock prefix to be used.
The instruction cmpxchg is of a particular interest. It accepts two operands: register or memory and a
register. It compares rax4 with the first operand. If they are equal, zf flag is set, the second operand’s value is
loaded into the first. Otherwise, the actual value of the first operand is loaded into rax and zf is cleared.
These instructions can be used as a part of implementation of mutexes and semaphores.
As we will see in section 17.12.2, there is now a standard-compliant way of using compare-and-set
operations (as well as manipulating with atomic variables). We recommend sticking to it to prevent non-
portable code and use atomics whenever you can. When you need complex operations to be performed
atomically, use mutexes or stick with the lock-free data structure implementations done by experts: writing
lock-free data structures has proven to be a challenge.
17.12.2 Atomics
The important C11 feature that can be used to write fast multithreaded programs is atomics (see section 7.17
of [7]). These are special variable types, which can be modified atomically. To use them, include the header
stdatomic.h.
4
Or eax, ax, al–depending on operand size
390
CHAPTER 17 ■ MULTITHREADING
_Atomic(int) counter;
_Atomic transforms the name of a type into the name of an atomic type. Alternatively, you can use the
atomic types directly as follows:
atomic_int counter;
A full correspondence between _Atomic(T) and atomic_T direct type forms can be found in section
7.17.6 of [7].
Atomic local variables should not be initialized directly; instead, the macro ATOMIC_VAR_INIT should be
used. It is understandable, because on some architectures with fewer hardware capabilities each such variable
should be associated with a mutex, which has to be created and initialized as well. Global atomic variables are
guaranteed to be in a correct initial state. ATOMIC_VAR_INIT should be used during the variable declaration
coupled with initialization; however, if you want to initialize the variable later, use atomic_init macro.
void f(void) {
/* Initialization during declaration */
atomic_int x = ATOMIC_VAR_INIT( 42 );
atomic_int y;
/* initialization later */
atomic_init( &y, 42 );
}
It is your responsibility to guarantee that the atomic variable initialization ends before anything else is
done with it. In other words, concurrent access to the variable being initialized is a data race.
Atomic variables should only be manipulated through an interface, defined in the language standard. It
consists of several operations, such as load, store, exchange, etc. Each of them exists in two versions.
• An explicit version, which accepts an extra argument, describing the memory
ordering. Its name ends with _explicit. For example, the load operation is
391
CHAPTER 17 ■ MULTITHREADING
17.12.4 Operations
The following operations can be performed on atomic variables (T denotes the non-atomic type, U refers to
the type of the other argument for arithmetic operations; for all types except pointers, it is the same as T, for
pointers it is ptrdiff_t).
392
CHAPTER 17 ■ MULTITHREADING
bool atomic_compare_exchange_strong(
volatile _Atomic(T)* object, T * expected, T desired);
bool atomic_compare_exchange_weak(
volatile _Atomic(T)* object, T * expected, T desired);
All these operations can be used with an _explicit suffix to provide a memory ordering as an
additional argument.
Load and store functions do not need a further explanation; we will discuss the other ones briefly.
atomic_exchange is a combination of load and store: it replaces the value of an atomic variable with
desired and returns its old value.
fetch_op family of operations is used to atomically change the atomic variable value. Imagine you need
to increment an atomic counter. Without fetch_add it is impossible to do since in order to increment it you
need to add one to its old value, which you have to read first. This operation is performed in three steps:
reading, addition, writing. Other threads may interfere between these stages, which destroys atomicity.
atomic_compare_exchange_strong is preferred to its weak counterpart, since the weak version can fail
spuriously. The latter has a better performance on some platforms.
The atomic_compare_exchange_strong function is roughly equivalent to the following pseudo code:
if ( *object == *expected )
*object = desired;
else
*expected = *object;
As we see, this is a typical CAS instruction that was discussed in section 17.11.
atomic_is_lock_free macro is used to check whether a specific atomic variable uses locks or not.
Remember that without providing explicit memory ordering, all these operations are assumed to be
sequentially consistent, which in Intel 64 means mfence instructions all over the code. This can be a huge
performance killer.
The Boolean shared flag has a special type named atomic_flag. It has two states: set and clear. It is
guaranteed that operations on it are atomic without using locks.
The flag should be initialized with the ATOMIC_FLAG_INIT macro as follows:
The relevant functions are atomic_flag_test_and_set and atomic_flag_clear, both of which have
_explicit counterparts, accepting memory ordering descriptions.
393
CHAPTER 17 ■ MULTITHREADING
17.13 Summary
In this chapter we have studied the basics of multithreaded programming. We have seen the different
memory models and the problems that emerge from the fact compiler and hardware optimizations mess
with the instruction execution order. We have learned how to control them, placing different memory
barriers, we have seen why volatile is not a solution to problems that emerge from multithreading. Then
we introduced pthreads, the most common standard of writing multithreaded applications of Unix-like
systems. We have seen thread management, used mutexes and condition variables, and learned why
spinlocks only have meaning on multicore and multiprocessor systems. We have seen how memory
reorderings should be taken into account even when working on an usually strong architecture such as Intel
64 and have seen the limits of its strictness. Finally, we have studied the atomic variables—a very useful
feature of C11 that allows us to get rid of explicit mutex usage and in many cases boost performance while
maintaining correctness. Mutexes are still important when we want to perform complex manipulations on
non-trivial data structures.
■ Question 377 What are the arguments against usage of volatile variables?
■ Question 378 What is a memory barrier?
■ Question 382 What is a data dependency? Can you write code where data dependency does not force an
order on operations?
■ Question 383 What is the difference between mfence, sfence, and lfence?
394
CHAPTER 17 ■ MULTITHREADING
■ Question 391 Why should the -pthread flag be used when compiling with pthreads?
■ Question 394 Can one thread get access to the stack of the other thread?
■ Question 398 Should every shared mutable variable which is never changed be associated with a mutex?
■ Question 399 Should every shared mutable variable which is changed be associated with a mutex?
■ Question 400 Can we work with a shared variable without ever using a mutex?
■ Question 401 What is a deadlock?
■ Question 408 Which guarantees does Intel 64 provide for memory reorderings?
■ Question 409 Which important guarantees does Intel 64 not provide for memory reorderings?
■ Question 410 Correct the program shown in Listing 17-18 so that no memory reordering occurs.
■ Question 411 Correct the program shown in Listing 17-18 so that no memory reordering occurs by using
atomic variables.
395
CHAPTER 17 ■ MULTITHREADING
■ Question 412 What is lock-free programming? Why is it harder than traditional multithreaded
programming with locks?
■ Question 413 What is a CAS operation? How can it be implemented in Intel 64?
■ Question 414 How strong is the C memory model?
■ Question 420 How are the atomic variables manipulation functions with _explicit suffix different from
their ordinary counterparts?
396
PART IV
Appendices
CHAPTER 18
The debugger is a very powerful instrument at your disposal. It allows executing programs step by step
and monitoring their state, including register values and memory contents. In this book we are using the
debugger called gdb. This appendix is an introduction aimed to ease your first steps with it.
Debugging is a process of finding bugs and studying program behavior. In order to do that, we usually
perform single steps observing a part of the program’s state that is of interest to us. We can also run the
program until a certain condition is met or a position in code is reached. Such position in the code is called
breakpoint.
Let us study a sample program shown in Listing 18-1. We have already seen it in Chapter 2. This code
prints the rax register contents into stdout.
section .text
global _start
_start:
mov rax, 0x1122334455667788
mov rdi, 1
mov rdx, 1
mov rcx, 64
.loop:
push rax
sub rcx, 4
sar rax, cl
and rax, 0xf
push rcx
syscall
pop rcx
pop rax
test rcx, rcx
jnz .loop
We are going to compile an executable file print_rax from it and launch gdb.
gdb has its own command system and the interaction with it happens through these commands. So
whenever gdb is launched and you see its command prompt (gdb), you can type commands and it will
interact accordingly.
You can load an executable file by issuing file command and then typing the filename, or by passing it
as an argument.
The <tab> key functions in the gdb command prompt perform autocompletion hints. Many commands
also have shorthands.
The two most important commands are
• quit to quit gdb.
• help cmd to show help for the command cmd.
The ˜/.gdbinit file stores commands that will be automatically executed when gdb starts. Such a file
can be created in the current directory as well, but for security reasons this feature is disabled by default.
■ Note To enable loading the .gdbinit file from any directory, add the following line to the ˜/.gdbinit file in
your home directory:
By default, gdb uses AT&T assembly syntax. In our book we stick to Intel syntax; to change gdb’s default
preferences regarding assembly syntax, add the following line to ˜/.gdbinit file:
400
CHAPTER 18 ■ APPENDIX A. USING GDB
We stopped at the breakpoint that we have placed at the _start label. Let us switch into pseudo
graphical mode using commands:
layout asm
layout regs
401
CHAPTER 18 ■ APPENDIX A. USING GDB
FMT (used by print and x commands) is an encoded format description. It allows us to explicitly choose
the data type to interpret the memory contents correctly.
FMT consists of a format letter and a size letter. The most useful format letters are
• x (hexadecimal)
• a (address)
• i (instruction, tries to perform disassembly)
• c (char)
• s (null-terminated string)
The most useful size letters are b (byte) and g (giant, 8 bytes).
To take an address of a variable, use the & symbol. The examples will show when it is handy.
Following are some examples based on the program shown in Listing 18-1:
• Displaying rax contents:
402
CHAPTER 18 ■ APPENDIX A. USING GDB
(gdb) x /i &_start
0x4000b0 <_start>: movabs rax,0x1122334455667788
(gdb) x /i $rip
=> 0x4000e9 <_start.loop+32>: jne 0x4000c9 <_start.loop>
• Checking the contents of codes. The /FMT part of x command can start with the
elements count. In our case, /12cb stands for “12 characters one byte each.”
(gdb) x /x $rsp
0x7fffffffdf90: 0x01
To use gdb with C programs productively, remember to always use the -ggdb compiler option. It
generates additional information that gdb can use, such as .line section or symbols for local variables.
An appropriate layout to work with C code is src; type layout src to switch to it. Figure 18-2 depicts
this layout.
403
CHAPTER 18 ■ APPENDIX A. USING GDB
Another useful option consists of studying and navigating a call stack. Each time any function is called,
it uses a part of a stack to store its local variables. To demonstrate navigation we are going to use a simple
program shown in Listing 18-2.
404
CHAPTER 18 ■ APPENDIX A. USING GDB
Then we place a breakpoint at the function g and run the program as follows:
(gdb) break g
Breakpoint 1 at 0x400531: file call_stack.c, line 5.
(gdb) run
Starting program: .../call_stack
We want to see which functions are currently launched. The backtrace command is the way to do it.
(gdb) backtrace
#0 g (garg=44) at call_stack.c:4
#1 0x0000000000400561 in f (farg=42) at call_stack.c:10
#2 0x0000000000400572 in main () at call_stack.c:14
There are three stack frames that gdb is aware of, and we can switch between them using the frame
<idx> command.
Our state right now is depicted in Figure 18-3. We are sure that the function f has launched function g
as the backtrace says, so that instance of f should have a local variable flocal. We want to know its value.
If we try to print it right away, gdb complains that such variable does not exist. However, if we select the
appropriate stack frame using the frame 1 command first, we will gain access to all its local variables.
Figure 18-4 depicts this change.
405
CHAPTER 18 ■ APPENDIX A. USING GDB
(gdb) frame 1
#1 0x0000000000400561 in f (farg=42) at call_stack.c:10
(gdb) print farg
$3 = 42
Other than that, gdb supports evaluating expressions with common arithmetic operations, launching
functions, writing automation scripts in Python and much more.
For further reading, consult [1].
406
CHAPTER 18 ■ APPENDIX A. USING GDB
407
CHAPTER 19
This appendix will introduce you to the most basic notions of writing Makefiles. For more information refer
to [2].
To build a program you might need to perform multiple actions: launch compiler with the right flags,
probably for each source file, and use linker. Sometimes you have to launch scripts written to generate
source code files as well. At times the program consists of several parts written in different programming
languages!
Moreover, if you changed only a part of the program, you might not want to rebuild everything but
only those parts that depend on the source file changed. Huge programs can take hours of CPU (central
processing unit) time to build!
In this book we are going to use GNU Make. It is a common tool used to control the generation of
artifacts such as executable files, dynamic libraries, resource files, etc.
<target> : <prerequisites>
[tab] <recipe>
A rule describes how to generate a specific file, which is the rule’s <target>. <prerequisites>
describe which other targets should be generated first.
A recipe consists of one or many actions to be carried out by make. Every recipe line should be
preceded by [tab] character!
Let us say, we have a simple program consisting of two assembly files: main.asm and lib.asm. We want
to produce the object file for each of them and then link these into an executable program.
Listing 19-1 shows an example of a simple Makefile.
lib.o: lib.asm
nasm -f elf64 -o lib.o lib.asm
main.o: main.asm
nasm -f elf64 -o main.o main.asm
clean:
rm main.o lib.o program
When the Makefile with these contents is created, executing make in the same directory will launch
the recipe for the first target described. If a target named all is present, its recipe will be executed instead.
Otherwise, typing make targetname will execute the recipe for the target targetname.
The target program should produce the file program. To do it we should build files main.o and lib.o
first. If we change the file main.o and launch make again, only main.o will be rebuilt before refreshing
program, but not lib.o. The same mechanism forces rebuilding lib.o when lib.asm is changed.
So, the recipe is launched when there is no file corresponding to the target name or this file should be
changed (because one of its dependencies has been updated).
Traditionally, every Makefile has a target named clean to get rid of all produced files, leaving only the
sources. The targets such as clean are called Phony Targets, because they do not correspond to a certain
file. It is best to enumerate them in a separate recipe corresponding to a special .PHONY target as follows:
clean:
rm -f *.o
help:
echo 'This is the help'
variable = value
They are not the same thing as environmental variables such as PWD. Their values are substituted using a
dollar sign and a pair of parentheses as follows:
$(variable)
410
CHAPTER 19 ■ APPENDIX B. USING MAKE
Now, we are going to use the variables in at least the following cases:
• To abstract the compiler (we will be able to easily switch between Clang, GCC, MSVC,
or whatever else compiler as long as they support the same set of flags).
• To abstract the compilation flags.
Traditionally, in case of C, these variables are named
• CC for “C compiler.”
• CFLAGS for “C compiler flags.”
• LD for “link editor” (linker).
• AS as “assembly language compiler.”
• ASFLAGS as “assembly language compiler flags.”
An additional benefit is that whenever we want to choose compilation flags we only need to do it in one
place. Listing 19-2 shows the modified Makefile.
lib.o: lib.asm
$(AS) $(ASFLAGS) -o lib.o lib.asm
main.o: main.asm
$(AS) $(ASFLAGS) -o main.o main.asm
clean:
rm main.o lib.o program
.PHONY: clean
EMPTYVAR =
INCLUDEDIR = include
CFLAGS = -c -std=c99 -I$(INCLUDEDIR) -ggdb -Wno-attributes
411
CHAPTER 19 ■ APPENDIX B. USING MAKE
Target names support the wildcard symbol %. There should be only one such wildcard in a target name.
The substring that % matches is called the stem. The occurences of % in prerequisites are replaced with
exactly the stem. For example, this rule
%.o : %.c
echo "Building an object file"
specifies how to build any object file from a .c file with the matching name. However, right now we do not
know how to use these rules, because once we try to write a command to compile the file we face a problem:
we do not know the exact names of the files involved, and the stem is inaccessible inside the recipe. The
automatic variables solve this problem.
all: main
clean:
rm -f *.o main
.PHONY: clean
412
CHAPTER 19 ■ APPENDIX B. USING MAKE
.
lib.c
lib.h
main.c
main.h
Makefile
0 directories, 8 files
> make
gcc -std=c11 -Wall -c main.c -o main.o
gcc -std=c11 -Wall -c lib.c -o lib.o
gcc main.o lib.o -o main
Refer to the well-written GNU Make Manual [2] for further instructions.
413
CHAPTER 20
Throughout this book we have used several system calls. We gather the information about them in this appendix.
■ Note It is always a good idea to read the man pages first, for example, man -s 2 write.
The exact flag and parameter values vary from system to system and should never be used in the
immediate form. If you write in C, use the relevant headers (shown in man pages for the system call of
interest). If you write in assembly, you will have to use LXR or another online system with annotated kernel
code or look through these C headers yourself and create your own, corresponding %define’s.
The values provided are valid for the following system:
> uname -a
Linux 3.16-2-amd64 #1 SMP Debian 3.16.3-2 (2014-09-20) x86_64 GNU/Linux
Issuing a system call in assembly is simple: just initialize the relevant registers into correct parameter
values (in any order) and execute syscall instruction. If you need flags, you should define them on your
own first; we provided you with their exact values.
Remember, that NASM can also compute constant expressions, such as O_TRUNC|O_RDWR.
Issuing a system call in C is usually done like calling a function, whose declaration is provided in some
include files.
■ Note In C, never use the flags values directly, like, substituting O_APPEND with 0x1000. Use the defines
provided in the header files, because they are both more readable and portable. Since we will have no
corresponding assembly headers, we have to define them by hand in the assembly files.
20.1 read
ssize_t read(int fd, void *buf, size_t count);
20.1.1 Arguments
1. fd File descriptor by which we read. 0 for stdin; use open system call to open a
file by name.
2. buf The address of the first byte in a sequence of bytes. The received bytes will
be placed there.
3. count We will attempt to read that many bytes.
Returns rax = number of bytes successfully read, -1 on error.
Includes to use in C:
#include <unistd.h>
20.2 write
ssize_t write(int fd, const void *buf, size_t count);
20.2.1 Arguments
1. fd File descriptor by which we write. 1 for stdout, 2 for stderr; use open system
call to open a file by name.
2. buf The address of the first byte in a sequence of bytes to be written.
3. count We will attempt to write that many bytes.
Returns rax = number of bytes successfully written, -1 on error.
Includes to use in C:
#include <unistd.h>
20.3 open
int open(const char *pathname, int flags, mode_t mode);
416
CHAPTER 20 ■ APPENDIX C. SYSTEM CALLS
20.3.1 Arguments
1. filename Name of the file to be opened (null-terminated string).
2. flags Are described below. They can be combined using |, for example,
O_CREAT| O_WRONLY|O_TRUNC.
3. mode Is an integer number encoding user, group, and all others’ permissions.
They are similar to ones used by chmod command.
Returns rax = new file descriptor for the given file, -1 on error.
Includes to use in C:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
20.3.2 Flags
• O_APPEND = 0x1000
Append to a file on each write.
• O_CREAT = 0x40
Create a new file.
• O_TRUNC = 0x200
If the file already exists and is a regular file and the access mode allows writing it
will be truncated to length 0.
• O_RDWR = 2
Read and write.
• O_WRONLY = 1
Write only.
• O_RDONLY = 0
Read only.
20.4 close
int close(int fd);
417
CHAPTER 20 ■ APPENDIX C. SYSTEM CALLS
20.4.1 Arguments
1. fd a valid file descriptor that should be closed.
Returns rax = zero un success, -1 on error. Global variable errno holds the error code.
Includes to use in C:
#include <unistd.h>
20.5 mmap
void *mmap(
void *addr, size_t length,
int prot, int flags,
int fd, off_t offset);
Description Map pages in virtual address space to something. It can be anything that lies behind a
“file” (devices, files on disk, etc.) or just physical memory. In the latter case, the pages are anonymous, they
bear no correspondence to anything present in file system. Such pages hold the heap and stacks of a process.
20.5.1 Arguments
1. addr A hint for the starting virtual address of the freshly mapped region. We try
to map at this address, and if we can’t, we let the operating system (OS)
choose it. If 0, will always be chosen by OS.
2. len Length of a mapped region in bytes.
3. prot Protection flags (see below). They can be combined using |.
4. flags Behavior flags (see later). They can be combined using |.
5. fd A valid file descriptor for the file to be mapped, ignored if MAP_ANONYMOUS
behavior flag is used.
6. off Starting offset in the file fd. We skip all bytes prior to this offset and map the
file starting with it. Ignored if MAP_ANONYMOUS behavior flag is used.
Returns rax = pointer to the mapped area, -1 on error.
Includes to use in C:
#include <sys/mman.h>
418
CHAPTER 20 ■ APPENDIX C. SYSTEM CALLS
■ Note To be able to use MAP_ANONYMOUS flag you might need to define _DEFAULT_SOURCE flag immediately
before including the relevant header file, as follows:
#define _DEFAULT_SOURCE
#include <sys/mman.h>
20.6 munmap
int munmap(void *addr, size_t length);
Description Unmaps a region of memory of a given length. You can map a huge region using mmap and
then unmap a fraction of it using munmap.
20.6.1 Arguments
1. addr Start of the region to unmap.
2. length Length of the region to unmap.
Returns rax = zero un success, -1 on error. Global variable errno holds the error code.
Includes to use in C:
#include <sys/mman.h>
419
CHAPTER 20 ■ APPENDIX C. SYSTEM CALLS
20.7 exit
void _exit(int status);
20.7.1 Arguments
1. status Exit code. It is stored into $? environmental variable.
Returns Nothing.
Includes to use in C:
#include <unistd.h>
420
CHAPTER 21
> uname -a
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 69
model name : Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
stepping : 1
microcode : 0x1d
cpu MHz : 2394.458
cache size : 3072 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse
sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon
pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf
pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe
popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat
epb pln pts dtherm fsgsbase smep
bogomips : 4788.91
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 69
model name : Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
stepping : 1
microcode : 0x1d
cpu MHz : 2394.458
cache size : 3072 KB
physical id : 2
siblings : 1
core id : 0
cpu cores : 1
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse
sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon
pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf
pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe
popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat
epb pln pts dtherm fsgsbase smep
bogomips : 4788.91
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
MemTotal: 1017348 kB
MemFree: 516672 kB
MemAvailable: 565600 kB
Buffers: 32756 kB
Cached: 114944 kB
SwapCached: 10044 kB
Active: 376288 kB
Inactive: 49624 kB
Active(anon): 266428 kB
Inactive(anon): 12440 kB
Active(file): 109860 kB
422
CHAPTER 21 ■ APPENDIX D. PERFORMANCE TESTS INFORMATION
Inactive(file): 37184 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 901116 kB
SwapFree: 868356 kB
Dirty: 44 kB
Writeback: 0 kB
AnonPages: 270964 kB
Mapped: 43852 kB
Shmem: 648 kB
Slab: 45980 kB
SReclaimable: 29016 kB
SUnreclaim: 16964 kB
KernelStack: 4192 kB
PageTables: 6100 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 1409788 kB
Committed_AS: 1212356 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 145144 kB
VmallocChunk: 34359590172 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 49024 kB
DirectMap2M: 999424 kB
DirectMap1G: 0 kB
423
CHAPTER 22
Bibliography
[15] Intel Corporation. Intel® 64 and IA-32 architectures software developer’s manual.
Available: www.intel.com/content/dam/www/public/us/en/documents/
manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.
September 2014.
[16] Intel Corporation. Intel® 64 and IA-32 architectures optimization reference
manual. Available: www.intel.com/content/www/us/en/architecture-and-
technology/64-ia-32-architectures-optimization-manual.html. June 2016.
[17] David Kanter. Intel’s Haswell CPU microarchitecture. Available:
www.realworldtech.com/haswell-cpu/1.
[18] Brian W. Kernighan. The C Programming Language. Prentice Hall Professional
Technical Reference, 2nd edition, 1988.
[19] Petter Larsson and Eric Palmer. Image processing acceleration techniques using
intel streaming simd extensions and intel advanced vector extensions.
January 2010.
[20] Doug Lea. A memory allocator. http://g.oswego.edu/dl/html/malloc.html.
2000.
[21] Michael E. Lee. Optimization of computer programs in c. Available:
http://leto.net/docs/C-optimization.php.
[22] Lomont, Chris. “Fast inverse square root.” Tech-315 nical Report (2003):
32. February 2003.
[23] Love, Robert. Linux Kernel Development (Novell Press). Novell Press, 2005.
[24] Michael Matz, Jan Hubicka, Andreas Jaeger, and Mark Mitchell. System V
Application Binary Interface. AMD64 Architecture Processor Supplement. Draft
version 0.99.6, 2013.
[25] McKenney, Paul E. “Memory barriers: a hardware view for software hackers.”
Linux Technology Center, IBM Beaverton (2010).
[26] Pawell Moll. How do debuggers (really) work? In Embedded Linux Conference
Europe, October 2015. http://events.linuxfoundation.org/sites/events/
files/slides/slides_16.pdf.
[27] The netwide assembler: NASM manual. Available: www.nasm.us/doc/.
[28] N. N. Nepeyvoda and I. N. Skopin. Foundations of programming. RHD
Moscow-Izhevsk, 2003.
[29] Benjamin C. Pierce. Types and programming languages. Cambridge, MA: MIT
Press, 1st ed. 2002.
[30] Jeff Preshing. The purpose of memory order consume in c++11. http://preshing.
com/20140709/the-purpose-of-memory_order_consume-in-cpp11/. 2014.
[31] Jeff Preshing. Weak vs. strong memory models http://preshing.com/20120930/
weak-vs-strong-memory-models/. 2012.
[32] Brad Rodriguez. Moving forth: a series on writing Forth kernels. http://www.
bradrodriguez.com/papers/moving1.html. The Computer Journal #59
(January/February 1993).
426
CHAPTER 22 ■ BIBLIOGRAPHY
[33] Uresh, Vahalia. “UNIX Internals: The New Frontiers.” (2005). Dorling Kindersley
Pvt. Limited. 2008.
[34] Anthony Williams. C++ concurrency in action: practical multithreading.
Shelter Island, NY: Manning. 2012.
[35] Glynn Winskel. The formal semantics of programming languages: an
introduction. Cambridge, MA: MIT Press. 1993.
427
Index
A
relative addressing, 23–24
pointer, 30
Abstract machines, 3 string length computation, 32–33
Abstract syntax tree, 230–231 strings, 29
Accessing code Assembly preprocessor, 64–73
header files, 187–188 conditionals, 66–69
Addressing, 23–24, 30–31, 36 %define, 64–67, 69–70
base-indexed with scale and displacement, 31 macros with arguments, 65–66, 68–69
direct, 31 Assertions, 251
immediate, 23, 31
indirect addressing, 23
Address space, 48 B
Address space layout randomization (ASLR), 285 Backus-Naur form (BNF), 222
Alignment, 235–239 Binutils, 77
Address translation process BitMaP (BMP) format, 256–257
DEP and EXB, 55 BNF (Backus-Naur form), 222
page table entry, 55 Booleans, 150
PML4, 53 Branch prediction, 338
segmentation faults, 55 Breakpoint, 399
TLB, 55
Array, 147, 153–158, 160, 162–169, 179, 209–210
defined, 153 C
initializers, 154 C
memory allocators, 153 C89, 130
summation functionality compilation, 130
bug, 163 control flow
const qualifiers, 166 fibonacci series, 138
Assembly code for, 135–136
dictionary implementation, 87–89 switch, 137–138
dynamic library, 307–308 while, 135
GCC and NASM, 308 dangling else, 134
Assembly language, 4 data types, 132–133
constant precomputation, 30 Duff’s device, 137
endianness, 28–29 expressions, 139, 141–142
function calls, 25–28 function, 142–144
instruction main function, 156–157
mov, 20–21 preprocessor, 144–145
syscall, 20–21 block, 143–144
label, 19 #define, 144–145
output register #endif, 144
local labels, 23 #ifndef, 144
rax value, 22–23 #include, 144–145
C (cont.)
program structure, 130–132
D
statements, 139–142 Data Execution Prevention (DEP), 288
block, 131, 140 Data models, 213–214
C11, 238–239 Data streams, 215–217
alignment, 238–239 Data structure padding, 235–238, 244
Caching, 47 Data types
binary search, prefetching, 342–345 static weak typing, 173
cache-lines, 341 strong typing, 172
cache-miss, 341 weak typing, 172
LL-cache, 341 Deadlocks, 377–378
matrix initialization, 346–348 Debugging, 399
memory, 341 Declarations
memory bypassing, 345 forward declaration, 182
prefetching, 341 functions, 182–183
use case, 340 incomplete type, 183
Calling convention, 266–268 structure, 183
variable arguments count Descriptor Privilege Level, 41, 42
ellipsis, 271 Directive, 19
Calling function, 303–305 Distributed factorization, 370, 372–374
Chomsky hierarchy, 229–230 Dynamic array
CISC, 45 scanf function, 195
C11 memory model Dynamic library, 293–294
atomics, 391 calling function, 303–305
Intel 64, 390 .dynstr, 83
memory ordering, 392 .dynsym, 83
operations, 392–393 .hash, 83
Code models optimizations, 313–315
kernel, 316 Dynamic linker, 305–306
large Dynamic Memory Allocation, 195
PIC, 320–322
without PIC, 318
medium, 319 E
PIC, 322–323 ELF (Executable and Linkable Format), 74–76, 291
without PIC, 318–319 file type, 74, 76, 89
small headers, 75
PIC, 319 .dynstr, 83
without PIC, 317 .dynsym, 83–84
Code reusability, 255 .hash, 83–84
Coding style execution view, 76
characteristics, 241–242 linking view, 75–76
file structure, 243 Program Header Table, 76
functions, 246 section controls, 76
integer types, 241 sections, 76
naming, 242 .bss, 76
types, 243–244 .data, 76, 293, 295, 307, 313
variables, 244–245 .debug, 76
Compilation process .dynstr, 83
preprocessor (see Preprocessor) .dynsym, 83
translation, 74 execution view, 76
Compiler, 64 .fini, 311
Condition variables, 379, 381 .hash, 83
const types, 158–160 .init, 311–312
Context-free grammars, 229 .line, 75–76, 317, 403
Context-sensitive grammars, 229 linking view, 75–76
430
■ INDEX
431
■ INDEX
432
■ INDEX
O
substitutions
with arguments, 65
Object file, 74–82, 88 conditionals, 66
Operator sizeof, 157–158 macros, 64–65
Optimizations, 313–315 symbols, 64
branch prediction, 338 Prime number, 71–72, 138, 167
compiler flags, 329 Prime number checker, 167
constant propagation, 334–335 Procedure, 142–143
execution unit, 339–340 Programming language, 221
grouping reads and writes, 340 Protected mode, 42–43
low-level, 329 far jump, 42
performance tests, 327 GDT/LDT, 41
profiler, 328–329 RPL, 41
return value, 336–338 segment descriptor, 42
stack frame pointer omission, 329–330 segment selector, 41
subexpressions elimination, 333–334 Protection rings, 14, 39, 41, 43, 44, 46, 92, 93
tail recursion, 330–333 #BP, 96
#GP, 96
P, Q
#PF, 96
#UD, 96
Parser combinators, 227 Pthreads
Parsing complex definition, 211 condition variables, 379, 381
Physical address, 48, 52, 53, 55 deadlocks, 377–378
Pointer arithmetic, 202–203 distributed factorization, 370–374
Pointers, 151–152 joinable thread, 370
array, 210 livelocks, 378–379
function, 205 multithreading, use case, 365
NULL, 203–204 mutexes, 374–376
ptrdiff_t, 204–205 semaphore, 382, 384
void*, 203 set attributes function, 368–369
Polymorphism spinlocks, 381–382
coercions, 179 synchronization, 369
definition, 174 threads, creation, 366–368
inclusion, 177–178 ptrdiff_t, 204
overloading, 178–179
parametric, 175–177
Position independent code (PIC), 82, 293 R
Pragmatics, 235–238 Real mode, 39–40
alignment, 235 segment, 39
data structure padding, 235–238 segment registers, 39, 40
Preloading, 301–302 Reduced Instruction Set Computer (RISC), 45–46
Preprocessor, 64, 144–145 Registers
condition advantages, 8
argument type, 68–69 callee-saved, 26, 33, 266, 321
on definition, 67 CISC and RISC, 45
text identity, 67–68 instruction decoder, 46
%define, 64 locality of reference, 8
#define directive, 190 rax decomposition, 12, 416
evaluation order, 69–70 rflags, 12
#ifdef, 191 rip, 12
include guard, 192–193 rsi and rdi decomposition, 13
labels inside macros, 72–73 rsp and rbp decomposition, 13
pitfalls, 194 segment, 13
#pragma once, 192 Regular expressions, 222, 229, 231
repetition, 70–71 Relocatable object files, 74, 76–80
433
■ INDEX
435