Notes On Operating Systems
Notes On Operating Systems
Dror G. Feitelson
c 2011
Contents
I Background 1
1 Introduction 2
1.1 Operating System Functionality . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Abstraction and Virtualization . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Hardware Support for the Operating System . . . . . . . . . . . . . . . . 8
1.4 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
II The Classics 23
2 Processes and Threads 24
2.1 What Are Processes and Threads? . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Processes Provide Context . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.2 Process States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.3 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.4 Operations on Processes and Threads . . . . . . . . . . . . . . . . 34
2.2 Multiprogramming: Having Multiple Processes in the System . . . . . . 35
2.2.1 Multiprogramming and Responsiveness . . . . . . . . . . . . . . . 35
2.2.2 Multiprogramming and Utilization . . . . . . . . . . . . . . . . . . 38
2.2.3 Multitasking for Concurrency . . . . . . . . . . . . . . . . . . . . . 40
2.2.4 The Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 Scheduling Processes and Threads . . . . . . . . . . . . . . . . . . . . . . 41
2.3.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3.2 Handling a Given Set of Jobs . . . . . . . . . . . . . . . . . . . . . 43
2.3.3 Using Preemption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.4 Priority Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
ii
2.3.5 Starvation, Stability, and Allocations . . . . . . . . . . . . . . . . . 51
2.3.6 Fair Share Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3 Concurrency 64
3.1 Mutual Exclusion for Shared Data Structures . . . . . . . . . . . . . . . 65
3.1.1 Concurrency and the Synchronization Problem . . . . . . . . . . . 65
3.1.2 Mutual Exclusion Algorithms . . . . . . . . . . . . . . . . . . . . . 67
3.1.3 Semaphores and Monitors . . . . . . . . . . . . . . . . . . . . . . . 73
3.1.4 Locks and Disabling Interrupts . . . . . . . . . . . . . . . . . . . . 76
3.1.5 Multiprocessor Synchronization . . . . . . . . . . . . . . . . . . . . 79
3.2 Resource Contention and Deadlock . . . . . . . . . . . . . . . . . . . . . . 80
3.2.1 Deadlock and Livelock . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2.2 A Formal Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.2.3 Deadlock Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.2.4 Deadlock Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.2.5 Deadlock Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.2.6 Real Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3 Lock-Free Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4 Memory Management 95
4.1 Mapping Memory Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2 Segmentation and Contiguous Allocation . . . . . . . . . . . . . . . . . . 97
4.2.1 Support for Segmentation . . . . . . . . . . . . . . . . . . . . . . . 98
4.2.2 Algorithms for Contiguous Allocation . . . . . . . . . . . . . . . . 100
4.3 Paging and Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.3.1 The Concept of Paging . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.3.2 Benefits and Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.3.3 Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3.4 Algorithms for Page Replacement . . . . . . . . . . . . . . . . . . . 114
4.3.5 Disk Space Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.4 Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
iii
Introduction
In the simplest scenario, the operating system is the first piece of software to
run on a computer when it is booted. Its job is to coordinate the execution of
all other software, mainly user applications. It also provides various common
services that are needed by users and applications.
user
application
operating system
hardware
This is a misleading picture, because applications mostly execute machine
instructions that do not go through the operating system. A better picture is:
application
system calls operating
non−privileged
machine
instructions
where we have used a 3-D perspective to show that there is one hardware
base, one operating system, but many applications. It also shows the
important interfaces: applications can execute only non-privileged machine
instructions, and they may also call upon the operating system to perform
some service for them. The operating system may use privileged instructions
that are not available to applications. And in addition, various hardware
devices may generate interrupts that lead to the execution of operating system
code.
6. The chosen application runs directly on the hardware; again, the operating
system is not involved. After some time, the application performs a system
call to read from a file.
7. The system call causes a trap into the operating system The operating
system sets things up for the I/O operation (using some privileged
instructions). It then puts the calling application to sleep, to await the I/O
completion, and chooses another application to run in its place.
Exercise 1 How can the operating system guarantee that there will be a
system call or interrupt, so that it will regain control?
The operating system is a reactive program
Events can be classified into two types: interrupts and system calls. These are
described in more detail below. The goal of the operating system is to run as
little as possible, handle the events quickly, and let applications run most of
the time.
But there are also internal resources needed by the operating system:
•
Disk space for paging memory.
Entries in system tables, such as the process table and open files table.•
All the applications want to run on the CPU, but only one can run at a time.
Therefore the operating system lets each one run for a short while, and then
preempts it and gives the CPU to another. This is called time slicing. The
decision about which application to run is scheduling (discussed in Chapter
2).
As for memory, each application gets some memory frames to store its code
and data. If the sum of the requirements of all the applications is more than
the available physical memory, paging is used: memory pages that are not
currently used are temporarily stored on disk (we’ll get to this in Chapter 4).
With disk space (and possibly also with entries in system tables) there is
usually a hard limit. The system makes allocations as long as they are
possible. When the resource runs out, additional requests are failed.
However, they can try again later, when some resources have hopefully been
released by their users.
Exercise 3 As system tables are part of the operating system, they can be
made as big as we want. Why is this a bad idea? What sizes should be
chosen?
The operating system provides services
Isolation means that many applications can co-exist at the same time, using
the same hardware devices, without falling over each other’s feet. These two
issues are discussed next. For example, if several applications send and
receive data over a network, the operating system keeps the data streams
separated from each other.
As part of the abstract machine, the operating system also supports some
abstractions that do not exist at the hardware level. The chief one is files:
persistent repositories of data with names. The hardware (in this case, the
disks) only supports persistent storage of data blocks. The operating system
builds the file system above this support, and creates named sequences of
blocks (as explained in Chapter 5). Thus applications are spared the need to
interact directly with disks.
What features exist in the hardware but are not available in the
abstractExercise 4
machine presented to applications?
Exercise 5 Can the abstraction include new instructions too?
The abstract machines are isolated
The abstract machine presented by the operating system is “better” than the
hardware by virtue of supporting more convenient abstractions. Another
important improvement is that it is also not limited by the physical resource
limitations of the underlying hardware: it is a virtual machine. This means
that the application does not access the physical resources directly. Instead,
there is a level of indirection, managed by the operating system.
virtual machines
seen by applications
CPU
machine instructions (non−priviledged) registers
(general purpose)
memory
4 GB contiguous address space mapping
by the operating
system physical machine
available in hardware
CPU
machine instructions (priviledged and not)
The main reason for using virtualization is to make up for limited resources.
If the physical hardware machine at our disposal has only 1GB of memory,
and each abstract machine presents its application with a 4GB address space,
then obviously a direct mapping from these address spaces to the available
memory is impossible. The operating system solves this problem by coupling
its resource management functionality with the support for the abstract
machines. In effect, it juggles the available resources among the competing
virtual machines, trying to hide the deficiency. The specific case of virtual
memory is described in Chapter 4.
The idea of virtual machines is not new. It originated with MVS, the
operating system for the IBM mainframes. In this system, time slicing and
abstractions are completely decoupled. MVS actually only does the time
slicing, and creates multiple exact copies of the original physical machine.
Then, a single-user operating system called CMS is executed in each virtual
machine. CMS provides the abstractions of the user environment, such as a
file system.
As each virtual machine is an exact copy of the physical machine, it was also
possible to run MVS itself on such a virtual machine. This was useful to
debug new versions of the operating system on a running system. If the new
version is buggy, only its virtual machine will crash, but the parent MVS will
not. This practice continues today, and VMware has been used as a platform
for allowing students to experiment with operating systems. We will discuss
virtual machine support in Section 9.5.
To read more: History buffs can read more about MVS in the book by
Johnson [7].
Things can get complicated
On the other hand, virtualization can also be done at the application level. A
remarkable example is given by VMware. This is actually a user-level
application, that runs on top of a conventional operating system such as
Linux or Windows. It creates a set of virtual machines that mimic the
underlying hardware. Each of these virtual machines can boot an independent
operating system, and run different applications. Thus the issue of what
exactly constitutes the operating system can be murky. In particular, several
layers of virtualization and operating systems may be involved with the
execution of a single application.
In these notes we’ll ignore such complexities, at least initially. We’ll take the
(somewhat outdated) view that all the operating system is a monolithic piece
of code, which is called the kernel. But in later chapters we’ll consider some
deviations from this viewpoint.
CPUs typically have (at least) two execution modes: user mode and kernel
mode. User applications run in user mode. The heart of the operating system
is called the kernel. This is the collection of functions that perform the basic
services such as scheduling applications. The kernel runs in kernel mode.
Kernel mode is also called supervisor mode or privileged mode.
The execution mode is indicated by a bit in a special register called the
processor status word (PSW). Various CPU instructions are only available to
software running in kernel mode, i.e., when the bit is set. Hence these
privileged instructions can only be executed by the operating system, and not
by user applications. Examples include:
Instructions to set the interrupt priority level (IPL). This can be used to block•
certain classes of interrupts from occurring, thus guaranteeing undisturbed ex
ecution.
Instructions to set the hardware clock to generate an interrupt at a certain
time• in the future.
•
Instructions to activate I/O devices. These are used to implement I/O
operations on files.
•
Instructions to load and store special CPU registers, such as those used to
define the accessible memory addresses, and the mapping from each
application’s virtual addresses to the appropriate addresses in the physical
memory.
Instructions to load and store values from memory directly, without going
through• the usual mapping. This allows the operating system to access all the memory.
Exercise 7 Which of the following instructions should be privileged?
Level 0 is the most protected and intended for use by the kernel.
Level 1 is intended for other, non-kernel parts of the operating system. Level
2 is offered for device drivers: needy of protection from user applications, but
not
Despite this support, most operating systems (including Unix, Linux, and
Windows) only use two of the four levels, corresponding to kernel and user
modes.
2Indeed, device drivers are typically buggier than the rest of the kernel [5].
The trick is that when the CPU switches to kernel mode, it also changes the
program counter3 (PC) to point at operating system code. Thus user code will
never get to run in kernel mode.
Unix has a special privileged user called the “superuser”. The superuser can
override various protection mechanisms imposed by the operating system; for
example, he can access other users’ private files. However, this does not
imply running in kernel mode. The difference is between restrictions imposed
by the operating system software, as part of the operating system services,
and restrictions imposed by the hardware.
There are two ways to enter kernel mode: interrupts and system calls.
Interrupts cause a switch to kernel mode
Interrupts are special conditions that cause the CPU not to execute the next
instruction. Instead, it enters kernel mode and executes an operating system
interrupt handler.
But how does the CPU (hardware) know the address of the appropriate kernel
function? This depends on what operating system is running, and the
operating system might not have been written yet when the CPU was
manufactured! The answer to this problem is to use an agreement between the
hardware and the software. This agreement is asymmetric, as the hardware
was there first. Thus, part of the hardware architecture is the definition of
certain features and how the operating system is expected to use them. All
operating systems written for this architecture must comply with these
specifications.
Note that the hardware does this blindly, using the predefined address of the
interrupt vector as a base. It is up to the operating system to actually store the
correct addresses in the correct places. If it does not, this is a bug in the
operating system.
Clock interrupt. This tells the operating system that a certain amount of time•
has passed. It’s handler is the operating system function that keeps track of
time. Sometimes, this function also calls the scheduler which might preempt
the current application and run another in its place. Without clock interrupts,
the application might run forever and monopolize the computer.
An error condition: this tells the operating system that the current application•
did something illegal (divide by zero, try to issue a privileged instruction, etc.).
The handler is the operating system function that deals with misbehaved
applications; usually, it kills them.
•
When the handler finishes its execution, the execution of the interrupted
application continues where it left off — except if the operating system killed
the application or decided to schedule another one.
As an operating system can have more than a hundred system calls, the
hardware cannot be expected to know about all of them (as opposed to
interrupts, which are a hardware thing to begin with). The sequence of events
leading to the execution of a system call is therefore slightly more involved:
1. The application calls a library function that serves as a wrapper for the
system call.
2. The library function (still running in user mode) stores the system call
identifier and the provided arguments in a designated place in memory.
3. It then issues the trap instruction.
4. The hardware switches to privileged mode and loads the PC with the
address of the operating system function that serves as an entry point for
system calls. 5. The entry point function starts running (in kernel mode). It
looks in the designated place to find which system call is requested.
6. The system call identifier is used in a big switch statement to find and call
the appropriate operating system function to actually perform the desired
service. This function starts by retrieving its arguments from where they were
stored by the wrapper library function.
Exercise 12 Should the library of system-call wrappers be part of the
distribution of the compiler or of the operating system?
Typical system calls include:
•
Open, close, read, or write to a file.
Create a new process (that is, start running another application).• Get some information from the
system, e.g. the time of day.• Request to change the status of the application, e.g. to
reduce its priority or to• allow it to use more memory.
When the system call finishes, it simply returns to its caller like any other
function. Of course, the CPU must return to normal execution mode.
The hardware has special features to help the operating system In
addition to kernel mode and the interrupt vector, computers have various
features that are specifically designed to help the operating system.
The most common are features used to help with memory management.
Examples include:
Hardware to translate each virtual memory address to a physical address.
This• allows the operating system to allocate various scattered memory pages to an
application, rather than having to allocate one long continuous stretch of
memory.
“Used” bits on memory pages, which are set automatically whenever any
address• in the page is accessed. This allows the operating system to see which pages
were accessed (bit is 1) and which were not (bit is 0).
We’ll review specific hardware features used by the operating system as we
need them.
1.4 Roadmap
There are different views of operating systems An operating system can be
viewed in three ways:
• According to the services it provides to users, such as
– Time slicing.
– A file system.
By its programming interface, i.e. its system calls.•
• According to its internal structure, algorithms, and data structures.
To read more: To actually use the services provided by a system, you need to
read a book that describes that system’s system calls. Good books for Unix
programming are Rochkind [15] and Stevens [19]. A good book for Windows
programming is Richter [14]. Note that these books teach you about how the
operating system looks “from the outside”; in contrast, we will focus on how
it is built internally.
The above paragraphs relate to a single system with a single processor. The
first part of these notes is restricted to such systems. The second part of the
notes is about distributed systems, where multiple independent systems
interact.
All the things we mentioned so far relate to the operating system kernel. This
will indeed be our focus. But it should be noted that in general, when one
talks of a certain operating system, one is actually referring to a distribution.
For example, a typical Unix distribution contains the following elements:
• The Unix kernel itself. Strictly speaking, this is “the operating system”.
libc library. This provides the runtime environment for programs written• The
in C. For example, is containsprintf, the function to format printed output,
andstrncpy, the function to copy strings4.
These notes should not be considered to be the full story. For example, most
operating system textbooks contain historical information on the
development of operating systems, which is an interesting story and is not
included here. They also contain more details and examples for many of the
topics that are covered here.
The main recommended textbooks are Stallings [18], Silberschatz et al. [17],
and Tanenbaum [21]. These are general books covering the principles of both
theoretical work and the practice in various systems. In general, Stallings is
more detailed, and gives extensive examples and descriptions of real systems;
Tanenbaum has a somewhat broader scope.
In addition, there are a number of books on specific (real) systems. The first
and most detailed description of Unix system V is by Bach [1]. A similar
description of 4.4BSD was written by McKusick and friends [12]. The most
recent is a book on Solaris [10]. Vahalia is another very good book, with
focus on advanced issues in different Unix versions [23]. Linux has been
described in detail by Card and friends [4], by Beck and other friends [2], and
by Bovet and Cesati [3]; of these, the first
4Always usestrncpy, notstrcpy!
gives a very detailed low-level description, including all the fields in all
major data structures. Alternatively, source code with extensive commentary
is available for Unix version 6 (old but a classic) [9] and for Linux [11]. It is
hard to find anything with technical details about Windows. The best
available is Russinovich and Solomon [16].
While these notes attempt to represent the lectures, and therefore have
considerable overlap with textbooks (or, rather, are subsumed by the
textbooks), they do have some unique parts that are not commonly found in
textbooks. These include an emphasis on understanding system behavior and
dynamics. Specifically, we focus on the complementary roles of hardware
and software, and on the importance of knowing the expected workload in
order to be able to make design decisions and perform reliable evaluations.
Bibliography
[1] M. J. Bach, The Design of the UNIX Operating System. Prentice-Hall,
1986. [2] M. Beck, H. Bohme, M. Dziadzka, U. Kunitz, R. Magnus, and D.
Verworner, Linux Kernel Internals. Addison-Wesley, 2nd ed., 1998.
[3] D. P. Bovet and M. Cesati, Understanding the Linux Kernel. O’Reilly,
2001. [4] R. Card, E. Dumas, and F. M´evel, The Linux Kernel Book. Wiley,
1998.
[9] J. Lions, Lions’ Commentary on UNIX 6th Edition, with Source Code.
Annabooks, 1996.
[10] J. Mauro and R. McDougall, Solaris Internals. Prentice Hall, 2001. [11]
S. Maxwell, Linux Core Kernel Commentary. Coriolis Open Press, 1999.
Appendix A
Background on Computer
Architecture
Operating systems are tightly coupled with the architecture of the computer
on which they are running. Some background on how the hardware works is
therefore required. This appendix summarizes the main points. Note,
however, that this is only a highlevel simplified description, and does not
correspond directly to any specific real-life architecture.
After the subroutine runs, the ret instruction restores the previous state:
1. It restores the register values from the stack.
2. It loads the PC with the return address that was also stored on the stack.
3. It decrements the stack pointer to point to the previous stack frame.
The hardware also provides special support for the operating system. One
type of support is the mapping of memory. This means that at any given time,
the CPU cannot access all of the physical memory. Instead, there is a part of
memory that is accessible, and other parts that are not. This is useful to allow
the operating system to prevent one application from modifying the memory
of another, and also to protect the operating system itself. The simplest
implementation of this idea is to have a pair of special registers that bound
the accessible memory range. Real machines nowadays support more
sophisticated mapping, as described in Chapter 4.
another app
operating
accessible
system data
ALU
PSW PC MEM SP
A special case of calling a subroutine is making a system call. In this case the
caller is a user application, but the callee is the operating system. The
problem is that the operating system should run in privileged mode, or kernel
mode. Thus we cannot just use thecall instruction. Instead, we need thetrap
instruction. This does all whatcall does, and in addition sets the mode bit in
the processor status word (PSW) register. Importantly, whentrap sets this bit,
it loads the PC with the predefined address of the operating system entry
point (as opposed tocall which loads it with the address of a user function).
Thus after issuing atrap, the CPU will start executing operating system code
in kernel mode. Returning from the system call resets the mode bit in the
PSW, so that user code will not run in kernel mode.
There are other ways to enter the operating system in addition to system calls,
but technically they are all very similar. In all cases the effect is just like that
of a trap: to pass control to an operating system subroutine, and at the same
time change the CPU mode to kernel mode. The only difference is the trigger.
For system calls, the trigger is atrap instruction called explicitly by an
application. Another type of trigger is when the current instruction cannot be
completed (e.g. division by zero), a condition known as an exception. A third
is interrupts — a notification from an external device (such as a timer or disk)
that some event has happened and needs handling by the operating system.
The reason for having a kernel mode is also an example of hardware support
for the operating system. The point is that various control functions need to
be reserved to the operating system, while user applications are prevented
from performing them. For example, if any user application could set the
memory mapping registers, they would be able to allow themselves access to
the memory of other applications. Therefore the setting of these special
control registers is only allowed in kernel mode. If a user-mode application
tries to set these registers, it will suffer an illegal instruction exception.
Part II
The Classics
Operating systems are complex programs, with many interactions between
the different services they provide. The question is how to present these
complex interactions in a linear manner. We do so by first looking at each
subject in isolation, and then turning to cross-cutting issues.
Part III then discusses the cross-cutting issues, with chapters about topics that
are sometimes not covered. These include security, extending operating
system functionality to multiprocessor systems, various technical issues such
as booting the system, the structure of the operating system, and performance
evaluation.
Part IV extends the discussion to distributed systems. It starts with the issue
of communication among independent computers, and then presents the
composition of autonomous systems into larger ensambles that it enables.
Chapter 2
Processor Status Word (PSW): includes bits specifying things like the mode
(privileged or normal), the outcome of the last arithmetic operation (zero,
negative, overflow, or carry), and the interrupt level (which interrupts are
allowed and which are blocked).
General purpose registers used to store addresses and data values as directed•
by the compiler. Using them effectively is an important topic in compilers, but
does not involve the operating system.
The memory contains the results so far Only a small part of an applications
data can be stored in registers. The rest is in memory. This is typically
divided into a few parts, sometimes called segments:
All the addressable memory together is called the process’s address space. In
modern systems this need not correspond directly to actual physical memory.
We’ll discuss this later.
The operating system keeps all the data it needs about a process in the
process control block (PCB) (thus another definition of a process is that it is
“the entity described by a PCB”). This includes many of the data items
described above, or at least pointers to where they can be found (e.g. for the
address space). In addition, data needed by the operating system is included,
for example
Information about the user running the process, used to decide the process’s
ac• cess rights (e.g. a process can only access a file if the file’s permissions allow this
for the user running the process). In fact, the process may be said to represent
the user to the system.
The PCB may also contain space to save CPU register contents when the
process is not running (some implementations specifically restrict the term
“PCB” to this storage space).
Exercise 17 We said that the stack is used to save register contents, and that
the PCB also has space to save register contents. When is each used?
Schematically, all the above may be summarized by the following picture,
which shows the relationship between the different pieces of data that
constitute a process:
memory CPU
PCB kernel user PSW
PC
SP
The PCB is more than just a data structure that contains information about the
process. It actually represents the process. Thus PCBs can be linked together
to represent processes that have something in common — typically processes
that are in the same state.
For example, when multiple processes are ready to run, this may be
represented as a linked list of their PCBs. When the scheduler needs to decide
which process to run next, it traverses this list, and checks the priority of the
different processes.
Processes that are waiting for different types of events can also be linked in
this way. For example, if several processes have issued I/O requests, and are
now waiting for these I/O operations to complete, their PCBs can be linked in
a list. When the disk completes an I/O operation and raises an interrupt, the
operating system will look at this list to find the relevant process and make it
ready for execution again.
An important point is that a process may change its state. It can be ready to
run at one instant, and blocked the next. This may be implemented by moving
the PCB from one linked list to another.
Graphically, the lists (or states) that a process may be in can be represented as
different locations, and the processes may be represented by tokens that move
from one state to another according to the possible transitions. For example,
the basic states and transitions may look like this:
At each moment, at most one process is in the running state, and occupying
the CPU. Several processes may be ready to run (but can’t because we only
have one processor). Several others may be blocked waiting for different
types of events, such as a disk interrupt or a timer going off.
Processes are created in the ready state. A ready process may be scheduled to
run by the operating system. When running, it may be preempted and
returned to the ready state. A process may also block waiting for an event,
such as an I/O operation. When the event occurs, the process becomes ready
again. Such transitions continue until the process terminates.
2.1.3 Threads
The main exception in this picture is the stacks. A stack is actually a record of
the flow of the computation: it contains a frame for each function call,
including saved register values, return address, and local storage for this
function. Therefore each thread must have its own stack.
Exercise 25 Asynchronous I/O is obviously useful for writing data, which can
be done in the background. But can it also be used for reading?
priority stack 2
CPU reg’s
storage stack 3
stack 4
Exercise 26 If one thread allocates a data structure from the heap, can other
threads access it?
kernel user
memory
filesheap thread descriptors
accounting statestack 1priority stack
Note the replication of data structures and work. At the operating system
level, data about the process as a whole is maintained in the PCB and used
for scheduling. But when it runs, the thread package creates independent
threads, each with its own stack, and maintains data about them to perform its
own internal scheduling.
The problem with user-level threads is that the operating system does not
know about them. At the operating system level, a single process represents
all the threads. Thus if one thread performs an I/O operation, the whole
process is blocked waiting for the I/O to complete, implying that all threads
are blocked.
In Unix, jumping from one part of the program to another can be done using
the setjmp andlongjmp functions that encapsulate the required operations.
setjmp essentially stores the CPU state into a buffer.longjmp restores the state
from a buffer created with setjmp. The names derive from the following
reasoning:setjmp sets things up to enable you to jump back to exactly this
place in the program.longjmp performs a long jump to another location, and
specifically, to one that was previously stored usingsetjmp.
To implement threads, assume each thread has its own buffer (in our
discussion of threads above, this is the part of the thread descriptor set aside
to store registers). Given many threads, there is an array of such buffers
calledbuf. In addition, letcurrent by the index of the currently running thread.
Thus we want to store the state of the current thread inbuf[current]. The code
that implements a context switch is then simply
switch() {if (setjmp(buf[current]) == 0) {schedule();
}
}
The setjmp function stores the state of the current thread inbuf[current], and
returns 0. Therefore we enter theif, and the functionschedule is called. Note
that this is the general context switch function, due to our use ofcurrent.
Whenever a context switch is performed, the thread state is stored in the
correct thread’s buffer, as indexed bycurrent.
The schedule function, which is called from the context switch function, does
the following:
schedule()
new = { select-thread-to-run
current = new;
longjmp(buf[new], 1);
}
new is the index of the thread we want to switch to. longjmp performs a
switch to that
thread by restoring the state that was previously stored in buf[new]. Note that
this buffer indeed contains the state of that thread, that was stored in it by a
previous call to setjmp. The result is that we are again inside the call tosetjmp
that originally stored the state inbuf[new]. But this time, that instance
ofsetjmp will return a value of 1, not 0 (this is specified by the second
argument tolongjmp). Thus, when the function returns, theif surrounding it
will fail, andschedule will not be called again immediately. Instead,switch
will return and execution will continue where it left off before calling the
switching function.
User-level thread packages, such as pthreads, are based on this type of code.
But they provide a more convenient interface for programmers, enabling
them to ignore the complexities of implementing the context switching and
scheduling.
The following table summarizes the properties of kernel threads and user
threads, and contrasts them with processes:
processes
protected from each other, require operating system to communicate high
overhead: all operations require a kernel trap, significant work independent:
if one blocks, this does not affect the others
can run on different processors in a multiprocessor
one size fits all low overhead: everything is done at user level
if a thread blocks the whole process is blocked all share the same processor
the same thread library may be available on several systems
application-specific
thread management is possible
Note that operating systems that support threads, such as Mach and Windows
NT, have distinct system calls for processes and threads. For example, the
“process create” call can be used to create a new process, and then “thread
create” can be used to add threads to this process. This is an important
distinction, as creating a new process is much heavier: you need to create a
complete context, including its memory space. Creating a thread is much
easier, as it simply hooks into an existing context.
Unix originally did not support threads (it was designed in the late 1960’s).
Therefore many Unix variants implement threads as “light-weight” processes,
reusing a relatively large part of the process abstraction.
Threads within the same process are less restricted, as it is assumed that if
one terminates another this is part of what the application as a whole is
supposed to do.
Suspend execution A thread embodies the flow of a computation. So a
desirable operation on it may be to stop this computation.
A thread may suspend itself by going to sleep. This means that it tells the
system that it has nothing to do now, and therefore should not run. A sleep is
associated with a time: when this future time arrives, the system will wake
the thread up.
Threads (in the same process) can also suspend each other. Suspend is
essentially another state in the thread state transition graph, which is similar
to the blocked state. The counterpart of suspend is to resume another thread.
A resumed thread is moved from the suspend state to the ready state.
When there is only one CPU, multitasking and multiprogramming are the
same thing. In a parallel system or cluster, you can have multiprogramming
without multitasking, by running jobs on different CPUs.
Job 2 is ahead of job 3 in the queue, so when job 1 terminates, job 2 runs.
However, job 2 is very long, so job 3 must wait a long time in the queue,
even though it itself is short.
Consider a counter example, in which all jobs have the same length. In this
case, a job that arrives first and starts running will also terminate before a job
that arrives later. Therefore preempting the running job in order to run the
new job delays it and degrades responsiveness.
Exercise 32 Consider applications you run daily. Do they all have similar
runtimes, or are some short and some long?
To read more: The benefit of using preemption when the CV of service times
is greater than 1 was established by Regis [13].
If more than one job is being serviced, then instead of waiting for an I/O
operation to complete, the CPU can switch to another job. This does not
guarantee that the CPU never has to wait, nor that the disk will always be
kept busy. However it is in general possible to keep several systems
components busy serving different jobs.
job1 I/O I/O job2 I/O job1 I/O I/O I/O operation ends operation operation ends ends
CPU job1 job2 job1 idle job2
disk idle job1 idle job2 job1 idle
time
Improved utilization can also lead to improved responsiveness
In addition, allowing one job to use resources left idle by another helps the
first job to make progress. With luck, it will also terminate sooner. This is
similar to the processor sharing idea described above, except that here we are
sharing all the system resources, not just the CPU.
Finally, by removing the constraint that jobs have to wait for each other, and
allowing resources to be utilized instead of being left idle, more jobs can be
serviced. This is expressed as a potential increase in throughput. Realization
of this potential depends on the arrival of more jobs.
Exercise 33 In the M/M/1 analysis from Chapter 10, we saw that the average
response time grows monotonically with utilization. Does this contradict the
claims made here?
If all the jobs are compute-bound, meaning they need a lot of CPU cycles and
do not perform much I/O, the CPU will be the bottleneck. If all the jobs are
I/O-bound, meaning that they only compute for a little while and then
perform I/O operations, the disk will become a bottleneck. In either case,
multiprogramming will not help much.
In order to use all the system components effectively, a suitable job mix is
required. For example, there could be one compute-bound application, and a
few I/Obound ones. Some of the applications may require a lot of memory
space, while others require only little memory.
The operating system can create a suitable job mix by judicious long-term
scheduling. Jobs that complement each other will be loaded into memory and
executed. Jobs that contend for the same resources as other jobs will be
swapped out and have to wait.
The question remains of how to classify the jobs: is a new job going to be
computebound or I/O bound? An estimate can be derived from the job’s
history. If it has already performed multiple I/O operations, it will probably
continue to do so. If it has not performed any I/O, it probably will not do
much in the future, but rather continue to just use the CPU.
A third reason for supporting multiple processes at once is that this allows for
concurrent programming, in which the multiple processes interact to work on
the same problem. A typical example from Unix systems is connecting a set
of processes with pipes. The first process generates some data (or reads it
from a file), does some processing, and passes it on to the next process.
Partitioning the computational task into a sequence of processes is done for
the benefit of application structure and reduced need for buffering.
Exercise 35 Pipes only provide sequential access to the data being piped.
Why does this make sense?
The use of multitasking is now common even on personal systems. Examples
include:
•
Degraded performance: even when the CPU is running application code, its
performance may be reduced. For example, we can see
•
The decision about which process to schedule might depend on what you are
trying to achieve. There are several possible metrics that the system could try
to optimize, including
1Whenever we say process here, we typically also mean thread, if the system is thread based.
Response time or turnaround time — the average time from submitting a job
until it terminates. This is the sum of the time spent waiting in the queue, and
the time actually running:
job starts job arrives running terminates ? ? running? Tresp = Twait + Trun waits in
queue Trun - -time Twait -
Tresp -
If jobs terminate faster, users will be happier. In interactive systems, this may
be more of a sharp threshold: if resopnse time is less than about 0.2 seconds,
it is OK. If it is above 2 or 3 seconds, it is bad.
Wait time — reducing the time a job waits until it runs also reduces its
response time. As the system has direct control over the waiting time, but
little control over the actual run time, it should focus on the wait time (Twait).
Response ratio or slowdown — the ratio of the response time to the actual
run time: Trespslowdown =
Trun
This normalizes all jobs to the same scale: long jobs can wait more, and don’t
count more than short ones.
In real systems the definitions are a bit more complicated. For example, when
time slicing is used, the runtime is the sum of all the times that the job runs,
and the waiting time is all the times it waits in the ready queue — but
probably not the times it is waiting for an I/O operation to complete.
Utilization — the average percentage of the hardware (or the CPU) that is
actually used. If the utilization is high, you are getting more value for the
money invested in buying the computer.
Exercise 36 The response time can be any positive number. What are the
numerical ranges for the other metrics? When are high values better, and
when are low values better?
Not all metrics are equally applicable. For example, in many real-world
situations, users submit jobs according to their needs. The ratio of the
requirements of all the jobs to the available resources of the system then
determines the utilization, and does not directly reflect the scheduler’s
performance. There is only an indirect effect: a bad scheduler will discourage
users from submitting more jobs, so the utilization will drop.
1. All jobs are available at the outset and none arrive later2.
2. The job runtimes are also known in advance.
The assumption that all jobs are known in advance implies that this is the set
of jobs that the scheduling algorithm needs to handle. This is a reasonable
assumption in the context of an algorithm that is invoked repeatedly by the
operating system whenever it is needed (e.g. when the situation changes
because additional jobs arrive).
Given that off-line algorithms get all their required information at the outset,
and that all relevant jobs are available for scheduling, they can expect no
surprises during execution. In fact, they complete the schedule before
execution even begins.
The result is that off-line algorithms use “run to completion” (RTC): each job
is executed until it terminates, and the algorithm is only concerned with the
order in which jobs are started.
one to completion. For the off-line case, the jobs are run in the order that they
appear in the input.
Running short jobs first improves the average response time
Reordering the jobs so as to run the shortest jobs first (SJF) improves the
average response time. Consider two adjacent jobs, one longer than the other.
Because the total time for both jobs is constant, the second job will terminate
at the same time regardless of their order. But if the shorter one is executed
first, its termination time will be shorter than if the long one is executed first.
As a result, the average is also reduced when the shorter job is executed first:
long job short job
start average
short job long job
start average
By repeating this argument, we see that for every two adjacent jobs, we
should run the shorter one first in order to reduce the average response time.
Switching pairs of jobs like this is akin to bubble-sorting the jobs in order of
increasing runtime. The minimal average response time is achieved when the
jobs are sorted, and the shortest ones are first.
Exercise 38 Is SJF also optimal for other metrics, such as minimizing the
average slowdown?
But real systems are on-line
In real system you typically don’t know much in advance. In particular, new
jobs may arrive unexpectedly at arbitrary times. Over the lifetime of a
system, the scheduler will be invoked a very large number of times, and each
time there will only be a small number of new jobs. Thus it seems ill-advised
to emphasize the behavior of the scheduler in a single invocation. Instead,
one should consider how it handles all the arrivals that occur over a long
stretch of time.
On-line algorithms get information about one job at a time, and need to
decide immediately what to do with it: either schedule it or put it in a queue
to be scheduled later. The decision may of course be influenced by the past
(e.g. by previously arrived jobs that are already in the queue), but not by the
future.
On-line algorithms do not know about their input in advance: they get it
piecemeal as time advances. Therefore they might make a scheduling
decision based on current data, and then regret it when an additional job
arrives. The solution is to use preemption in order to undo the previous
decision.
We start by reversing the first of the two assumptions made earlier. Thus we
now assume that
1. Jobs may arrive unpredictably and the scheduler must deal with those that
have already arrived without knowing about future arrivals
2. Nevertheless, when a job arrives its runtime is known in advance.
The version of SJF used in this context is called “shortest remaining time
first” (SRT)3. As each new job arrives, its runtime is compared with the
remaining runtime of the currently running job. If the new job is shorter, the
current job is preempted and the new job is run in its place. Otherwise the
current job is allowed to continue its execution, and the new job is placed in
the appropriate place in the sorted queue.
Exercise 39 Is SRT also optimal for other metrics, such as minimizing the
average slowdown?
A problem with actually using this algorithm is the assumption that the run
times of jobs are known. This may be allowed in the theoretical setting of off-
line algorithms, but is usually not the case in real systems.
size. Thus when a request for a certain page arrives, we can get a pretty
accurate assessment of how long it will take to serve, based on the requested
page’s size. Scheduling web servers in this way turns out to improve
performance significantly for small pages, without too much effect on large
ones [6].
Preemption can also compensate for lack of knowledge
SRT only preempts the current job when a shorter one arrives, which relies
on the assumption that runtimes are known in advance. But a more realistic
set of assumptions is that
Using more preemptions can compensate for this lack of knowledge. The
idea is to schedule each job for a short time quantum, and then preempt it and
schedule another job in its place. The jobs are scheduled in round robin
order: a cycle is formed, and each gets one time quantum. Thus the delay
until a new job gets to run is limited to one cycle, which is the product of the
number of jobs times the length of the time quantum. If a job is short, it will
terminate within its first quantum, and have a relatively short response time.
If it is long, it will have to wait for additional quanta. The system does not
have to know in advance which jobs are short and which are long.
Note that when each process just runs for a short time, we are actually time
slicing the CPU. This results in a viable approximation to processor sharing,
which was shown on page 36 to prevent situations in which a short job gets
stuck behind a long one. In fact, the time it takes to run each job is more-or-
less proportional to its own length, multiplied by the current load. This is
beneficial due to the high variability of process runtimes: most are very short,
but some are very long.
job 1 job2 job3
arrives arrives arrives
job 1 2 1 2 3 1 2 3 1 2 1 2
time job3 job 1 job2 terminates terminates terminates
Exercise 41 Should the time slices be long or short? Why?
Using preemption regularly ensures that the system stays in control
But how are time quanta enforced? This is another example where the
operating system needs some help from the hardware. The trick is to have a
hardware clock that causes periodic interrupts (e.g. 100 times a second, or
every 10 ms). This interrupt, like all other interrupts, is handled by a special
operating system function. Specifically, this handler may call the scheduler
which may decide to preempt the currently running process.
Finally, we note that even if processes do not engage in infinite loops, some
are compute-bound while others are interactive (or I/O-bound). Without
preemption, CPU-bound processes may lock out the interactive processes for
excessive periods, which is undesirable. Preempting them regularly allows
interactive processes to get a chance to run, at least once in a while. But if
there are many compute-bound processes, this may not be enough. The
solution is then to give the interactive processes higher priorities.
However, the system can easily accumulate information about the processes,
and prioritize them. Interactive jobs, such as a text editor, typically interleave
short bursts of CPU activity with I/O operations to the terminal. In order to
improve responsiveness, the system should give high priority to such jobs.
This can be done by regarding the CPU burst as the unit of computing, and
scheduling processes with short bursts first (this is a variant of SJF).
The question remains of how to estimate the duration of the next burst. One
option is to assume it will be like the last one. Another is to use an average of
all previous bursts. Typically a weighted average is used, so recent activity
has a greater influence on the estimate. For example, we can define the n +
1st estimate as
Exercise 42 Can this be computed without storing information about all the
previous bursts?
Multi-level feedback queues learn from past behavior
New processes and processes that have completed I/O operations are placed
in the first queue, where they have a high priority and receive a short
quantum. If they do not block or terminate within this quantum, they move to
the next queue, where they have a lower priority (so they wait longer), but
then they get a longer quantum. On the other hand, if they do not complete
their allocated quantum, they either stay in the same queue or even move
back up one step.
In this way, a series of queues is created: each additional queue holds jobs
that have already run for a longer time, so they are expected to continue to
run for a long time. Their priority is reduced so that they will not interfere
with other jobs that are assumed to be shorter, but when they do run they are
given a longer quantum to reduce the overhead of context switching. The
scheduler always serves the lowestnumbered non-empty queue.
The Unix scheduler also prioritizes the ready processes based on CPU usage,
and schedules the one with the highest priority (which is the lowest numerical
value). The equation used to calculate user-level priorities is the following
(when running in kernel mode, the priority is fixed):
cpu use is recent CPU usage. This value is incremented for the running
process on every clock interrupt (typically 100 times a second). Thus the
priority of a process goes down as it runs. In order to adjust to changing
conditions, and not to over-penalize long running jobs, the cpu use value is
divided in two for all processes once a second (this is called exponential
aging). Thus the priority of a process goes up as it waits in the ready queue.
low priority
process process waits in queue process runs (exponential aging)runs
is the base priority for user processes, and distinguishes them from kernel
base
priorities. A process that goes to sleep in a system call, e.g. when waiting for
a disk operation to complete, will have a higher priority. Thus when it wakes
up it will have preference over all user-level processes. This will allow it to
complete the system call and release kernel resources. When it returns to user
mode, its priority will be reduced again.
nice is an optional additional term that users may use to reduce the priority of
their processes, in order to be nice to their colleagues.
Exercise 43 Does the Unix scheduler give preference to interactive jobs over
CPUbound jobs? If so, how?
Exercise 44 Can a Unix process suffer from starvation?
To read more: Unix scheduling is described in Bach [1, Sect. 8.1] (system V)
and in McKusick [11, Sect. 4.4] (4.4BSD). The BSD formula is slightly more
complicated.
Technically all this is done by keeping processes on either of two arrays: the
active array and the expired array. Initially all processes are on the active
array and eligible for scheduling. Each processes that exhausts its allocation
is moved to the expired array. When there are no more runnable processes on
the active array, the epoch ends and the arrays are switched.
The Linux approach has a subtle effect on interactive processes. The crucial
point is that allocations and priorities are correlated (in fact, they are the same
number). Therefore if we have several high-priority interactive processes,
each may run for a relatively long time before giving up the processor. As a
result such interactive may have to wait a long time to get their turn to run.
Fair share scheduling tries to give each job what it deserves to get. However,
“fair” does not necessarily imply “equal”. Deciding how to share the system
is an administrative policy issue, to be handled by the system administrators.
The scheduler should only provide to tools to implement the chosen policy.
For example, a fair share of the machine resources might be defined to be
“proportional to how much you paid when the machine was acquired”.
Another possible criterion is your importance in the organization. Your fair
share can also change with time, reflecting the current importance of the
project you are working on. When a customer demo is being prepared, those
working on it get more resources than those working on the next generation
of the system.
Shares can be implemented by manipulating the scheduling
A simple method for fair share scheduling is to add a term that reflects
resource usage to the priority equation. For example, if we want to control the
share of the resources acquired by a group of users, we can add a term that
reflects cumulative resource usage by the group members to the priority
equation, so as to reduce their priority as they use more resources. This can
be “aged” with time to allow them access again after a period of low usage.
The problem with this approach is that it is hard to translate the priority
differences into actual shares [8, 3].
To read more: Various other schemes for fair-share scheduling have been
proposed in the literature. Two very nice ones are lottery scheduling, which is
based on a probabilistic allocation of resources [16], and virtual time round
robin (VTRR), which uses a clever manipulation of the queue [12].
2.4 Summary
Abstractions
Implementation
Most modern operating systems support processes and threads directly. They
have one system call to create a process, and another to create an additional
thread within a process. Then there are many other system calls to manipulate
processes and threads. For example, this is the situation with Windows.
Linux is also unique. It does not have a distinction between processes and
threads. In effect, it only has threads (but they are called “tasks”). New ones
are created by theclone system call, which is similar tofork, but provides
detailed control over exactly what is shared between the parent and child.
Resource management
Threads abstract the CPU, and this is the main resource that needs to be
managed. Scheduling — which is what “resource management” is in this
context — is a hard problem. It requires detailed knowledge, e.g. how long a
job will run, which is typically not available. And then it turns out to be NP-
complete.
However, this doesn’t mean that operating systems can’t do anything. The
main idea is to use preemption. This allows the operating system to learn
about the behavior of different jobs, and to reconsider its decisions
periodically. This is not as pompous as it sounds, and is usually embodied by
simple priority rules and simple data structures like multi-level feedback
queues.
Workload issues
Hardware support
There are two types of hardware support that are related to processes. One is
having a clock interrupt, and using this to regain control, to perform
preemptive scheduling, and to implement timers.
The other is support for isolation — preventing each process from seeing
stuff that belongs to another process. This is a main feature of memory
management mechanisms, so we will review it in Chapter 4.
Bibliography
[1] M. J. Bach, The Design of the UNIX Operating System. Prentice-Hall,
1986.
[8] G. J. Henry, “The fair share scheduler”. AT&T Bell Labs Tech. J. 63(8,
part 2), pp. 1845–1857, Oct 1984.
Appendix B
UNIX Processes
Unix started out as a very simple operating system, designed and
implemented by two programmers at AT&T. They subsequently received the
Turing award (the Nobel prize of computer science) for this work.
It took some time, but eventually AT&T turned Unix into a commercial
product. Some current versions of Unix, such as Solaris, have their roots in
that version. At the same time another variant of Unix was designed and
implemented at Berkeley, and was distributed under the name BSD (for
Berkeley Software Distribution). IBM wrote their own version, called AIX.
Linux is also and independent version largely unrelated to the others. The
main unifying aspect of all these systems is that they support the same basic
system calls, although each has its own extensions.
One unique feature in Unix is how processes are created. This is somewhat
anachronistic today, and would probably be done differently if designed from
scratch. However, it is interesting enough for an appendix.
To read more: Unix was widely adopted in academia, and as a result there is a
lot of written material about it. Perhaps the best known classic is Bach’s book
on Unix version V [1]. Another well-known book is McKusick et al. who
describe the BSD version [2].
The Unix equivalent of a PCB is the combination of two data structures. The
data items that the kernel may need at any time are contained in the process’s
entry in the process table (including priority information to decide when to
schedule the process). The data items that are only needed when the process
is currently running are contained in the process’s u-area (including the tables
of file descriptors and signal handlers). The kernel is designed so that at any
given moment the current process’s u-area is mapped to the same memory
addresses, and therefore the data there can be accessed uniformly without
process-related indirection.
Exercise 48 Should information about the user be in the process table or the
u-area? Hint: it’s in the process table. Why is this surprising? Can you
imagine why is it there anyway?
running scheduleuser
trap orreturn interrupt
terminatedzombie ready running
userpreempt kernel
Note that the running state has been divided into two: running in user mode
and in kernel mode. This is because Unix kernel routines typically run within
the context of the current user process, rather than having a separate
environment for the kernel. The ready state is also illustrated as two states:
one is for preempted processes that will continue to run in user mode when
scheduled, and the other is for processes that blocked in a system call and
need to complete the system call in kernel mode (the implementation actually
has only one joint ready queue, but processes in kernel mode have higher
priority and will run first). The zombie state is for processes that terminate,
but are still kept in the system. This is done in case another process will later
issue thewait system call and check for their termination.
When swapping is considered, even more states are added to the graph, as
shown here:
running user
scheduletrap return
preempt
running kernel terminatedzombie
schedule ready
user
ready
swapped ready swapped blocked swapped event done
Exercise 49 Why isn’t the blocked state divided into blocked in user mode
and blocked in kernel mode?
Exercise 50 The arrow from ready user to running user shown in this graph
does not really exist in practice. Why?
Thefork system call duplicates a process
In Unix, new processes are not created from scratch. Rather, any process can
create a new process by duplicating itself. This is done by calling thefork
system call. The new process will be identical to its parent process: it has the
same data, executes the same program, and in fact is at exactly the same place
in the execution. The only differences are their process IDs and the return
value from thefork.
parent PCB
13: pid = fork(); 14: if (pid == 0) {pid = 758 15: /* child */ppid = 699 16: } else {uid = 31 17: /* parent
*/text 18: }data
stack x: 1PC = 13 y: 3SP = 79 pid: undef
It has a process ID (pid) of 758, a user ID (uid) of 31, text, data, and stack
segments, and so on (the “pid” in the data segment is the name of the variable
that is assigned in instruction 13; the process ID is stored in the PCB). Its
program counter (PC) is on instruction 13, the call tofork.
Calling fork causes a trap to the operating system. From this point, the
process is not running any more. The operating system is running, in thefork
function. This function examines the process and duplicates it.
First, fork allocates all the resources needed by the new process. This
includes a new PCB and memory for a copy of the address space (including
the stack). Then the contents of the parent PCB are copied into the new PCB,
and the contents of the parent address space are copied into the new address
space. The text segment, with the program code, need not be copied. It is
shared by both processes.
text 18: }
data
stack x: 1PC = 52 y: 3SP = 79 pid: undef
17: /* parent */text
data
stackx: 1 PC = 52y: 3 SP = 79pid: undef
Note that as the system completes the fork, it is left with two ready processes:
the parent and the child. These will be scheduled at the discretion of
scheduler. In principle, either may run before the other.
When the process is subsequently scheduled to run, it will start executing the
new program. Thusexec is also a very special system call: it is “a system call
that never returns”, because (if it succeeds) the context in which it was called
does not exist anymore.
Exercise 52 One of the few things that the new program should inherit from
the old one is the environment (the set of name, value pairs of environment
variables and their values). How can this be done if the whole address space
is re-initialized?
While exec replaces the program being run, it does not re-initialize the whole
environment. In particular, the new program inherits open files from its
predecessor. This is used when setting up pipes, and is the reason for
keepingfork andexec separate. It is described in more detail in Section 12.2.4.
Bibliography
[1] M. J. Bach, The Design of the UNIX Operating System. Prentice-Hall,
1986. [2] M. K. McKusick, K. Bostic, M. J. Karels, and J. S. Quarterman,
The Design and Implementation of the 4.4BSD Operating System. Addison
Wesley, 1996.
Given that the operating system supports multiple processes, there may be
various interactions among them. We will study three rather different types of
interactions:
new−>next = current−>next;
Chapter 3
Concurrency
Given that the operating system supports multiple processes, there may be
various interactions among them. We will study three rather different types of
interactions:
• Access to shared operating system data structures.
This issue is concerned with internal operating system integrity. The problem
is that several processes may request the operating system to take related
actions, which require updates to internal operating system data structures.
Note that there is only one operating system, but many processes. Therefore
the operating system data structures are shared in some way by all the
processes. Updates to such shared data structures must be made with care, so
that data is not corrupted.
This issue is concerned with the abstraction and services functionality of the
operating system. The point is that multiple processes may benefit from
interacting with each other, e.g. as part of a parallel or distributed application.
The operating system has to provide the mechanisms for processes to identify
each other and to move data from one to the other.
We’ll discuss the first two here, and the third in Chapter 12.
Consider adding a new element to a linked list. The code to insert the element
pointed to bynew after the element pointed to bycurrent is trivial and consists
of two statements:
new−>next = current−>next;
new
head current
current−>next = new;
new
head current
But what if the activity is interrupted between these two statements? new-
>next has already been set to point to another element, but this may no longer
be valid when the activity resumes! For example, the intervening activity may
delete the pointed element from the list, and insert it into another list! Let’s
look at what happens in detail. Initially, there are two lists, and we are
inserting the new element into one of them:
head1
new−>next = current−>next;new currenthead2
Now we are interrupted. The interrupting activity moves the gray item we are
pointing at into the other list, and updatescurrent->next correctly. It doesn’t
know about our new item, because we have not made it part of the list yet:
head1
x new current
head2
However, we don’t know about this intervention. So when we resume
execution we therefore overwritecurrent->next and make it point to our new
item:
head1
current−>next = new; new current
head2
As a result, the two lists become merged from the gray element till their end,
and two items are completely lost: they no longer appear in any list! It should
be clear that this is very very bad.
Exercise 54 Can you think of a scenario in which only the new item is lost
(meaning that it is not linked to any list)?
Does this happen in real life? You bet it does. Probably the most infamous
example is from the software controlling the Therac-25, a radiation machine
used to treat cancer patients. A few died due to massive overdose induced by
using the wrong data. The problem was traced to lack of synchronization
among competing threads [13].
The solution is that all the steps required to complete an multi-step action
must be done atomically, that is, as a single indivisible unit. The activity
performing them must not be interrupted. It will then leave the data structures
in a consistent state.
This idea is translated into program code by identifying critical sections that
will be executed atomically. Thus the code to insert an item into a linked list
will be
begincritical section;
new->next = current->next;
current->next = new;
endcritical section;
begin critical section is a special piece of code with a strange behavior: if one
activity passes it, another activity that also tries to pass it will get stuck. As a
result only one activity is in the critical section at any time. This property is
called mutual exclusion. Each activity, by virtue of passing thebegincritical
section code and being in the critical section, excludes all other activities
from also being in the critical section. When it finishes the critical section,
and executes theendcritical section code, this frees another activity that was
stuck in thebegincritical section. Now that other activity can enter the critical
section.
Exercise 55 Does the atomicity of critical sections imply that a process that is
in a critical section may not be preempted?
Mutual exclusion can be achieved by sophisticated algorithms
But how do you implement begin critical section andend critical section to
achieve the desired effect? In the 1960s the issue of how and whether
concurrent processes can coordinate their activities was a very challenging
question. The answer was that they can, using any of several subtle
algorithms. While these algorithms are not used in real systems, they have
become part of the core of operating system courses, and we therefore review
them here. More practical solutions (that are indeed used) are introduced
below.
The basic idea in all the algorithms is the same. They provide code that
implementsbegin critical section andend critical section by using an auxiliary
shared data structure. For example, we might use shared variables in which
each process indicates that it is going into the critical section, and then checks
that the others are not. With two processes, the code would be
process 1:
goingin 1 = TRUE;
while (goingin 2) /*empty*/;
critical section
going in 1 = FALSE;
process 2:
going in 2 = TRUE;
while (going in 1) /*empty*/;
critical section
going in 2 = FALSE;
Exercise 56 Except for the problem with deadlock, is this code at least
correct in the sense that if one process is in the critical section then it is
guaranteed that the other process will not enter the critical section?
goingin 1 = TRUE;
turn = 2;
while (goingin 2 && turn==2)
/*empty*/;
critical section
goingin 1 = FALSE;
process 2:
going in 2 = TRUE;
turn = 1;
while (going in 1 && turn==1)
/*empty*/;
critical section
going in 2 = FALSE;
This is based on using another shared variable that indicates whose turn it is
to get into the critical section. The interesting idea is that each process tries to
let the other process get in first. This solves the deadlock problem. Assume
both processes set their respectivegoingin variables toTRUE, as before. They
then both setturn to conflicting values. One of these assignments prevails, and
the process that made it then waits. The other process, the one whose
assignment toturn was overwritten, can then enter the critical section. When it
exits, it sets itsgoingin variable to FALSE, and then the waiting process can
get in.
However, note that several processes may be doing this at the same time, so
more than one process may end up with the same number. This is solved
when we compare our number with all the other processes, in the secondfor
loop. First, if we encounter a process that is in the middle of getting its ticket,
we wait for it to actually get the ticket. Then, if the ticket is valid and smaller
than ours, we wait for it to go through the critical section (when it gets out, it
sets its ticket to 0, which represents invalid). Ties are simply solved by
comparing the IDs of the two processes.
Note, however, that incrementing the global variable is typically not atomic:
each process reads the value into a register, increments it, and writes it back.
Can this cause problems? Hint: if it would have worked, we wouldn’t use the
loop.
There are four criteria for success In summary, this algorithm has the
following desirable properties:
1. Correctness: only one process is in the critical section at a time. 2.
Progress: there is no deadlock, and if one or more processes are trying to get
into the critical section, some process will eventually get in.
After showing that the algorithms are convincing using rapid hand waving,
we now turn to a formal proof. The algorithm used is the n-process
generalization of Peterson’s algorithm, and the proof is due to Hofri [8].
The algorithm uses two global arrays,q[n] andturn[n − 1], both initialized to
all 0’s. Each process pi also has three local variables,j,k, and its index i. The
code is
1 for (j=1; j<n; j++) {2 q[i] = j;
3 turn[j] = i;
4 while ((∃k=i s.t. q[k]≥j) && (turn[j] = i))
5 /*empty*/;
6
7} critical section
8 q[i] = 0;
The generalization from the two-process case is that entering the critical
section becomes a multi-stage process. Thus the Booleangoingin variable is
replaced by the integer variableq[i], originally 0 to denote no interest in the
critical section, passing through the values 1 to n − 1 to reflect the stages of
the entry procedure, and finally hitting n in
the critical section itself. But how does this work to guarantee mutual
exclusion? Insight may be gained by the following lemmas.
Lemma 3.1 A process that is ahead of all others can advance by one stage.
Proof: The formal definition of process i being ahead is that ∀k=i, q[k] <
q[i]. Thus the condition in line 4 is not satisfied, andj is incremented and
stored inq[i], thus advancing process i to the next stage.
Lemma 3.3 If there are at least two processes at stage j, there is at least one
process at each stage k ∈ {1, . . . , j − 1}.
Proof: The base step is for j = 2. Given a single process at stage 2, another
process can join it only by leaving behind a third process in stage 1 (by
Lemma 3.2). This third process stays stuck there as long as it is alone, again
by Lemma 3.2. For the induction step, assume the Lemma holds for stage j −
1. Given that there are two processes at stage j, consider the instance at which
the second one arrived at this stage. By Lemma 3.2, at that time there was at
least one other process at stage j − 1; and by the induction assumption, this
means that all the lower stages were also occupied. Moreover, none of these
stages could be vacated since, due to Lemma 3.2.
Using these, we can envision how the algorithm works. The stages of the
entry protocol are like the rungs of a ladder that has to be scaled in order to
enter the critical section. The algorithm works by allowing only the top
process to continue scaling the ladder. Others are restricted by the
requirement of having a continuous link of processes starting at the bottom
rung. As the number of rungs equals the number of processors minus 1, they
are prevented from entering the critical section. Formally, we can state the
following:
occupied, and the critical section is vacant. One of the two processes at stage
n − 1 can therefore enter the critical section. The other will then stay at stage n
− 1 because the condition in line 4 of the algorithm holds. Thus the integrity
of the critical section is maintained again, and correction is proved.
Progress follows immediately from the fact that some process must be either
ahead of all others, or at the same stage as other processes and not the last to
have arrived. For this process the condition in line 4 does not hold, and it
advances to the next stage.
Fairness follows from the fact that the last process to reach a stage is the one
that cannot proceed, because the stage’sturn cell is set to its ID. Consider
process p, which is the last to arrive at the first stage. In the worst case, all
other processes can be ahead of it, and they will enter the critical section first.
But if they subsequently try to enter it again, they will be behind process p. If
it tries to advance, it will therefore manage to do so, and one of the other
processes will be left behind. Assuming all non-last processes are allowed to
advance to the next stage, process p will have to wait for no more than n − j
other processes at stage j, and other processes will overtake it no more than n
times each. To read more: A full description of lots of wrong mutual
exclusion algorithms is given by Stallings [16, Sect. 5.2]. The issue of
sophisticated algorithms for mutual exclusion has been beaten to death by
Lamport [10, 11, 12].
Because the hardware guarantees that the test and set is atomic, it guarantees
that only one process will see the value 0. When that process sees 0, it
atomically sets the value to 1, and all other processes will see 1. They will
then stay in the while loop until the first process exits the critical section, and
sets the bit to 0. Again, only one other process will see the 0 and get into the
critical section; the rest will continue to wait.
else
return FAIL
where x is a pointer to a variable,old andnew are values, and the whole thing
is done atomically by the hardware. How can you implement a critical
section using this instruction?
Algorithms for mutual exclusion are tricky, and it is difficult to verify their
exact properties. Using hardware primitives depends on their availability in
the architecture. A better solution from the perspective of operating system
design is to use some more abstract mechanism.
The mechanism that captures the abstraction of inter-process synchronization
is the semaphore, introduced by Dijkstra [1]. Semaphores are a new data
type, that provides only two operations:
the P operation checks whether the semaphore is free. If it is, it occupies the•
semaphore. But if
the semaphore is already occupied, it waits till the semaphore will become
free.
TheV operation releases an occupied semaphore, and frees one blocked
process• (if there are any blocked processes waiting for this semaphore).
P(mutex);
critical section
V(mutex);
class semaphore
int value = 1;
P() {if (--value < 0)
blockthis proc();
}V() {if (value++ < 0)
resume blocked proc();
}
The reasons for calling the operationsP andV is that they are abbreviations of
the Dutch words “proberen” (to try) and “verhogen” (to elevate). Speakers of
Hebrew have the advantage of regarding them as abbreviations for and . In
English, the words wait and signal are sometimes used instead ofP andV,
respectively.
The power of semaphores comes from the way in which they capture the
essence of synchronization, which is this: the process needs to wait until a
certain condition allows it to proceed. The important thing it that the
abstraction does not specify how to implement the waiting.
All the solutions we saw so far implemented the waiting by burning cycles.
Any process that had to wait simply sat in a loop, and continuously checked
whether the awaited condition was satisfied — which is called busy waiting.
In a uniprocessor system with multiple processes this is very very inefficient.
When one process is busy waiting for another, it is occupying the CPU all the
time. Therefore the awaited process cannot run, and cannot make progress
towards satisfying the desired condition.
Semaphores have been very successful, and since their introduction in the
context of operating system design they have been recognized as a generally
useful construct for concurrent programming. This is due to the combination
of two things: that they capture the abstraction of needing to wait for a
condition or event, and that they can be implemented efficiently as shown
above.
Moreover, the notion of having to wait for a condition is not unique to critical
sections (where the condition is “no other process is executing this code”).
Thus semaphores can be used for a host of other things.
Another abstraction that was introduced for operating system design is that of
monitors [7]. A monitor encapsulates some state, and provides methods that
operate on that state. In addition, it guarantees that these methods are
executed in a mutually exclusive manner. Thus if a process tries to invoke a
method from a specific monitor, and some method of that monitor is already
being executed by another process, the first process is blocked until the
monitor becomes available.
First, the mutual exclusion problem can be solved using a bare bones
approach. It is possible to devise algorithms that only read and write shared
variables to determine whether they can enter the critical section, and use
busy waiting to delay themselves if they cannot.
Disabling interrupts can be viewed as locking the CPU: the current process
has it, and no other process can gain access. Thus the practice of blocking all
interrupts whenever a process runs in kernel mode is akin to defining a single
lock, that protects the whole kernel. But this may be stronger than what is
needed: for example, if one process wants to open a file, this should not
interfere with another that wants to allocate new memory.
The solution is therefore to define multiple locks, that each protect a part of
the kernel. These can be defined at various granularities: a lock for the whole
file system (and by implication, for all the data structures involved in the file
system implementation), a lock for a single large data structure such as a
system table, or a lock for a a single entry within such a table. By holding
only the necessary locks, the restrictions on other processes are reduced.
Note the shift in focus from the earlier parts of this chapter: we are no longer
talking about critical sections of code that should be executed in a mutually
exclusive manner. Instead, we are talking of data structures which need to be
used in a mutually exclusive manner. This shifts the focus from the artifact
(the code handling the data structure) to the substance (the data structure’s
function and why it is being accessed).
This shift in focus is extremely important. To drive the point home, consider
the following example. As we know, the operating system is a reactive
program, with many different functions that may be activated under diverse
conditions. In particular, we may have
The scheduler, which is called by the clock interrupt handler. When called, it•
scans the list of ready processes to find the one that has the highest priority and
schedule it to run.
The disk interrupt handler, which wakes up the process that initiated the I/O•
activity and places it on the ready queue.
The function that implements the creation of new processes, which allocates
a• PCB for the new process and links it to the ready queue so that it will run when
the scheduler selects it.
All these examples and more are code segments that manipulate the ready
queue and maybe some other data structures as well. To ensure that these
data structures remain consistent, they need to be locked. In particular, all
these functions need to lock the ready queue. Mutually exclusive execution of
the code segments won’t work, because they are distinct code fragments to
begin with.
The first problem is called priority inversion. This happens when a low
priority process holds a lock, and a higher priority process tries to acquire the
same lock. Given the semantics of locks, this leads to a situation where a
high-priority process waits for a low-priority one, and worse, also for
processes with intermediate priorities that may preempt the low-priority
process and thus prevent it from releasing the lock. One possible solution is
to temporarily elevate the priority of the process holding the lock to that of
the highest process that also want to obtain it.
The second problem is one of deadlocks, where a set of processes all wait for
each other and none can make progress. This will be discussed at length in
Section 3.2.
The problem with using these atomic instructions is that we are back to using
busy waiting. This is not so harmful for performance because it is only used
to protect very small critical sections, that are held for a very short time. A
common scenario is to use busy waiting with test and set on a variable that
protects the implementation of a semaphore. The protected code is very short:
just check the semaphore variable, and either lock it or give up because it has
already been locked by someone else. Blocking to wait for the semaphore to
become free is a local operation, and need not be protected by the test and set
variable.
3.2 Resource Contention and Deadlock
One of the main functions of an operating system is resource allocation. But
when multiple processes request and acquire multiple resources, deadlock
may ensue.
For example, consider a function that moves an item from one list to another.
To do so, it must lock both lists. Now assume that one process wants to move
an item from list A to list B, and at the same time another process tries to
move an item from list B to list A. Both processes will lock the source list
first, and then attempt to lock the destination list. The problem is that the first
activity may be interrupted after it locks list A. The second activity then runs
and locks list B. It cannot also lock list A, because list A is already locked, so
it has to wait. But the first activity, which is holding the lock on list A, is also
stuck, because it cannot obtain the required lock on list B. Thus the two
activities are deadlocked waiting for each other.
The problem is that the following scenario is possible. All the philosophers
become hungry at the same time, and troop into the dining room. They all sit
down in their places, and pick up the forks to their left. Then they all try to
pick up the forks to their right, only to find that those forks have already been
picked up (because they are also another philosopher’s left fork). The
philosophers then continue to sit there indefinitely, each holding onto one
fork, and glaring at his neighbor. They are deadlocked.
For the record, in operating system terms this problem represents a set of five
processes (the philosophers) that contend for the use of five resources (the
forks). It is highly structured in the sense that each resource is potentially
shared by only two specific processes, and together they form a cycle. The
spaghetti doesn’t represent anything.
Exercise 71 Before reading on, can you think of how to solve this?
Luckily, several solutions that do indeed work have been proposed. One is to
break the cycle by programming one of the philosophers differently. For
example, while all the philosophers pick up their left fork first, one might
pick up his right fork first. Another is to add a footman, who stands at the
dining room door and only allows four philosophers in at a time. A third
solution is to introduce randomization: each philosopher that finds his right
fork unavailable will relinquish his left fork, and try again only after a
random interval of time.
R1 R2
R3
Exercise 74 What are examples of resources (in a computer system) of which
there is a single instance? A few instances? Very many instances?
Exercise 75 A computer system has three printers attached to it. Should they
be modeled as three instances of a generic “printer” resource type, or as
separate resource types?
Assuming requests represent resources that are really needed by the process,
an unsatisfied request implies that the process is stuck. It is waiting for the
requested resource to become available. If the resource is held by another
process that is also stuck, then that process will not release the resource. As a
result the original process will stay stuck. When a group of processes are
stuck waiting for each other in this way, they will stay in this situation
forever. We say they are deadlocked.
The first two are part of the semantics of what resource allocation means. For
example, consider a situation in which you have been hired to run errands,
and given a bicycle to do the job. It would be unacceptable if the same
bicycle was given to someone else at the same time, or if it were taken away
from under you in mid trip.
The third condition is a rule of how the system operates. As such it is subject
to modifications, and indeed some of the solutions to the deadlock problem
are based on such modifications.
The fourth condition may or may not happen, depending on the timing of
various requests. If it does happen, then it is unresolvable because of the first
three.
R1 R2
R3
given to P2.
Exercise 76 Under what conditions is having a cycle a sufficient condition?
Think of conditions relating to both processes and resources.
One option is to annul the “hold and wait” condition. This can be done in
several ways. Instead of acquiring resources one by one, processes may be
required to give the system a list of all the resources they will need at the
outset. The system can then either provide all the resources immediately, or
block the process until a later time when all the requested resources will be
available. However, this means that processes will hold on to their resources
for more time than they actually need them, which is wasteful.
In some cases we can nullify the second condition, and allow resources to be
preempted. For example, in Section 4.4 we will introduce swapping as a
means to deal with overcommitment of memory. The idea is that one process
is selected as a victim and swapped out. This means that its memory contents
are backed up on disk, freeing the physical memory it had used for use by
other processes.
Exercise 80 Consider an intersection of two roads, with cars that may come
from all 4 directions. Is there a simple uniform rule (such as “always give
right-of-way to someone that comes from your right”) that will prevent
deadlock? Does this change if a traffic circle is built?
In order to identify safe states, each process must declare in advance what its
maximal resource requirements may be. Thus the algorithm works with the
following data structures:
•
The maximal requirements of each process Mp,
The current allocation to each process Cp, and• The currently available resources A.•
These are all vectors representing all the resource types. For example, if there
are three resource types, and three units are available from the first, none
from the second, and one from the third, then A = (3, 0, 1).
Assume the system is in a safe state, and process p makes a request R for
more resources (this is also a vector). We tentatively update the system state
as if we performed the requested allocation, by doing
Cp = Cp + R
A = A− R
Where operations on vectors are done on respective elements (e.g. X + Y is
the vector (x1 + y1, x2 + y2, ...xk + yk)). We now check whether this new state
is also safe. If it is, we will really perform the allocation; if it is not, we will
block the requesting process and make it wait until the allocation can be
made safely.
To check whether the new state is safe, we need to find an ordering of the
processes such that the system has enough resources to satisfy the maximal
requirements of each one in its turn. As noted above, it is assumed that the
process will then complete and release all its resources. The system will then
have a larger pool to satisfy the maximal requirements of the next process,
and so on. This idea is embodied by the following pseudo-code, where P is
initialized to the set of all processes:
while ( P = ∅) {found = FALSE;
foreach p ∈ P
if (
Mp − Cp ≤/* p
{ A) {
can obtain all it needs, terminate, */ /* and releases what it already has.
*/ A = A + Cp;
P = P − {p}; found = TRUE;
}
}if (! found) return FAIL;
}return OK;
R1 R2
P1 P2 P3
R3
This state is described by the following vectors:
C1 = (0, 0, 1) C2 = (1, 0, 1) C3 = (0, 1, 0) A = (3, 0, 0) M1 = (3, 0, 1) M2 = (2,
1, 1) M3 = (0, 1, 1)
This is not good enough to fill all the requests of P2, but we can fulfill all of
P3’s potential requests (there is only one: a request for an instance of R3). be
able to run to completion, and will then release its instance of R2. second
iteration of the externalwhile loop we will find that P2 too can acquire all the
resources it needs and terminate.
As another example, consider the initial situation again, but this time consider
what will happen if P2 requests an instance of resource R1. This request
cannot be granted, as it will lead to an unsafe state, and therefore might lead
to deadlock. More formally, granting such a request by P2 will lead to the
state
So P3 will
So in the
C1 = (0, 0, 1) C2 = (2, 0, 1) C3 = (0, 1, 0) A = (2, 0, 0) M1 = (3, 0, 1) M2 = (2,
1, 1) M3 = (0, 1, 1)
In this state the system does not have sufficient resources to grant the
maximal requirements of any of the processes:
Therefore the first iteration of the algorithm will fail to find a process that can
in principle obtain all its required resources first. Thus the request of process
P2 will not be granted, and the process will be blocked. This will allow P1
and P3 to make progress, and eventually, when either of them terminate, it
will be safe to make the allocation to P2.
Exercise 82 Show that in this last example it is indeed safe to grant the
request of P2 when either P1 or P3 terminate.
Exercise 83 What can you say about the following modifications to the above
graph: 1. P1 may request another instance of R3 (that is, add a dashed arrow
from P1 to R3)
While deadlock situations and how to handle them are interesting, most
operating systems actually don’t really employ any sophisticated
mechanisms. This is what Tanenbaum picturesquely calls the ostrich
algorithm: pretend the problem doesn’t exist [18]. However, in reality the
situation may actually not be quite so bad. Judicious application of simple
mechanisms can typically prevent the danger of deadlocks.
Computational resources: these are the resources needed in order to run the
program. They include the resources we usually talk about, e.g. the CPU,
memory, and disk space. They also include kernel data structures such as
page tables and entries in the open files table.
For example, in Unix an important resource a process may hold onto are
entries in system tables, such as the open files table. If too many files are
open already, and the table is full, requests to open additional files will fail.
The program can then either abort, or try again in a loop. If the program
continues to request that the file be opened in a loop, it might stay stuck
forever, because other programs might be doing the same thing. However it is
more likely that some other program will close its files and terminate,
releasing its entries in the open files table, and making them available for
others. Being stuck indefinitely can also be avoided by only trying a finite
number of times, and then giving up.
spinlock(&rq1->lock);
spinlock(&rq2->lock);
} else {spinlock(&rq2->lock);
spinlock(&rq1->lock);
}
What is this for?
else
return FAIL;
where x is a pointer to a variable, andold andnew are values; what this does is
to verify that*x (the variable pointed to byx) has the expected old value, and
if it does, to swap this with a new value. Of course, the whole thing is done
atomically by the hardware, so the value cannot change between the check
and the swap.
To see why this is useful, consider the same example from the beginning of
the chapter: adding an element to the middle of a linked list. The regular code
to add an elementnew after the elementcurrent is
new->next = current->next;
current->next = new;
But as we saw, the list could become corrupted if some other process changes
it between these two instructions. The alternative, using compare and swap, is
as follows: new->next = current->next;
compareand swap(¤t->next, new->next, new);
This first creates a link from the new item to the item after where it is
supposed to be inserted. But then the insertion itself is done atomically by the
compare and swap; and moreover, this is conditioned on the fact that the
previous pointer,current->next, did not change in the interim.
Of course, using compare and swap does not guarantee that another process
will not barge in and alter the list. So the compare and swap may fail. It is
therefore imperative to check the return value from the compare and swap, in
order to know whether the desired operation had been performed or not. In
particular, the whole thing can be placed in a loop that retries the compare
and swap operation until it succeeds:
do {current
= element after which to insert new ; new->next = current->next;
until (compare and swap(¤t->next, new->next, new));}
Exercise 89 Why did we also put the assignment to current and to new.next in
the loop?
The code shown here is lock-free: it achieves the correct results without using
locks and without causing one process to explicitly wait for another.
However, a process may have to retry its operation again and again an
unbounded number of times. It is also possible to devise algorithms and data
structures that are wait-free, which means that all participating processes will
complete their operations successfully within a bounded number of steps.
Exercise 90 Given that a process may need to retry the compare and swap
many times, is there any benefit relative to locking with busy waiting?
3.4 Summary
Abstractions
Resource management
Workload issues are not very prominent with regard to the topics discussed
here, but they do exist. For example, knowing the distribution of lock waiting
times can influence the choice of a locking mechanism. Knowing that
deadlocks are rare allows for the problem to be ignored.
Hardware support
Bibliography
[1] E. W. Dijkstra, “Co-operating sequential processes”. In Programming
Languages, F. Genuys (ed.), pp. 43–112, Academic Press, 1968.
Memory Management
Primary memory is a prime resource in computer systems. Its management is
an important function of operating systems. However, in this area the
operating system depends heavily on hardware support, and the policies and
data structures that are used are dictated by the support that is available.
Actually, three players are involved in handling memory. The first is the
compiler, which structures the address space of an application. The second is
the operating system, which maps the compiler’s structures onto the
hardware. The third is the hardware itself, which performs the actual accesses
to memory locations.
Stack — this is the area in memory used to store the execution frames of
functions called by the program. Such frames contain stored register values
and space for local variables. Storing and loading registers is done by the
hardware as part of the instructions to call a function and return from it (see
Appendix A). Again, this region is enlarged at runtime as needed.
In some systems, there may be more than one instance of each of these
regions. For example, when a process includes multiple threads, each will
have a separate stack. Another common examle is the use of dynamically
linked libraries; the code for each such library resides in a separate segment,
that — like the text segment — can be shared with other processes. In
addition, the data and heap may each be composed of several independent
segments that are acquired at different times along the execution of the
application. For example, in Unix it is possible to create a new segment using
theshmget system call, and then multiple processes may attach this segment
to their address spaces using theshmat system call.
The compiler creates the text and data segments as relocatable segments, that
is with addresses from 0 to their respective sizes. Text segments of libraries
that are used by the program also need to be included. The heap and stack are
only created as part of creating a process to execute the program. All these
need to be mapped to the process address space.
The sum of the sizes of the memory regions used by a process is typically
much smaller than the total number of addresses it can generate. For example,
in a 32-bit architecture, a process can generate addresses ranging from 0 to
232 1 = 4, 294, 967, 295, which is 4 GB, of which say 3GB are available to
the user program (the remainder is−
left for the system, as described in Section 11.7.1). The used regions must be
mapped into this large virtual address space. This mapping assigns a virtual
address to every instruction and data structure. Instruction addresses are used
as targets for branch instructions, and data addresses are used in pointers and
indexed access (e.g. the fifth element of the array is at the address obtained
by adding four element sizes to the the array base address).
3GB
Mapping static regions, such as the text, imposes no stack problems. The
problem is with regions that may growgrowth at runtime, such as the heap and
the stack. These reunused gions should be mapped so as to leave them ample
roomgrowth
to grow. One possible solution is to map the heap and heapstack so that they grow towards
each other. This allows
either of them to grow much more than the other, withdataout having to know which
in advance. No such solution
exists if there are multiple stacks.text
0
The addresses generated by the compiler are relative and virtual. They are
relative because they assume the address space starts from address 0. The are
virtual because they are based on the assumption that the application has all
the address space from 0 to 3GB at its disposal, with a total disregard for
what is physically available. In practice, it is necessary to map these
compiler-generated addresses to physical addresses in primary memory,
subject to how much memory you bought and contention among different
processes.
In bygone days, dynamic memory allocation was not supported. The size of
the used address space was then fixed. The only support that was given was
to map a process’s address space into a contiguous range of physical
addresses. This was done by the loader, which simply set the base address
register for the process. Relative addresses were then interpreted by adding
the base address to each of them.
1 This means that this register is part of the definition of the architecture of the machine, i.e. how it
works and what services it provides to software running on it. It is not a general purpose register used
for holding values that are part of the computation.
Exercise 93 What are the consequences of not doing the bounds check?
Multiple segments can be supported by using a table
Using a special base registers and a special length register is good for
supporting a single contiguous segment. But as noted above, we typically
want to use multiple segments for structuring the address space. To enable
this, the segment base and size values are extracted from a segment table
rather than from special registers. The problem is then which table entry to
use, or in other words, to identify the segment being accessed. To do so an
address has to have two parts: a segment identifier, and an offset into the
segment. The segment identifier is used to index the segment table and
retrieve the segment base address and size, and these can then be used to find
the physical address as described above.
Exercise 94 Can the operating system decide that it wants to use multiple
segments, or does it need hardware support?
+
physical address
base addr
unused other’s
other
segment table
111111111111other’s111111111111 111111111111
unused
111111111111other’s111111111111 111111111111
Note that using such a register to point to the segment table also leads to the
very important property of memory protection: when a certain process runs,
only its segment table is available, so only its memory can be accessed. The
mappings of segments belonging to other processes are stored in other
segment tables, so they are not accessible and thus protected from any
interference by the running process.
The contents of the tables, i.e. the mapping of the segments into physical
memory, is done by the operating system using algorithms described next.
When searching for a free area of memory that is large enough for a new
segment, several algorithms can be used:
First-fit scans the list from its beginning and chooses the first free area that is
big enough.
Best-fit scans the whole list, and chooses the smallest free area that is big
enough. The intuition is that this will be more efficient and only waste a little
bit each time. However, this is wrong: first-fit is usually better, because best-
fit tends to create multiple small and unusable fragments [11, p. 437].
Worst-fit is the opposite of best-fit: it selects the biggest free area, with the
hope that the part that is left over will still be useful.
Exercise 96 Given the following memory map (where gray areas are
allocated), what was the last segment that was allocated assuming the first-fit
algorithm was used? and what if best-fit was used?
0 15 30 48 73 7880 86 96 118 128
Exercise 97 Given a certain situation (that is, memory areas that are free and
allocated), and a sequence of requests that can be satisfied by first-fit, is it
always true that these requests can also be satisfied by best-fit? How about
the other way around? Prove these claims or show counter axamples.
The complexity of first-fit and best-fit depends on the length of the list of free
areas, which becomes longer with time. An alternative with constant
complexity is to use a buddy system.
buddy
organization 1
free 1 free free memory1
Fragmentation is a problem
Internal fragmentation occurs when the system allocates more than the
process• requested and can use. For example, this happens when the size of the address
space is increased to the next power of two in a buddy system.
External fragmentation occurs when unallocated pieces of memory are left
be• tween allocated areas, and these unallocated pieces are all too small to satisfy
any additional requests (even if their sum may be large enough).
Exercise 98 Does next-fit suffer from internal fragmentation, external
fragmentation, or both? And how about a buddy system?
Modern algorithms for memory allocation from the heap take usage into
account. In particular, they assume that if a memory block of a certain size
was requested and subsequently released, there is a good chance that the
same size will be requested again in the future (e.g. if the memory is used for
a new instance of a certain object). Thus freed blocks are kept in lists in
anticipation of future requests, rather than being merged together. Such reuse
reduces the creation of additional fragmentation, and also reduces overhead.
To read more: Jacob and Mudge provide a detailed survey of modern paging
schemes and their implementation [9]. This doesn’t exist in most textbooks
on Computer Architecture.
The basic idea in paging is that memory is mapped and allocated in small,
fixed size units (the pages). A typical page size is 4 KB. Addressing a byte of
memory can then be interpreted as being composed of two parts: selecting a
page, and specifying an offset into the page. For example, with 32-bit
addresses and 4KB pages, 20 bits indicate the page, and the remaining 12
identify the desired byte within the page. In mathematical notation, we have
page
=
address
4096
physical1GB memory
virtual address of x page offset
mappedframes page storage of xpage table
frame
0
Note that the hardware does not need to perform mathematical operations to
figure out which frame is needed and the offset into this frame. By virtue of
using bits to denote pages and offsets, it just needs to select a subset of the
bits from the virtual address and append them to the bits of the frame number.
This generates the required physical address that points directly at the desired
memory location.
virtual address of x
physical address
frame offset storage of x
page table
mapped
page
frame
If a page is not currently mapped to a frame, the access fails and generates a
page fault. This is a special type of interrupt. The operating system handler
for this interrupt initiates a disk operation to read the desired page, and maps
it to a free frame. While this is being done, the process is blocked. When the
page becomes available, the process is unblocked, and placed on the ready
queue. When it is scheduled, it will resume its execution with the instruction
that caused the page fault — but this time it will work. This manner of
getting pages as they are needed is called demand paging.
Exercise 100 Does the page size have to match the disk block size? What are
the considerations?
When memory is partitioned into pages, these pages can be mapped into
unrelated physical memory frames. Pages that contain contiguous virtual
addresses need not be contiguous in physical memory. Moreover, all pages
and all frames are of the same size, so any page can be mapped to any frame.
As a result the allocation of memory is much more flexible.
An important observation is that not all the used parts of the address space
need to be mapped to physical memory simultaneously. Only pages that are
accessed in the current phase of the computation are mapped to physical
memory frames, as needed. Thus programs that use a very large address
space may still be executed on a machine with a much smaller physical
memory.
Exercise 101 What is the maximal number of pages that may be required to
be memory resident (that is, mapped to physical frames) in order to execute a
single instruction?
Paging provides reduced fragmentation
Exercise 102 Can the operating system set the page size at will?
It depends on locality
The success of paging systems depends on the fact that applications display
locality of reference. This means that they tend to stay in the same part of the
address space for some time, before moving to other remote addresses. With
locality, each page is used many times, which amortizes the cost of reading it
off the disk. Without locality, the system will thrash, and suffer from
multiple page faults.
fast storage slow storage transfer unit processor cache cache primary
memory cache line paging primary memory disk page
However, one difference is that in the case of memory the location of pages is
fully associative: any page can be in any frame, as explained below.
Exercise 103 The cache is maintained by hardware. Why is paging delegated
to the operating system?
But how does this measure locality? If memory accesses are random, you
would expect to find items “in the middle” of the stack. Therefore the average
stack depth will be high, roughly of the same order of magnitude as the size
of all used memory. But if there is significant locality, you expect to find
items near the top of the stack. For example, if there is significant temporal
locality, it means that we tend to repeatedly access the same items. If such
repetitions come one after the other, the item will be found at the top of the
stack. Likewise, if we repeatedly access a set of items in a loop, this set will
remain at the top of the stack and only the order of its members will change
as they are accessed.
eon_rush perl_diffmail gzip_program 2 2 2
1.5 1.5 1.5
111
0.5 0.5 0.5
0 3 10 4 0 3 10 4 0 3 10 41 10 100 10 1 10 100 10 1 10 100 10 stack distance stack distance stack
distance
Another measure of locality is the size fo the working set, loosely defined as
the set of addresses that are needed at a particular time [7]. The smaller the
set (relative to the full set of all possible addresses), the stronger the locality.
This may change for each phase of the computation.
Exercise 104 So when memory becomes cheaper than disk, this is the end of
paging? Or are there reasons to continue to use paging anyway?
The cost is usually very acceptable
So far we have only listed the benefits of paging. But paging may cause
applications to run slower, due to the overheads involved in interrupt
processing. This includes both the direct overhead of actually running the
interrupt handler, and the indirect overhead of reduced cache performance
due to more context switching, as processes are forced to yield the CPU to
wait for a page to be loaded into memory. The total effect may be a
degradation in performance of 10–20% [10].
To improve performance, it is crucial to reduce the page fault rate. This may
be possible with good prefetching: for example, given a page fault we may
bring additional pages into memory, instead of only bringing the one that
caused the fault. If the program subsequently references these additional
pages, it will find them in memory and avoid the page fault. Of course there
is also the danger that the prefetching was wrong, and the program does not
need these pages. In that case it may be detrimental to bring them into
memory, because they may have replaced other pages that are actually more
useful to the program.
+
v frame
page table
address register invalid
page fault
The page part of the virtual address is used as an index into the page table,
which contains the mapping of pages to frames (this is implemented by
adding the page number to a register containing the base address of the page
table). The frame number is extracted and used as part of the physical
address. The offset is simply appended. Given the physical address, an access
to memory is performed. This goes through the normal mechanism of
checking whether the requested address is already in the cache. If it is not, the
relevant cache line is loaded from memory.
A modified or dirty bit, indicating whether the page has been modified. If so,
it has to be written to disk when it is evicted. This bit is cleared by the
operating system when the page is mapped. It is set automatically by the
hardware each time any word in the page is written.
A used bit, indicating whether the page has been accessed recently. It is set•
automatically by the
hardware each time any word in the page is accessed. “Recently” means since
the last time that the bit was cleared by the operating system.
One problem with this scheme is that the page table can be quite big. In our
running example, there are 20 bits that specify the page, for a total of 220 =
1048576 pages in the address space. This is also the number of entries in the
page table. Assuming each one requires 4 bytes, the page table itself uses
1024 pages of memory, or 4 MB. Obviously this cannot be stored in
hardware registers. The solution is to have a special cache that keeps
information about recently used mappings. This is again based on the
assumption of locality: if we have used the mapping of page X recently, there
is a high probability that we will use this mapping many more times (e.g. for
other addresses that fall in the same page). This cache is called the translation
lookaside buffer (TLB).
With the TLB, access to pages with a cached mapping can be done
immediately. Access to pages that do not have a cached mapping requires
two memory accesses: one to get the entry from the page table, and another to
the desired address.
Note that the TLB is separate from the general data cache and instruction
cache. Thus the translation mechanism only needs to search this special
cache, and there are no conflicts between it and the other caches. One reason
for this separation is that the TLB includes some special hardware features,
notably the used and modified bits described above. These bits are only
relevant for pages that are currently being used, and must be updated by the
hardware upon each access.
The TLB may make access faster, but it can’t reduce the size of the page
table. As the number of processes increases, the percentage of memory
devoted to page tables also increases. This problem can be solved by using an
inverted page table. Such a table only has entries for pages that are allocated
to frames. Inserting pages and searching for them is done using a hash
function. If a page is not found, it is inferred that it is not mapped to any
frame, and a page fault is generated.
Exercise 105 Another way to reduce the size of the page table is to enlarge
the size of each page: for example, we can decide that 10 bits identify the
page and 22 the offset, thus reducing the page table to 1024 entries. Is this a
good idea?
When a page fault occurs, the required page must be mapped to a free frame.
However, free frames are typically in short supply. Therefore a painful
decision has to be made, about what page to evict to make space for the new
page. The main part of the operating system’s memory management
component is involved with making such decisions.
Exercise 106 Will all erroneous accesses be caught? What happens with
those that are not caught?
•
Implementing the copy-on-write optimization.
In Unix, forking a new process involves copying all its address space.
However, in many cases this is a wasted effort, because the new process
performs an exec system call and runs another application instead. A possible
optimization is then to use copy-on-write. Initially, the new process just uses
its parent’s address space, and all copying is avoided. A copy is made only
when either process tries to modify some data, and even then, only the
affected page is copied.
This idea is implemented by copying the parent’s page table to the child, and
marking the whole address space as read-only. As long as both processes just
read data, everything is OK. When either process tries to modify data, this
will cause a memory error exception (because the memory has been tagged
readonly). The operating system handler for this exception makes separate
copies of the offending page, and changes their mapping to read-write.
Swapping the Unix u-area. •
In Unix, the u-area is a kernel data structure that
contains important information about the currently running process.
Naturally, the information has to be maintained somewhere also when the
process is not running, but it doesn’t have to be accessible. Thus, in order to
reduce the complexity of the kernel code, it is convenient if only one u-area is
visible at any given time.
This idea is implemented by compiling the kernel so that the u-area is page
aligned (that is, the data structure starts on an address at the beginning of a
page). For each process, a frame of physical memory is allocated to hold its
uarea. However, these frames are not mapped into the kernel’s address space.
Only the frame belonging to the current process is mapped (to the page where
the “global” u-area is located). As part of a context switch from one process
to another, the mapping of this page is changed from the frame with the u-
area of the previous process to the frame with the u-area of the new process.
If you understand this, you understand virtual memory with paging.
Paging can be combined with segmentation
It is worth noting that paging is often combined with segmentation. All the
above can be done on a per-segment basis, rather than for the whole address
space as a single unit. The benefit is that now segments do not need to be
mapped to contiguous ranges of physical memory, and in fact segments may
be larger than the available memory.
virtual address
segment page offset
boundssegment tabinvalid
addr register
segment
exception exception
v frame
invalid
page
page fault
1. The segment number, which is used as an index into the segment table and
finds the correct page table. Again, this is implemented by adding it to the
contents of a special register that points to the base address of the segment
table.
2. The page number, which is used as an index into the page table to find the
frame to which this page is mapped. This is mapped by adding it to the base
address of the segment’s page table, which is extracted from the entry in the
segment table.
To read more: Stallings [13, Sect. 7.3] gives detailed examples of the actual
structures used in a few real systems (this was deleted in the newer edition).
More detailed examples are given by Jacob and Mudge [9].
As noted above, the main operating system activity with regard to paging is
deciding what pages to evict in order to make space for new pages that are
read from disk.
FIFO is bad and leads to anomalous behavior
The simplest algorithm is FIFO: throw out the pages in the order in which
they were brought in. You can guess that this is a bad idea because it is
oblivious of the program’s behavior, and doesn’t care which pages are
accessed a lot and which are not. Nevertheless, it is used in Windows 2000.
In fact, it is even worse. It turns out that with FIFO replacement there are
cases when adding frames to a process causes more page faults rather than
less page faults — an effect known as Belady’s anomaly. While this only
happens in pathological cases, it is unsettling. It is the result of the fact that
the set of pages maintained by the algorithm when equipped with n frames is
not necessarily a superset of the set of pages maintained when only n− 1
frames are available.
In algorithms that rely on usage, such as those described below, the set of
pages kept in n frames is a superset of the set that would be kept with n − 1
frames, and therefore Belady’s anomaly does not happen.
At the other extreme, the best possible algorithm is one that knows the future.
This enables it to know which pages will not be used any more, and select
them for replacement. Alternatively, if no such pages exist, the know-all
algorithm will be able to select the page that will not be needed for the
longest time. This delays the unavoidable page fault as much as possible, and
thus reduces the total number of page faults.
The most influential concept with regard to paging is the working set. The
working set of a process is a dynamically changing set of pages. At any given
moment, it includes the pages accessed in the last ∆ instructions; ∆ is called
the window size and is a parameter of the definition.
The importance of the working set concept is that it captures the relationship
between paging and the principle of locality. If the window size is too small,
then the working set changes all the time. This is illustrated in this figure,
where each access is denoted by an arrow, ∆ = 2, and the pages in the
working set are shaded:
time
But with the right window size, the set becomes static: every additional
instruction accesses a page that is already in the set. Thus the set comes to
represent that part of memory in which the accesses are localized. For the
sequence shown above, this requires ∆ = 6:
time
Knowing the working set for each process leads to two useful items of
information:
How many pages each process needs (the resident set size), and• Which pages are not in the
working set and can therefore be evicted.•
However, keeping track of the working set is not realistically possible. And
to be effective it depends on setting the window size correctly.
Evicting the least-recently used page approximates the working set
The reason for having the window parameter in the definition of the working
set is that memory usage patterns change over time. We want the working set
to reflect the current usage, not the pages that were used a long time ago. This
insight leads to the LRU page replacement algorithm: when a page has to be
evicted, pick the one that was least recently used (or in other words, was used
farthest back in the past).
Note that LRU automatically also defines the resident set size for each
process. This is so because all the pages in the system are considered as
potential victims for eviction, not only pages belonging to the faulting
process. Processes that use few pages will be left with only those pages, and
lose pages that were used in the more distant past. Processes that use more
pages will have larger resident sets, because their pages will never be the
least recently used.
The way to use these bits is as follows. Initially, all bits are set to zero. As
pages are accessed, their bits are set to 1 by the hardware. When the operating
system needs a frame, it scans all the pages in sequence. If it finds a page
with its used bit still 0, it means that this page has not been accessed for some
time. Therefore this page is a good candidate for eviction. If it finds a page
with its used bit set to 1, it means that this page has been accessed recently.
The operating system then resets the bit to 0, but does not evict the page.
Instead, it gives the page a second chance. If the bit stays 0 until this page
comes up again for inspection, the page will be evicted then. But if the page
is accessed continuously, its bit will be set already when the operating system
looks at it again. Consequently pages that are in active use will not be
evicted.
It should be noted that the operating system scans the pages in a cyclic
manner, always starting from where it left off last time. This is why the
algorithm is called the “clock” algorithm: it can be envisioned as arranging
the pages in a circle, with a hand pointing at the current one. When a frame is
needed, the hand is moved one page at a time, setting the used bits to 0.
When it finds a page whose bit is already 0, it stops and the page is evicted.
01 00 00 00 1 1 0 0 1 1 1 0 0 0 0 0
00001111
first try... second try... third try... fourth try... success!
The clock algorithm uses only one bit of information: whether the page was
accessed since the last time the operating system looked at it. A possible
improvement is for the operating system to keep a record of how the value of
this bit changes over time. This allows it to differentiate pages that were not
accessed a long time from those that were not accessed just now, and
provides a better approximation of LRU.
The alternative is local paging. When a certain process suffers a page fault,
we only consider the frames allocated to that process, and choose one of the
pages belonging to that process for eviction. We don’t evict a page belonging
to one process and give the freed frame to another. The advantage of this
scheme is isolation: a process that needs a lot of memory is prevented from
monopolizing multiple frames at the expense of other processes.
The main consequence of local paging is that it divorces the replacement
algorithm from the issue of determining the resident set size. Each process
has a static set of frames at its disposal, and its paging activity is limited to
this set. The operating system then needs a separate mechanism to decide
upon the appropriate resident set size for each process; for example, if a
process does a lot of paging, its allocation of frames can be increased, but
only subject to verification that this does not adversely affect other processes.
Another problem with paging is that the evicted page may be dirty, meaning
that it was modified since being read off the disk. Dirty pages have to be
written to disk when they are evicted, leading to a large delay in the service
of page faults (first write one page, then read another). Service would be
faster if the evicted page was clean, meaning that it was not modified and
therefore can be evicted at once.
All the page replacement algorithms cannot solve one basic problem. This
problem is that if the sum of the sizes of the working sets of all the processes
is larger than the size of the physical memory, all the pages in all the working
sets cannot be in memory at the same time. Therefore every page fault will
necessarily evict a page that is in some process’s working set. By definition,
this page will be accessed again soon, leading to another page fault. The
system will therefore enter a state in which page faults are frequent, and no
process manages to make any progress. This is called thrashing.
Using local paging reduces the effect of thrashing, because processes don’t
automatically steal pages from each other. But the only real solution to
thrashing is to reduce the memory pressure by reducing the
multiprogramming level. This means that we will have less processes in the
system, so each will be able to get more frames for its pages. It is
accomplished by swapping some processes out to disk, as described in
Section 4.4 below.
Exercise 111 Can the operating system identify the fact that thrashing is
occurring? how?
A simple solution, used in early Unix systems, is to set aside a partition of the
disk for this purpose. Thus disk space was divided into two main parts: one
used for paging, and the other for file systems. The problem with this
approach is that it is inflexible. For example, if the system runs out of space
for pages, but still has a lot of space for files, it cannot use this space.
4.4 Swapping
In order to guarantee the responsiveness of a computer system, processes
must be preemptable, so as to allow new processes to run. But what if there is
not enough memory to go around? This can also happen with paging, as
identified by excessive thrashing.
•
Fairness: processes that have accumulated a lot of CPU usage may be
swapped out for a while to give other processes a chance.
•
Creating a good job mix: jobs that compete with each other will be swapped
out more than jobs that complement each other. For example, if there are two
compute-bound jobs and one I/O-bound job, it makes sense to swap out the
compute-bound jobs alternately, and leave the I/O-bound job memory
resident.
out
swap out
swapped ready
swapped blocked event done
Exercise 112 Are all the transitions in the graph equivalent, or are some of
them “heavier” (in the sense that they take considerably more time) ?
And there are the usual sordid details
Paging and swapping introduce another resource that has to be managed: the
disk space used to back up memory pages. This has to be allocated to
processes, and a mapping of the pages into the allocated space has to be
maintained. In principle, the methods described above can be used: either use
a contiguous allocation with direct mapping, or divide the space into blocks
and maintain an explicit map.
One difference is the desire to use large contiguous blocks in order to achieve
higher disk efficiency (that is, less seek operations). An approach used in
some Unix systems is to allocate swap space in blocks that are some power-
of-two pages. Thus small processes get a block of say 16 pages, larger
processes get a block of 16 pages followed by a block of 32 pages, and so on
up to some maximal size.
4.5 Summary
Memory management provides one of the best examples in each of the
following.
Abstractions
Resource management
Workload issues
Workloads typically display a high degree of locality. Without it, the use of
paging to disk in order to implement virtual memory would simply not work.
Locality allows the cost of expensive operations such as access to disk to be
amortized across multiple memory accesses, and it ensures that these costly
operations will be rare.
Hardware support
Bibliography
[1] O. Babaoglu and W. Joy, “ Convertingaswap-
basedsystemtodopaginginanarchitecture lacking page-referenced bits”. In
8th Symp. Operating Systems Principles, pp. 78–86, Dec 1981.
File Systems
Support for files is an abstraction provided by the operating system, and does
not entail any real resource management. If disk space is available, it is used,
else the service cannot be provided. There is no room for manipulation as in
scheduling (by timesharing) or in memory management (by paging and
swapping). However there is some scope for various ways to organize the
data when implementing the file abstraction.
The attributes of being named and being persistent go hand in hand. The idea
is that files can be used to store data for long periods of time, and
specifically, for longer than the runtimes of the processes that create them.
Thus one process can create a file, and another process may re-access the file
using its name.
The attribute of being sequential means that data within the file can be
identified by its offset from the beginning of the file.
As for structure, in Unix there simply is no structure. There is only one type
of file, which is a linear sequence of bytes. Any interpretation of these bytes
is left to the application that is using the file. Of course, some files are used
by the operating system itself. In such cases, the operating system is the
application, and it imposes some structure on the data it keeps in the file. A
prime example is directories, where the data being stored is file system data;
this is discussed in more detail below. Another example is executable files. In
this case the structure is created by a compiler, based on a standard format
that is also understood by the operating system.
Other systems may support more structure in their files. Examples include:
IBM mainframes: the operating system supports lots of options, including
fixed•
or variable-size records, indexes, etc. Thus support for higher-level
operations is possible, including “read the next record” or “read the record
with key X”.
Macintosh OS: executables have two “forks”: one containing code, the other
la• bels used in the user interface. Users can change labels appearing on buttons
in the application’s graphical user interface (e.g. to support different
languages) without access to the source code.
Windows NTFS: files are considered as a set of attribute/value pairs. In
partic• ular, each file has an “unnamed data attribute” that contains its contents. But
it can also have multiple additional named data attributes, e.g. to store the
file’s icon or some other file-specific information. Files also have a host of
systemdefined attributes.
As this is data about the file maintained by the operating system, rather than
user data that is stored within the file, it is sometimes referred to as the file’s
metadata.
Exercise 113 What are these data items useful for?
Exercise 114 And how about the file’s name? Why is it not here?
Exercise 115 The Unixstat system call provides information about files (see
man stat). Why is the location of the data on the disk not provided?
The most important operations are reading and writing
Given that files are abstract objects, one can also ask what operations are
supported on these objects. In a nutshell, the main ones are
5.2.1 Directories
Directories provide a hierarchical structure
The name space for files can be flat, meaning that all files are listed in one
long list. This has two disadvantages:
There can be only one file with a given name in the system. For example, it
is• not possible for different users to have distinct files named “ex1.c”.
The list can be very long and unorganized.•
With a hierarchical structure files are identified by the path from the root,
through several levels of directories, to the file. Each directory contains a list
of those files and subdirectories that are contained in it. Technically, the
directory maps the names of these files and subdirectories to the internal
entities known to the file systems — that is, to their inodes. In fact,
directories are just like files, except that the data stored in them is not user
data but rather file system data, namely this mapping. The hierarchy is
created by having names that refer to subdirectories.
The system identifies files by their full path, that is, the file’s name
concatenated to the list of directories traversed to get to it. This always starts
from a distinguished root directory.
Exercise 118 So how does the system find the root directory itself?
Assume a full path/a/b/c is given. To find this file, we need to perform the
following:
a inodes 1
2
3
4
5
"a" => 3 blocks with
directory contents
"b" => 8
content b
c
6
7
8 "c" => 5
9
10
1. Read the inode of the root / (assume it is the first one in the list of inodes),
and use it to find the disk blocks storing its contents, i.e. the list of
subdirectories and files in it.
2. Read these blocks and search for the entrya. This entry will map/a to its
inode. Assume this is inode number 3.
3. Read inode 3 (which represents /a), and use it to find its blocks.
4. Read the blocks and search for subdirectoryb. Assume it is mapped to
inode 8.
5. Read inode 8, which we now know to represent/a/b.
6. Read the blocks of /a/b, and search for entryc. This will provide the inode
of /a/b/c that we are looking for, which contains the list of blocks that hold
this file’s contents.
7. To actually gain access to /a/b/c, we need to read its inode and verify that
the access permissions are appropriate. In fact, this should be done in each of
the steps involved with reading inodes, to verify that the user is allowed to
see this data. This is discussed further in Chapter 7.
Note that there are 2 disk accesses per element in the path: one for the inode,
and the other for the contents.
A possible shortcut is to start from the current directory (also called the
working directory) rather than from the root. This obviously requires the
inode of the current directory to be available. In Unix, the default current
directory when a user logs onto the system is that user’s home directory.
Exercise 119 Are there situations in which the number of disk accesses when
parsing a file name is different from two per path element?
Exercise 120 The contents of a directory is the mapping from names (strings
of characters) to inode (e.g. integers interpreted as an index into a table).
How would you implement this? Recall that most file names are short, but
you need to handle arbitrarily long names efficiently. Also, you need to
handle dynamic insertions and deletions from the directory.
Note that the process described above may also fail. For example, the
requested name may not be listed in the directory. Alternatively, the name
may be there but the user may not have the required permission to access the
file. If something like this happens, the system call that is trying to gain
access to the file will fail. It is up to the application that called this system
call to print en error message or do something else.
5.2.2 Links
Files can have more than one name
Exercise 122 Why in the world would you want a file to have multiple names?
Exercise 123 Are there any disadvantages to allowing files to have multiple
names? Hint: think about “..”.
Special care must be taken when deleting (unlinking) a file. If it has multiple
links, and one is being removed, we should not delete the file’s contents — as
they are still accessible using the other links. The inode therefore has a
counter of how many links it has, and the data itself is only removed when
this count hits zero.
Exercise 124 An important operation on files is renaming them. Is this really
an operation on the file? How it is implemented?
Soft links are flexible but problematic
Or to just search
To read more: The opposition to using file names, and preferring search
procedures, is promoted by Raskin, the creator of the Mac interface [11].
As in other areas, there are many options and alternatives. But in the area of
file systems, there is more data about actual usage patterns than in other
areas. Such data is important for the design and evaluation of file systems,
and using it ensures that the selected policies do indeed provide good
performance.
If only the processor can access memory the only option is programmed I/O.
This means that the CPU (running operating system code) accepts each word
of data as it comes off the disk, and stores it at the required address. This has
the obvious disadvantage of keeping the CPU busy throughout the duration of
the I/O operation, so no overlap with other computation is possible.
2. Entry 13 from that data structure is read from the disk and copied into the
kernel’s open files table. This includes various file attributes, such as the
file’s owner and its access permissions, and a list of the file blocks.
3. The user’s access rights are checked against the file’s permissions to
ascertain that reading (R) is allowed.
The reason for opening files is that the above operations may be quite time
consuming, as they may involve a number of disk accesses to map the file
name to its inode and to obtain the file’s attributes. Thus it is desirable to
perform them once at the outset, rather then doing them again and again for
each access. open returns a handle to the open file, in the form of the file
descriptor, which can then be used to access the file many times.
which means that 100 bytes should be read from the file indicated byfd into
the memory bufferbuf.
disk memory
attributes 5 8
5 open kernel user
files fdattributes table
58
buffer cache buf
111
111
disk blocks
The argument fd identifies the open file by pointing into the kernel’s open
files table. Using it, the system gains access to the list of blocks that contain
the file’s data. In our example, it turns out that the first data block is disk
block number 5. The file system therefore reads disk block number 5 into its
buffer cache. This is a region of memory where disk blocks are cached. As
this takes a long time (on the order of milliseconds), the operating system
typically blocks the operation waiting for it to complete, and does something
else in the meanwhile, like scheduling another process to run.
When the disk interrupts the CPU to signal that the data transfer has
completed, handling of theread system call resumes. In particular, the desired
range of 100 bytes is copied into the user’s memory at the address indicated
bybuf. If additional bytes from this block will be requested later, the block
will be found in the buffer cache, saving the overhead of an additional disk
access.
Exercise 126 Is it possible to read the desired block directly into the user’s
buffer, and save the overhead of copying?
Writing may require new blocks to be allocated
Now suppose the process wants to write a few bytes. Let’s assume we want
to write 100 bytes, starting with byte 2000 in the file. This will be expressed
by the pair of system calls
seek(fd,2000)
write(fd,buf,100)
Let’s also assume that each disk block is 1024 bytes. Therefore the data we
want to write spans the end of the second block to the beginning of the third
block.
The problem is that disk accesses are done in fixed blocks, and we only want
to write part of such a block. Therefore the full block must first be read into
the buffer cache. Then the part being written is modified by overwriting it
with the new data. In our example, this is done with the second block of the
file, which happens to be block 8 on the disk.
disk memory
attributes 5 8
8 disk blocks open kernel user files fdattributes table
58
buffer cache
overwrite
buf
11
11
11
11
copy 11
The rest of the data should go into the third block, but the file currently only
has two blocks. Therefore a third block must be allocated from the pool of
free blocks. Let’s assume that block number 2 on the disk was free, and that
this block was allocated. As this is a new block, there is no need to read it
from the disk before modifying it — we just allocate a block in the buffer
cache, prescribe that it now represents block number 2, and copy the
requested data to it. Finally, the modified blocks are written back to the disk.
disk memory
open kernel user
files fdattributes table
attributes2 5 8 2
5 8 buffer cache
buf
11 111
11
111 111
8
disk blocks
Note that the copy of the file’s inode was also modified, to reflect the
allocation of a new block. Therefore this too must be copied back to the disk.
Likewise, the data structure used to keep track of free blocks needs to be
updated on the disk as well.
You might have noticed that the read system call provides a buffer address
for placing the data in the user’s memory, but does not indicate the offset in
the file from which the data should be taken. This reflects common usage
where files are accessed sequentially, and each access simply continues
where the previous one ended. The operating system maintains the current
offset into the file (sometimes called the file pointer), and updates it at the
end of each operation.
If random access is required, the process can set the file pointer to any
desired value by using theseek system call (random here means arbitrary, not
indeterminate!).
Exercise 127 What happens (or should happen) if you seek beyond the
current end of the file, and then write some data?
The first is the in-core inode table, which contains the inodes of open files
(recall that an inode is the internal structure used to represent files, and
contains the file’s metadata as outlined in Section 5.1). Each file may appear
at most once in this table. The data is essentially the same as in the on-disk
inode, i.e. general information about the file such as its owner, permissions,
modification time, and listing of disk blocks.
parent process fd table open files table inode table opened before fork
pointer child process inode 13 fd table pointer inherited opened after fork pointer
pointer
The second table is the open files table. An entry in this table is allocated
every time a file is opened. These entries contain three main pieces of data:
•
An indication of whether the file was opened for reading or for writing An
offset (sometimes called the “file pointer”) storing the current position within
the• file.
A pointer to the file’s inode.•
The third table is the file descriptors table. There is a separate file descriptor
table for each process in the system. When a file is opened, the system finds
an unused slot in the opening process’s file descriptor table, and uses it to
store a pointer to the new entry it creates in the open files table. The index of
this slot is the return value of theopen system call, and serves as a handle to
get to the file. The first three indices are by convention preallocated to
standard input, standard output, and standard error.
Exercise 128 Why do we need the file descriptors table? Couldn’topen just
return the index into the open files table?
Another important use for the file descriptors table is that it allows file
pointers to be shared. For example, this is useful in writing a log file that is
shared by several processes. If each process has its own file pointer, there is a
danger that one process will overwrite log entries written by another process.
But if the file pointer is shared, each new log entry will indeed be added to
the end of the file. To achieve sharing, file descriptors are inherited across
forks. Thus if a process opens a file and then forks, the child process will also
have a file descriptor pointing to the same entry in the open files table. The
two processes can then share the same file pointer by using these file
descriptors. If either process opens a file after the fork, the associated file
pointer is not shared.
Exercise 129 Given that multiple file descriptors can point to the same open
file entry, and multiple open file entries can point to the same inode, how are
entries freed? Specifically, when a process closes a file descriptor, how can
the system know whether it should free the open file entry and/or the inode?
If an application requests only a small part of a block, the whole block has to•
be read anyway. By caching it (rather than throwing it away after satisfying
the request) the operating system can later serve additional requests from the
same block without any additional disk accesses. In this way small requests
are aggregated and the disk access overhead is amortized rather than being
duplicated.
In some cases several processes may access the same disk block; examples
in• clude loading an executable file or reading from a database. If the blocks are
cached when the first process reads them off the disk, subsequent accesses
will hit them in the cache and not need to re-access the disk.
A lot of data that is written to files is actually transient, and need not be kept•
for long periods of time. For example, an application may store some data in
a temporary file, and then read it back and delete the file. If the file’s blocks
are initially stored in the buffer cache, rather than being written to the disk
immediately, it is possible that the file will be deleted while they are still
there. In that case, there is no need to write them to the disk at all. The same
holds for data that is overwritten after a short time.
Alternatively, data that is not erased can be written to disk later (delayed
write).• The process need not be blocked to wait for the data to get to the disk. Instead
the only overhead is just to copy the data to the buffer cache.
Data about file system usage in working Unix 4.2 BSD systems was collected
and analyzed by Ousterhout and his students [9, 2]. They found that a
suitably-sized buffer cache can eliminate 65–90% of the disk accesses. This
is attributed to the reasons listed above. Specifically, the analysis showed that
20–30% of newly-written information is deleted or overwritten within 30
seconds, and 50% is deleted or overwritten within 5 minutes. In addition,
about 2/3 of all the data transferred was in sequential access of files. In files
that were opened for reading or writing, but not both, 91–98% of the accesses
were sequential. In files that were opened for both reading and writing, this
dropped to 19–35%.
The downside of caching disk blocks is that the system becomes more
vulnerable to data loss in case of a system crash. Therefore it is necessary to
periodically flush all modified disk blocks to the disk, thus reducing the risk.
This operation is called disk synchronization.
Exercise 130 It may happen that a block has to be read, but there is no space
for it in the buffer cache, and another block has to be evicted (as in paging
memory). Which block should be chosen? Hint: LRU is often used. Why is
this possible for files, but not for paging?
The data structure used by Unix1 to implement the buffer cache is somewhat
involved, because two separate access scenarios need to be supported. First,
data blocks need to be accessed according to their disk address. This
associative mode is implemented by hashing the disk address, and linking the
data blocks according to the hash key. Second, blocks need to be listed
according to the time of their last access in order to implement the LRU
replacement algorithm. This is implemented by keeping all blocks on a global
LRU list. Thus each block is actually part of two linked lists: one based on its
hashed address, and the other based on the global access order.
1Strictly speaking, this description relates to older versions of the system. Modern Unix variants
typically combine the buffer cache with the virtual memory mechanisms.
hash LRU list
lists headheads
0 blk 14
1
2 blk 9 blk 2 blk 16
3 blk 10
4
5 blk 19 blk 26 blk 12 blk 5
6 blk 27 blk 13
Prefetching can overlap I/O with computation
The fact that most of the data transferred is in sequential access implies
locality. This suggests that if the beginning of a block is accessed, the rest
will also be accessed, and therefore the block should be cached. But it also
suggests that the operating system can guess that the next block will also be
accessed. The operating system can therefore prepare the next block even
before is is requested. This is called prefetching.
Prefetching does not reduce the number of disk accesses. In fact, it runs the
risk of increasing the number of disk accesses, for two reasons: first, it may
happen that the guess was wrong and the process does not make any accesses
to the prefetched block, and second, reading the prefetched block into the
buffer cache may displace another block that will be accessed in the future.
However, it does have the potential to significantly reduce the time of I/O
operations as observed by the process. This is due to the fact that the I/O was
started ahead of time, and may even complete by the time it is requested.
Thus prefetching overlaps I/O with the computation of the same process. It is
similar to asynchronous I/O, but does not require any specific coding by the
programmer.
The idea is simple: when a file is opened, create a memory segment and
define the file as the disk backup for this segment. In principle, this involves
little more than the allocation of a page table, and initializing all the entries to
invalid (that is, the data is on disk and not in memory).
A read orwrite system call is then reduced to just copying the data from or to
this mapped memory segment. If the data is not already there, this will cause
a page fault. The page fault handler will use the inode to find the actual data,
and transfer it from the disk. The system call will then be able to proceed
with copying it.
For example, the following pseudo-code gives the flavor of mapping a file
and changing a small part of it:
ptr=mmap("myfile",RW)
strncpy(ptr+2048,"new text",8)
The first line is the system call mmap, which maps the named file into the
address space with read/write permissions. It returns a pointer to the
beginning of the newly created segment. The second line over-writes 8 bytes
at the beginning of the third block of the file (assuming 1024-byte blocks).
Moreover, as noted above copying the data may be avoided. This not only
avoids the overhead of the copying, but also completely eliminates the
overhead of trapping into the system to perform theread orwrite system call.
However, the overhead of disk access cannot be avoided — it is just done as
a page fault rather than as a system call.
Exercise 132 What happens if two distinct processes map the same file? To
read more: See the man page formmap.
The index is arranged in a hierarchical manner. First, there are a few (e.g. 10)
direct pointers, which list the first blocks of the file. Thus for small files all
the necessary pointers are included in the inode, and once the inode is read
into memory, they can all be found. As small files are much more common
than large ones, this is efficient.
If the file is larger, so that the direct pointers are insufficient, the indirect
pointer is used. This points to a whole block of additional direct pointers,
which each point to a block of file data. The indirect block is only allocated if
it is needed, i.e. if the file is bigger than 10 blocks. As an example, assume
blocks are 1024 bytes (1 KB), and each pointer is 4 bytes. The 10 direct
pointers then provide access to a maximum of 10 KB. The indirect block
contains 256 additional pointers, for a total of 266 blocks (and 266 KB).
If the file is bigger than 266 blocks, the system resorts to using the double
indirect pointer, which points to a whole block of indirect pointers, each of
which point to an additional block of direct pointers. The double indirect
block has 256 pointers to indirect blocks, so it represents a total of 65536
blocks. Using it, file sizes can grow to a bit over 64 MB. If even this is not
enough, the triple indirect pointer is used. This points to a block of double
indirect pointers.
blk 1
last modificationblk 11
size ind 1direct 1
direct 2blk 266
A nice property of this hierarchical structure is that the time needed to find a
block is logarithmic in the file size. And note that due to the skewed
structure, it is indeed logarithmic in the actual file size — not in the maximal
supported file size. The extra levels are only used in large files, but avoided
in small ones.
Exercise 133 what file sizes are possible with the triple indirect block? what
other constraints are there on file size?
The Distribution of File Sizes
1 files0.9 bytes
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 16 512 16K 512K 16M 512M file size [B]
The distribution of files is the top line (this is a CDF, i.e. for each size x it
shows the probability that a file will be no longer than x). For example, we
can see that about 20% of the files are up to 512 bytes long. The bottom line
is the distribution of bytes: for each file size x, it shows the probability that
an arbitrary byte belong to a file no longer than x. For example, we can see
that about 10% of the bytes belong to files that are up to 16 KB long.
The three vertical arrows allow us to characterize the distribution [4]. The
middle one shows that this distribution has a “joint ratio” of 11/89. This
means that the top 11% of the files are so big that together they account for
89% of the disk space. At the same time, the bottom 89% of the files are so
small that together they account for only 11% of the disk space. The leftmost
arrow shows that the bottom half of the files are so small that they only
account for 1.5% of the disk space. The rightmost arrow shows that the other
end of the distribution is even more extreme: half of the disk space is
accounted for by only 0.3% of the files, which are each very big.
A different structure is used by FAT, the original DOS file system (which has
the dubious distinction of having contributed to launching Bill Gates’s
career). FAT stands for “file allocation table”, the main data structure used to
allocate disk space. This table, which is kept at the beginning of the disk,
contains an entry for each disk block (or, in more recent versions, each
cluster of consecutive disk blocks that are allocated as a unit). Entries that
constitute a file are linked to each other in a chain, with each entry holding
the number of the next one. The last one is indicated by a special marking of
all 1’s (FF in the figure). Unused blocks are marked with 0, and bad blocks
that should not be used also have a special marking (-1 in the figure).
file1 0
directoryallocation 4table
0
12
"file1" attr 10 0
FF 0
"file2" attr 6 2
0
"file3" attr 15 8
FF FF
-1
0
16
16 11
Exercise 134 Repeat Ex. 127 for the Unix inode and FAT structures: what
happens if you seek beyond the current end of the file, and then write some
data?
Exercise 135 What are the pros and cons of the Unix inode structure vs. the
FAT structure? Hint: consider the distribution of file sizes shown above.
The problem with this structure is that each table entry represents an
allocation unit of disk space. For small disks it was possible to use an
allocation unit of 512 bytes. In fact, this is OK for disks of up to 512× 64K =
32MB. But when bigger disks became available, they had to be divided into
the same 64K allocation units. As a result the allocation units grew
considerably: for example, a 256MB disk was allocated in units of 4K. This
led to inefficient disk space usage, because even small files had to allocate at
least one unit. But the design was hard to change because so many systems
and so much software were dependent on it.
The problem with this layout is that it entails much seeking. Consider the
example of opening a file named/a/b/c. To do so, the file system must access
the root inode, to find the blocks used to implement it. It then reads these
blocks to find which inode has been allocated to directorya. It then has to
read the inode fora to get to its blocks, read the blocks to findb, and so on. If
all the inodes are concentrated at one end of the disk, while the blocks are
dispersed throughout the disk, this means repeated seeking back and forth.
A possible solution is to try and put inodes and related blocks next to each
other, in the same set of cylinders, rather than concentrating all the inodes in
one place (a cylinder is a set of disk tracks with the same radius; see
Appendix C). This was done in the Unix fast file system. However, such
optimizations depend on the ability of the system to know the actual layout of
data on the disk, which tends to be hidden by modern disk controllers [1].
Modern systems are therefore limited to using logically contiguous disk
blocks, hoping that the disk controller indeed maps them to physically
proximate locations on the disk surface.
Exercise 136 The superblock contains the data about all the free blocks, so
every time a new block is allocated we need to access the superblock. Does
this entail a disk access and seek as well? How can this be avoided? What
are the consequences?
The use of a large buffer cache and aggressive prefetching can satisfy most
read requests from memory, saving the overhead of a disk access. The next
performance bottleneck is then the implementation of small writes, because
they require much seeking to get to the right block. This can be solved by not
writing the modified blocks
2Actually it is usually the second block — the first one is a boot block, but this is not part of the file
system.
in place, but rather writing a single continuous log of all changes to all files
and metadata. In addition to reducing seeking, this also improves
performance because data will tend to be written sequentially, thus also
making it easier to read at a high rate when needed.
Of course, this complicates the system’s internal data structures. When a disk
block is modified and written in the log, the file’s inode needs to be modified
to reflect the new location of the block. So the inode also has to be written to
the log. But now the location of the inode has also changed, so this also has
to be updated and recorded. To reduce overhead, metadata is not written to
disk immediately every time it is modified, but only after some time or when
a number of changes have accumulated. Thus some data loss is possible if the
system crashes, which is the case anyway.
Another problem is that eventually the whole disk will be filled with the log,
and no more writing will be possible. The solution is to perform garbage
collection all the time: we write new log records at one end, and delete old
ones at the other. In many cases, the old log data can simply be discarded,
because it has since been overwritten and therefore exists somewhere else in
the log. Pieces of data that are still valid are simply re-written at the end of
the log.
The discussion so far has implicitly assumed that there is enough space on the
disk for the desired files, and even for the whole file system. With the
growing size of data sets used by modern applications, this can be a
problematic assumption. The solution is to use another layer of abstraction:
logical volumes.
The considerations cited above regarding block placement are based on the
need to reduce seeking, because seeking (mechanically moving the head of a
disk) takes a relatively long time. But with new solid state disks, based on
flash memory, there is no mechanical movement. therefore such
considerations become void, and we have full flexibility in using memory
blocks.
5.4.3 Reliability
By definition, the whole point of files to to store data permanently. This can
run into two types of problems. First, the system may crash leaving the data
structures on the disk in an inconsistent state. Second, the disks themselves
sometimes fail. Luckily, this can be overcome.
The implementation involves the logging of all operations. First, a log entry
describing the operation is written to disk. Then the actual modification is
done. If the system crashes before the log entry is completely written, then
the operation never happened. If it crashes during the modification itself, the
file system can be salvaged using the log that was written earlier.
distinct disks. This allows for fast reading (you A Acan access the less loaded copy), but
wastes diskB Bspace and delays writing.
CC
RAID 3: parity disk — data blocks are distributed among
the disks in round-robin manner. For each set ofA B C PABCblocks, a parity block is
computed and stored on D E F PDEFa separate disk. Thus if one of the original data
blocks is lost due to a disk failure, its data can beG H I PGHI reconstructed from
the other blocks and the parity. This is based on the self-identifying nature of
disk failures: the controller knows which disk has failed, so parity is good
enough.
How are the above ideas used? One option is to implement the RAID as part
of the file system: whenever a disk block is modified, the corresponding
redundant block is updated as well. Another option is to buy a disk controller
that does this for you. The interface is the same as a single disk, but it is
faster (because several disks are actually used in parallel) and is much more
reliable. While such controllers are available even for PCs, they are still
rather expensive.
5.5 Summary
Abstractions
As files deal with permanent storage, there is not much scope for resource
management — either the required space is available for exclusive long-term
usage, or it is not. The only management in this respect is the enforcement of
disk quotas.
Files and directories are represented by a data structure with the relevant
metadata; in Unix this is called an inode. Directories are implemented as files
that contain the mapping from user-generated names to the respective inodes.
Access to disk is mediated by a buffer cache.
Workload issues
Hardware support
Hardware support exists at the I/O level, but not directly for files. One form
of support is DMA, which allows slow I/O operations to be overlapped with
other operations; without it, the whole idea of switching to another process
while waiting for the disk to complete the transfer would be void.
Bibliography
[1] D. Anderson, “You don’t know Jack about disks”. Queue 1(4), pp. 20–30,
Jun 2003.
A modern disk typically has multiple (1–12) platters which rotate together on
a common spindle (at 5400 or 7200 RPM). Data is stored on both surfaces of
each platter. Each surface has its own read/write head, and they are all
connected to the same arm and move in unison. However, typically only one
head is used at a time, because it is too difficult to align all of them at once.
The data is recorded in concentric tracks (about 1500–2000 of them). The set
of tracks on the different platters that have the same radius are called a
cylinder. This concept is important because accessing tracks in the same
cylinder just requires the heads to be re-aligned, rather than being moved.
Each track is divided into sectors, which define the minimal unit of access
(each is 256–1024 data bytes plus error correction codes and an inter-sector
gap; there are 100-200 per track). Note that tracks near the rim of the disk are
much longer than tracks near the center, and therefore can store much more
data. This is done by dividing the radius of the disk into (3–20) zones, with
the tracks in each zone divided into a different number of sectors. Thus tracks
near the rim have more sectors, and store more data.
spindletrack
cylinderhead
arm platter
zone sector
In times gone by, addressing a block of data on the disk was accomplished by
specifying the surface, track, and sector. Contemporary disks, and in
particular those with SCSI controllers, present an interface in which all
blocks appear in one logical sequence. This allows the controller to hide bad
blocks, and is easier to handle. However, it prevents certain optimizations,
because the operating system does not know which blocks are close to each
other. For example, the operating system cannot specify that certain data
blocks should reside in the same cylinder [1].
The base algorithm is FIFO (first in first out), that just services the requests in
the order that they arrive. The most common improvement is to use the
SCAN algorithm, in which the head moves back and forth across the tracks
and services requests in the order that tracks are encountered. A variant of
this is C-SCAN (circular SCAN), in which requests are serviced only while
moving in one direction, and then the head returns as fast as possible to the
origin. This improves fairness by reducing the maximal time that a request
may have to wait.
queued requests 6
2
7
rotational delay and transfer
3
initial position 5
1
4
C-SCAN
time
FIFO slope reflects rate of head movement
As with addressing, in the past it was the operating system that was
responsible for scheduling the disk operations, and the disk accepted such
operations one at a time. Contemporary disks with SCSI controllers are
willing to accept multiple outstanding requests, and do the scheduling
themselves.
While placing an inode and the blocks it points to together reduces seeking, it
may also cause problems. Specifically, a large file may monopolize all the
blocks in the set of cylinders, not leaving any for other inodes in the set.
Luckily, the list of file blocks is not all contained in the inode: for large files,
most of it is in indirect blocks. The fast file system therefore switches to a
new set of cylinders whenever a new indirect block is allocated, choosing a
set that is less loaded than the average. Thus large files are indeed spread
across the disk. The extra cost of the seek is relatively low in this case,
because it is amortized against the accesses to all the data blocks listed in the
indirect block.
While this solution improves the achievable bandwidth for intermediate size
files, it does not necessarily improve things for the whole workload. The
reason is that large files indeed tend to crowd out other files, so leaving their
blocks in the same set of cylinders causes other small files to suffer. More
than teaching us about disk block allocation, this then provides testimony to
the complexity of analyzing performance implications, and the need to take a
comprehensive approach.
Sequential layout is crucial for achieving top data transfer rates. Another
optimization is therefore to place consecutive logical blocks a certain distance
from each other along the track, called the track skew. The idea is that
sequential access is common, so it should be optimized. However, the
operating system and disk controller need some time to handle each request.
If we know how much time this is, and the speed that the disk is rotating, we
can calculate how many sectors to skip to account for this processing. Then
the request will be handled exactly when the requested block arrives under
the heads.
To read more: The Unix fast file system was originally described by
McKusick and friends [2].
Bibliography
[1] D. Anderson, “Youdon’tknowJackaboutdisks”. Queue 1(4), pp. 20–30,
Jun 2003. [2] M. McKusick, W. Joy, S. Leffler, and R. Fabry,
“AfastfilesystemforUNIX”. ACM Trans. Comput. Syst. 2(3), pp. 181–197,
Aug 1984.
[3] C. Ruemmler and J. Wilkes, “An introduction to disk drive modeling”.
Computer 27(3), pp. 17–28, Mar 1994.
[4] E. Shriver, A. Merchant, and J. Wilkes, “
Ananalyticbehaviormodelfordiskdrives
withreadaheadcachesandrequestreordering”. In SIGMETRICS Conf.
Measurement & Modeling of Comput. Syst., pp. 182–191, Jun 1998.
Now after we know about many specific examples of what operating systems
do and how they do it, we can distill some basic principles.
Chapter 6
6.1 Virtualization
Virtualization is probably the most important tool in the bag of tricks used for
system design. It means that the objects that the system manipulates and
presents to its users are decoupled from the underlying hardware. Examples
for this principle abound.
Virtualization typically involves some mapping from the virtual objects to the
physical ones. Thus each level of virtualization is equivalent to a level of
indirection. While this has its price in terms of overhead, the benefits of
virtualization outweigh the cost by far.
Memory is allocated in pages that are mapped to the address space using a
page• table.
Disk space is allocated in blocks that are mapped to files using the file’s
index• structure.
•
Network bandwidth is allocated in packets that are numbered by the sender
and reassembled at the destination
•
Maintain flexibility One of the reasons for using small fixed chunks is to
increase flexibility. Thus to be flexible we should avoid making large fixed
allocations.
A good example of this principle is the handling of both memory space and
disk space. It is possible to allocate a fixed amount of memory for the buffer
cache, to support file activity, but this reduces flexibility. It is better to use a
flexible scheme in which the memory manager handles file I/O based on
memory mapped files. This allows for balanced allocations at runtime
between the memory activity and file activity.
In many cases the distribution of request sizes is highly skewed: there are
very many small requests, some medium-sized requests, and few very large
requests. Thus the operating system should optimize its performance for
small requests, but still be able to support large requests efficienty. The
solution is to use hierarchical structures that span several (binary) orders of
magnitude. Examples are
•
The Unix inode structure used to keep maps of file blocks, with its indirect,
double indirect, and triple indirect pointers.
• Buddy systems used to allocate memory, disk space, or processors.
•
Multi-level feedback queues, where long jobs receive increasingly longer
time quanta at lower priorities.
Time is limitless
The possibility to serve more and more clients, by gracefully degrading the
service each one receives. Examples are time slicing the CPU and sending
messages in packets.
The problem is of course that sometimes the load on the system indeed needs
more resources than are available. The solution in this case is — if possible
— to forcefully reduce the load. For example, this is the case when swapping
is used to escape from thrashing, or when a system uses admission controls.
6.3 Reduction
Problems don’t always have to be solved from basic principles. On the
contrary, the best solution is usually to reduce the problem to a smaller
problem that can be solved easily.
Use Conventions
One example comes from the early support for paging and virtual memory on
the VAX running Unix. Initially, hardware support for mapping was
provided, but without use bits that can be used to implement replacement
algorithms that approximate LRU. The creative solution was to mark the
pages as absent. Accessing them would then cause a page fault and a trap to
the operating system. But the operating system would know that the page was
actually there, and would simply use this page fault to simulate its own used
bits. In later generations, the used bit migrated to hardware.
Finally, many aspects of I/O control, including caching and disk scheduling,
are migrating from the operating system to the disk controllers.
As a side note, migration from the operating system to user applications is
also important. Various abstractions invented within operating systems, such
as semaphores, are actually useful for any application concerned with
concurrent programming. These should (and have been) exposed at the
systems interface, and made available to all.
– Address translation that includes protection and page faults for unmapped
pages
– The ability to switch mappings easily using a register that points to the base
address of the page/segment table
– The TLB which serves as a cache for the page table
– Updating the used/modified bits of pages when they are accessed
Finally, there are lots of technical issues that simply don’t have much lustre.
However, you need to know about them to really understand how the system
works.
Chapter 7
The operating system can do anything it desires. This is due in part to the fact
that the operating system runs in kernel mode, so all the computer’s
instructions are available. For example, it can access all the physical memory,
bypassing the address mapping that prevents user processes from seeing the
memory of other processes. Likewise, the operating system can instruct a
disk controller to read any datda block off the disk, regardless of who the data
belongs to.
The problem is therefore to prevent the system from performing such actions
on behalf of the wrong user. Each user, represented by his processes, should
be able to access only his private data (or data that is flaged as publicly
available). The operating system, when acting on behalf of this user (e.g.
during the execution of a system call), must restrain itself from using its
power to access the data of others.
Windows was originally designed as a single-user system, so the user had full
priviledges. The distinction between system administrators and other users
was only introduced recently.
A basic tool used for access control is hiding. Whatever a process (or user)
can’t see, it can’t access. For example, kernel data structures are not visible in
the address space of any user process. In fact, neither is the data of other
processes. This is achieved using the hardware support for address mapping,
and only mapping the desired pages into the address space of the process.
A slightly less obvious example is that files can be hidden by not being listed
in any directory. For example, this may be useful for the file used to store
users’ passwords. It need not be listed in any directory, because it is not
desirable that processes will access it directly by name. All that is required is
that the system know where to find it. As the system created this file in the
first place, it can most probably find it again. In fact, its location can even be
hardcoded into the system.
Exercise 137 One problem with the original versions of Unix was that
passwords were stored together with other data about users in a file that was
readable to all (/etc/passwd). The password data was, however, encrypted.
Why was this still a security breach?
A simple example is the file descriptor used to access open files. The
operating system, as part of theopen system call, creates a file descriptor and
returns it to the process. The process stores this descriptor, and uses it in
subsequentread and write calls to identify the file in question. The file
descriptor’s value makes sense to the system in the context of identifying
open files; in fact, it is an index into the process’s file descriptors table. It is
meaningless for the process itself. By knowing how the system uses it, a
process can actually forge a file descriptor (the indexes are small integers).
However, this is useless: the context in which file descriptors are used causes
such forged file descriptors to point to other files that the process has opened,
or to be illegal (if they point to an unused entry in the table).
Exercise 138 Does using the system’s default random number stream for this
purpose compromise security?
Humans recognize each other by their features: their face, hair style, voice,
etc. While user recognition based on the measurement of such features is now
beginning to be used by computer systems, it is still not common.
1. The password should be physically secured by the user. If the user writes it
on his terminal, anyone with access to the terminal can fool the system.
To read more: Stallings [1, sect. 15.3] includes a nice description of attacks
against user passwords.
Processes may also transfer their identity
Unix has a mechanism for just such situations, called “set user ID on
execution”. This allows the teacher to write a program that reads the file and
extracts the desired data. The program belongs to the teacher, but is executed
by the students. Due to the set user ID feature, the process running the
program assumes the teacher’s user ID when it is executed, rather than
running under the student’s user ID. Thus it has access to the grades file.
However, this does not give the student unrestricted access to the file
(including the option of improving his grades), because the process is still
running the teacher’s program, not a program written by the student.
Exercise 140 Does sending a Unix file descriptor from one process to
another provide the receiving process with access to the file?
Object Operations
File read, write, rename, delete
Memory read, write, map, unmap
Process kill, suspend, resume
In the following we use files for concreteness, but the same ideas can be
applied to all other objects as well.
Access to files is restricted in order to protect data
Files may contain sensitive information, such as your old love letters or new
patent ideas. Unless restricted, this information can be accessed by whoever
knows the name of the file; the name in turn can be found by reading the
directory. The operating system must therefore provide some form of
protection, where users control access to their data.
Most systems, however, do not store the access matrix in this form. Instead
they either use its rows or its columns.
You can focus on access rights to objects...
Using columns means that each object is tagged with an access control list
(ACL), which lists users and what they are allowed to do to this object.
Anything that is not specified as allowed is not allowed. Alternatively, it is
possible to also have a list cataloging users and specific operations that are to
be prevented.
Exercise 141 Create an ACL that allows read access to user yosi, while
denying write access to a group of which yosi is a member. And how about
an ACL that denies read access to all the group’s members except yosi?
Using rows means that each user (or rather, each process running on behalf of
a user) has a list of capabilities, which lists what he can do to different
objects. Operations that are not specifically listed are not allowed.
The problem with both ACLs and capabilities is that they may grow to be
very large, if there are many users and objects in the system. The solution is
to group users or objects together, and specify the permissions for all group
members at once. For example, in Unix each file is protected by a simplified
ACL, which considers the world as divided into three: the file’s owner, his
group, and all the rest. Moreover, only 3 operations are supported: read,
write, and execute. Thus only 9 bits are needed to specify the file access
permissions. The same permission bits are also used for directories, with
some creative interpretation: read means listing the directory contents, write
means adding or deleting files, and execute means being able to access files
and subdirectories.
Exercise 142 what can you do with a directory that allows you read
permission but no execute permission? what about the other way round? Is
this useful? Hint: can you use this to set up a directory you share with your
partner, but is effectively inaccessible to others?
7.4 Summary
Abstractions
Resource management
Workload issues
Hardware support
Bibliography
[1] W. Stallings, Operating Systems: Internals and Design Principles.
Prentice-Hall, 3rd ed., 1998.
Chapter 8
with SMP, things really happen at the same time disabling interrupts doesn’t
help
8.1.3 Conflicts
In recent years multiprocessor systems (i.e. machines with more than one
CPU) are becoming more common. How does this affect scheduling?
Queueing analysis can help
The question we are posing is the following: is it better to have one shared
ready queue, or should each processor have its own ready queue? A shared
queue is similar to common practice in banks and post offices, while separate
queues are similar to supermarkets. Who is right?
CPU 1
departing jobs
arriving
shared queue
CPU 2
departing jobs
jobs departing
CPU 3 jobs
CPU 4
departing jobs queue 1 departingCPU 1 jobs
queue 2 departing
CPU 2 jobsarriving
jobs queue 3 departing
CPU 3 jobs
queue 4 departing
CPU 4 jobs
The queueing model for a shared queue is M/M/m, where m is the number of
processors (servers). The state space is similar to that of an M/M/1 queue,
except that the transitions from state i to i− 1 depend on how many servers are
active:
λλλλλλ
0 1 2 m m+1 µ 2µ 3µ mµ mµ mµ
The mathematics for this case are a bit more complicated than for the M/M/1
queue, but follow the same principles. The result for the average response
time is
q
r=1
µ
1 + m(1 ρ)
−
where q is the probability of having to wait in the queue because all m servers
are busy, and is given by
(
mρ
)
m
q = π0m!(1 ρ)
−
The expression for π0 is also rather complicated...
The resulting plots of the response time, assuming µ = 1 and m = 4, are
30
25 4 x M/M/1 M/M/4 20
15
10
5
0 0 0.2 0.4 0.6 0.8 1 utilization
Obviously, using a shared queue is substantially better. The banks and post
offices are right. And once you think about it, it also makes sense: with a
shared queue, there is never a situation where a client is queued and a server
is idle, and clients that require only short service do not get stuck behind
clients that require long service — they simply go to another server.
Exercise 143 Why is it “obvious” from the graphs that M/M/4 is better?
But what about preemption?
The queueing analysis is valid for the non-preemptive case: each client sits in
the queue until it is picked by a server, and then holds on to this server until it
terminates. With preemption, a shared queue is less important, because
clients that only require short service don’t get stuck behind clients that need
long service. However, this leaves the issue of load balancing.
If each processor has its own local queue, the system needs explicit load
balancing — migrating jobs from overloaded processors to underloaded ones.
This ensures fairness and overall low response times. However,
implementing process migration is not trivial, and creates overhead. It may
therefore be better to use a shared queue after all. This ensures that there is
never a case where some job is in the queue and some processor is idle. As
jobs are not assigned to processors in this case, this is called “load sharing”
rather than “load balancing”.
The drawback of using a shared queue is that each job will probably run on a
different processor each time it is scheduled. This leads to the corruption of
cache state. It is fixed by affinity scheduling, which takes the affinity of each
process to each processor into account, where affinity is taken to represent
the possibility that relevant state is still in the cache. It is similar to simply
using longer time quanta.
8.2 Supporting Multicore Environments Chapter 9
For example, one can implement thread management as a low level. Then the
memory manager, file system, and network protocols can use threads to
maintain the state of multiple active services.
The order of the layers is a design choice. One can implement a file system
above a memory manager, by declaring a certain file to be the backing store
of a memory segment, and then simply accessing the memory. If the data is
not there, the paging mechanism will take care of getting it. Alternatively,
one can implement memory management above a file system, by using a file
for the swap area.
applications
system calls
file system
memory management
operating system
dispatcher
drivers
hardware
This distinction is also related to the issue of where the operating system code
runs. A kernel can either run as a separate entity (that is, be distinct from all
processes), or be structured as a collection of routines that execute as needed
within the environment of user processes. External servers are usually
separate processes that run at user level, just like user applications. Routines
that handle interrupts or system calls run within the context of the current
process, but typically use a separate kernel stack.
Interrupts are a hardware event. When an interrupt occurs, the hardware must
know what operating system function to call. This is supported by the
interrupt vector. When the system is booted, the addresses of the relevant
functions are stored in the interrupt vector, and when an interrupt occurs, they
are called. The available interrupts are defined by the hardware, and so is the
interrupt vector; the operating system must comply with this definition in
order to run on this hardware.
While the entry point for handling the interrupt must be known by the
hardware, it is not necessary to perform all the handling in this one function.
In many cases, it is even unreasonable to do so. The reason is that interrupts
are asynchronous, and may occur at a very inoportune moment. Thus many
systems partition the handling of interrupts into two: the handler itself just
stores some information about the interrupt that has occurred, and the actual
handling is done later, by another function. This other functions is typically
invoked at the next context switch. This is a good time for practically any
type of operation, as the system is in an itermediate state after having stopped
one process but before starting another.
Instead, the choice of system calls is done by means of a single entry point.
This is the function that is called when a trap instruction is issued. Inside this
function is a large switch instructions, that branches according to the desired
system call. For each system call, the appropriate internal function is called.
The fact that different entry points handle different events is good — it shows
that the code is partitioned according to its function. However, this does not
guarantee a modular structure as a whole. When handling an event, the
operating system may need to access various data structures. These data
structures are typically global, and are accessed directly by different code
paths. In monolithic systems, the data structures are not encapsulated in
separate modules that are only used via predefined interfaces.
File related tables — the u-area of each process contains the file descriptors
table for that process. Entries in this table point to entries in the open files
table. These, in turn, point to entries in the inode table. The information in
each inode contains the device on which the file is stored.
problems: code instability and security risks due to global address space. any
driver can modify core kernel data. need to lock correctly. (see linus vs.
tanenbaum)
9.2.3 Preemption
preempting the operating system for better responsiveness [2]. fits with
partitioning interrupt handling. must not wait for I/O.
9.3 Microkernels
Microkernels take the idea of doing things outside of the kernel to the limit.
Microkernels separate mechanism from policy
The reasons for using microkernels are modularity and reliability. The
microkernel provides the mechanisms to perform various actions, such as
dispatching a process or reading a disk block. However, that is all it does. It
does not decide which process to run — this is the work of an external
scheduling server. It does not create named files out of disk blocks — this is
the work of an external file system server. Thus it is possible to change
policies and services by just replacing the external servers, without changing
the microkernel. In addition, any pproblem with one of the servers will at
worst crash that server, and not the whole system. And then the server can
simply be restarted.
Microkernels are thus named because they are supposed to be much smaller
than a conventional kernel. This is a natural result of the fact that a lot of
functionality was moved out of the kernel. The main things that are left in the
kernel are
server
mgmt server
microkernel
dispatch
inter−processor comm. mechanisms
hardware
The distinction between the microkernel and the servers extends to interrupt
handling. The interrupt handler invoked by the hardware is part of the
microkernel. But this function typically cannot actually handle the interrupt,
as this should be done subject to policy decisions of the external servers.
Thus a microkernel architecture naturally creates the partitioning of interrupt
handlers into two parts that was done artificially in monolithic systems.
Note that this also naturally inserts a potential preemption point in the
handling of interrupts, and in fact of any kernel service. This is important for
improved responsiveness.
The system’s policies and the services it provides are what differentiates it
from other systems. Once these things are handled by an external server, it is
possible to support multiple system personalities at once. For example,
Windows NT can support Windows applications with its Windows server,
OS/2 applications with its OS/2 server, and Unix applications with its Unix
server. Each of these servers supplies unique interfaces and services, which
are implemented on top of the NT microkernel.
It may be surprising at first, but the fact is that operating systems have to be
extended after they are deployed. There is no way to write a “complete”
operating system that can do all that it will ever be asked to do. One reason
for this is that new hardware devices may be introduced after the operating
system is done. As this cannot be anticipated in advance, the operating
system has to be extended.
Exercise 144 But how does code that was compiled before the new device
was added know how to call the routines that handle the new device?
It can also be used to add functionality
Several systems support loadable modules. This means that additional code
can be loaded into a running system, rather than having to recompile the
kernel and reboot the system.
The idea of virtual machines is not new. It originated with MVS, the
operating system for the IBM mainframes. In this system, time slicing and
abstractions are completely decoupled. MVS actually only does the time
slicing, and creates multiple exact copies of the original physical machine.
Then, a single-user operating system called CMS is executed in each virtual
machine. CMS provides the abstractions of the user environment, such as a
file system.
As each virtual machine is an exact copy of the physical machine, it was also
possible to run MVS itself on such a virtual machine. This was useful to
debug new versions of the operating system on a running system. If the new
version is buggy, only its virtual machine will crash, but the parent MVS will
not. This practice continues today, and VMware has been used as a platform
for allowing students to experiment with operating systems.
To read more: History buffs can read more about MVS in the book by
Johnson [3].
Hypervisors perform some operating system functions
Bibliography
[1] R. H. Campbell, N. Islam, D. Raila, P. Madany. Designing and
implementing Choices: An object-oriented system in C++. Comm. ACM,
36(9):117–126, Sep 1993.
Performance Evaluation
Operating system courses are typically concerned with how to do various
things, such as processor scheduling, memory allocation, etc. Naturally, this
should be coupled with a discussion of how well the different approaches
work.
The most common metrics are related to time. If something takes less time,
this is good for two reasons. From the user’s or client’s perspective, being
done sooner means that you don’t have to wait as long. From the system’s
perspective, being done sooner means that we are now free to deal with other
things.
While being fast is good, it still leaves the issue of units. Should the task of
listing a directory, which takes milliseconds, be on the same scale as the task
of factoring large numbers, which can take years? Thus it is sometimes better
to consider time relative to some yardstick, and use different yardsticks for
different jobs.
A system’s client is typically only concerned with his own work, and wants it
completed as fast as possible. But from the system’s perspective the sum of
all clients is typically more important than any particular one of them.
Making many clients moderately happy may therefore be better than making
one client very happy, at the expense of others.
The metric of “happy clients per second” is called throughput. Formally, this
is the number of jobs done in a unit of time. It should be noted that these can
be various types of jobs, e.g. applications being run, files being transferred,
bank accounts being updated, etc.
Exercise 146 Are response time and throughput simply two facets of the same
thing?
Being busy is good
Being up is good
The above metrics are concerned with the amount of useful work that gets
done. There is also a whole group of metrics related to getting work done at
all. These are metrics that measure system availability and reliability. For
example, we can talk about the mean time between failures (MTBF), or the
fraction of time that the system is down.
A special case in this class of metrics is the supportable load. Every system
becomes overloaded if it is subjected to extreme load conditions. The
question is, first, at what level of load this happens, and second, the degree to
which the system can function under these conditions.
A major goal of the operating system is to run as little as possible, and enable
user jobs to use the computer’s resources. Therefore, when the operating
system does run, it should do so as quickly and unobtrusively as possible. In
other words, the operating system’s overhead should be low. In addition, it
should not hang or cause other inconveniences to the users.
Exercise 147 Which of the following metrics is related to time, throughput,
utilization, or reliability?
The probability that a workstation is idle and can therefore execute remote
jobs• The response time of a computer program• The probability that a disk fails• The bandwidth of a
communications network• The latency of loading a web page• The number of transactions per second
processed by a database system•
Exercise 149 You are designing a kiosk that allows people to check their
email. You are told to expect some 300 users a day, each requiring about 2
minutes. Based on this you decide to deploy two terminals (there are 600
minutes in 10 hours, so two terminals provide about double the needed
capacity — a safe margin). Now you are told that the system will be deployed
in a school, for use during recess. Should you change your design?
For operating systems there are few such agreed benchmarks. In addition,
creating a representative workload is more difficult because it has a temporal
component. This means that in a typical system additional work arrives
unexpectedly at different times, and there are also times at which the system
is idle simply because it has nothing to do. Thus an important component of
modeling the workload is modeling the arrival process (as in the bursty
example in Exercise 149). By contradistinction, in computer architecture
studies all the work is available at the outset when the application is loaded,
and the CPU just has to execute all of it as fast as possible.
Consider the distribution of job runtimes, and ask the following question:
given that a job has already run for time t, how much longer do you expect it
to run?
The answer depends on the distribution of runtimes. A trivial case is when all
jobs run for exactly T seconds. In this case, a job that has already run for t
seconds has another T − t seconds to run. In other words, the longer you have
waited already, and less additional time you expect to have to wait for the job
to complete. This is true not only for this trivial case, but for all distributions
that do not have much of a tail.
The boundary of such distributions is the exponential distribution, defined by
the pdf f (x) = λe−λx. This distribution has the remarkable property of being
memoryless. Its mean is 1/λ, and this is always the value of the time you can
expect to wait. When the job just starts, you can expect to wait 1/λ seconds. 7
seconds later, you can expect to wait an additional 1/λ seconds. And the same
also holds half an hour later, or the next day. In other words, if you focus on
the tail of the distribution, its shape is identical to the shape of the distribution
as a whole.
Exercise 150 You are designing a system that handles overload conditions by
actively killing one job, in the hope of reducing the load and thus being able
to better serve the remaining jobs. Which job should you select to kill,
assuming you know the type of distribution that best describes the job sizes?
Second, it is typically very hard to assess whether or not the tail really decays
according to a power law: there are simply not enough observations. And
from a practical point of view, it typically doesn’t really matter. Therefore we
prefer not to get into arguments about formal definitions.
Example distributions
The Pareto distribution is defined on the range x > 1, and has a simple power-
law CDF:
F (x) = 1− x−a
where a must be positive and is called the shape parameter — the lower a is,
the heavier the tail of the distribution. In fact, the distribution only has a
mean if a > 1 (otherwise the calculation of the mean does not converge), and
only has a variance if a > 2 (ditto). The pdf is
f (x) = ax−(a+1)
This is the simplest example of the group of heavy-tail distributions.
Exercise 151 What is the mean of the Pareto distribution when it does exist?
for some 0 < p < 1. By a judicious choice of λ1, λ2, and p, one can create a
distribution in which the standard deviation is as large as desired relative to
the mean, indicating a fat tail (as opposed to the exponential distribution, in
which the standard deviation is always equal to the mean). However, this is
not a heavy tail.
The following graphs compare the CDFs of the exponential, Pareto, and
hyper-exponential distributions. Of course, these are special cases, as the
exact shape depends on the parameters chosen for each one. The right graph
focuses on the tail.
11
0.8 pareto 0.998 paretoexponential exponentialhyper exponential
hyper exponential 0.6 0.996
0.4 0.994
0.2 0.992
0 0.99
0 2 4 6 8 10 12 14 0 50 100 150 200 250 300
The distribution of job runtimes. In this case, most of the jobs are short, but•
most of the CPU time is spent running long jobs.
The distribution of file sizes in a file system. Here most of the files are small,•
but most of the disk space is devoted to storing large files.
The distribution of flow sizes on the Internet. As with files, most flows are
short,• but most bytes transferred belong to long flows.
We will consider the details of specific examples when we need them.
Exercise 152 The distribution of traffic flows in the Internet has been
characterized as being composed of “mice and elephants”, indicating many
small flows and few big ones. Is this a good characterization? Hint: think
about the distribution’s modes.
that is, the probability of using each item is inversely proportional to its rank:
if the most popular item is used k times, the second most popular is used k/2
times, the third k/3 times, and so on. Note that we can’t express this in the
conventional way using a probability density function, because1dx = ∞! The
only way out is to use a normalizing factorx
that is proportional to the number of observations being made. Assuming N
observations, the probability of observing the item with rank x is then
1f(x) =x ln N
and the CDF isln xF (x) = ln N
Exercise 153 Would you expect CPU usage and I/O activity to be correlated
with each other?
Exercise 154 What is the significance of the daily cycle in each of the
following cases?
1. Access to web pages over the Internet
2. Usage of a word processor on a privately owned personal computer
3. Execution of computationally heavy tasks on a shared system
The major benefit of using mathematical analysis is that it provides the best
insights: the result of analysis is a functional expression that shows how the
performance depends on system parameters. For example, you can use
analysis to answer the question of whether it is better to configure a system
with one fast disk or two slower disks. We will see an example of this below,
in Section 10.5.
Exercise 155 Consider performance metrics like job response time and
network bandwidth. What are system parameters that they may depend on?
The drawback of analysis is that it is hard to do, in the sense that it is not
always possible to arrive at a closed-form solution. Even when it is possible,
various approximations and simplifications may have to be made in the
interest of mathematical tractability. This reduces the confidence in the
relevance of the results, because the simplifications may create unrealistic
scenarios.
Simulation is flexible
The main advantage of simulation is its flexibility: you can make various
modifications to the simulation model, and check their effect easily. In
addition, you can make some parts of the model more detailed than others, if
they are thought to be important. In particular, you do not need to make
simplifying assumptions (although this is sometimes a good idea in order to
get faster results). For example, you can simulate the system with either one
or two disks, at many different levels of detail; this refers both to the disks
themselves (e.g. use an average value for the seek time, or a detailed model
that includes the acceleration and stabilization of the disk head) and to the
workload that they serve (e.g. a simple sequence of requests to access
consecutive blocks, or a sequence of requests recorded on a real system).
The perception of being “the real thing” may also be misguided. It is not easy
to measure the features of a complex system. In many cases, the performance
is a function of many different things, and it is impossible to uncover each
one’s contribution. Moreover, it often happens that the measurement is
affected by factors that were not anticipated in advance, leading to results that
don’t seem to make sense.
Exercise 156 How would you go about measuring something that is very
short, e.g. the overhead of trapping into the operating system?
In the end, simulation is often the only viable alternative
It would be wrong to read the preceding paragraphs as if all three alternatives
have equal standing. The bottom line is that simulation is often used because
it is the only viable alternative. Analysis may be too difficult, or may require
too many simplifying assumptions. Once all the assumptions are listed, the
confidence in the relevance of the results drops to the point that they are not
considered worth the effort. And difficulties arise not only from trying to
make realistic assumptions, but also from size. For example, it may be
possible to analyze a network of half a dozen nodes, but not one with
thousands of nodes.
One way to simplify the model is to use a static workload rather than a
dynamic one. For example, a scheduler can be evaluated by how well it
handles a given job mix, disregarding the changes that occur when additional
jobs arrive or existing ones terminate.
The problem with using static workloads is that this leads to lesser accuracy
and less confidence in the evaluations results. This happens because
incremental work, as in a dynamically evolving real system, modifies the
conditions under which the system operates. Creating static workloads may
miss such effects.
For example, the layout of files on a disk is different if they are stored in a
disk that was initially empty, or in a disk that has been heavily used for some
time. When storing data for the first time in a new file system, blocks will
tend to be allocated consecutively one after the other. Even if many different
mixes of files are considered, they will each lead to consecutive allocation,
because each time the evaluation starts from scratch. But in a real file system
after heavy usage — including many file deletions — the available disk space
will become fragmented. Thus in a live file system there is little chance that
blocks will be allocated consecutively, and evaluations based on allocations
that start from scratch will lead to overly optimistic performance results. The
solution in this case is to develop a model of the steady state load on a disk,
and use this to prime each evaluation rather than starting from scratch [15].
In real life, you often need to wait in queue for a service: there may be people
ahead of you in the post office, the supermarket, the traffic light, etc.
Computer systems are the same. A program might have to wait when another
is running, or using the disk. It also might have to wait for a user to type in
some input. Thus computer systems can be viewed as networks of queues and
servers. Here is an example:
queue
disk
A
queue disk
B
In this figure, the CPU is one service station, and there is a queue of jobs
waiting for it. After running for some time, a job may either terminate,
require service from one of two disks (where it might have to wait in queue
again), or require input. Input terminals are modeled as a so called “delay
center”, where you don’t have to wait in a queue (each job has its own user),
but you do need to wait for the service (the user’s input).
Consider an example where each job takes exactly 100 ms (that is, one tenth
of a second). Obviously, if exactly one such job arrives every second then it
will be done in 100 ms, and the CPU will be idle 90% of the time. If jobs
arrive exactly half a second apart, they still will be serviced immediately, and
the CPU will be 80% idle. Even if these jobs arrive each 100 ms they can still
be serviced immediately, and we can achieve 100% utilization.
But if jobs take 100 ms on average, it means that some may be much longer.
And if 5 such jobs arrive each second on average, it means that there will be
seconds when many more than 5 arrive together. If either of these things
happens, jobs will have to await their turn, and this may take a long time. It is
not that the CPU cannot handle the load on average — in fact, it is 50% idle!
The problem is that it cannot handle multiple jobs at once when a burst of
activity happens at random.
Exercise 158 Is it possible that a system will not be able to process all its
workload, even if it is idle much of the time?
We would therefore expect the average response time to rise when the load
increases. But how much?
To read more: The following is only a very brief introduction to the ideas of
queueing theory. A short exposition on queueing analysis was given in early
editions of Stallings [16, appendix A]. A more detailed discussion is given by
Jain [7, Part VI]. Krakowiak [10, Chap. 8] bases the general discussion of
resource allocation on queueing theory. Then there are whole books devoted
to queueing theory and its use in computer performance evaluation, such as
Lazowska et al. [12] and Kleinrock [8, 9].
The main parameters of the model are the arrival rate and the service rate.
The arrival rate, denoted by λ, is the average number of clients (jobs)
arriving per unit of time. For example, λ = 2 means that on average two jobs
arrive every second. It also means that the average interarrival time is half a
second.
The service rate, denoted by µ, is the average number of clients (jobs) that
the server can service per unit of time. For example, µ = 3 means that on
average the server can service 3 jobs per second. It also means that the
average service time is one third of a second.
The number of clients that the server actually serves depends on how many
arrive. If the arrival rate is higher than the service rate, the queue will grow
without a bound, and the system is said to be saturated. A stable system, that
does not saturate, requires λ < µ. The load or utilization of the system is ρ =
λ/µ.
Note that the process by which jobs arrive at the queue and are serviced is a
random process. The above quantities are therefore random variables. What
we want to find out is usually the average values of metrics such as the
response time and number of jobs in the system. We shall denote averages by
a bar above the variable, as in ¯n.
This relationship is very useful, because if we know λ, and can find ¯ from
our analysis, then we can compute ¯, the average response time, which is the
metric for performance.
Exercise 159 Can you derive a more formal argument for Little’s Law? Hint:
look at the cumulative time spent in the system by all the jobs that arrive and
are serviced during a long interval T.
The simplest example is the so-called M/M/1 queue. This is a special case of
the arrive-queue-server-done system pictured above. The first M means that
interarrival times are exponentially distributed (the “M” stands for
“memoryless”). The second M means that service times are also
exponentially distributed. The 1 means that there is only one server.
The way to analyze such a queue (and indeed, more complicated queueing
systems as well) is to examine its state space. For an M/M/1 queue, the state
space is very simple. The states are labeled by integers 0, 1, 2, and so on, and
denote the number of jobs currently in the system. An arrival causes a
transition from state i to state i + 1. The average rate at which such transitions
occur is simply λ, the arrival rate of new jobs. A departure (after a job is
serviced) causes a transition from state i to state i− 1. This happens at an
average rate of µ, the server’s service rate.
λλλλ
0123
µµµµ
disk
The nice thing about this is that it is a Markov chain. The probability of
moving from state i to state i + 1 or i − 1 does not depend on the history of
how you got to state i (in general, it may depend on which state you are in,
namely on i. For the simple case of an M/M/1 queue, it does not even depend
on i).
Given the fact that such long-term probabilities exist, we realize that the flow
between neighboring states must be balanced. In other words, for every
transition from state i to state i + 1, there must be a corresponding transition
back from state i + 1 to state i. But transitions occur at a known rate, that only
depends on the fact that we are in the given state to begin with. This allows
us to write a set of equations that express the balanced flow between
neighboring states. For example, the flow from state 0 to state 1 is
λ π0
because when we are in state 0 the flow occurs at a rate of λ. Likewise, the
flow back from state 1 to state 0 is
µ π1 The balanced flow implies that λ π0 = µ π1 or that λ π1 = µ π0
Now let’s proceed to the next two states. Again, balanced flow implies that
λ π1 = µ π2
which allows us to express π2 as
λ
π2 = µ π1
Substituting the expression for π1 we derived above then leads to
1 Actually a Markov chain must satisfy several conditions for such limiting probabilities to exist; for
example, there must be a path from every state to every other state, and there should not be any periodic
cycles. For more on this see any book on probabilistic models, e.g. Ross [14]
π
2
=
λ2 µπ0
and it is not hard to see that we can go on in this way and derive the general
expression
πi = ρi π0
where ρ =λ.µ
Exercise 161 What is the meaning of ρ? Hint: it is actually the utilization of
the system. Why is this so?
Given that the probabilities for being in all the states must sum to 1, we have
the additional condition that∞
π0 ρi = 1
i=0
Taking π0 out of the sum and using the well-known formula for a geometric
sum,∞ ρi =
1
i=0
, this leads to1−ρ
π0 = 1− ρ
This even makes sense: the probability of being in state 0, where there are no
jobs in the system, is 1 minus the utilization.
We’re actually nearly done. Given the above, we can find the expected
number of jobs in the system: it is
n = iπi
i
= i(1− ρ)ρi
i
ρ= 1− ρ
Finally, we use Little’s Law to find the expected response time. It is
r
=
n
λ
=
ρ
λ(1− ρ)
1= µ− λ
The end result of all this analysis looks like this (by setting µ = 1 and letting λ
range from 0 to 1):
30
25
20
15
10
5
0 0 0.2 0.4 0.6 0.8 1 utilization
For low loads, the response time is good. Above 80% utilization, it becomes
very bad.
Exercise 162 What precisely is the average response time for very low loads?
Does this make sense?
queueserver
A more realistic model is an interactive system, where the return path goes
through a user-interaction component; in this case the number of jobs actually
in the system may fluctuate, as different numbers may be waiting for user
interaction at any given instance.
the system
queueserver
...
user interaction
Exercise 164 Consider the following scenarios. Which is better modeled as
an open system, and which as a closed system?
As we saw, the performance of an open system like the M/M/1 queue can be
quantified by the functional relationship of the response time on the load. For
closed systems, this is irrelevant. A simple closed system like the one
pictured above operates at full capacity all the time (that is, at the coveted
100% utilization), because whenever a job terminates a new one arrives to
take its place. At the same time, it does not suffer from infinite response
times, because the population of jobs is bounded.
The correct metrics for closed systems are therefore throughput metrics, and
not response time metrics. The relevant question is how many jobs were
completed in a unit of time, or in other words, how many cycles were
completed.
10.6 Simulation Methodology
Analytical models enable the evaluation of simplified mathematical models
of computer systems, typically in steady state, and using restricted workload
models (e.g. exponential distributions). Simulations are not thus restricted —
they can include whatever the modeler wishes to include, at the desired level
of detail. Of course this is also a drawback, as they do not include whatever
the modeler did not consider important.
To read more: Again, we only touch upon this subject here. Standard
simulation methodology is covered in many textbooks, e.g. Jain [7, Part V],
as well as textbooks dedicated to the topic such as Law and Kelton [11]. The
issues discussed here are surveyed in [13].
In order to simulate the system in a steady state, you must first ensure that it
is in a steady state. This is not trivial, as simulations often start from an
empty system. Thus the first part of the simulation is often discarded, so as to
start the actual evaluation only when the system finally reaches the desired
state.
For example, when trying to evaluate the response time of jobs in an open
system, the first jobs find an empty system (no need to wait in queue), and
fully available resources. Later jobs typically need to wait in the queue for
some time, and the resources may be fragmented.
The decision regarding when steady state is achieved typically also depends
on the convergence of the measured metrics.
The simulation length is determined by the desired accuracy
Once the steady state is reached, the simulation is continued until the desired
accuracy is reached. As the simulation unfolds, we sample more and more
instances of the performance metrics in which we are interested. For
example, as we simulate more and more jobs, we collect samples of job
response times. The average value of the samples is considered to be an
estimator for the true average value of the metric, as it would be in a real
system. The longer the simulation, the more samples we have, and the more
confidence we have in the accuracy of the results. This derives from the fact
that the size of the confidence interval (the range of values in which we think
the true value resides with a certain degree of confidence) is inversely
proportional to the square root of the number of samples.
Exercise 165 What sort of confidence levels and accuracies are desirable in
systems evaluation?
Naturally, one of the most important metrics for a system is the point at
which it makes the transition from functioning to overloaded. This is also
related to the question of the distribution of loads: is this a bimodal
distribution, with most workloads falling in the well-behaved normal mode,
and just some extreme discrete cases creating overload, or is there a
continuum? The answer seems to be a combination of both. The load
experienced by most systems is not steady, but exhibits fluctuations at many
different time scales. The larger fluctuations, corresponding to the more
extreme loads, occur much more rarely than the smaller fluctuations. The
distinction into two modes is not part of the workload, but a feature of the
system: above some threshold it ceases to handle the load, and the dynamics
change. But as far as the workload itself is concerned, the above-threshold
loads are just a more extreme case, which seems to be discrete only because it
represents the tail of the distribution and occurs relatively rarely.
But what if the rare events are actually what we are interested in? For
example, when evaluating communication networks, we are not interested
only in the average latency of messages under normal conditions. We also
want to know what is the probability of a buffer overflow that causes data to
be lost. In a reasonable network this should be a rather rare event, and we
would like to estimate just how rare.
Exercise 166 What are other examples of such rare events that are important
for the evaluation of an operating system?
If, for example, the events in which we are interested occur with a probability
of one in a million (10−6), we will need to simulate billions of events to get a
decent measurement of this effect. And 99.9999% of this work will be wasted
because it is just filler between the events of interest. Clearly this is not an
efficient way to perform such evaluations.
10.7 Summary
Performance is an important consideration for operating systems. It is true
that the main consideration is functionality — that is, that the system will
actually work. But it is also important that it will work efficiently and
quickly.
In many senses, operating systems are queueing systems, and handle jobs
that• wait for services. In open systems, where the job population is very big and
largely independent of system performance, this means that
1. Randomness in arrival and service times causes the average response time
to tend to infinity as the load approaches each sub-system’s capacity.
2. This is a very general result and can be derived from basic principles.
3. You cannot achieve 100% utilization, and in fact might be limited to much
less.
In closed systems, on the other hand, there is a strong feedback from the
system’s performance to the creation of additional work. Therefore the
average response time only grows linearly with the population size.
Disclaimer
Bibliography
[1] R. J. Adler, R. E. Feldman, and M. S. Taqqu (eds.), A Practical Guide to
Heavy Tails. Birkh¨auser, 1998.
Appendix D
Self-Similar Workloads
Traditionally workload models have used exponential and/or normal
distributions, mainly because of their convenient mathematical properties.
Recently there is mounting evidence that fractal models displaying self-
similarity provide more faithtul representations of reality.
D.1 Fractals
Fractals are geometric objects which have the following two (related)
properties: they are self-similar, and they do not have a characteristic scale.
The fractal dimension is based on self-similarity
Being self similar means that parts of the object are similar (or, for pure
mathematical objects, identical) to the whole. If we enlarge the whole object
we end up with several copies of the original. That is also why there is no
characteristic scale — it looks the same at every scale.
The reason they are called fractals is that they can be defined to have a
fractional dimension. This means that these objects fill space in a way that is
different from what we are used to. To explain this, we first need to define
what we mean by “dimension”.
Consider a straing line segment. If we double it, we get
two copies of the original:
Let’s denote the factor by which we increase the size of the line segments by
f, and the number of copies of the original that we get by n. The above three
examples motivate us to define dimensionality as
d = logf n
With this definition, the line segment is one-dimentional, the square is 2D,
and the cube 3D.
Now apply the same definition to the endlessly recursive Sierpinski triangle.
Doubling each line segment by 2 creates a larger triangle which contains 3
copies of the original. Using our new definitions, its dimention is therefore
log2 3 = 1.585. It is a fractal.
More formally, consider a time series x1, x2, x3, . . . xn. xi can be the number
of packets transfered on a network in a second, the number of files opened in
an hour, etc. Now create a series of disjoint sums. For example, you can sum
every 3 consecutive elements from the original series: x3 = x1 + x2 + x3, x3 =
x4 + x5 + x6, .... This can1 2
be done several times, each time summing consecutive elements from the
previous series. For example, the third series will start with x9 = x3 + x3 + x3
= x1 + . . . + x9. If1 1 2 3
all these series look the same, we say that the original series is self similar.
Let’s assume the steps are independent of each other, and denote the drunk’s
location after i steps by xi. The relationship between xi+1 and xi is
x
i
+1
=
xi + 1 with probability 0.5
xi − 1 with probability 0.5
so on average the two options cancel out. But if we look at x2, we geti
x
2= (xi + 1)2 = x2 + 2xi + 1 with probability 0.5i
i+1 (xi − 1)2 = x2 2xi + 1 with probability 0.5i −
Now consider a drunk with inertia. Such a drunk tends to lurch several steps
in the same direction before switching to the other direction. The steps are no
longer independent — in fact, each step has an effect on following steps,
which tend to be in the same direction. Overall, the probabilities of taking
steps in the two directions are still the same, but these steps come in bunches.
xn = cnH
where H, the Hurst parameter, satisfies 0.5 < H < 1.
In anti-persistent processes the drunk backtracks all the time, and makes less
progress than expected. In such cases the Hurst parameter satisfies 0 < H <
0.5.
And it is easy to measure experimentally
A nice thing about this formulation is that it is easy to verify. Taking the log
of both sides of the equation xn = cnH leads to
log xn = log c + H log n
We can take our data and find the average range it covers as a function of n.
Then we plot it in log-log scales. If we get a straight line, our data is subject
to the Hurst effect, and the slope of the line gives us the Hurst parameter H.
As it turns out, the Hurst effect is very common. Hurst himself showed this
for various natural phenomena, including annual river flows, tree ring widths,
and sunspot counts. It has also been shown for various aspects of computer
workloads, ranging from network traffic to file server activity. Common
values are in the range 0.7 < H < 0.9, indicating persistent processes with
high burstiness and self similarity.
Bibliography
[1] S. D. Gribble, G. S. Manku, D. Roselli, E. A. Brewer, T. J. Gibson, and E.
L. Miller, “Self-similarityinfilesystems”. In SIGMETRICS Conf. Measurement
& Modeling of Comput. Syst., pp. 141–150, Jun 1998.
[3] E. E. Peters, Fractal Market Analysis. John Wiley & Sons, 1994.
[4] D. Thi´ ebaut, “On the fractal dimension of computer programs and its
application to the prediction of the cache miss ratio”. IEEE Trans. Comput.
38(7), pp. 1012– 1026, Jul 1989.
Chapter 11
Technicalities
The previous chapters covered the classical operating system curriculum,
which emphasizes abstractions, algorithms, and resource management.
However, this is not enough in order to create a working system. This chapter
covers some more technical issues and explains some additional aspects of
how a computer system works. Most of it is Unix-specific.
The basic idea is to use amplification: write some very short program that can
only do one thing: read a larger program off the disk and start it. This can be
very specific in the sense that it expects this program to be located at a
predefined location on the disk, typically the first sector, and if it is not there
it will fail. By running this initial short program, you end up with a running
longer program.
In the ’60 the mechanism for getting the original short program into memory
was to key it in manually using switches on the front panel of the computer:
you literally set the bits of the opcode and arguments for successive
instructions (in the PDP-8 in this picture, these are the row of switches at the
bottom)
Contemporary computers have a non-volatile read-only memory (ROM) that
contains the required initial short program. This is activated automatically
upon power up.
The initial short program typically reads the boot sector from the disk. The
boot sector contains a larger program that loads and starts the operating
system. To do so, it has to have some minimal understanding of the file
system structure.
Exercise 168 How is address translation set up for the booting programs?
Exercise 169 How can one set up a computer such that during the boot
process it will give you a choice of operating systems you may boot?
One of the first things the operating system does whan it starts running is to
load the interrupt vector. This is a predefined location in memory where all
the entry points to the operating system are stored. It is indexed by interrupt
number: the address of the handler for interrupt i is stored in entry i of the
vector. When a certain interrupt occurs, the harwdare uses the interrupt
number as an index into the interrupt vector, and loads the value found there
into the PC, thus jumping to the appropriate handler (of course, it also sets
the execution mode to kernel mode).
page fault
device clock interrupt
disk interrupt
terminal interrupt trap software trap
typical action
emergency shutdown emergency shutdown kill process
kill process
After setting up the interrupt vector, the system creates the first process
environment. This process then forks additional system processes and the init
process. The init process forks login processes for the various terminals
connected to the system (or the virtual terminals, if connections are through a
network).
11.2 Timers
providing time-related services. example: video/audio needs to occur at given
rate in real time; old games would run faster when the PC was upgraded...
interface: ask for signal after certain delay. best effort service, may miss.
using
periodic clock sets the resolution. improved by soft timers [1], one-shot
timers [5]. dependence on clock resolution [3]. Vertigo [4] — reduce
hardware clock rate when
not needed, to save power.
question of periodic timers, and need to keep control. if only one app, just let
it
run: control will return if something happens (e.g. network or terminal
interrupt).
When you sit at a terminal connected to a Unix system, the first program you
encounter is the login program. This program runs under the root user ID (the
superuser), as it needs to access system information regarding login
passwords. And in any case, it doesn’t know yet which user is going to log in.
The login program prompts you for your login and password, and verifies
that it is correct. After verifying your identity, the login program changes the
user ID associated with the process to your user ID. It then execs a shell.
A trick
The login program need not necessarily exec the shell. For example, new
student registration may be achieved by creating a login called “register”, that
does not require a password, and execs a registration program rather than a
shell. New students can then initially login as “register”, and provide their
details to the system. When the registration program terminates they are
automatically logged off.
A shell is the Unix name for a command interpreter. Note that the shell runs
in the context of the same process as the one that previously ran the login
program, and therefore has the same user ID (and resulting priviledges). The
shell accepts your commands and executes them. Some are builtin commands
of the shell, and others are separate programs. Separate programs are
executed by the fork/exec sequence described below.
The shell can also string a number of processes together using pipes, or run
processes in the background. This simply means that the shell does not wait
for their termination, but rather immediately provides a new prompt.
Exercise 171 Script files often start with a line of the form #!/bin/ interpreter .
What is the magic number in this case? Is it relevant?
exec creates a text segment and initializes it using the text section of the
executable. It creates a data segment and initializes it using another section.
However, the created segment in this case is typically larger than the
corresponding section, to account for uninitialized data. Stack and heap
segments are also created, which have no corresponding section in the
executable.
A special case occurs when shared libraries are used (shared libraries are
called dynamic link libraries (DLLs) in Windows). Such libraries are not
incorporated into the text segment by the compiler (or rather, the linker).
Instead, an indication that they are needed is recorded. When the text segment
is constructed, the library is included on the fly. This enables a single copy of
the library code to serve multiple applications, and also reduces executable
file sizes.
11.6 Context Switching
To switch from one process to another, the operating system has to store the
hardware context of the current process, and restore that of the new process.
Recall that system calls (and, for that matter, interrupt handlers) are
individual entry points to the kernel. But in what context do these routines
run?
The kernel can access everything
To answer this, we must first consider what data structures are required.
These can be divided into three groups:
•
Global kernel data structures, such as the process table, tables needed for
memory allocation, tables describing the file system, and so forth.
•
A stack for use when kernel routines call each other. Note that the kernel as a
whole is re-entrant: a process may issue a system call and block in the middle
of it (e.g. waiting for the completion of a disk operations), and then another
process may also issue a system call, even the same one. Therefore a separate
stack is needed for kernel activity on behalf of each process.
global
kernel data kernel text kernel stack u-area
user stack
per process
user heap user data user text
accessible
only by kernel
accessible
by user program
But different mappings are used
To enable the above distinction, separate address mapping data can be used.
In particular, the kernel address space is mapped using distinct page tables. In
total, the following tables are needed.
Tables for the kernel text and global data. These are never changed, as there
is a single operating system kernel. They are marked as being usable only in
kernel mode, to prevent user code from accessing this data.
•
Tables for per-process kernel data. While these are also flagged as being
accessible only in kernel mode, they are switched on every context switch.
Thus the currently installed tables at each instant reflect the data of the
current process. Data for other processes can be found by following pointers
in kernel data structures, but it cannot be accessed directly using the data
structures that identify the “current process”.
•
Tables for the user address space. These are the segment and page tables
discussed in Chapter 4.
Access to data in user space requires special handling. Consider a system call
like read, that specifies a user-space buffer in which data should be deposited.
This is a pointer that contains a virtual address. The problem is that the same
memory may be mapped to a different virtual address when accessed from
the kernel. Thus accessing user memory cannot be done directly, but must
first undergo an address adjustment.
Exercise 172 Does the Unix scheme impose any restrictions on the interrupt
handlers?
Recall that a process actually has two stacks: one is the regular stack that
resides in the stack segment and is used by the user-level program, and the
other is the kernel stack that is part of the per-process kernel data, together
with the u-area. Making a system call creates frames on both stacks, and uses
the u-area as a staging area to pass data from user mode to kernel mode and
back.
The information regarding what system call is being requested, and what
arguments are provided, is available to the user-level library function. The
question is how to pass this information to the kernel-mode system call
function. The problem is that the user-level function cannot access any of the
kernel memory that will be used by the kernel-mode function. The simplest
solution is therefore to use an agreed register to store the numberical code of
the requested system call. After storing this code, the library function issues
the trap instruction, inducing a transition to kernel mode.
The trap instruction creates a call frame on the kernel stack, just like the
functioncall instruction which creates a call frame on the user stack. It also
loads the PC with the address of the system’s entry point for system calls.
When this function starts to run, it retrieves the system call code from the
agreed register. Based on this, it knows how many arguments to expect.
These areguments are then copied from the last call frame on the user stack
(which can be identified based on the saved SP register value) to a designated
place in the u-area. Once the arguments are in place, the actual function that
implements the system call is called. All these funcitons retrieve their
arguments from the u-area in the same way.
Exercise 173 Why do the extra copy to the u-area? Each system call function
can get the arguments directly from the user stack!
When the system call function completes its task, it propagates its return
value in the same way, by placing it in a designated place in the u-area. It
then returns to its caller, which is the entry point of all system calls. This
function copies the return value to an agreed register, where it will be found
by the user-level library function.
11.8 Error Handling
The previous sections of this chapter were all about how basic operations get
done. This one is about what to do when things go wrong.
It is desirable to give the application a chance to recover
The sad fact is that programs often ask the system to do something it cannot
do. Sometimes it is the program’s fault. For example, a program should
verify that a value is not zero before dividing by this value. But what should
the system do when the program does not check, and the hardware flags an
exception? Other times it is not the program’s fault. For example, when one
process tries to communicate with another, it cannot be held responsible for
misbehavior of the other process. After all, if it knew everything about the
other process, it wouldn’t have tocommunicate in the first place.
In either case, the simple way out is to kill the process that cannot proceed.
However, this is a rather extreme measure. A better solution would be to punt
the problem back to the application, in case it has the capability to recover.
Exercise 174 What are possible reasons for failure of thefork system call?
How about write? Andclose?
When a system call cannot perform the requested action, it is said to fail. This
is typically indicated to the calling program by means of the system call’s
return value. For example, in Unix most system calls return -1 upon failure,
and 0 or some nonnegative number upon successful completion. It is up to
the program to check the return value and act accordingly. If it does not, its
future actions will probably run into trouble, because they are based on the
unfounded assumption that the system call did what it was asked to do.
The more difficult case is problems that occur when the application code is
running, e.g. division by zero or issuing of an illegal instruction. In this case
the operating system is notified of the problem, but there is no obvious
mechanism to convey the information about the error condition to the
application.
Exercise 175 Where should the information regarding the delivery of signals
to an application be stored?
Once the signalling mechanism exists, it can also be used for other
asynchronous events, not only for hardware exceptions. Examples include
Exercise 176 How would you implement a program that has sevaral different
instances of catching the same type of exception?
Bibliography
[1] M. Aron and P. Druschel, “ Soft timers: efficient microsecond software
timer support for network processing”. ACM Trans. Comput. Syst. 18(3), pp.
197–228, Aug 2000.
•
Send and receive e-mail.
Finger a user on a remote machine.• “Talk” to a user on another machine
interactively.•
(Incidently, these services are handled by daemons.)
This part in the notes explains how communication is performed, and how
distributed systems are constructed.
Chapter 12
Interprocess Communication
Recall that an important function of operating systems is to provide
abstractions and services to applications. One such service is to support
communication among processes, in order to enable the construction of
concurrent or distributed applications. A special case is client-server
applications, which allow client applications to interact with server
applications using well-defined interfaces.
12.1 Naming
In order to communicate, processes need to know about each other
We get names when we are born, and exchange them when we meet other
people. What about processes?
In computer systems such as Unix, if one process forks another, the child
process may inherit various stuff from its parent. For example, various
communication mechanisms may be established by the parent process before
the fork, and are thereafter accessible also by the child process. One such
mechanism is pipes, as described below on page 230 and in Appendix E.
Exercise 177 Can a process obtain the identity (that is, process ID) of its
family members? Does it help for communication?
Predefined names can be adopted
Another simple approach is to use predefined and agreed names for certain
services. In this case the name is known in advance, and represents a service,
not a specific process. For example, when you call energency services by
phone you don’t care who specifically will answer, as long as he can help
handle the emergency. Likewise, the process that should implement a service
adopts the agreed name as part of its initialization.
For example, this approach is the basis for the world-wide web. The service
provided by web servers is actually a service of sending the pages requested
by clients (the web browsers). This service is identified by the port number
80 — in essense, the port number serves as a name. This means that a
browser that wants a web page from the serverwww.abc.com just sends a
request to port 80 at that address. The process that implements the service on
that host listens for incoming requests on that port, and serves them when
they arrive. This is described in more detail below.
Exercise 179 And how do we create the initial contact with the name service?
A sticky situation develops if more than one set of processes tries to use the
same string to identify themselves to each other. It is easy for the first process
to figure out that the desired string is already in use by someone else, and
therefore another string should be chosen in its place. Of course, it then has to
tell its colleagues about the new string, so that they know what to request
from the name server. But it can’t contact its colleagues, because the whole
point of having a string was to establish the initial contact...
Recall that a major part of the state of a process is its memory. If this is
shared among a number of processes, they actually operate on the same data,
and thereby communicate with each other.
Rather than sharing the whole memory, it is possible to only share selected
regions. For example, the Unix shared memory system calls include
provisions for
Registering a name for a shared memory region of a certain size.• Mapping a named region of
shared memory into the address space.•
The system call that maps the region returns a pointer to it, that can then be
used to access it. Note that it may be mapped at different addresses in
different processes. To read more: See the man pages forshmget andshmat.
Exercise 180 How would you implement such areas of shared memory? Hint:
think about integration with the structures used to implement virtual memory.
In some systems, it may also be possible to inherit memory across a fork.
Such memory regions become shared between the parent and child processes.
distributed shared memory may span multiple machines
While the basic idea behind the implementation of DSM is simple, getting it
to perform well is not. If only one copy of each page is kept in the system, it
will have to move back and forth between the machines running processes
that access it. This will happen even if they actually access different data
structures that happen to be mapped to the same page, a situation known as
false sharing. Various mechanisms have been devised to reduce such harmful
effects, including
Allowing multiple copies of pages that are only being read.• Basing the system on objects or data
structures rather than on pages.• Partitioning pages into sub-pages and moving them
independently of each other.•
Exercise 181 Is there a way to support multiple copies of pages that are also
written, that will work correctly when the only problem is actually false
sharing?
To read more: A thorough discussion of DSM is provided by Tanenbaum [8],
chapter 6. An interesting advanced system which solves the granularity
problem is MilliPage [3].
Sharing memory leads to concurrent programming
The problem with shared memory is that its use requires synchronization
mechanisms, just like the shared kernel data structures we discussed in
Section 3.1. However, this time it is the user’s problem. The operating system
only has to provide the means by which the user will be able to coordinate the
different processes. Many systems therefore provide semaphores as an added
service to user processes.
Exercise 183 Is it possible to use files for shared memory without suffering a
diskrelated performance penalty?
12.2.2 Remote Procedure Call
At the opposite extreme from shared memory is remote procedure call. This
is the most structured approach to communications.
A natural extension to procedure calls is to call remote procedures
procedure g .
. stub for g
. return rv .
.
stub for f
return
The other stub mimics the caller, and calls the desired procedure locally with
the specified arguments. When the procedure returns, the return values are
shipped back and handed over to the calling process.
RPC is a natural extension of the procedure call interface, and has the
advantage of allowing the programmer to concentrate on the logical structure
of the program, while disregarding communication issues. The stub functions
are typically provided by a library, which hides the actual implementation.
For example, consider an ATM used to dispense cash at a mall. When a user
requests cash, the business logic implies calling a function that verifies that
the account balance is sufficient and then updates it according to the
withdrawal amount. But such a function can only run on the bank’s central
computer, which has direct access to the database that contains the relevant
account data. Using RPC, the ATM software can be written as if it also ran
directly on the bank’s central computer. The technical issues involved in
actually doing the required communication are encapsulated in the stub
functions.
Messages are chunks of data On the other hand, message passing retains the
partitioning of the data into “chunks” — the messages. There are two main
operations on messages:
One of the reasons for the popularity of streams is that their use is so similar
to the use of sequential files. in the order you wrote it. element after where
you dropped off last time. It is therefore possible to use the same interfaces;
in Unix this is file descriptors andread andwrite system calls.
Exercise 187 Make a list of all the attributes of files. Which would you expect
to describe streams as well?
You can write to them, and the data gets accumulated You can read from
them, and always get the next data
Pipes are FIFO files with special semantics
Once the objective is defined as inter-process communication, special types
of files can be created. A good example is Unix pipes.
A pipe is a special file with FIFO semantics: it can be written sequentially (by
one process) and read sequentially (by another). The data is buffered, but is
not stored on disk. Also, there is no option to seek to the middle, as is
possible with regular files. In addition, the operating system provides some
special related services, such as
If the process writing to the pipe is much faster than the process reading
from• it, data will accumulate unnecessarily. If this happens, the operating system
blocks the writing process to slow it down.
•
If a process tries to read from an empty pipe, it is blocked rather then getting
an EOF.
If a process tries to write to a pipe that no process can read (because the read•
end was closed), the process gets a signal. On the other hand, when a process
tries to read from a pipe that no process can write, it gets an EOF.
Exercise 188 How can these features be implemented efficiently?
The problem with pipes is that they are unnamed: they are created by the pipe
system call, and can then be shared with other processes created by a fork.
They cannot be shared by unrelated processes. This gap is filled by FIFOs,
which are essentially named pipes. This is a special type of file, and appears
in the file name space.
To read more: See the man page forpipe. The mechanics of stringing
processes together are described in detail in Appendix E. Named pipes are
created by themknode system call, or by themkfifo shell utility.
In the original Unix system, pipes were implemented using the file system
infrastructure, and in particular, inodes. In modern systems they are
implemented as a pair of sockets, which are obtained using thesocketpair
system call.
Sockets are the most widely used mechanism for communication over the
Internet, and are covered in the next section.
Concurrency and asynchrony make things hard to anticipate
The select system call is designed to help with such situations. This system
call receives a set of file descriptors as an argument. It then blocks the calling
process until any of the sockets represented by these file descriptors has data
that can be read. Alternatively, a timeout may be set; if no data arrives by the
timeout, the select will return with a failed status.
Peer-to-peer systems, such as those used for file sharing, are symmetrical in a
different sense: in such systems, all nodes act both as clients and as servers.
Therefore some of the differences between clients and servers described
below do not hold.
The implications of this situation are twofold. First, the server cannot
anticipate which clients will contact it and when. As a consequence, it is
futile for the server to try and establish contact with clients; rather, it is up to
the clients to contact the server. The server just has to provide a means for
clients to find it, be it by registering in a name service or by listening on a
socket bound to a well-known port.
Unix daemons are server processes that operate in the background. They are
used to provide various system services that do not need to be in the kernel,
e.g. support for email, file spooling, performing commands at pre-defined
times, etc. In particular, daemons are used for various services that allow
systems to inter-operate across a network. In order to work, the systems have
to be related (e.g. they can be different versions of Unix). The daemons only
provide a weak coupling between the systems.
fd=socket() First, it creates a socket. This means that the operating system
allocates a data structure to keep all the information about this
communication channel, and gives the process a file descriptor to serve as a
handle to it.
bind(fd, port) The server then binds this socket (as identified by the file
descriptor) to a port number. In effect, this gives the socket a name that can
be used by clients: the machine’s IP (internet) address together with the port
are used to identify this socket (you can think of the IP address as a street
address, and the port number as a door or suite number at that address).
Common services have predefined port numbers that are well-known to all
(to be described in Section 12.3.1). For other distributed applications, the port
number is typically selected by the programmer.
listen(fd) To complete the setup, the server then listens to this socket. This
notifies the system that communication requests are expected.
The other process, namely the client, does the following.
fd=socket() First, it also creates a socket.
connect(fd, addr, port) It then connects this socket to the server’s socket, by
giving the server’s IP address and port. This means that the server’s address
and port are listed in the local socket data structure, and that a message
regarding the communication request is sent to the server. The system on the
server side finds the server’s socket by searching according to the port
number.
client IP well−knownIP server processport process
fd=3 arbitrary portfd=3 assigned automatically
To actually establish a connection, the server has to take an additional step:
The reason for using accept is that the server must be ready to accept
additional requests from clients at any moment. Before callingaccept, the
incoming request from the client ties up the socket bound to the well-know
port, which is the server’s advertised entry point. By callingaccept, the server
re-routes the incoming connection to a new socket represented by another file
descriptor. This leaves the original socket free to receive additional clients,
while the current one is being handled. Moreover, if multiple clients arrive,
each will have a separate socket, allowing for unambiguous communication
with each one of them. The server will often also create a new thread to
handle the interaction with each client, to furhter encapsulate it.
host A
client A IPIP serverport 1234
fd=3fd=3 thread well−known src=otherX
port threadhost B src=1234@A fd=4 Yclient B IP
fd=3
fd=5 thread Z port 5678
Note that the distinction among connections is done by the IP addresses and
port numbers of the two endpoints. All the different sockets created byaccept
share the same port on the server side. But they have different clients, and
this is indicated in each incoming communication. Communications coming
from an unknown source are routed to the original socket.
Exercise 189 What can go wrong in this process? What happens if it does?
port usage
21 ftp
23 telnet
25 smtp (email)
42 name server
70 gopher
79 finger
80 http (web)
Exercise 190 What happens if the target system is completely different, and
does not adhere to the port-usage conventions?
Exercise 191 Is it possible to run two different web servers on the same
machine?
Communication is done using predefined protocols
To contact a server, the client sends the request to a predefined port on faith.
In addition, the data itself must be presented in a predefined format. For
example, when accessing the finger daemon, the data sent is the login of the
user in which we are interested. The server reads whatever comes over the
connection, assumes this is a login name, and tries to find information about
it. The set of message formats that may be used and their semantics are called
a communication protocol.
12.4 Middleware
Unix daemons are an example of a convention that enables different versions
of the same system to interact. To some degree other systems can too, by
having programs listen to the correct ports and follow the protocols. But there
is a need to generalize this sort of interoperability. This is done by
middleware.
The hard way to solve the problem is to deal with it directly in the
application. Thus the desktop application will need to acknowledge the fact
that the database is different, and perform the necessary translations in order
to access it correctly. This creates considerable additional work for the
developer of the aplication, and is specific for the systems in question.
A much better solution is to use a standard software layer that handles the
translation of data formats and service requests. This is what middleware is
all about.
CORBA provides middleware for objects
The heart of the system is the object request broker (ORB). This is a sort of
naming service where objects register their methods, and clients search for
them. The ORB makes the match and facilitates the connection, including the
necessary translations to compensate for the heterogeneity. Multiple ORBs
can also communicate to create one large object space, where all methods are
available for everyone.
12.5 Summary
Abstractions
Resource management
Hardware support
As we didn’t discuss implementations, hardware support is irrelevant at this
level.
Bibliography
[1] D. E. Comer, Computer Networks and Internets. Prentice Hall, 2nd ed.,
1999. [2] D. E. Comer and D. L. Stevens, Internetworking with TCP/IP, Vol.
III: ClientServer Programming and Applications. Prentice Hall, 2nd ed.,
1996.
(Inter)networking
Interprocess communication within the confines of a single machine is
handled by that machine’s local operating system; for example, a Unix
system can support two processes that communicate using a pipe. But when
processes on different machines need to communicate, communication
networks are needed. This chapter describes the protocols used to establish
communications and pass data between remote systems.
When you compose an email message, the email application saves what you
write in a (temporary) file. At the top level, sending the email message is just
copying this file from the sender to the receiver’s incomingmail directory.
This is conceptually very simple. The message text begins with a line saying
“To: yossi”, so the receiving email program knows to whom the message is
addressed.
mail mail
message
Long messages may need to be broken up into shorter ones
Two things are noteworthy. First, the packetization regards the data it
receives as a single entity that has to be packetized; it does not interpret it,
and does not care if it is actually composed of some higher-level headers
attached to the “real” data. Second, the resulting packets need not be the same
size: there is a maximal packet size, but if the remaining data is less that this,
a smaller packet can be sent.
The above description is based on the assumption that all packets arrive in the
correct order. But what if some packet is lost on the way, or if one packet
overtakes another and arrives out of sequence? Thus another responsibility of
the packetization layer is to handle such potential problems.
The notion that packets may be lost or reordered reflects the common case
where there is not direct link between the two computers. The message (that
is, all its packets) must then be routed through one or more intermediate
computers.
The above scenario is optimistic in the sense that it assumes that data arrives
intact. In fact, data is sometimes corrupted during transmission due to noise
on the communication lines. Luckily it is possible to devise means to
recognize such situations. For example, we can calculate the parity of each
packet and its headers, and add a parity bit at the end. If the parity does not
match at the receiving side, en error is flagged and the sender is asked to re-
send the packet. Otherwise, an acknowledgment is sent back to the sender to
notify it that the data was received correctly.
Parity is a simple but primitive form of catching errors, and has limited
recognition. Therefore real systems use more sophisticated schemes such as
CRC. However, the principle is the same. And naturally, we want a separate
program to handle the calculation of the error-correcting code (ECC) and the
necessary re-sends.
mail mail packets packets
route route route
error error error
At the bottom level the bits need to be transmitted somehow. for example,
bits may be represented by voltage levels on a wire, or by pulses of light in a
waveguide. This is called the physical layer and does not concern us here —
we are more interested in the levels that are implemented by the operating
system.
The set of programs that the message goes through is called a protocol stack.
Logically, each layer talks directly with its counterpart on the other machine,
using some particular protocol. Each protocol specifies the format and
meaning of the messages that can be passed. Usually this is a protocol-
specific header and then a data field, that is not interpreted. For example, the
packetization layer adds a header that contains the packet number in the
sequence.
Exercise 192 What is the protocol of the email application? look at your
email and make an educated guess.
Standardization is important
IP, the Internet Protocol, is essentially concerned with routing data across
multiple networks, thus logically uniting these networks into one larger
network. This is called the “network” layer, and IP is a network protocol.
However, the networks used may have their own routing facility, so IP may
be said to correspond to the “top part” of the network layer. The lower layers
(“link” layer and “physical” layer) are not part of the TCP/IP suite — instead,
whatever is available on each network is used.
TCP and UDP provide end-to-end services (based on IP), making them
“transport” protocols. In addition, they are responsible for delivering the data
to the correct application. This is done by associating each application with a
port number. These are the same port numbers used in thebind andconnect
system calls, as described above on page 233.
The top layer is the “application” layer. Some basic applications (such as ftp
and telnet) are an integral part of the standard TCP/IP suite, and thus make
these services available around the world.
IP performs the routing of messages across the different networks, until they
reach their final destination. For example, the message may go over an
Ethernet from its origin to a router, and then over a token ring from the router
to its destination. Of course, it can also go through a larger number of such
hops. All higher levels, including the application, need not be concerned with
these details.
host A host B application application
TCP TCP
IP IP
ethernet access
ethernet router (gateway) IP
ethernet token ring
access access
token ring access
token ring
Exercise 193 Does IP itself have to know about all the different networks a
message will traverse in advance?
The worldwide amalgamation of thousands of public networks connecting
millions of computers is called the Internet, with a capital I.
Protocol version (4 bits). Most of the Internet uses version 4 (IPv4), but some
uses• the newer version 6. The header described here is for version 4.
• Header length, in words (4 bits).
•
Indication of quality of service desired, which may or may not be supported
by routers (8 bits).
• Packet length in bytes (16 bits). This limits a packet to a maximum of 64
KiB.
•
Three fields that deal with fragmentation, and the place of this fragment in
the sequence of fragments that make up the packet (32 bits).
•
Time to live — how many additional hops this packet may propagate (8 bits).
This is decremented on each hop, and when it hits 0 the packet is discarded.
• Identifier of higher-level protocol that sent this packet, e.g. TCP or UDP (8
bits).
• Header checksum, used to verify that the header has not been corrupted (16
bits).
• Sender’s IP address (32 bits)
• Destination IP address (32 bits)
Exercise 194 Is UDP useful? What can you do with a service that may
silently loose datagrams?
TCP provides a reliable stream service
•
Source port number (16 bits). Port numbers are thus limited to the range up to
65,535. Of this the first 1024 are reserved for well-known services.
Destination port number (16 bits).• Sequence number of the first byte of data being sent in this packet
(32 bits).• Acknowledgments, expressed as the sequence number of the next
byte of data that• the sender expects to receive from the recipient (32 bits).
Header size, so we will know where the data starts (4 bits). This is followed
by 4• reserved bits, whose use has not been specified.
Eight single-bit flags, used for control (e.g. in setting up a new connection).• The available space
that the sender has to receive more data from the recipient (16• bits).
Checksum on the whole packet (16 bits).• Indication of urgent data (16 bits)•
And there are some useful applications too
Several useful applications have also become part of the TCP/IP protocol
suite. These include
SMTP (Simple Mail Transfer Protocol), with features like mailing lists and
mail• forwarding. It just provides the transfer service to a local mail service that takes
care of things like editing the messages.
FTP (File Transfer Protocol), used to transfer files across machines. This
uses• one TCP connection for control messages (such as user information and which
files are desired), and another for the actual transfer of each file. Telnet,
which provides a remote login facility for simple terminals.•
To read more: The TCP/IP suite is described in Stallings [7] section 13.2, and
very briefly also in Silberschatz and Galvin [5] section 15.6. It is naturally
described in more detail in textbooks on computer communications, such as
Tanenbaum [9] and Stallings [6]. Finally, extensive coverage is provided in
books specific to these protocols, such as Comer [2] or Stevens [8].
But how does the destination know if the data is valid? After all, it is just a
sequence of 0’s and 1’s. The answer is that the data must be encoded in such
a way that allows corrupted values to be identified. A simple scheme is to add
a parity bit at the end. The parity bit is the binary sum of all the other bits.
Thus if the message includes an odd number of 1’s, the parity bit will be 1,
and if it includes an even number of 1’s, it will be 0. After adding the parity
bit, it is guaranteed that the total number of 1’s is even. A receiver that
receives a message with an odd number of 1’s can therefore be sure that it
was corrupted.
Exercise 196 What if two bits were corrupted? or three bits? In other words,
when does parity identify a problem and when does it miss?
Or we can send redundant data to begin with
An alternative approach is to encode the data in such a way that we can not
only detect that an error has occurred, but we can also correct the error.
Therefore data does not have to be resent. This scheme is called forward error
correction (FEC).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
If any bit is corrupted, this will show up in a subset of the parity bits. The
positions of the affected parity bits then provide the binary representation of
the location of the corrupted data bit.
Exercise 197 What happens if one of the parity bits is the one that is
corrupted? Exercise 198 Another way to provide error correction is to
arrange n2 data bits in a square, and compute the parity of each row and
column. A corrupted bit then causes two parity bits to be wrong, and their
intersection identifies the corrupt bit. How does this compare with the above
scheme?
The main reason for using FEC is in situations where ARQ is unwieldy. For
example, FEC is better suited for broadcasts and multicasts, because it avoids
the need to collect acknowledgments from all the recipients in order to verify
that the data has arrived safely to all of them.
The examples above are simple, but can only handle one corrupted bit. For
example, if two data bits are corrupted, this may cancel out in one parity
calculation but not in another, leading to a pattern of wrong parity bits that
misidentifies a corrupted data bit.
1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 errors
101010011010001
wrong
parity
The most commonly used error detection code is the cyclic redundancy check
(CRC). This can be explained as follows. Consider the data bits that encode
your message as a number. Now tack on a few more bits, so that if you divide
the resulting number by a predefined divisor, there will be no remainder. The
receiver does just that; if there is no remainder, it is assumed that the message
is valid, otherwise it was corrupted.
To read more: There are many books on coding theory and the properties of
the resulting codes, e.g. Arazi [1]. CRC is described in Stallings [6].
Timeouts are needed to cope with lost data
The solution to this problem is to use timeouts. The sender waits for an ack
for a limited time, and if the ack does not arrive, it assumes the packet was
lost and retransmits it. But if the packet was only delayed, two copies may
ultimately arrive! The transport protocol must deal with such situations by
numbering the packets, and discarding duplicate ones.
The recipient of a message must have a buffer large enough to hold at least a
single packet — otherwise when a packet comes in, there will be nowhere to
store it. In fact, the need to bound the size of this buffer is the reason for
specifying the maximal transfer unit allowed by a network.
Even when the available buffer space is more than a single packet, it is still
bounded. If more packets arrive than there is space in the buffer, the buffer
will overflow. The extra packets have nowhere to go, so they are dropped,
meaning that for all intents and purposes they are lost. The situation in which
the network is overloaded and drops packets is called congestion.
In order to avoid congestion, flow control is needed. Each sender has to know
how much free space is available in the recipient’s buffers. It will only send
data that can fit into this buffer space. Then it will stop transmitting, until the
recipient indicates that some buffer space has been freed. Only resends due to
transmission errors are not subject to this restriction, because they replace a
previous transmission that was already taken into account.
AB
header data
An example of how such control data is incorporated in the header of the
TCP protocol is given above on page 245.
Large buffers improve utilization of network resources
If the buffer can only contain a single packet, the flow control implies that
packets be sent one by one. When each packet is extracted by the application
to which it is destined, the recipient computer will notify the sender, and
another packet will be sent. Even if the application extracts the packets as
soon as they arrive, this will lead to large gaps between packets, due to the
time needed for the notification (known as the round-trip time (RTT)).
to send
time to traverse the net
Exercise 201 Denote the network bandwidth by B, and the latency to traverse
the network by tℓ. Assuming data is extracted as soon as it arrives, what size
buffer is needed to keep the network 100% utilized?
The considerations for setting the window size are varied. On one hand is the
desire to achieve the maximal possible bandwidth. As the bandwidth is
limited by BW ≤ window/RT T, situations in which the RTT is large imply
the use of a large
window and long timeouts. For example, this is the case when satellite links
are employed. Such links have a propagation delay of more than 0.2 seconds,
so the round-trip time is nearly half a second. The delay on terrestrial links,
by contrast, is measured in milliseconds.
Exercise 202 Does this mean that high-bandwidth satellite links are useless?
On the other hand, a high RTT can be the result of congestion. In this case it
is better to reduce the window size, in order to reduce the overall load on the
network. Such a scheme is used in the flow control of TCP, and is described
below.
An efficient implementation employs linked lists and I/O vectors The
management of buffer space for packets is difficult for two reasons:
•
Packets may come in different sizes, although there is an upper limit imposed
by the network.
Headers are added and removed by different protocol layers, so the size of
the• message and where it starts may change as it is being processed. It is desirable
to handle this without copying it to a new memory location each time.
The solution is to store the data in a linked list of small units, sometimes
called mbufs, rather than in one contiguous sequence. Adding a header is then
done by writing it in a separate mbuf, and prepending it to the list.
The problem with a linked list structure is that now it is impossible to define
the message by its starting address and size, because it is not contiguous in
memory. Instead, it is defined by an I/O vector: a list of contiguous chunks,
each of which is defined by its starting point and size.
Congestion hurts
The reason that congestion needs to be controlled (or rather, avoided) is that
it has observable consequences. If congestion occurs, packets are dropped by
the network (actually, by the routers that implement the network). These
packets have to be re-sent, incurring additional overhead and delays. As a
result, communication as a whole slows down.
Even worse, this degradation in performance is not graceful. It is not the case
that when you have a little congestion you get a little degradation. Instead,
you get a positive feedback effect that makes things progressively worse. The
reason is that when a packet is dropped, the sending node immediately sends
it again. Thus it increases the load on a network that is overloaded to begin
with. As a result even more packets are dropped. Once this process starts, it
drives the network into a state where most packets are dropped all the time
and nearly no traffic gets through. The first time this happened in the Internet
(October 1986), the effective bandwidth between two sites in Berkeley
dropped from 32 Kb/s to 40 b/s — a drop of nearly 3 orders of magnitude [3].
The only thing that a transmitter knows reliably is when packets arrive to the
receiver — because the receiver sends an ack. Of course, it takes the ack
some time to arrive, but once it arrives the sender knows the packet went
through safely. This implies that the network is functioning well.
On the other hand, the sender doesn’t get any explicit notification of failures.
The fact that a packet is dropped is not announced with a nack (only packets
that arrive but have errors in them may be nack’d). Therefore the sender must
infer that a packet was dropped by the absence of the corresponding ack. This
is done with a timeout mechanism: if the ack did not arrive within the
prescribed time, we assume that the packet was dropped.
The problem with timeouts is how to set the time threshold. If we wait and an
ack fails to arrive, this could indicate that the packet was dropped. but it
could just be the result of a slow link, that caused the packet (or the ack) to be
delayed. The question is then how to distinguish between these two cases.
The answer is that the threshold should be tied to the round-trip time (RTT):
the higher the RTT, the higher the threshold should be.
The reason for using an average rather than just one measurement is that
Internet RTTs have a natural variability. It is therefore instructive to re-write
the above expression as [3]
rnew = r + α(m− r)
With this form, it is natural to regard r as a predictor of the next
measurement, and m−r as an error in the prediction. But the error has two
possible origins:
The factor α multiplies the total error. This means that we need to
compromise. We want a large α to get the most out of the new measurement,
and take a big step towards the true average value. But this risks amplifying
the random error too much. It is therefore recommended to use relatively
small values, such as 0.1 ≤ α ≤ 0.2. This will make the fluctuations of r be
much smaller than the fluctuations in m , at the price of taking longer to
converge.
In the context of setting the threshold for timeouts, it is important to note that
we don’t really want an estimate of the average RTT: we want the maximum.
Therefore we also need to estimate the variability of m, and use a threshold
that takes this variability into account.
The basic idea of TCP congestion control is to throttle the transmission rate:
do not send more packets than what the network can handle. To find out how
much the network can handle, each sender starts out by sending only one
packet and waiting for an ack. If the ack arrives safely, two packets are sent,
then four, etc.
The “slow start” algorithm used in TCP is extremely simple. The initial
window size is 1. As each ack arrives, the window size is increased by 1.
That’s it.
Exercise 203 Adding 1 to the window size seems to lead to linear growth of
the transmission rate. But this algorithm actually leads to exponential
growth. How come?
Hold back if congestion occurs
The problem with slow start is that it isn’t so slow, and is bound to quickly
reach a window size that is larger than what the network can handle. As a
result, packets will be dropped and acks will be missing. When this happens,
the sender enters “congestion avoidance” mode. In this mode, it tries to
converge on the optimal window size. This is done as follows:
1. Set a threshold window size to half the current window size. This is the
previous window size that was used, and worked OK.
2. Set the window size to 1 and restart the slow-start algorithm. This allows
the network time to recover from the congestion.
3. When the window size reaches the threshold set in step 1, stop increasing it
by 1 on each ack. Instead, increase it by 1 on each ack, where w is the
window size.w
This will cause the window size to grow much more slowly — in fact, now it
will be linear, growing by 1 packet on each RTT. The reason to continue
growing is twofold: first, the threshold may simply be too small. Second,
conditions may change, e.g. a competing communication may terminate
leaving more bandwidth available.
To read more: The classic on TCP congestion avoidance and control is the
paper by Jacobson [3]. An example of recent developments in this area is the
paper by Paganini et al. [4].
13.2.4 Routing
The solution, like in many other cases, is to do this in several steps. The first
is to translate the human-readable name to a numerical IP address. The
second is to route the message from one router to another along the way,
according to the IP address. This is done based on routing tables, so another
question is how to create and maintain the routing tables.
For example, consider looking up the host cs.huji.ac.il. The last component,
il, is the top-level domain name, and includes all hosts in the domain “Israel”.
There are not that many top-level domains, and they hardly ever change, so it
is possible to maintain a set of root DNS servers that know about all the top-
level DNSs. Every computer in the world must be able to find its local root
DNS, and can then query it for the address of theil DNS.
Given the address of the il DNS, it can be queried for the address ofac.il, the
server of academic hosts in Israel (actually, it will be queried for the whole
address, and will return the most specific address it knows; as a minimum, it
should be able to return the address ofac.il). This server, in turn, will be asked
for the address ofhuji.ac.il, the Hebrew University domain server. Finally, the
address of the Computer Science host will be retrieved.
Exercise 204 Assume the .il DNS is down. Does that mean that all of Israel is
unreachable?
Somewhat surprisingly, one of the least used is the code for the United States
(.us). This is because most US-based entities have preferred to use functional
domains such as
.com companies
.edu educational institutions .gov government facilities
.org non-profit organizations .pro professionals (doctors, lawyers)
As a sidenote, DNS is one of the major applications that use UDP rather than
TCP.
IP addresses identify a network and a host
IP version 4 addresses are 32 bits long, that is four bytes. The most common
way to write them is in dotted decimal notation, where the byte values are
written separated by dots, as in 123.45.67.89. The first part of the address
specifies a network, and the rest is a host on that network. In the past the
division was done on a byte boundary. The most common were class C
addresses, in which the first three bytes were allocated to the network part.
This allows 221 = 2097152 networks to be identified (3 bits are used to
identify this as a class C address), with a limit of 254 hosts per network (the
values 0 and 255 have special meanings). Today classless addresses are used,
in which the division can be done at arbitrary bit positions. This is more
flexible and allows for more networks.
The Internet protocols were designed to withstand a nuclear attack and not be
susceptible to failures or malicious behavior. Routes are therefore created
locally with no coordinated control.
When a router boots, it primes its routing table with data about immediate
neighbors. It then engages in a protocol of periodically sending all the routes
it knows about to its neighbors, with an indication of the distance to those
destinations (in hops). When such a message arrives from one of the
neighbors, the local routing table is updated with new routes or routes that are
shorter than what the router knew about before.
The same table is used to reach the final destination of a message, in the last
network in the set.
13.3 Summary
Abstractions
Resource management
Workload issues
Hardware support
Bibliography
[1] B. Arazi, A Commonsense Approach to the Theory of Error Correcting
Codes. MIT Press, 1988.
[2] D. E. Comer, Internetworking with TCP/IP, Vol. I: Principles, Protocols,
and Architecture. Prentice-Hall, 3rd ed., 1995.
[3] V. Jacobson, “Congestion avoidance and control”. In ACM SIGCOMM
Conf., pp. 314–329, Aug 1988.
14.1.1 Authentication
In distributed systems, services are rendered in response to incoming
messages. For example, a file server may be requested to disclose the
contents of a file, or to delete a file. Therefore it is important that the server
know for sure who the client is. Authentication deals with such verification of
identity.
The simple solution is to send the user name and password with every
request. The server can then verify that this is the correct password for this
user, and is so, it will respect the request. The problem is that an
eavesdropper can obtain the user’s password by monitoring the traffic on the
network. Encrypting the password doesn’t help at all: the eavesdropper can
simply copy the encrypted password, without even knowing what the original
password was!
Background: encryption
Encryption deals with hiding the content of messages, so that only the
intended recipient can read them. The idea is to apply some transformation on
the text of the message. This transformation is guided by a secret key. The
recipient also knows the key, and can therefore apply the inverse
transformation. Eavesdroppers can obtain the transformed message, but don’t
know how to invert the transformation.
(a) It looks up the user’s password p, and uses a one-way function to create
an encryption key Kp from it. One way functions are functions that it is hard
to reverse, meaning that it is easy to compute Kp from p, but virtually
impossible to compute p from Kp.
(e) It bundles the session key with the created unforgeable ticket, creating
{Ks,{U, Ks}Kk}.
(f) Finally, the whole thing is encrypted using the user-key that was generated
from the user’s password, leading to {Ks,{U, Ks}Kk}Kp. This is sent back to the
client.
(b) Using Kp, the client decrypts the message it got from the server, and
obtains Ks and {U, Ks}Kk.
(c) It erases the user key Kp.
The session key is used to get other keys
Now, the client can send authenticated requests to the server. Each request is
composed of two parts: the request itself, R, encrypted using Ks, and the
unforgeable ticket. Thus the message sent is {RKs,{U, Ks}Kk}. When the
server receives such a request, it decrypts the ticket using its secret key Kk,
and finds U and Ks. If this works, the server knows that the request indeed
came from user U, because only user U’s password could be used to decrypt
the previous message and get the ticket. Then the server uses the session key
Ks to decrypt the request itself. Thus even if someone spies on the network
and manages to copy the ticket, they will not be able to use it because they
cannot obtain the session key necessary to encrypt the actual requests.
14.1.2 Security
Kerberos can be used within the confines of a single organization, as it is
based on a trusted third party: the authentication server. But on the Internet in
general nobody trusts anyone. Therefore we need mechanisms to prevent
malicious actions by intruders.
The question, of course, is how to identify “bad things”. Simple firewalls are
just packet filters: they filter out certain packets based on some simple
criterion. Criteria are usually expressed as rules that make a decision based
on three inputs: the source IP address, and destination IP address, and the
service being requested. For example, there can be a rule saying that
datagrams from any address to the mail server requesting mail service are
allowed, but requests for any other service should be dropped. As another
example, if the organization has experienced break-in attempts coming from
a certain IP address, the firewall can be programmed to discard any future
packets coming from that address.
As can be expected, such solutions are rather limited: they may often block
packets that are perfectly benign, and on the other hand they may miss
packets that are part of a complex attack. A more advanced technology for
filtering is to use stateful inspection. In this case, the firewall actually follows
that state of various connections, based on knowledge of the communication
protocols being used. This makes it easier to specify rules, and also supports
rules that are more precise in nature. For example, the firewall can be
programmed to keep track of TCP connections. If an internal host creates a
TCP connection to an external host, the data about the external host is
retained. Thus incoming TCP datagrams belonging to a connection that was
initiated by an internal host can be identified and forwarded to the host,
whereas other TCP packets will be dropped.
The major feature provided by NFS is support for the creation of a uniform
file-system name space. You can log on to any workstation, and always see
your files in the same way. But your files are not really there — they actually
reside on a file server in the machine room.
The way NFS supports the illusion of your files being available wherever you
are is by mounting the file server’s file system onto the local file system of
the workstation you are using. One only needs to specify the remoteness
when mounting; thereafter it is handled transparently during traversal of the
directory tree. In the case of a file server the same file system is mounted on
all other machines, but more diverse patterns are possible.
Exercise 205 The uniformity you see is an illusion because actually the file
systems on the workstations are not identical — only the part under the
mount point. Can you think of an example of a part of the file system where
differences will show up?
Mounting a file system means that the root directory of the mounted file
system is identified with a leaf directory in the file system on which it is
mounted. For example, a directory called/mnt/fs on the local machine can be
used to host the root of the file server’s file system. When trying to access a
file called/mnt/fs/a/b/c, the local file system will traverse the local root
andmnt directories normally (as described in Section 5.2.1). Upon reaching
the directory/mnt/fs, it will find that this is a mount point for a remote file
system. It will then parse the rest of the path by sending requests to
access/a,/a/b, and/a/b/c to the remote file server. Each such request is called a
lookup.
Exercise 206 What happens if the local directory serving as a mount point
(/mnt/fs in the above example) is not empty?
Exercise 207 Can the client send the whole path (/a/b/c in the above example)
at once, instead of one component at a time? What are the implications?
NFS servers are stateless
We normally consider the file system to maintain the state of each file:
whether it is open, where we are in the file, etc. When implemented on a
remote server, such an approach implies that the server is stateful. Thus the
server is cognizant of sessions, and can use this knowledge to perform
various optimizations. However, such a design is also vulnerable to failures:
if clients fail, server may be stuck with open files that are never cleaned up; if
a server fails and comes up again, clients will loose their connections and
their open files. In addition, a stateful server is less scalable because it needs
to maintain state for every client.
The NFS design therefore uses stateless servers. The remote file server does
not maintain any data about which client has opened what file, and where it is
in the file.
To interact with stateless servers, each operation must be self contained. In
particular,
There is no need for open and close operations at the server level. However,
there is an open operation on the client, that parses the file’s path name and
retrieves a handle to it, as described below.
•
Handling of local and remote accesses is mediated by the vfs (virtual file
system). Each file or directory is represented by a vnode in the vfs; this is like
a virtual inode. Mapping a path to a vnode is done by lookup on each
component of the path, which may traverse multiple servers due to cascading
mounts. The final vnode contains an indication of the server where the file
resides, which may be local. If it is local, it is accessed using the normal
(Unix) file system. If it is remote, the NFS client is invoked to do an RPC to
the appropriate NFS server. That server injects a request to its vnode layer,
where it is served locally, and then returns the result.
local processes
sys calls for files sys calls for files
vnode operations (VFS) vnode operations (VFS)
localNFS NFS local file systemclient server file system
network
State is confined to the NFS client
For remote access, the NFS client contains a handle that allows direct access
to the remote object. This handle is an opaque object created by the server:
the client receives it when parsing a file or directory name, stores it, and
sends it back to the server to identify the file or directory in subsequent
operations. The content of the handle is only meaningful to the server, and is
used to encode the object’s ID. Note that using such handles does not violate
the principle of stateless servers. The server generates handles, but does not
keep track of them, and may revoke them based on a timestamp contained in
the handle itself. If this happens a new handle has to be obtained. This is done
transparently to the application.
The NFS client also keeps pertinent state information, such as the position in
the file. In each access, the client sends the current position to the server, and
receives the updated position together with the response.
Both file attributes (inodes) and file data blocks are cached on client nodes to
reduce communication and improve performance. The question is what
happens if multiple clients access the same file at the same time. Desirable
options are
Session semantics: in this approach each user gets a separate copy of the file
data, that is only returned when the file is closed. It is used in the Andrew file
system, and provides support for disconnected operation.
The semantics of NFS are somewhat less well-defined, being the result of
implementation considerations. In particular, shared write access can lead to
data loss. However, this is a problematic access pattern in any case. NFS does
not provide any means to lock files and guarantee exclusive access.
Just like file servers that provide a service of storing data, so computational
servers provide a service of hosting computations. They provide the CPU
cycles and physical memory needed to perform the computation. The
program code is provided by the client, and executed on the server.
Exercise 209 Are there any restrictions on where processes can be migrated?
Migration should be amortized by lots of computation
There are two major questions involved in migration for load balancing:
which process to migrate, and where to migrate it.
The considerations for choosing a process for migration involve its size. This
has two dimensions. In the space dimension, we would like to migrate small
processes: the smaller the used part of the process’s address space, the less
data that has to be copied to the new location. In the time dimension, we
would like to migrate long processes, that will continue to compute for a long
time after they are migrated. Processes that terminate soon after being
migrated waste the resources invested in moving them, and would have done
better staying where they were.
Luckily, the common distributions of job runtimes are especially well suited
for this. As noted in Section 2.3, job runtimes are well modeled by a heavy-
tailed distribution. This means that a small number of processes dominate the
CPU usage. Moving only one such process can make all the difference.
Moreover, it is relatively easy to identify these processes: they are the oldest
ones available. This is because a process that has already run for a long time
can be assigned to the tail of the distribution. The distribution of process
lifetimes has the property that if a process is in its tail, it is expected to run for
even longer, more so than processes that have only run for a short time, and
(as far as we know) are not from the tail of the distribution. (This was
explained in more detail on page 186.)
Details: estimating remaining runtime
A seemingly good model for process runtimes, at least for those over a
second long, is that they follow a Pareto distribution with parameter near −1:
Pr(r > t) = 1/t The conditional distribution describing the runtime, given that
we already know that it is more than a certain number T, is then
Pr(r > t|r > T ) = T /t
Thus, if the current age of a process is T, the probability that it will run for
more than 2T time is about 1/2. In other words, the median of the expected
remaining running time grows linearly with the current running time.
To read more: The discussion of process lifetimes and their interaction with
migration is based on Harchol-Balter and Downey [7].
Choosing a destination can be based on very little information
The Mosix system is a modified Unix, with support for process migration. At
the heart of the system lies a randomized information dispersal algorithm.
Each node in the system maintains a short load vector, with information
regarding the load on a small set of nodes [5]. The first entry in the vector is
the node’s own load. Other entries contain data that it has received about the
load on other nodes.
At certain intervals, say once a minute, each node sends its load vector to
some randomly chosen other node. A node that received such a message
merges it into its own load vector. This is done by interleaving the top halves
of the two load vectors, and deleting the bottom halves. Thus the retained
data is the most up-to-date available.
Exercise 210 Isn’t there a danger that all these messages will overload the
system?
The information in the load vector is used to obtain a notion of the general
load on the system, and to find migration partners. When a node finds that its
load differs significantly from the perceived load on the system, it either tries
to offload processes to some underloaded node, or solicits additional
processes from some overloaded node. Thus load is moved among the nodes
leading to good load balancing within a short time.
The surprising thing about the Mosix algorithm is that it works so well
despite using very little information (the typical size of the load vector is only
four entries). This turns out to have a solid mathematical backing. An abstract
model for random assignment of processes is the throwing of balls into a set
of urns. If n balls are thrown at random into n urns, one should not expect
them to spread out evenly. In fact, the probability that an
−n , which for large n tends to 1/e — i.e. more than a third of the urns stay
empty! At the other extreme, the urn that gets the most balls is expected to
have about log n balls in it. But if we do not choose the urns at random one at
a time, but rather select two each time and place the ball in the emptier of the
two, the expected number of balls in the fullest urn drops to log log n [3].
This is analogous to the Mosix algorithm, that migrates processes to the less
loaded of a small subset of randomly chosen nodes. In addition to its efficient
assignment algorithm, Mosix employs a couple of other heuristic
optimizations. One is the preferential migration of “chronic forgers” —
processes that fork many other processes. Migrating such processes
effectively migrates their future offspring as well. Another optimization is to
amortize the cost of migration by insisting that a process complete a certain
amount of work before being eligible for migration. This prevents situations
in which processes spend much of their time migrating, and do not make any
progress.
Negotiating the migration of a process is only part of the problem. Once the
decision to migrate has been made, the actual migration should be done. Of
course, the process itself should continue its execution as if nothing had
happened.
To achieve this magic, the process is first blocked. The operating systems on
the source and destination nodes then cooperate to create a perfect copy of
the original process on the destination node. This includes not only its
address space, but also its description in various system tables. Once
everything is in place, the process is restarted on the new node.
Exercise 211 Can a process nevertheless discover that it had been migrated?
Exercise 212 Does all the process’s data have to be copied before the process
is restarted?
Special care is needed to support location-sensitive services
But some features may not be movable. For example, the process may have
opened some files that are available locally on the original node, but are not
available on the new node. It might perform I/O to the console of the original
node, and moving this to the console of the new node would be inappropriate.
Worst of all, it might have network connections with remote processes
somewhere on the Internet, and it is unreasonable to update all of them about
the migration.
To read more: Full details about the Mosix system are available in Barak,
Guday, and Wheeler [4]. More recent publications include [2, 8, 1].
Bibliography
[1] L. Amar, A. Barak, and A. Shiloh, “ The MOSIX direct file system access
method for supporting scalable cluster file systems”. Cluster Computing 7(2),
pp. 141– 150, Apr 2004.
[8] R. Lavi and A. Barak, “ The home model and competitive algorithms for
load balancinginacomputingcluster”. In 21st Intl. Conf. Distributed Comput.
Syst., pp. 127–134, Apr 2001.
Appendix E
Using Unix Pipes
One of the main uses for Unix pipes is to string processes together, with the
stdout (standard output) of one being piped directly to thestdin (standard
input) of the next. This appendix explains the mechanics of doing so.
By default, a Unix process has two open files predefined: standard input
(stdin) and standard output (stdout). These may be accessed via file
descriptors 0 and 1, respectively.
keyboard screen
(stdin)
0
1 (stdout)
process
Again by default, both are connected to the user’s terminal. The process can
read what the user types by reading from stdin, that is, from file descriptor 0
— just like reading from any other open file. It can display stuff on the screen
by writing to stdout, i.e. to file descriptor 1.
Using pipes, it is possible to string processes to each other, with the standard
output of one connected directly to the standard input of the next. The idea is
that the program does not have to know if it is running alone or as part of a
pipe. It reads input fromstdin, processes it, and writes output tostdout. The
connections are handled before the program starts, that is, before theexec
system call.
To connect two processes by a pipe, the first process calls the pipe system
call. This creates a channel with an input side (from which the process can
read) and an output side (to which it can write). They are available to the
calling process as file descriptors 3 and 4 respectively (file descriptor 2 is the
predefined standard error, stderr).
keyboard screen
(stdin)
0
1 (stdout) 4 pipe
3
The process then forks, resulting in two processes that share theirstdin,stdout,
and the pipe.
keyboard screen
01
01
parentpipe child
44
33
This is because open files are inherited across a fork, as explained on page
135.
Exercise 213 What happens if the two processes actually read from their
sharedstdin, and write to their sharedstdout? Hint: recall the three tables
used to access open files in Unix.
To create the desires connection, the first (parent) process now closes its
original stdout, i.e. its connection to the screen. Using thedup system call, it
then duplicates the output side of the pipe from its original file descriptor (4)
to thestdout file descriptor (1), and closes the original (4). It also closes the
input side of the pipe (3).
1 1 1 4 pipe44
333
The second (child) process closes its original stdin (0), and replaces it by the
input side of the pipe (3). It also closes the output side of the pipe (4). As a
result the first process has the keyboard asstdin and the pipe asstdout, and the
second has the pipe asstdin and the screen asstdout. This completes the
desired connection. If either process now does anexec, the program will not
know the difference.
keyboard screen
01 child
parentpipe 0 1
Exercise 214 What is the consequence if the child process does not close its
output side of the pipe (file desceriptor 4)?
Appendix F
5. Session: control over the dialog between end stations (often unused).
6. Presentation: handling the representation of data, e.g. compression and
encryption.
7. Application: an interface for applications providing generally useful
services, e.g. distributed database support, file transfer, and remote login.
In the context of operating systems, the most important are the routing and
transport functions. In practice, the TCP/IP protocol suite became a de-facto
standard before the OSI model was defined. It actually handles the central
part of the OSI stack.
application presentation session transport network data link physical application
TCP/UDP
IP
network access physical
To read more: The OSI model is described briefly in Silberschatz and Galvin
[1] section 15.6. Much more detailed book-length descriptions were written
by Tanenbaum [3] and Stallings [2].
Bibliography
[1] A. Silberschatz and P. B. Galvin, Operating System Concepts. Addison-
Wesley, 5th ed., 1998.
[2] W. Stallings, Data and Computer Communications. Macmillan, 4th ed.,
1994. [3] A. S. Tanenbaum, Computer Networks. Prentice-Hall, 3rd ed.,
1996.
Answers to Exercises
Exercise 1: It can’t. It needs hardware support in the form of a clock interrupt
which happens regularly. On the other hand, it can be argued that the
operating system does not really need to regain control. This is discussed in
Section 11.2.
Exercise 3: System tables should be kept small because their storage comes at
the expense of memory for users, and processing large tables generates
higher overheads. The relative sizes of different tables should reflect their use
in a “normal” workload; if one table is always the bottleneck and becomes
full first, space in the other tables is wasted.
Exercise 4: First and foremost, privileged instructions that are used by the
operating system to perform its magic. And maybe also low-level features
that are hidden by higher-level abstractions, such as block-level disk access
or networking at the packet level.
Exercise 5: Typically no. Instructions are executed by the hardware, and the
operating system is not involved. However, it may be possible to emulate
new instructions as part of handling an illegal-instruction exception.
Exercise 7:
Exercise 10: Some, that are managed by the operating system, are indeed
limited to this resolution. For example, this applies to sleeping for a certain
amount of time. But on some architectures (including the Pentium) it is
possible to access a hardware cycle counter directly, and achieve much better
resolution for application-level measurements.
Exercise 11: Assuming the interrupt handlers are not buggy, only an
asynchronous interrupt can occur. If it is of a higher level, and not blocked, it
will be handled immediately, causing a nesting of interrupt handlers.
Otherwise the hardware should buffer it until the current handler terminates.
Exercise 12: The operating system, because they are defined by the services
provided by the operating system, and not by the language being used to
write the application. Exercise 13: In principle, this is what happens in a
branch: the “if” loads one next instruction, and the “else” loads a different
next instruction.
Exercise 14: If the value is not the address of an instruction, we’ll get an
exception. Specifically, if the value specifies an illegal memory address, e.g.
one that does not exist, this leads to a memory fault exception. If the value is
an address within the area of memory devoted to the program, but happens to
point into the middle of an instruction, this will most probably lead to an
illegal instruction exception.
Exercise 15: Obviously each instruction in the text segment contains the
addresses of its operands, but only those in the data segment can be given
explicitly, because the others are not known at compile time (instead,
indirection must be used). Pointer variables can hold the addresses of other
variables, allowing data, heap, and stack to point to each other. There are also
self-references: for example, each branch instruction in the text segment
contains the address to which to go if the branch condition is satisfied.
The text segment cannot normally be modified. The data, heap, and stack are
modified by instructions that operate on memory contents. The stack is also
modified as a side effect of function call and return.
The contents of system tables is only changed by the operating system. The
process can cause such modifications by appropriate system calls. For
example, system calls to open or close a file modify the open files table.
Exercise 17: The stack is used when this process calls a function; this is an
internal thing that is part of the process’s execution. The PCB is used by the
operating system when it performs a context switch. This is handy because it
allows the operating system to restore register contents without having to
delve into the process’s stack.
Exercise 22: Add a “suspended” state, with transitions from ready and
blocked to suspended, and vice versa (alternatively, add two states
representing suspended-ready and suspended-blocked). There are no
transition from the running state unless a process can suspend itself; there is
no transition to the running state because resumption is necessarily mediated
by another process.
Exercise 23: Registers are part of the CPU. Global variables should be
shared, so should not be stored in registers, as they might not be updated in
memory when a thread switch occurs. The stack is private to each thread, so
local variables are OK.
Exercise 24: Directly no, because each thread has its own stack pointer. But
in principle it is possible to store the address of a local variable in a global
pointer, and then all threads can access it. However, if the original thread
returns from the function the pointer will be left pointing to an unused part of
the stack; worse, if that thread subsequently calls another function, the
pointer may accidentally point to something completely different.
Exercise 25: Yes, to some extent. You can set up a thread to prefetch data
before you really need it.
Exercise 26: It depends on where the pointer to the new data structure is
stored. If it is stored in a global variable (in the data segment) all the other
threads can access it. But if it is stored in a private variable on the allocating
thread’s stack, the others don’t have any access.
Exercise 27: Yes — they are not independent, and one may block the others.
For example, this means that they cannot be used for asynchronous I/O.
Exercise 33: The system models are quite different. The M/M/1 analysis is
essentially a single server with FCFS scheduling. Here we are talking of
multiple servers — e.g. the CPU and the disk, and allowing each to have a
separate queue.
Exercise 34: If there is only one CPU and multiple I/O devices, including
disks, terminals, and network connections. It can then be hoped that the
different I/O-bound jobs are actually not identical, and will use different I/O
devices. In a multiprocessor, multiple compute-bound jobs are OK.
Exercise 36:metric
range preference > 0 low is better > 0 low is better > 1 low is better > 0 high
is better [0, 1] high is better
Exercise 37: For single-CPU jobs, the same argument holds. For multiple-
CPU jobs, there are cases where an off-line algorithm will prefer using
preemption over run-tocompletion. Consider an example of 4 CPUs and 5
jobs: 3 jobs requiring one CPU for one unit of time, a job requiring one CPU
for 4 units, and a job requiring all 4 CPUs for two units of time. With run to
completion, the average response time is 13 = 2.6
Exercise 40: Just before, so that they run immediately (that is, at the end of
the current quantum). Then only jobs that are longer than one quantum have
to wait for a whole cycle.
Exercise 41: The shorter they are, the better the approximation of processor
sharing. But context switching also has an overhead, and the length of the
time slice should be substantially larger than the overhead to keep the relative
overhead (percentage of time spent on overhead rather than on computation)
acceptably low. In addition to the direct overhead (time spent to actually
perform the context switch) there is an effect on the cache efficiency of the
newly run process: when a process is scheduled to run it will need to reload
its cache state, and suffer from many cache misses. The quantum should be
long enough to amortize this. Typical values are between 0.01 and 0.1 of a
second.
Exercise 42:
En+1 = αTn + (1− α)En
Exercise 43: Yes. When waiting for user input, their priority goes up due to
the exponential aging (recall that lower values reflect higher priority).
Exercise 45: If the process runs continuously, it will gain 100 points a
second, but they will be halved on each subsequent second. The grand total
will therefore approach 200.
Exercise 47: To a point. You can use multi-level feedback among the jobs in
each group, with the allocation among groups being done by fair shares. You
can also redistribute the tickets within a group to reflect changing priorities.
Exercise 49: A process only gets blocked when waiting for some service,
which is requested using a system call, and therefore only happens when
running in kernel mode.
Exercise 51: Ready to run in kernel mode, as it is still in thefork system call.
Exercise 52: It is copied to kernel space and then reconstructed.
Exercise 54:
1. Another process runs first and starts to add another new item after current.
Call this itemnew2. The other process performsnew2->next = current->next
and is interrupted.
2. Now “our” process linksnew aftercurrent with out any problems. Socurrent
now points tonew.
3. The other process resumes and overwrites current, making it point tonew2.
new now points to whatever was aftercurrent originally, but nothing points to
it.
Exercise 55: No. It may be preempted, and other processes may run, as long
as they do not enter the critical section. Preventing preemptions is a sufficient
but not necessary condition for mutual exclusion. Note that in some cases
keeping hold of the processor is actually not desirable, as the process may be
waiting for I/O as part of its activity in the critical section.
Exercise 56: Yes. As the case where both processes operate in lockstep leads
to deadlock, we are interested in the case when one starts before the other.
Assume without loss of generality that process 1 is first, and gets to itswhile
loop before process 2 sets the value ofgoingin 2 toTRUE. In this situation,
process 1 had already set goingin 1 toTRUE. Therefore, when process 2 gets
to itswhile loop, it will wait.
Exercise 58: The problem is that the value ofcurrentticket can go down. Here
is a scenario for three processes, courtesy of Grisha Chockler:
Now all three processes enter their for loops. At this point everything is still
OK: They all see each other ticket: P1’s ticket is (1,1), P2’s ticket is (1,2),
and P3’s ticket is (2,3). So P1 enters the critical section first, then P2 and
finally P3.
Now assume that P3 is still in the critical section, whereas P1 becomes
hungry again and arrives at the bakery doorway (the entry section). P1 sees
current ticket=2 so it chooses (2,1) as its ticket and enters the for loop. P3 is
still in the critical section, but its ticket (2,3) is higher than (2,1)! So P1 goes
ahead and we have two processes simultaneously in the critical section.
Exercise 59: Once a certain process chooses its number, every other process
may enter the critical section before it at most once.
Exercise 60:
1. At any given moment, it is possible to identify the next process that will
get into the critical section: it is the process with the lowest ID of those that
have the lowest ticket number.
Note the negation in thewhile condition: when the compare and swap
succeeds, we want to exit the loop and proceed to the critical section.
Exercise 63: It is apparently wasteful because only one of these processes
will succeed in obtaining the semaphore, and all the others will fail and block
again. The reason to do it is that it allows the scheduler to prioritize them, so
that the highest priority process will be the one that succeeds. If we just wake
the first waiting process, this could happen to be a low-priority process,
which will cause additional delays to a high-priority process which is further
down in the queue.
Exercise 64: It is wrong to create a semaphore for each pair, because even
when the A/B lists semaphore is locked, another process may nevertheless
cause inconsistencies by locking the A/C semaphore and manipulating list A.
The correct way to go is to have a semaphore per data structure, and lock all
those you need: to move an item between lists A and B, lock both the A and
B semaphores. However, this may cause deadlock problems, so it should be
done with care.
Exercise 65: The solution uses 3 semaphores: space (initialized to the number
of available buffers) will be used to block the producer if all buffers are
full,data (initially 0) will be used to block the consumer if there is no data
available, andmutex (initially 1) will be used to protect the data. The
pseudocode for the producer is
while (1) {/* produce new item */
P(space)
P(mutex)
/* place item in next free buffer */
V(mutex)
V(data)
}
The pseudocode for the consumer is
while (1) {
P(data)
P(mutex)
/* remove item from the first occupied buffer */ V(mutex)
V(space)
/* consume the item */
}
Note, however, that this version prevents the producer and consumer from
accessing the buffer at the same time even if they are accessing different
cells. Can you see why? Can you fix it?
Exercise 66: The previous solution doesn’t work because if we have multiple
producers, we can have a situation where producer1 is about to place data in
buffer[1], and producer2 is about to place data in buffer[2], but producer2 is
faster and finishes first. producer2 then performs V(data) to signal that data is
ready, but a consumer will try to get this data from the first occupied buffer,
which as far as it knows is buffer[1], which is actually not ready yet. The
solution is to add a semaphore that regulates the flow into and out of each
buffer.
Exercise 68: This is exactly the case when a kernel function terminates and
the process is returning to user mode. However, instead of running
immediately, it has to wait in the ready queue because some other process has
a higher priority.
Exercise 75: If they are located at the same place and have the same
capabilities, they may be used interchangeably, and are instances of the same
type. Usually, however, printers will be located in different places or have
different capabilities. For example, “B/W printer on first floor” is a different
resource type than “color printer on third floor”.
Exercise 76: The system is deadlocked if the cycle involves all the instances
of each resource participating in it, or — as a special case — if there is only
one instance of each. This condition is also implied if the cycle contains all
the processes in the system, because otherwise some request by some process
could be fulfilled, and the cycle would be broken.
Exercise 80: A uniform rule leads to deadlock. But with a traffic circle,
giving rightof-way to cars already in the circle prevents deadlock, because it
de-couples the cars coming from the different directions.
Exercise 84: Each resource (lock) has a single instance, and all processes may
want to acquire all the resources. Thus once any one acquires a lock, we must
let it run to completion before giving any other lock to any other process. In
effect, everything is serialized.
Exercise 85: In principle yes — that the rules are being followed. For
example, if the rule is that resources be requested in a certain order, the
system should check that the order is not being violated. But if you are sure
there are no bugs, and are therefore sure that the rules are being followed,
you can leave it to faith.
Exercise 87: It’s more like livelock, as the processes are active and not
blocked. In principle, the operating system can detect it by keeping track of
previous system calls, but this entails a lot of bookkeeping. Alternatively, it is
possible to claim that as far as the operating system is concerned things are
OK, as the processes are active and may eventually actually make progress.
And if quotas are in place, a program may be killed when it exceeds its
runtime quota, thus releasing its resources for others.
Exercise 88: Prevent deadlock by making sure runqueues are always locked
in the same order, by using their addresses as unique identifiers.
Exercise 90: Yes. Locking takes the pessimistic view that problems will
occur and takes extreme precautions to prevent them. Using compare and
swap takes the optimistic approach. If the actual conflict occurs in a small
part of the code, there is a good chance that things will actually work out
without any undue serialization.
Exercise 91: It is only possible if the application does not store virtual
addresses and use them directly, because then the operating system cannot
update them when the mapping is changed.
Exercise 93: If the bounds are not checked, the program may access storage
that is beyond the end of the segment. Such storage may be unallocated, or
may be part of another segment. The less-serious consequence is that the
program may either produce wrong results or fail, because the accessed
storage may be modified undeterministically via access to the other segment.
The more serious consequence is that this may violate inter-process
protection if the other segment belongs to another process.
Exercise 95: Did you handle the following special cases correctly?
really good code would also catch bugs such as deallocation of a range that
was actually free.
Exercise 97: Both claims are not true. Assume two free regions of sizes 110
and 100, and a sequence of requests for sizes 50, 60, and 100. First-fit will
pack them perfectly, but best-fit will put the first request for 50 in the second
free area (which leaves less free space), the second request in the first free
area (only option left), and then fail to allocate the third. Conversely, if the
requests are for 100 and 110, best-fit will pack them perfectly, but first-fit
will allocate 100 of 110 for the first request and fail to allocate the second.
Exercise 98: Next fit and its relatives suffer from external fragmentation.
Buddy may suffer from both.
Exercise 99: External only. Internal fragmentation can only be reduced by
using smaller chunks in the allocation.
Exercise 100: No. If the page is larger than a block used for disk access,
paging will be implemented by several disk accesses (which can be organized
in consecutive locations). But a page should not be smaller than a disk access,
as that would cause unwanted data to be read, and necessitate a memory copy
as well.
Exercise 101: This depends on the instruction set, and specifically, on the
addressing modes. For example, if an instruction can operate on two operands
from memory and store the result in a register, then normally three pages are
required: one containing the instruction itself, and two with the operands.
There may be special cases if the instruction spans a page boundary.
Exercise 104: There are other considerations too. One is that paging allows
for more flexible allocation, and does not suffer from fragmentation as much
as contiguous allocation. Therefore memory mapping using pages is
beneficial even if we do not do paging to disk. Moreover, using a file to back
a paging system (rather than a dedicated part of the disk) is very flexible, and
allows space to be used for files or paging as needed. Using only primary
memory such dynamic reallocation of resources is impossible.
Exercise 105: No. With large pages (4MB in this example) you loose
flexibility and suffer from much more fragmentation.
Exercise 106: No: some may land in other segments, and wreck havoc.
Exercise 107: There will be no external fragmentation, but instead there will
be internal fragmentation because allocations are made in multiples of the
page size. However, this is typically much smaller.
Exercise 110: Yes (see [1]). The idea is to keep a shadow vector of used bits,
that is maintained by the operating system. As the pages are scanned, they are
marked in the page table as not present (even though actually they are
mapped to frames), and the shadow bits are set to 0. When a page is accessed
for the first time, the hardware will generate a page fault, because it thinks the
page is not present. The operating system will then set the shadow bit to 1,
and simply mark the page as present. Thus the overhead is just a trap to the
operating system for each page accessed, once in each cycle of the clock
algorithm.
Exercise 113:
Owner: apply special access restrictions.
Permissions: check restrictions before allowing access.
Exercise 115: Remember that files are abstract. The location on the disk is an
internal implementation detail that should be hidden.
Exercise 117: It’s bad because then a program that traverses the file system
(e.g.ls
-R) will get into an infinite loop. It can be prevented by only providing a
mechanism to create new subdirectories, not to insert existing ones. In Unix it
is prevented by not allowing links to directories, only to files.
Exercise 118: Some convention is needed. For example, in Unix files and
directories are represented internally by data structures called inodes. The
root directory is always represented by entry number 2 in the table of inodes
(inode 0 is not used, and inode 1 is used for a special file that holds all the
bad blocks in the disk, to prevent them from being allocated to a real file). In
FAT, the root directory resides at the beginning of the disk, right after the file
allocation table itself, and has a fixed size. In other words, you can get
directly to its contents, and do not have to go through any mapping.
Exercise 119: If the directory is big its contents may span more than one
block. Exercise 121: the advantage is short search time, especially for very
large directories. A possible disadvantage is wasted space, because hash
tables need to be relatively sparse in order to work efficiently.
Exercise 124: Renaming actually means changing the name as it appears in
the directory — it doesn’t have to touch the file itself.
Exercise 130: It is best to choose blocks that will not be used in the future. As
we don’t know the future, we can base the decisions on the past using LRU.
LRU depends on knowing the order of all accesses. For memory, accesses are
done by the hardware with no operating system intervention, and it is too
difficult to keep track of all of them. File access is done using operating
system services, so the operating system can keep track.
Exercise 131: The problem is how to notify the process when the I/O is done.
The common solution is that the conventional read and write functions return
a handle, which the process can then use to query about the status of the
operation. This query can either be non-blocking (poll), returning “false” if
the operation is not finished yet, or blocking, returning only when it is
finished.
Exercise 132: It just becomes a shared memory segment. If they write to this
file, they should use some synchronization mechanism such as a semaphore.
Exercise 133: Sorry — I get confused by large numbers. But the more serious
constraint on file sizes is due to the size of the variables used to store the file
size. An unsigned 32-bit number can only express file sizes up to 4GB. As
64-bit architectures become commonplace this restriction will disappear.
Exercise 136: The superblock is so useful that it is simply kept in memory all
the time. However, in case of a system crash, the copy of the superblock on
the disk may not reflect some of the most recent changes. This can result in
the loss of some files or data.
Exercise 137: Because the file cold be copied and cracked off-line. This was
typically done by trying to encrypt lots of dictionary words, and checking
whether the results matched any of the encrypted passwords.
Exercise 138: Yes. A user can create some legitimate handles to see what
random numbers are being used, and try to forge handles by continuing the
pseudo-random sequence from there.
Exercise 139: No. Programs may have bugs that cause security leaks.
Exercise 140: No. A file descriptor is only valid within the context of the
process that opened the file; it is not a global handle to the file.
Exercise 141: The only way to deny access to all the group except one
member is to list all the members individually.
Exercise 146: In general, no. They would be if jobs were run to completion
one after the other. But jobs may overlap (one uses the disk whole the other
runs on the CPU) leading to a more complex interaction. As a simple
example, consider two scenarios: in one jobs run for one minute each, and
arrive at 2 minute intervals; the throughput is then 30 jobs/hour, and the
response time is 1 minute each. but if the 30 jobs arrive all together at the
beginning of each hour, the throughput would still be 30 jobs/hour, but their
average response time would be 15 minutes.
Exercise 148: The access pattern: which addresses are accessed one after the
other. Exercise 149: Yes, definitely! those 300 users will not be spread
evenly through the day, but will come in big groups at the end of each study
period.
Exercise 150: If it is a light-tailed distribution, kill a new job. If it is
memoryless, choose one at random. If it is fat-tailed, select the oldest one.
Exercise 151: The integral is
xf(x)dx = a x−adx For a ≤ 1 this does not converge. For a > 1, the result is
a1 xa∞ −adx =a− 1
Exercise 154:
1. The peak load (typically during the day or evening) is much higher than
the average (which also includes low-load periods).
2. No significance.
3. Computational tasks can be delayed and executed at night so as to allow
for efficient interactive work during the day.
Exercise 155: Response time obviously depends on the CPU speed. But it is
also affected by cache size, memory speed, and contention due to other jobs
running at the same time. Network bandwidth is likewise limited by the raw
capabilities of the networking hardware, but also depends on the CPU speed,
memory, and communication protocol being used.
Exercise 156: Measure the time to call a very simple system call many times
in a loop, and divide by the number of times. Improved accuracy is obtained
by unrolling the loop, and subtracting the time for the loop overhead
(measured by an empty loop). The number of iterations should be big enough
to be measurable with reasonable accuracy, but not so big that context
switching effects start having an impact. Exercise 158: Yes! one simple case
is when too much work arrives at once, and the queue overflows (i.e. there is
insufficient space to store all the waiting jobs). A more complex scenario is
when the system is composed of multiple devices, each of which is not highly
utilized, but their use cannot be overlapped.
Exercise 159: The gist of the argument is as follows. Consider a long interval
T. During this time, N jobs arrive and are processed. Focusing on job i for the
moment, this job spends ri time in the system. The cumulative time spent by
all the jobs can be denoted by X =i ri. Using these notations, we can
approximate the arrival rate as λ = N/T, the average number of jobs in the
system as ¯ = X/T, and the average response time as ¯ = X/N. Therefore
n = X = N X = λ· ¯T T · N
Exercise 160: This is a 2-D state space, where state (x, y) means that there are
x jobs waiting and being serviced by the CPU, and y jobs waiting and being
serviced by the disk. For each y, the transition (x, y) → (x + 1, y) means a new
job has arrived, and depends on λ. Denote by p the probability that a job
needs service by the disk after being served by the CPU. The alternative is
that it terminates, with probability 1− p. Transitions (x, y) → (x−1, y) therefore
occur at a rate proportional to (1−p)µCP U, and transitions (x, y) → (x − 1, y +
1) at a rate proportional to pµCP U. Finally, transitions (x, y) → (x + 1, y − 1) ,
which mean that a disk service has completed, occur at a rate proportional to
µdisk.
λ
0,0 1,0 2,0 3,0 (1-p) µCPU µdiskp µCPU
0,1 1,1 2,1 3,1
0,2 1,2 2,2 3,2
Exercise 161: If jobs arrive at a rate of λ, the average interarrival time is 1/λ.
Likewise, the service rate is µ, so the average service time is 1/µ. The
utilization is the fraction of time that the system is busy, or in other words,
the fraction of time from one arrival to the next that the system is busy
serving the first arrival:
U = 1/µ = λ
1 /λ µ which is ρ.
Exercise 162: It is 1/µ, the expected service time, because under low load
there is no waiting.
Exercise 164:
Exercise 169: The program in the boot sector doesn’t load the operating
system directly. Instead, it checks all the disks (and disk partitions) to see
which contain boot blocks for different operating systems, and creates a
menu based on its findings. Choosing an entry causes the relevant boot block
to be loaded, and everything continues normally from there.
Exercise 171: It is the first 16 bits, which are the ASCII codes for ‘#’ and ‘!’.
This allows exec to identify the file as a script, and invoke the appropriate
interpreter rather than trying to treat it as a binary file.
Exercise 172: Yes. for example, they may not block, because this will cause a
random unrelated process to block. Moreover, they should not tamper with
any of the “current process” data.
Exercise 173: Yes, but this may cause portability problems. For example, in
some architectures the stack grows upwards, and in others downwards. The
design described above requires only one function to know about such
details. By copying the arguments to the u-area, they are made available to
the rest of the kernel in a uniform and consistent manner.
Exercise 175: A bit-vector with a bit for each signal type exists in the process
table entry. This is used to note the types of signals that have been sent to the
process since it last ran; it has to be in the process table, because the signals
have to be registered when the process is not running. The handler functions
are listed in an array in the u-area, because they need to run in the context of
the process, and are therefore invoked only when the process is scheduled
and the u-area is available.
Exercise 176: The problem is that you can only register one handler. The
simplistic approach is to write a specific handler for each instance, and
register the correct one at the beginning of the try code. However, this suffers
from overhead to change the handlers all the time when the application is
actually running normally. A better approach is to define a global variable
that will contain an ID of the current try, and write just one handler that starts
by performing a switch on this ID, and then only runs the desired code
segment.
Exercise 177: The parent gets the identity of its child as the return value of
the fork. The child process can obtain the identity of its parent from the
operating system, usiong thegetppid system call (this is not a typo: it stands
for “get parent process id”). This is useful only for very limited
communication by sending signals.
Exercise 180: The area of shared memory will be a separate segment, with its
own page table. This segment will appear in the segment tables of both
processes. Thus they will both share the use of the page table, and through it,
the use of the same pages of physical memory.
Exercise 181: You can save a shadow page with the original contents, and
use it to produce a diff indicating exactly what was changed locally and needs
to be updated on the master copy of the page.
Exercise 182: The first process creates a pipe and write one byte to it. It then
forks the other processes. The P operation is implemented by reading a byte
from the pipe; if it is empty it will block. The V operation is implemented by
writing a byte to the pipe, thus releasing one blocked process.
Exercise 183: Yes, by mapping the same file into the address spaces of
multiple processes.
Exercise 184: The function can not have side effects in the caller’s context —
it has to be a pure function. But it can actually affect the state at the callee,
for example affecting the bank’s database in the ATM scenario.
Exercise 185: Onlymsg is typically passed by reference, simply to avoid
copying the data.
Exercise 186: There are two ways. One is to break up long messages into
segments with a predefined length, and reassemble them upon receipt. The
other is to use a protocol whereby messages are always sent in pairs: the first
is a 4-byte message that contains the required buffer size, and the second is
the real message. This allows the recipient to prepare a buffer of the required
size before posting the receive.
Exercise 188: In the original Unix systems, the contents of a pipe were
maintained using only the immediate pointers in the inode, in a circular
fashion. If a process tried to write additional data when all the blocks were
full, the process was blocked until some data was consumed. In modern Unix
systems, pipes are implemented using socket pairs, which also have a
bounded buffer. The other features are simple using counters of processes
that have the pipe open.
Exercise 190: Either there will be no program listening to the port, and the
original sender will be notified of failure, or there will be some unrelated
program there, leading to a big mess.
Exercise 191: Yes, but not on the same port. This is why URLs sometimes
include a port, as inhttp://www.name.dom:8080, which means send to port
8080 (rather than the default port 80) on the named host.
Exercise 192: It specifies the use of a text file, in which the first few lines
constitute the header, and are of the format “ keyword : value ”; the most
important one is the one that starts with “To:”, which specifies the intended
recipient.
Exercise 193: No. IP on any router only has to deal with the networks to
which that router is attached.
Exercise 194: Oh yes. For example it can be used for video transmission,
where timeliness is the most important thing. In such an application UDP is
preferable because it has less overhead, and waiting for retransmissions to get
everything may degrade the viewing rather than improve it.
Exercise 198: It is also good for only one corrupt bit, but has higher
overhead: for N data bits we need 2√N parity bits, rather than just lg N parity
bits.
Exercise 199: Somewhat more than the round-trip time. We might even want
to adjust it based on load conditions.
Exercise 201: The time for the notification to arrive is the round trip time,
which is approximately 2tℓ. In order to keep the network fully utilized we
need to transmit continuously during this time. The number of bytes sent will
therefore be 2Btℓ, or twice the bandwidth-delay product.
Exercise 202: no — while each communication might be limited, the link can
be used to transfer many multiplexed communications.
Exercise 203: When each ack arrives, two packets are sent: one because the
ack indicates that space became available at the receiver, and the other
because the window was enlarged. Thus in each round-trip cycle the number
of packets sent is doubled.
host A host B
ack
Exercise 205: /tmp is usually local, and stores temporary files used or created
by users who logged on to each specific workstation.
Exercise 206: Its contents would become inaccessible, so this is not allowed.
Exercise 207: In principle this optimization is possible, but only among
similar systems. NFS does it one component at a time so that only the client
has to parse the local file name. For example, on a Windows system the local
file name may be\a\b, which the server may not understand if it uses ‘/’ as a
delimiter rather than ‘\’.
Exercise 209: Sure. For example, both origin and destination machines
should belong to the same architecture, and probably run the same operating
system. And there may also be various administrative or security restrictions.
Exercise 211: Maybe it can query the local host name, or check its process
ID. But such queries can also be caught, and the original host name or ID
provided instead of the current one.
Exercise 212: No — demand paging (from the source node) can be used to
retrieve only those pages that are really needed. This has the additional
benefit of reducing the latency of the migration.
Exercise 214: When the parent process terminates, the child process will not
receive an EOF on the pipe as it should, because there is still a process that
can write to the pipe: itself!
Index
abstract machine, 6
abstraction, 74
accept system call, 234
access matrix, 166
access rights, 10, 26
accounting, 26
ACE, 167
ack, 246, 252
acknowledgment, 241
ACL, 167
address space, 25, 29, 95 address space size, 106
address translation
IP, 255
virtual memory, 109
administrative policy, 53
affinity scheduling, 172
aging, 50
ALU, 19
analysis, 191
anti-persistent process, 210
application layer, 274
ARQ, 247
arrival process, 186
arrival rate, 196
associativity, 137
asynchronous I/O, 29, 138, 232
atomic instructions, 73
atomicity, 67, 80
authentication, 260
avoidance of deadlock, 85
backbone, 257
background, 215, 233
bakery algorithm, 69
banker’s algorithm, 85
base address register, 97
Belady’s anomaly, 115
benchmarks, 185
best effort, 157, 245
best-fit, 100
bimodal distribution, 189 bind system call, 233
biometrics, 165
blocked process, 28, 29, 75, 232 booting, 212
bounded buffer problem, 76 brk system call, 96
broadcast, 230, 245, 248
buddy system, 101
buffer, 249
buffer cache, 133, 136
burstiness, 190, 209
bus, 80
busy waiting, 75
C-SCAN, 151
cache, 107
interference, 41, 172
related, 75
customization, 179
CV, 37
avoidance, 85
detection and recovery, 89 ignoring, 89
necessary conditions for, 83 prevention, 84, 89
bimodal, 189
exponential, 187
fat tail, 186–189
267
fault containment, 180
FCFS, 36, 37, 44
FEC, 247
FIFO, 115, 151, 230
file
hacking, 163
hard link, 129
hardware support
disabling, 77
interrupt priority level, see IPL interrupt vector, 10, 175, 213 inverted page
table, 110
IP, 243
Kerberos, 261
kernel, 8, 217
address mapping, 217
Lamport, 69, 73
layers, 173
light-weight process, 34
link, 129
Linux
scheduling, 52
listen system call, 234
Little’s Law, 197
livelock, 81
load, 196, 203
load balancing, 172, 267
load sharing, 172
loadable modules, 179
local paging, 118
locality, 106, 115
lock, 78
lock-free, 92
locks, 89
log-structured file system, 145 logical volume, 145
login, 215, 261
long-term scheduling, 40 longjmp, 32
lookup, 264
LRU, 116, 137
LWP, 30
scheduling, 170
synchronization, 79
multiprogramming, 4, 35, 119
multitasking, 35, 40, 177
multithreading, 29
mutex, 74
mutual exclusion, 68
MVS, 7, 180
nack, 246
name service, 225, 233
named pipe, 231
naming, 224, 235
negative feedback, 52
network layer, 274
next-fit, 101
NFS, 263
nice parameter, 50
in queueing, 201
with known interface, 242
ORB, 237
OSI, 274
ostrich algorithm, 89
overflow, 249
overhead, 31, 41, 184
overlapping I/O with computation, 29
overloaded system, 51, 196, 203
P, 74, 179
packet filter, 263
packetization, 240
page fault, 105, 106, 139, 227
page table, 104
paged segments, 113
paging, 103
queueing analysis, 171, 195 queueing network, 194 response time, 42, 183,
196, 202 responsiveness, 36, 77
rings, 163
RMI, 228
ROM, 213
root, 163, 215
root directory, 127, 264
root DNS, 255
round robin, 47
round trip time, see RTT
router, 262
routing, 241
routing table, 257
RPC, 228, 265
RTC, 44
RTT, 250, 253
running process, 28
cycle in, 83
resources, 5, 89
allocation in predefined order, 85, 90
exclusive use, 83
preemption, 83
response ratio, 42
safe state, 85
satellite link, 251
saturation, 51, 196, 203 SCAN, 151
SCSI, 151
second chance, 117
security rings, 163
seek system call, 134, 230 segment table, 99, 113 segments, 98
paged, 113
select system call, 231 self-similarity, 190, 208 semaphore, 74, 179, 227
send, 229
sequential access, 138 server, 29, 177, 232, 233 service rate, 196
service station, 194
session layer, 274
setjmp, 32
seven layers model, 274 shared file pointer, 135 shared libraries, 216
shared memory segment, 95, 98, 112, 226
shared queue, 171
shell, 215
shmat system call, 96
shmget system call, 96
short-term scheduling, 41
Sierpinski triangle, 209
signal (on semaphore), 74
signals, 221
simulation, 192
SJF, 45, 48
sliding window, 251
slowdown, 42, 183
SMTP, 246
socket, 231
socket system call, 233
socketpair system call, 231
soft link, 129
SP, 20, 25
SPEC, 185
SRPT, 46
SRT, 46
stability, 51, 196, 203
stack, 29, 96, 219
starvation, 51, 70
stat system call, 125
state of process, 27, 120
stateful inspection, 263
stateful server, 265
stateless server, 265
static workload, 193
stderr, 272
stdin, 271
stdout, 271
steady state, 193, 202, 203
stub, 228
subroutine call, 20
superuser, 10, 163, 215
swapping, 120
synchronization, 79
system call, 12, 13, 21, 219
failure, 90, 220
daemons, 233
data structures, 176
devices, 289
fast file system, 144, 152 file access permissions, 168 file system, 144
inode, 140
kernel mode, 59
non-preemptive kernel, 77 process states, 59
process table, 58
scheduling, 49, 52
signals, 221
system calls
accept , 234
bind, 233
brk, 96
connect, 234 dup, 273
exec, 62, 112, 271 fork, 60, 112, 271 kill, 221
listen, 234
open, 131, 164 pipe, 231, 272 read, 132
seek, 134, 230 select, 231
shmat, 96
shmget, 96
socketpair, 231 socket, 233
stat, 125
wait, 59
write, 133
tables for files, 134 u-area, 58, 112, 176 used bit, 110, 117
user ID, 215
user mode, 8
user-level threads, 31
utilization, 38, 43, 184
V, 74
valid bit, 109
virtual address, 96
virtual file system, 265
virtual machine, 180
virtual machine monitor, 181 virtual memory, 103
virtualization, 6, 154, 180 VMware, 8, 181
vnode, 265
wait (on semaphore), 74 wait for condition, 75 wait system call, 59 wait time,
42
wait-free, 92
weighted average, 48, 253 well-known port, 233, 235 window size
flow control, 251, 254 working set definition, 115 Windows, 167
Windows NT, 34, 178
working directory, 128
working set, 115
workload, 185–191
dynamic vs. static, 193 self-similar, 211
worst-fit, 100
write system call, 133