Advanced OS Lecture Notes
Advanced OS Lecture Notes
Without software, a computer is effectively useless. Computer software controls the use of the hardware (CPU,
memory, disks etc.), and makes the computer into a useful tool for its users.
In most computers, the software can be regarded as a set of layers, as shown in the following diagram. Each layer
hides much of the complexity of the layer below, and provides a set of abstract services and concepts to the layer
above.
More
abstract Users
Application Programs
System Software
Less
abstract Computer Hardware
For example, the computer’s hard disk allows data to be stored on it in a set of fixed-sized blocks. The system
software hides this complexity, and provides the concept of files to the application software. In turn, an application
program such as a word processor hides the idea of a file, and allows the user to work with documents instead.
Computer software can be thus be divided into 2 categories:
The most fundamental of all system software is the operating system. It has three main tasks to perform.
• The operating system must shield the details of the hardware from the application programs, and thus from
the user.
• The operating system has to provide a set of abstract services to the application programs, instead. When
applications use these abstract services, the operations must be translated into real hardware operations.
• Finally, the resources in a computer (CPU, memory, disk space) are limited. The operating system must act
as a resource manager, optimising the use of the resources, and protecting them against misuse and abuse.
When a system provides multiuser or multitasking capabilities, resources must be allocated fairly and
equitably amongst a number of competing requests.
operating system: (Often abbreviated ‘OS’) The foundation software of a machine, of course; that which
schedules tasks, allocates storage, and presents a default interface to the user between applications.
The facilities an operating system provides and its general design philosophy exert an extremely strong
influence on programming style and on the technical cultures that grow up around its host machines.
ii
1.2 Kernel Mode and User Mode
Textbook reference: Tanenbaum & Woodhull pg 3
Because an operating system must hide the computer’s hardware, and manage the hardware resources, it needs
to prevent the application software from accessing the hardware directly. Without this sort of protection, the
operating system would not be able to do its job.
The computer’s CPU provides two modes of operation which enforce this protection. The operating system runs
in kernel mode, also known as supervisor mode or privileged mode. In kernel mode, the software has complete
access to all of the computer’s hardware, and can control the switching between the CPU modes.
The rest of the software runs in user mode. In this mode, direct access to the hardware is prohibited, and so is any
arbitrary switching to kernel mode. Any attempts to violate these restrictions are reported to the kernel mode
software: in other words, to the operating system itself.
By having two modes of operation which are enforced by the computer’s own hardware, the operating system can
force application programs to use the operating system’s abstract services, instead of circumventing any resource
allocations by direct hardware access.
Before we go on with our introduction to operating systems, we should look at what other system software there
is.
At the top of the operating system are the system calls. These are the set of abstract operations that the operating
system provides to the applications programs, and thus are also known as the application program interface, or
API. This interface is generally constant: users cannot change what is in the operating system.
Above the system calls are a set of library routines which come with the operating system. These are functions and
subroutines which are useful for many programs.
The programs do the work for the user. Systems programs do operating system-related things: copy or move files,
delete files, make directories, etc.
Other, non-system software are the application programs installed to make the computer useful. Applications like
Netscape Navigator, Microsoft Word or Excel are examples of non-system software. These are usually purchased
separately from the operating system.
Of course, in many cases software must be written for a special application, by the users themselves or by
programmers in an organisation.
2
Regardless of type, all programs can use the library routines and the system calls that come with an operating
system.
• A simple monitor provides few services to the user, and leaves much the control of the hardware to the
user’s own programs. A good example here is MS-DOS.
• A batch system takes user’s jobs, and segregates them into batches, with similar requirements. Each batch
is given to the computer to run. When jobs with similar system requirements are batched together, this
helps to streamline their processing. User interaction is generally lacking in batch systems: jobs are entered,
are processed, and the output from each job comes out at a later time. The emphasis here is on the
computer’s utilisation. An example batch system is IBM’s VM.
• An embedded system usually has the operating system built into the computer, and is used to control
external hardware. There is little or no application software in an embedded system. Examples here are the
Palm Pilot, the electronic diaries that everybody seems to have, and of course the computers built into VCRs,
microwaves, and into most cars.
• A real-time system is designed to respond to input within certain time constraints. This input usually comes
from external sensors, and not from humans. Thus, there is usually little or no user interaction. Many
embedded systems are also real-time systems. An example real-time system is the QNX operating system.
• Finally, a multiprogramming system appears to be running many jobs at the same time, each with user
interaction. The emphasis here is on user response time as well as computer utilisation. Multiprogramming
systems are usually more general-purpose than the other types of operating systems. Example
multiprogramming systems are Unix and Windows NT.
In this course, we will concentrate on multiprogramming systems: these are much more sophisticated and complex
then the other operating system types, and will give us a lot more to look at. We will also concentrate on multi-
user systems: these are systems which support multiple users at the same time.
The services provided by an operating system depends on the concepts around which the operating system was
created; this gives each operating system a certain ‘feel’ to the programmers who write programs for it.
We are talking here not about the ‘look & feel’ of the user interface, but the ‘look & feel’ of the programmer’s
interface, i.e the services provided by the API.
Although each operating system provides its own unique set of services, most operating systems share a few
common concepts. Let’s briefly take at look at each now. We will examine most of these concepts in detail in later
topics.
3
• A program is a collection of computer instructions plus some data that resides on a storage medium, waiting
to be called into action.
• A process is a program during execution. It has been loaded into the computer’s main memory, and is taking
input, manipulating the input, and producing output.
Specifically, a process is an enviroment for a program to run in. This environment protects the running program
against other processes, and also provides the running program with access to the operating system’s services via
the system calls.
2.2 Memory
Part of every computer’s hardware is its main memory. This is a set of temporary storage locations which can hold
machine code instructions and data. Memory is volatile: when the power is turned off, the contents of main
memory are lost.
In current computers, there are usually several megabytes of memory (i.e millions of 8-bit storage areas). Memory
contents can be accessed by reading or writing a memory location, which has an integer address, just like the
numbers on the letter boxes in a street.
Memory locations often have a hardware protection, allowing or preventing read and writes. Usually, a process
can only read or write to a specific set of locations that have been given to it by the operating system.
The operating system allocates memory to processes as they are created, and reclaims the memory once they
finish. As well, processes can usually request more memory, and also relinquish this extra memory if they no longer
require it.
2.3 Files
Files are storage areas for programs, source code, data, documents etc. They can be accessed by processes, but
don’t disappear when processes die, or when the machine is turned off. They are thus persistent objects.
Operating systems provide mechanisms for file manipulation, such as open, close, create, read and write.
As part of the job of hiding the hardware and providing abstract services, the operating system must map files onto
areas on disks and tapes. The operating system must also deal with files that grow or shrink in size.
Some operating systems don’t enforce any structure to files, or enforce particular file types types. Others
distinguish between file types and structures, e.g Pascal source files, text documents, machine code files, data
files etc.
Most operating systems allow files to have permissions, allowing certain types of file access to authorised users
only.
Directories may exist to allow related files to be collected. The main reason for the existence of directories is to
make file organisation easier and more flexible for the user.
2.4 Windows
Nearly all operating systems these days provide some form of graphical user interface, although in many cases a
command-line interface is also available.
In these operating systems, there are services available to allow processes to do graphical work. Although there
are primitive services such as line and rectangle drawing, most GUI interfaces provide a abstract concept known
as the window.
4
The window is a logical, rectangular, drawing area. Processes can create one or more windows, of any size. The
operating system may decorate each window with borders, and these may include icons which allow the window
to be destroyed, resized, or hidden.
The operating system must map these logical windows onto the physical display area provided by the video card
and computer monitor. As well, the operating system must direct the input from the user (in the form of keyboard
input, and mouse operations) to the appropriate window: this is known as changing the input focus.
From a programmer’s point of view, an operating system is defined mainly by the Application Program Interface
that it provides, and to a lesser extent what library routines are available.
It follows, therefore, that a number of different operating system products may provide exactly the same
Application Program Interface, and thus appear to be the same operating system to the programmer. The most
obvious example of this is Unix.
Unix is really not a single operating system, but rather a collection of operating systems that share a common
Application Program Interface. This API has now been standardised, and is known as the POSIX standard. Solaris,
HP-UX, SCO Unix, Digital Unix, System V, Linux, Minix and FreeBSD are all examples of Unix operating systems.
What this means is that a program written to run on one Unix platform can be recompiled and will run on another
Unix system. As long as the set of systems calls are the same on both systems, the program will run on both
systems.
Another group of operating systems which share a common API are the Windows systems from Microsoft:
Windows CE, Windows 95 or 98 and Windows NT. Although structurally different, a program can be written to run
on all three.
The implementation of an operating system is completely up to its designers, and throughout the course we will
look at some of the design decisions that must be made when creating an operating system.
In general, none of the implementation details of an operating system are visible to the programmer or user: these
details are hidden behind the operating system’s Application Program Interface. The API fixes the “look” of the
operating system, as seen by the programmer.
5
This API, however, can be implemented by very different operating system designs. For example, Solaris, Linux and
Minix all provide a POSIX API: all three systems have a very different operating system architecture.
We will examine the two most common operating system designs, the monolithic model and the clientserver
model.
6
High-Level Language System Call Interface User Mode
K L
Kernel
Kernel
mode
Client obtains
service by sending
messages to server
processes
The messages sent between the clients and servers are well-defined ‘lumps’ of data. These must be copied between
the client and the server. This copying can slow the overall system down, when compared to a monolithic system
where no such copying is required. The servers themselves also may need to intercommunicate.
There must be a layer in the operating system that does message passing. This model can be implemented on top
of a single machine, where messages are copied from a client’s memory are into the server’s memory area. The
client-server model can also be adapted to work over a network or distributed system where the processes run on
several machines.
Machine 1 Machine 2 Machine 3 Machine 4
Client File server Process server Terminal server
Kernel Kernel Kernel Kernel
Network
Message from
client to server
Windows NT uses the client-server model, as shown in the diagram below. Most of the subsystems are privileged
processes. The NT executive, which provides the message-passing capability, is known as a monolithic microkernel.
Other client-server based operating systems are Minix and Plan 9.
8
3 The OS/Machine Interface
Textbook reference: Stallings ppg 9 – 38
The operating system must hide the physical resources of the computer from the user’s programs, and fairly
allocate these resources. In order to explore how this is achieved, we must first consider how the main components
of a computer work.
There are several viewpoints on how a computer works: how its electronic gates operate, how it executes machine
code etc. We will examine the main functional components of a computer and their abstract interconnection. We
will ignore complications such as caches, pipelines etc.
9
In order to execute an instruction, the CPU must fetch a word (usually) from main memory and decode the
instruction: then the instruction can be performed.
The Program Counter, or PC, in the CPU indicates from which memory location the next instruction will be fetched.
The PC is an example of a register.
Some instructions may cause the CPU to read more data from main memory, or to write data back to main memory.
This occurs when the data needed to perform the operation must be loaded from the main memory into the CPU’s
registers. Of course, if the CPU already has the correct data in its registers, then no main memory access is required,
except to fetch the instruction.
As the number of internal registers is limited, data currently in a registers often has to be discarded so that it can
be replaced by new data that is required to perform an instruction. Such a discard is known as a register spill.
10
Main memory is typically composed of Random Access Memory (RAM), which means that the CPU can read from
a memory location, or the CPU can overwrite the contents of a memory location with a new value. When registers
are spilled, the CPU often saves the old register value into a location in the main memory. Main memory is also
used to hold buffers of data which will be written out to permanent storage on the disk.
Parts of main memory may be occupied by Read Only Memory (ROM). Write operations to ROM are electrically
impossible, and so the write has no effect on their contents.
3.3 Buses
The CPU and main memory are connected via three buses:
• The address bus, which carries the address in main memory which the CPU is accessing. Its size indicates
how many memory locations the CPU can access. A 32-bit address bus allows 232 address locations, giving 4
Gigbytes of addressable memory.
• The data bus, which carries the data being read to/from that address). Its size indicates the natural data size
for the machine. A 32-bit machine means that its data bus is 32-bits wide.
• The status bus, which carries information about the memory access and other special system events to all
devices connected to the three buses.
11
Memory Disk drive Tape Drive UART
controller controller
Decoder Decoder Decoder Decoder
Status
CPU Data
Address
Valid
Read/write bit
CPU halted
Reset
Status
Assert
Halt CPU
Interrupt
lines
Bus
Several
DMA
levels
• The CPU places the value of the program counter on the address bus.
• It asserts a ‘read’ signal on the read/write line (part of the status bus).
• Main memory receives both the address request and the type of request (read).
• Main memory retrieves the value required from its hardware, and places the value on the data bus. It then
asserts the ‘valid address’ line on the status bus.
• Meanwhile, the CPU waits a period of time watching the ‘valid address’ line.
• If a ‘valid address’ signal is returned, the value (i.e the next instruction) is loaded off the data bus and into
the CPU.
• If no ‘valid address’ returned, there is an error. The CPU will perform some exceptional operation instead.
Read accesses for data, and the various write requests, are performed in a similar fashion. Note that main memory
needs an address decoder to work out which addresses it should respond to, and which it should ignore. Most
computers don’t have their entire address space full of main memory. This implies that reads or writes to certain
memory locations will always fail.
Here are some example computers and their address & data bus sizes:
Computer Address Bus Data Bus
IBM XT 20-bit 8-bit
IBM AT 24-bit 16-bit
486/Pentium 32-bit 32-bit
68030/040 32-bit 32-bit
Sparc ELC 32-bit 32-bit
DEC Alpha 64-bit 64-bit
12
3.4 Peripheral Devices
Textbook reference: Tanenbaum & Woodhull ppg 154 – 157
The computer must be able to do I/O, so as to store data on long-term storage devices, and to communicate with
the outside world. However, we don’t want the CPU to be burdened with the whole task of doing I/O, i.e controlling
every electrical & mechanical aspect of every peripheral. Therefore, most devices have a device controller which
takes device commands from the CPU, performs them on the actual device, and reports the results (and any data)
back to the CPU.
The CPU communicates with the device controllers via the three buses. Therefore, the controllers usually appear
to be memory locations from the CPU’s point of view. Each device controller has a decoder which tells the device
if the asserted address belongs to that device. If so, parts of the address and the data is written to/from the device.
Usually, this means that the device controller is mapped into the computer’s address space. And because main
memory has its own decoder, we can say that the locations in main memory are also mapped into the computer’s
address space:
1,000,000
ROM
900,000
800,200
800,100
Disk 800,032
800,000
UART 730,000
Video
0
I/O Decoding Addresses
RAM
In the diagram above, the UART (serial I/O) controller decodes addresses 800,000 to
800,031, which is 32 addresses. It ignores addresses outside this region, and the decodes passes values 0 to 31 to
the controller, when the address is inside the region.
Assume this UART uses the following addresses:
Decoded Real Use of this location Format of this location
Location Location
0 800,000 Output format Speed (4), Parity (2), Stop bits (2)
1 800,001 Output status register
3 800,003 Input format Speed (4), Parity (2), Stop bits (2)
4 800,004 Input status register
These special addresses are known as device registers, and are similar to the registers inside a CPU. To output a
character, first the operating system must set up the output characteristics:
13
• It asserts ‘write’ on the r/w line.
• It waits a period of time.
• If no ‘valid address’ returned, error.
Then, to output a character, the character is sent to 800,002 as above. The UART latches the character, and it is
transmitted over the serial line at the bit rate set in the output format.
Input from a device is more complicated. There are three types: polling, interrupts, and direct memory access
(DMA). We will leave DMA until later.
With polling, the UART leaves the input character at the address 800,005 and an indicator that a character has
arrived at the address 800,004. The CPU must periodically scan (i.e read) this address to determine if a character
has arrived. Because of this periodic checking, polling makes multitasking difficult if not impossible: the frequent
reading cannot be performed by the operating system if a running program has sole use of the CPU.
poll: v.,n. 1. [techspeak] The action of checking the status of an input line, sensor, or memory location
to see if a particular external event has been registered. 2. To repeatedly call or check with someone: “I
keep polling him, but he’s not answering his phone; he must be swapped out.”
3.5 Interrupts
An alternative way for the operating system to find out when input has arrived, or when output has been
completed, is to use interrupts. If a computer uses interrupts for I/O operations, a device will assert an interrupt
line on the status bus when an I/O operation has been completed. Each device has its own interrupt line.
For example, when a character arrives, the UART described above asserts its interrupt line. This sends a signal in
to the CPU along the status bus. If the interrupt has priority greater than any other asserted interrupt line, the CPU
will stop what it is doing, and jump to an interrupt handler for that line. This interrupt handler is a section of
machine code places at a fixed location in main memory.
Here, the interrupt handler will collect the character, do something with it and then return the CPU to what it was
doing before the handler started i.e the program running before the interrupt came in. Generally speaking,
interrupt handlers are a part of the operating system.
Interrupts are prioritised. The CPU is either running the current program, or dealing with the highest interrupt sent
in from devices along the status bus. If an interrupt’s priority is too low, then the interrupt will remain asserted
until the other interrupts finish, and the CPU can handle it. Alternatively, if a new interrupt has a priority higher
than the one currently being handled by the CPU, then the CPU diverts to the new interrupt’s handler, just as it did
when it left the running program.
The CPU has an internal status register which holds the value of the current interrupt being handed. Normal
programs run at a level below the lowest interrupt priority.
14
0 Reset Handler 1,000,870
1 IRQ 1 - Keyboard 1,217,306
2 IRQ 2 - Mouse 1,564,988
If the operating system is protected, how does a program ask for services from the OS? User programs can’t call
functions within the operating system’s memory, because it can’t see those areas of memory.
A special user-mode machine instruction, known as a TRAP instruction, causes an exception, switches the CPU
mode to kernel mode, and starts the handler for the TRAP instruction.
To ask for a particular service from the operating system, the user program puts values in machine registers to
indicate what service it requires. Then it executes the TRAP instruction, which changes the CPU mode to privileged
mode, and moves execution to TRAP hander in the operating system’s memory.
15
The operating system checks the request, and performs it, using a dispatch table to pass control to one of a set of
operating system service routines.
User program 2
User programs
run in
User program 1 user mode
Kernel call
Service
procedure
Dispatch table
Figure 1-16. How a system call can be made: (1) User program traps
to the kernel. (2) Operating system determines service number
required. (3) Operating system calls service procedure. (4) Control is
returned to user program.
When the service has been performed, the operating system returns control to the program, lowering the
privileges back to user-mode. Thus, the job only has access to the privileged operating system via a single, well-
protected entry point.
This mechanism for obtaining operating system services is known as a system call. The set of available system calls
is known as that operating system’s Application Program Interface or API.
trap: 1. n. A program interrupt, usually an interrupt caused by some exceptional situation in the user
program. In most cases, the OS performs some action, then returns control to the program. 2. vi. To
cause a trap. “These instructions trap to the monitor.” Also used transitively to indicate the cause of the
trap. “The monitor traps all input/output instructions.”
The history and development of operating systems is described in some detail in the textbook. We will only cover
the highlights here.
16
Most early programs were numerical calculations, and very CPU-intensive. The early 1950s saw the introduction of
punched cards to speed programming.
bare metal: n. 1. New computer hardware, unadorned with such snares and delusions as an operating
system, a high-level language, or even an assembler. Commonly used in the phrase ‘programming on
the bare metal’, which refers to the arduous work needed to create these basic tools for a new machine.
Real bare-metal programming involves things like building boot proms and BIOS chips, implementing
basic monitors used to test device drivers, and writing the assemblers that will be used to write the
compiler back ends that will give the new machine a real development environment.
Stone Age: n., adj. 1. In computer folklore, an ill-defined period from ENIAC (ca. 1943) to the mid1950s;
the great age of electromechanical dinosaurs, characterised by hardware such as mercury delay lines
and/or relays.
System
drive Input tape Output
tape tape
(a) (b) (c) (d) (e) (f)
CPU was thus less idle as the tape could be read/written to faster than the punched cards. However, the CPU was
still mostly idle when reading from or writing to the tape.
Whyis this? Reading a piece of data from main memory is very quick, because it is completely electronic. Reading
the same piece of data from tape is much slower, as the tape is a mechanical device. Punched cards are even
slower.
The first basic operating system performed batch operations: for each job on the input tape load the job, run it,
send any output to a second tape, and move onto the next job. Because the operating system must keep its
instructions in main memory to work, it had to be protected to prevent itself from being destroyed by the jobs
that it was loading and running. In this generation, the jobs were mainly scientific and engineering calculations.
Bronze Age: n. 1. Era of transistor-logic, pre-ferrite-core machines with drum or CRT mass storage.
17
4.3 3rd Generation: 1965 – 1980
In the 3rd Generation, integrated circuits (ICs) make machines smaller and more reliable, although they were
initially more expensive. Companies found they outgrew their machines, but each model had different batch
systems. This meant that each change of computer system involved a recoding of jobs and retraining of operators.
To alleviate these problems, IBM decided to create a whole family of machines, each with a similar hardware
architecture, with a single operating system that ran on every family member. This was the System/360 family, and
OS/360. OS/360 ran to millions of lines of code, with a constant number of bugs, even in new system releases.
Computer usage moved away from purely scientific work to business work e.g inventories. These type of jobs were
more I/O intensive (lots of reading/writing on tape). The CPU became idle waiting for the tape while processing
these I/O intensive jobs, and so CPU utilisation dropped again. The solution to the problem of CPU utilisation on
I/O jobs was multiprogramming:
This could give over 90% CPU utilisation, but with some overhead caushed by the switching between jobs. To
improve performance further, disks were used to cache/spool jobs (i.e both the programs to execute and their
associated data). Disks were faster to access than tape, especially for random access where the data is accessed in
no particular order from the disk.
These system still suffered from slow job turnaround: the users had to wait for a job to run to termination (or
crash) before they could do any reprogramming.
Iron Age: n. In the history of computing, 1961–1971 — the formative era of commercial mainframe
technology, when big iron dinosaurs ruled the earth. These began with the delivery of the first PDP-1,
coincided with the dominance of ferrite core, and ended with the introduction of the first commercial
microprocessor (the Intel 4004) in 1971.
4.4 Timesharing
A method of overcoming the slow job turnaround was introduced at this point timesharing. on a timesharing
system, the operating system swapped very quickly between jobs (even if the current job was still using the CPU),
allowing input/output to come from users on terminals instead of tape or punched cards.
This switching away from a job using the CPU is known as pre-emption, and is the hallmark of an interactive
operating multiprogramming operating system. The Multics operating system was designed at this time, to support
hundreds of users simultaneously. Its design was good, and introduced many new ideas, but was very expensive
hardware-wise, and fizzled out with the introduction of minicomputers.
Multics: n. [from “MULTiplexed Information and Computing Service”] A late 1960s timesharing operating
system co-designed by a consortium including MIT, GE, and Bell Laboratories. Very innovative for its
time — among other things, it introduced the idea of treating all devices uniformly as special files. All the
members but GE eventually pulled out after determining that second-system effect had bloated Multics
to the point of practical unusability. One of the developers left in the lurch by the project’s breakup was
Ken Thompson, a circumstance which led directly to the birth of UNIX.
18
4.5 3rd Generation – Part 2
Minicomputers arrived, introduced with the PDP-1 in 1961. These machines were only 5% the cost of mainframes,
but gave about 10% – 20% of their performance. These made minicomputers affordable to individual departments,
not just to large companies.
Although Multics died, many of its ideas were passed on to Unix. Unix was mostly written in a high-level language
called ‘C’, thus aiding ports to new hardware. In fact, it was one of the first portable operating systems.
Both minicomuters and mainframes got faster/cheaper and minis picked up more mainframe operating system
ideas as time went on.
Microcomputers brought computers to individuals. They began by using the 1st generation operating system ideas,
but have been catching up ever since. Networking was introduced, allowing machines to be connected over
small/large distances. Operating systems had functionality added to allow the files (or other services) of machines
to be accessible by other machines over the network. Such systems are known as network operating systems.
In another approach to using the connectivity provided by a network, distributed operating systems were created.
These make all the machines on a network appear to be part of one single machine.
Because of the immense power of the new machines, the emphasis on software design shifted away from system
performance and efficiency to user interface and applications.
5 Processes
Textbook reference: Stallings ppg 107 – 147; Tanenbaum & Woodhull ppg 47 – 52
19
5.1 What is a Process?
The primary function of an operating system is to provide an environment where user programs can run. The
operating system must provide a framework for program execution, a set of services (file management etc.) and
an interface to these services. On a multiprogramming system, the operating system must also ensure that the
many programs loaded into memory do not interfere with each other.
This restricted form of program execution, with access to the services of the operating system, is known as a
process. Specifically, a process is a sequence of computer instructions executing in an address space, with access
to the operating system services. An address space is an area of main memory to which the process has exclusive
access.
The process consists of: the machine code instructions in main memory, a data area and a stack in main memory,
and the CPU’s registers which the process uses while it is using the CPU.
The data area holds the process’ global data, e.g the internal memory representation of a word processor
document. The stack usually holds the process’ local data e.g local variables in functions and subroutines, plus
program counters for subroutine returns.
20
high
Stack
S low
Data
Machine code D
C Non-
Memory map is another term for ‘address space’. The memory map allows the stack to grow down and the data
area to grow up: in other words, the process is able to use more main memory that the operating system initially
allocates. Areas above or below the process’ memory are invalid because of protections set by the operating
system; they usually contain other processes. On some machine architectures, such as segmented architectures,
the three sections are separated.
Process Process
created Running
Polling
dies
The word ‘idle’ is a misnomer; because there is only one process loaded on a simple monitor, either the operating
system or the process itself is repetetively querying a device to see if the I/O has completed. This technique is
called polling. As you would imagine, polling is very wasteful of the CPU resources.
A batch system may have many processes (jobs) in memory simultaneously, but when one is started, it runs to
completion before the next can begin running. Polling is still used by the operating system to detect the completion
of I/O.
21
Process Selected Process
batched Ready Running
I/O completed Waiting for I/O
Polling
to run dies
To reduce the waste of the CPU resource by polling, on some batch systems the processes run to block. Instead of
waiting for I/O, the operating system blocks the process (puts it to sleep), and selects another ready-to-run process
to run.
Process Selected Process
loaded in Ready Running
memory
Blocked
to run dies
When the I/O is finished, the operating system finds which sleeping process requested the I/O, delivers any
incoming data (if required), and moves the process from the blocked state to the ready state, as it can now execute
more instructions.
As we saw in a previous lecture, in this situation an interrupt from the hardware would have told the operating
system that the I/O operation had finished.
There can be more than one process in the ready state, and more than one process in the blocked state, but as
there is only one CPU, there can only be one process in the running state. On multiprocessor systems, there can
be up to one running process per CPU.
Question: Why is this type of system unsuitable for interactive use? Think about this for a minute or two before
going on to the next paragrah.
It is unsuitable for interactive use for the following reason. Imagine a word processor process which was blocked
waiting for some keyboard input from the user. Once the user types a key, an interrupt informs the operating
system that some input is available. The operating system delivers this to the word process, and moves it to the
ready state. At this point, there is no guarantee that this process will ever run. If a CPU-bound process (i.e one that
never does any I/O) is currently running, then it will never relinquish the CPU and allow the word processor to run.
The solution here is simple. On an interactive multiprogramming system, the system may pre-empt a running
process to allow fast and fair use of the computer. Periodically, the operating system takes the running process off
the CPU, puts it back on the ready queue, and selects another ready process to move to the running state. Thus,
although the CPU is running only one process at any time, the quick switching between processes makes it appear
as if many are running simultaneously.
22
Process Selected Process
batched Ready to run Running dies
Pre-empted
I/O occurs Blocks for I/O
Blocked
Process
Process killed
killed
The characteristic of an interactive operating system is the transition from ‘Running’ to ‘Ready’.
23
5.7 Process Control Blocks
Textbook reference: Tanenbaum & Woodhull ppg 52 – 53
When the operating system moves a process from the running state, it must ensure that the process’ bits and
pieces are kept so it can be restarted. This means that the operating system must:
All of this information is stored in the Process Control Block in the operating system. There is one Process Control
Block per process, and it usually contains several sections:
– Register copies
– Information about the process’ memory areas
• Statistics section:
The above is an example only of the possible types of information that must be recorded. Each operating system
stores different things in its PCBs. The statistics section is used to aid the operating system in deciding how/when
to allocate resources to the process, as will be seen in future lectures. An example PCB (from Minix) is:
24
Time of next alarm Real uid
The operation of changing running processes is known as a context switch, and often has a significant overhead
(typically hundreds of machine instructions). This goes against the philosophy of operating system design which
tries to minimise the number of instructions not used by user processes.
25
6 Process Scheduling
Textbook reference: Stallings ppg 391 – 425; Tanenbaum & Woodhull ppg 82 – 93
6.1 Introduction
On a batch or multiprogramming system, there is usually a queue of processes blocked waiting for I/O, and a queue
of processes waiting to run. There is only ever 0 or 1 processes actually running. Context switching between
processes in the ready queue and the running position takes the operating system some time, thus wasting
resources that could be used by user processes. Operating systems schedule processes in order to try and achieve
certain goals:
However, not all of these can be satisfied at the same time, because the CPU is a finite resource. For example, to
satisfy goal d), you would process only batch jobs and never do any pre-emptions, to minimise switching time; this
unfortunately would violate goal c).
Some schedulers use run to block/completion scheduling: wait until a process blocks or dies before rescheduling.
This is only useful on batch systems, as the running process may not block for days if it is CPU-bound.
Other schedulers use pre-emptive scheduling, and suspend the running process after a period of time known as
the process’ quantum or timeslice. This allows other processes to run, even if the original process hadn’t blocked.
Pre-emption is needed for interactive systems.
26
Running
Ready Queue Process
Blocked Pool
This scheduling algorithm is easy to implement, but has one big disadvantage: one CPU-bound process will hog the
CPU, thus reducing the overall throughput.
How can this happen? It should be obvious that all processes will get equal opportunity to enter the running state.
However, once there, CPU-bound processes will use up all of their timeslice, whereas processes that do lots of I/O
will block, giving up the CPU to another processes. They must then cycle through the blocked state before they can
get back to the ready state. Small I/O bound processes therefore tend to wait in the ready state for the CPU-bound
process to give up the CPU.
Priority 3
Priority 2
Priority 1
27
When choosing the next process to move to the running state, the operating systems schedules the process from
the highest priority non-empty queue. This can lead to starvation of lower priorities if one queue has many CPU
bound processes, as the lower queues only get CPU time when the upper queues are empty (no processes or
processes blocked).
Every 10 milliseconds (i.e every timeslice), the running process’ Recent CPU Usage is incremented to a maximum
value of 127. Every second, all processes’ CPU Usage values are halved.
Thus, a process which uses little CPU has its CPU Usage repetetively halved, and the Priority value tends towards
0. A process which uses a lot of CPU keeps accumulating Recent CPU Usage and hence maintains a Priority value
well above 0.
Here is a snapshot of a set of processes running on a Unix machine:
load averages: 0.29, 0.11, 0.11 31
processes: 2 running, 29 sleeping
Cpu states: 97.7% user, 0.0% nice, 1.9% system, 0.4% interrupt, 0.0% idle
PID USER PRI SIZE RES STATE TIME WCPU CPU COMMAND
3071 wkt 57 608K 228K run 0:03 88.08% 15.98% gunzip
28
1 wkt 10 364K 0K sleep 0:00 0.00% 0.00% <init>
The gunzip process is running and has a low-ish priority of 57. Processes like tcsh, sendmail and cron
have run recently (which is why they have a priority of 18), but are currently blocked on I/O. Processes like rlogin
and getty have not run recently (which is why they have a priority of 3), and are also currently blocked on I/O.
The load averages give the number of jobs in the Ready queues averaged over 1, 5 and 15 minutes.
7 Introduction to Input/Output
Textbook reference: Stallings ppg 471 – 511; Tanenbaum & Woodhull ppg 153 – 165
29
between the devices and the processes that is simple and easy to use by programmers. If possible, the operating
system should also make this interface device-independent.
The reasons for this all: the user doesn’t want to worry about physical attributes of devices, their programming
quirks, and how to handle errors. They just want to deal with abstract I/O such as mouse events, files and
documents, windows on the screen, and data sent across the network. Leaving the management of I/O to the
operating system makes the programmers’ and users’ lives easier, and allows scheduling algorithms to be created
for the devices.
interrupt: 1. [techspeak] n. On a computer, an event that interrupts normal processing and temporarily
diverts flow-of-control through an “interrupt handler” routine. See also trap. 2. interj. A request for
attention from a hacker. Often explicitly spoken. “Interrupt — have you seen Joe recently?”
interrupts locked out: When someone is ignoring you. In a restaurant, after several fruitless attempts to
get the waitress’s attention, a hacker might well observe “She must have interrupts locked out”. The
synonym ‘interrupts disabled’ is also common.
poll: v.,n. 1. [techspeak] The action of checking the status of an input line, sensor, or memory location
to see if a particular external event has been registered. 2. To repeatedly call or check with someone: “I
keep polling him, but he’s not answering his phone; he must be swapped out.”
Using polling or interrupts to transfer data can be quite CPU intensive. Consider a UART that interrupts each time
a character arrives from a serial port. Now imagine that the serial port is connected to a 56Kbps modem
downloading a large JPEG image from the Internet. 56Kbps is roughly 5,600 characters per second. Therefore, while
the image is being downloaded, the UART is interrupting the current running process 5,600 times a second.
On each interrupt, the CPU is diverted to the UART interrupt handler. The handler must read the incoming
character, find the process waiting for it, copy the character to the process, change its state to ready before
returning to the running process. You can imagine that this takes a lot of the CPU away from the running process.
A better method is to use direct memory access (DMA). Here, the device controller delivers the data to the
appropriate process location in main memory, and sends an interrupt when the transfer is complete. And if the
30
device can buffer and deliver a bundle of data (e.g 16 characters with the 16550AN UART), then this can again
reduce the interrupt handling load on the CPU.
DMA requires some intelligence to be built into the device controller. For example, consider a disk drive controller
which has the following registers:
Location Meaning of Register
10,000 Disk address: cylinder, head, sector
10,004 Operation to perform: read or write
10,008 Start of DMA buffer in main memory
Imagine the operating system wants to write one disk block (say 1,000 bytes) from address 34,500 to cylinder 159,
head 7, sector 5. The operating system also knows that the disk device can perform DMA. To make the write to
disk occur, the operating system writes the value 34,500 into address 10,008, the disk address ‘cyl 159, hd 7, sc 5’
into address 10,000, and finally the command ‘write’ into address 10,004.
The disk device controller reads these values, and understands that it must write the buffer starting at location
34,500 to disk location ‘cyl 159, hd 7, sc 5’. It then uses the address/data/status buses to read the 1,000 bytes from
main memory into a buffer on the controller. The device asserts each address in turn (just like the CPU does), and
can thus temporarily stop the CPU (or any other device) from accessing the bus.
Once the data is copied into the disk controller, it can command the disk hardware to write the data at the
requested disk location. Once the operation is complete, the disk controller sends an interrupt into the CPU to
indicate completion. Only now is the operating system’s disk interrupt handler brought into action, and it has much
less work to do now.
Drive
Disk
CPU Memory controller
Buffer
System bus
Usually DMA is slower than the CPU, as the I/O devices don’t operate as fast as the CPU. Thus, DMA may prevent
the CPU from using the bus on average about 1 access out of every N machine cycles; this is known as “stealing a
cycle” from the CPU.
If the CPU needs to read/write data from its internal registers and main memory, this denial of memory access can
slow it down. But with modern computers, this is often a rare situation. Modern CPUs have large numbers of
registers, and a lot of memory cache. CPUs can continue to run the current program by using the data in registers
and cache, and minimise the use of the main memory. The CPU can usually continue to work temporarily without
requiring main memory access.
As with the DMA write operation described above, reads from the disk can be performed without supervision from
the operating system/CPU. Many peripheral device controllers (disks, UARTs, network cards, printer ports etc.)
support DMA data transfers. Good operating systems make use of this support to improve their performance.
8 Principles of Input/Output
Textbook reference: Stallings ppg 471 – 511; Tanenbaum & Woodhull ppg 159 – 165
31
8.1 Introduction
As we have noted previously, devices are usually memory-mapped: the operating system sees the device controller
and its registers but not the actual device itself. Devices usually interrupt when input arrives, or when an error as
occurred. The operating system must have an interrupt handler to perform the tasks when interrupts occur. On
most machines, interrupts are prioritised. There are usually two types of devices:
• Character-based: keyboards, screens, serial connections, printers, mice, network cards. These device do I/O
operations one byte at a time, or sometimes a variable number of bytes at a time.
• Block-based: disks, tapes, scanners. These device do I/O operations one block at a time. The size of the blocks
depend on the device, and can range from 128 bytes to over 4,096 bytes.
Some devices in a computer, like clocks (which only send interrupts at fixed intervals), or ROM/RAM, don’t fit the
above category.
• User-level software.
• Device-independent software.
• Device drivers.
• Interrupt handlers.
32
I/O Layer OS Level Examples
User-level software User Programs and
Libraries
As many devices offer DMA operations, which helps to take the I/O load off the CPU, it is pertinent to ask: why
can’t a DMA I/O operation deliver data directly to the memory of a process?
There are several reasons why this can’t be achieved. First, the device might have less (or more) data available to
deliver than was requested by the process. This is especially true with block devices, who can only do I/O in block-
sized units. The process may have crashed or terminated since the request was made. Another reason is that the
operating system may prefer to cache the data from the I/O operation. More on that later.
Interrupt handlers cannot be put to sleep, as they are in the lower-half of the kernel. However, a lowpriority
interrupt handler will be interrupted by a high-priority interrupt.
Interrupt handlers should be as fast as possible, as while running, the CPU is taken away from the task of running
users’ processes. Thus, large or complicated interrupt handlers can degrade the processing performance of a
computer. Similarly, a large high-priority handler slows down the system’s response to low-priority interrupts.
33
Similarly, the device driver takes abstract requests for I/O from the device-independent layer and converts them
to requests that the device controller can perform.
For example, the device-independent layer may know that a particular disk has 1,700,000 512-byte blocks. A read
request from the device-independent layer for disk block X must be converted to the correct cylinder/head/sector
number. The device driver can then load this information into the controller, ask it to read the block, and place the
incoming data into a certain location using DMA. On some drives, the controller must be asked to seek to the track
first before data can be read.
driver: n. 1. The main loop of an event-processing program; the code that gets commands and
dispatches them for execution. 2. [techspeak] In ‘device driver’, code designed to handle a particular
peripheral device such as a magnetic disk or tape unit.
The basic function of this layer is to perform I/O functions that are common to all devices in a particular grouping,
and to provide a uniform interface to the user-level software.
For example, a terminal-layer performs terminal operations, even when the devices used are keyboards & screens,
remote serial terminals or network-linked terminals. Similarly, a filesystem provides files, directories, file
permissions and read/write operations, even when the devices used are floppies, hard disks, CD-ROMs, network
drives and RAM disks.
A uniform interface makes writing user software easier. Similarly, a uniform naming method makes it easier to use
devices or the abstract objects available on those devices (e.g files). Protection of devices or their services is also
important, and it is best to do it here, so that all devices get protection in the same manner, and the user sees
consistency in the protection.
Different block devices have different block sizes. This layer must provide a standard block size. To do so it may
need to buffer incoming blocks, or to read/write multiple real blocks. Buffering must be done as a user may only
want to read/write half a block, or to read one character from a device where 20 are available to be read. Allocating
device blocks to store data is also done at a device-independent level in a device-independent fashion. We will
cover this in the filesystem lectures.
The operating system must ensure that only one user is accessing particular devices, e.g a printer. Thus, new opens
on opened devices must fail. User access may lead to starvation and the operating system should attempt to
prevent starvation. To avoid this, the operating system may use spooling, and operating system services to regulate
device access, for example:
34
Error reporting must be done at the device-independent level, and is done only if the lower levels cannot rectify
the error.
buffer overflow: n. What happens when you try to stuff more data into a buffer (holding area) than it can
handle. This may be due to a mismatch in the processing rates of the producing and consuming
processes, or because the buffer is simply too small to hold all the data that must accumulate before a
piece of it can be processed. The term is used of and by humans in a metaphorical sense. “What time
did I agree to meet you? My buffer must have overflowed.”
spool: [from early IBM ‘Simultaneous Peripheral Operation Off-Line’, but this acronym is widely thought
to have been contrived for effect] vt. To send files to some device or program (a ‘spooler’) that queues
them up and does something useful with them later. The spooler usually understood is the ‘print spooler’
controlling output of jobs to a printer, but the term has been used in connection with other peripherals
(especially plotters and graphics devices).
As mentioned above, clocks don’t really ‘do’ I/O. As they are hardware devices, they are included here. Clocks
usually do two things:
• Send interrupts to the CPU at regular intervals (clock ticks). These can be used to prompt process
rescheduling and allow the operating system to calculate time-specific statistics.
• Send an interrupt after a requested time (alarm clock).
• A few clock devices provide the time of day values in registers. The time of day is often batterybacked.
Most clocks have settable clock tick periods. The usual speeds are 50Hz, 60Hz or 100Hz. Alarm clocks are achieved
by writing a value into a clock register. This is decremented each clock tick, and an interrupt is sent when the value
reaches zero.
8.7 Clocks – Software
The clock interrupt handler receives the clock tick interrupt. The software must:
• Maintain the time of day clock when the clock hardware doesn’t have one.
• Pre-empt processes when their timeslice has been exhausted.
• Handle any ‘alarm calls’ that have been requested, e.g by processes like cron.
• Provide ‘alarm calls’ for the operating system itself, e.g for network retransmissions.
• Perform time-based statistics gathering for the operating system.
Maintaining the time of day is hard because, at 60Hz, a 32-bit value overflows in 2 years. It is most often achieved
by keeping two counters, a tick counter, and a seconds counter. Under Unix, the seconds counter has its epoch at
1st January 1970. For MS-DOS and Windows, the epoch is 1980.
Each process has a timeslice. When scheduled, this is copied into an operating system variable as a number of clock
ticks, and is decremented on each tick. At value zero, the process can be pre-empted by the operating system.
Some operating systems allow processes to set up ‘alarm calls’. When the alarm goes off, exceptional things happen
to the process e.g under Unix, a signal can be sent to the process. The clock driver simulates alarm calls by keeping
a linked list of calls and their differences. The head node’s difference is decremented until zero, at which time the
alarm ‘goes off’. The node is removed, and the next node becomes the head.
35
Because of the overhead of context switching, there is no guarantee of accuracy for the alarm call. Generally
processes are only guaranteed that the alarm will not go off earlier than requested. The operating system uses
alarm calls to timeout on I/O operations, e.g a disk read which never occurs, a network transmission.
Most operating systems gather time-based statistics to aid adaptive scheduling algorithms. The statistics are used
by high level process schedulers, user information, system administration.
jiffy: n. 1. The duration of one tick of the system clock on the computer. Often one AC cycle time (1/60
second in the U.S. and Canada, 1/50 most other places), but more recently 1/100 sec has become
common. “The swapper runs every 6 jiffies” means that the virtual memory management routine is
executed once for every 6 ticks of the clock, or about ten times a second. 2. Indeterminate time from a
few seconds to forever. “I’ll do it in a jiffy” means certainly not now and possibly never. This is a bit
contrary to the more widespread use of the word.
disksort(dp, bp);
if (dp->b_active == 0) wdustart(du); /* start drive */
36
Device drivers provide a set of operations to the device-independent I/O layer. Here are some of the device driver
operations provided by Unix device drivers. In Unix, a ‘character’ device is a ‘non-block’ device.
Function Device Type Description
d open Block and character Initialise device when first used
d close Block and character Used when device is released. May shutdown
device or take it off-line.
d strategy Block Read/write interface, allows event re-ordering.
d read Character Reads data from device.
d write Character Writes data to device.
d ioctl Block and character Generic control operations on device.
Writing and debugging devices drivers and interrupt handlers is a real pain. There is no protected process
environment, and many device drivers or interrupt handlers cannot be single-stepped or stopped at a breakpoint.
We will look at two example devices: disks and terminals/keyboards.
Hard disks consist of a number of platters, each of which is flat and circular. Each platter platter has 2 surfaces,
and both are covered with magnetic material which is used to record information. Disks spin at high speeds, often
3600 rpm or sometimes 7200 rpm.
For each available platter surface, there is a read/write head, which records a track on the surface. There are
usually hundreds or thousands of tracks on each platter. The head is mounted on an arm, which moves or seeks
from track to track. The tracks form concentric circles on the platter’s surface, and are invisible to the naked eye.
Each track can hold a lot of information (10 to 100K), so tracks are usually broken into sectors, each holding a
portion of the data. Sectors can store a fixed amount of data, generally 512 bytes or sometimes 1,024 bytes. A
vertical group of tracks is known as a cylinder. Cylinders are important, because all heads move at the same time.
Once the heads arrive at a particular track position, all the sectors on the tracks that form a cylinder can be read
without further arm motion.
37
To access a track, the arm must seek to it. The average seek time on drives is 10-50 milliseconds. Then, the disk
must rotate to bring the data to the head: the latency time. Finally, the data is read: the transfer time. Generally,
a disk’s seek time latency time transfer time.
washing machine: n. Old-style 14-inch hard disks in floor-standing cabinets. So called because of the size
of the cabinet and the ‘top-loading’ access to the media packs — and, of course, they were always set
on ‘spin cycle’.
38
By reordering the queue in this way, the movement of the arm (and hence the seek time) minimised. Thus, on
average, a pending disk request is serviced faster than with FCFS. However, this reordering or requests can lead to
starvation on big seek requests, especially if new requests are continually arriving. This most affects disk requests
for tracks on the extreme edge of the disk, whereas the middle tracks are preferentially selected by the algorithm.
In other words, the algorithm is unfair.
Because requests are reordered so that they are performed in ascending or descending order, this again helps to
minimise seek time, and thus improve overall disk I/O performance. One nice property of SCAN is that, given any
collection of requests, the upper bound on the motion of the arms is fixed at exactly 2 ∗ the number of tracks.
39
Note the following: if the queue is usually of size 1 or less, then no request reordering scheme is going to be useful,
and you end up with FCFS by default. But on most multiprogramming systems, disk request reordering is useful,
and generally C-SCAN is the preferred algorithm.
walking drives: n. An occasional failure mode of magnetic-disk drives back in the days when they were
huge, clunky washing machines. Those old dinosaur parts carried terrific angular momentum; the
combination of a misaligned spindle or worn bearings and stick-slip interactions with the floor could
cause them to ‘walk’ across a room, lurching alternate corners forward a couple of millimeters at a time.
There is a legend about a drive that walked over to the only door to the computer room and jammed it
shut; the staff had to cut a hole in the wall in order to get at it! Walking could also be induced by certain
patterns of drive access (a fast seek across the whole width of the disk, followed by a slow seek in the
other direction). Some bands of old-time hackers figured out how to induce disk-accessing patterns that
would do this to particular drive models and held disk-drive races.
10.8 Interleaving
Textbook reference: Tanenbaum & Woodhull ppg 158 – 159
In many cases where the operating system is slow, it must spend time processing a block read in from the disk
before it can read in another one. For example, the number of DMA buffers may be limited, and the operating
system must move out a block to free up a DMA buffer for another transfer. Thus, if the operating system wants
to read sectors 0, 1 and 2 from a particular track, it may miss sector 1 after reading
0, and have to wait an entire revolution before it can read sector 1.
If the sectors are interleaved, then there will be a gap to give the operating system time to process before its next
read.
7 0 7 0 5 0
6 1 3 4 2 3
5 2 6 1 7 6
4 3 2 5 4 1
40
(a) (b) (c)
Figure 3-4. (a) No interleaving. (b) Single interleaving. (c) Double
interleaving.
Note that interleaving can either be done in hardware (i.e within the disk controller logic) or in the software: for
example, on a disk with eight sectors per track, the operating system can treat physical sectors 0,1,2,3,4,5,6,7as
0,2,4,6,1,3,5,7; in other words, the operating system can itself perform a logical to physical mapping.
a) programming error: e.g request for non-existent sector. Hopefully the operating system is written to ensure
this does not happen. If it does, halt the system?
b) transient error: e.g dust on the head. The best option is to retry the operation; if errors persist, tell the upper
layers the sector is bad.
c) permanent error: e.g a physically bad sector. This is a problem as some application programs read the entire
disk (e.g backup programs). Some intelligent drives keep a spare cylinder, and when permanent errors occur,
internally map the bad sector to one on the spare cylinder. This can erode the arm scheduling algorithms used.
d) seek error: e.g arm went to track 7, not 6. Some drives fix these errors automatically. Others just inform the
operating system. Here the operating system must recalibrate the head by bringing it back to cylinder 0 and
retrying the seek.
e) controller error: e.g it refuses to accept commands. The operating system can attempt to reset the controller.
If the problem persists, give up.
disk crash: n. A sudden, usually drastic failure that involves the read/write heads dropping onto the
surface of the disks and scraping off the oxide; may also be referred to as a ‘head crash’.
farming: [Adelaide University, Australia] n. What the heads of a disk drive are said to do when they plow
little furrows in the magnetic media. Typically used as follows: “Oh no, the machine has just crashed; I
hope the hard drive hasn’t gone farming again.”
11 Terminals
Textbook reference: Tanenbaum & Woodhull ppg 235 – 249
41
Every computer has one or more terminals used to communicate with it. By terminal, I mean an I/O device
consisting of a keyboard and a screen. Terminals have a large number of different forms, which we will soon see.
The terminal device drivers must hide these differences from the device-independent software, so that I/O on
terminals can be done by user processes without any knowledge of the actual hardware involved.
RS-232
interface Receive line Terminal
CPU card
Transmit line
Bus UART UART
Figure 3-30. An RS-232 terminal communicates with a computer over a
communication line, one bit at a time. The computer and the terminal are
completely independent.
Characters to/from the terminal are sent in a serial fashion, e.g 7 bits per character, one start bit and one or more
stop bits. The start/stop bits are used to delimit the characters.
Characters are transmitted asynchronously over the serial line: that is, the computer has no idea when the next
character will arrive. Common data speeds are (in bits per second): 300, 1200, 2400, 9600 and 19200. The terminal
and the computer both use chips called UARTs to do the character-to-serial and serial-to-character conversion.
At 9600 bps, with 7 data bits, one start bit and two stop bits (i.e 10 overall), we get 960 characters per second, or
around 1 ms per character. This is a long time for an operating system. Usually the operating system asks the UART
to return an interrupt after sending/receiving a character. Some UARTs have small buffers (2,4,16 characters), and
are able to send less interrupts to the operating system.
tty: /T-T-Y/ n. A terminal of the teletype variety, characterized by a noisy mechanical printer, a very
limited character set, and poor print quality. Usage: antiquated (like the TTYs themselves).
42
glass tty: /glass T-T-Y/ n. A terminal that has a display screen but which, because of hardware or software
limitations, behaves like a teletype or some other printing terminal, thereby combining the disadvantages
of both: like a printing terminal, it can’t do fancy display hacks, and like a display terminal, it doesn’t
produce hard copy. An example is the early ‘dumb’ version of Lear-Siegler ADM 3 (without cursor
control).
smart terminal: n. A terminal that has enough computing capability to render graphics or to offload some
kind of front-end processing from the computer it talks to. The development of workstations and personal
computers has made this term and the product it describes semi-obsolescent, but one may still hear
variants of the phrase ‘act like a smart terminal’ used to describe the behavior of workstations or PCs
with respect to programs that execute almost entirely out of a remote server’s storage, using said devices
as displays.
Vide o
controller
Bus
Analog
video signal
(e.g., 16 MHz )
Parallel Keyboard
port
With a character-mapped display, writing a character in a memory location causes the character to be displayed.
With a bit-mapped display, every bit in the video memory controls one pixel on the screen. The operating system
must ‘paint’ the characters on the screen. In both cases, scrolling involves copying every byte in video memory
from one address to another.
The keyboard is completely decoupled from the display. It is usually parallel connected to one memory address,
which is where the operating system received the characters. With many keyboards, the actual character is not
exchanged. Instead a key code is transmitted, indicating which key was pressed. For example, different key codes
will be generated for ‘left shift’, ‘right shift’, ‘caps lock’, ‘control’, ‘A’ etc. The operating system must convert these
key codes into the appropriate characters.
43
In cooked mode, the user needs several characters in order to perform the line editing. These can usually be chosen
by the user, but the standard Unix ones are:
ˆH Erase last character
ˆC Interrupt process/kill line
\ Escape next character
tab Expand to spaces on output device
ˆS Stop output
ˆQ Start output
ˆD End of terminal input
CR End line
Most users expect to see the characters they type on the screen. However, in some situations (e.g changing
passwords), this needs to be disabled. The terminal independent software thus must also perform echoing, and
must provide an interface so that programs can turn it on/off. Echoing presents problems:
• How to erase the last character when the user types backspace?
• What about lines longer than the screen width?
• Does the terminal understand about tabs on output?
• Conversion of operating system-specific end of line to that used by the terminal. Unix uses LF, MS-DOS uses
CR-LF, Macs use LF-CR.
11.5 Output
Terminal output is simpler than input. However, serial terminal present problems that memory-mapped ones do
not. With serial output around 1ms per character, processes can output data faster than it can be transmitted
down the wire. The operating system must buffer output or it will be lost. It will also need to block the transmitting
process if the buffer threatens to overflow.
Memory-mapped terminals on the other hand are as fast as memory, with 1 to 100 microseconds per character.
They have their own problems: how to output the BELL (ˆG) character on memory-mapped terminals? The
operating system should toggle the speaker to simulate the bell. The output driver may need to keep track of
where the cursor is on memory-mapped displays. It also must perform scrolling.
To take advantage of the capabilities of smart serial terminals, the software needs to know the special command
sequences to user them. These capabilities also need to be simulated on memory-mapped displays by the output
software. The sorts of capabilities are:
A similar situation occurs in GUI environments which have to provide drivers for different mice, video cards and
keyboards, while still presenting the same API to the programmer.
44
12.1 What is Memory & Why Manage It?
Every process needs some memory to store its variables and code. But if there is more than one process in memory
at any one time, then the operating system must manage memory (as a resource) to prevent processes from
reading/damaging each other’s memory, and to ensure that each process has enough memory (not too much, not
too little). The latter is most difficult, as a process’ memory requirements may vary with time, and users schedule
processes unpredictably.
One thing to remember is that memory is device-like, and has an address decoder.
high
Stack
S low
Data
Machine codeNon- D
C
segmentedSegmented
Architecture
Architecture
The code section (also called the text) holds the program’s instructions. This usually only needs to be readonly. The
data section holds the process’ global data and variables. The stack holds the local variables and the arguments to
each routine.
The grey areas are invalid, as the process initially doesn’t use these locations. But the stack may grow down as the
process calls its routines. On most operating systems the process can also ask for an increased data space. Invalid
areas outside of the process remain invalid: the process cannot read from or write to these locations.
Historically, memory management has evolved from none at all to being very sophisticated. Let’s quickly cover the
history of memory management.
45
12.3 Bare Machine
Top of Memory
Here, there is no memory management. We give the program all of the memory, with no limitation. This provides
maximum flexibility to the user, and minimum hardware cost. There is no special memory hardware, and no need
for any operating system software. However, there are no operating system services; the user must provide these.
ROM
One way to protect the operating system itself is to put the operating system into ROM, thus it is
hardwareprotected. The operating system, unfortunately, still needs some RAM memory for its own operations,
and this is unprotected from access or modification by any running program. Device addresses are not protected
either.
In this environment, it is impossible to protect processes from other processes. This sort of environment was used
in the early minicomputer and microcomputer systems, e.g CP/M and the Apple ][. We still have a hangover from
these days with the BIOS in PC clones.
A variation is to load an extended operating system into RAM. In this situation the operating system is totally
unprotected. I have one word for this situation: MS-DOS.
12.5 Partitions
Textbook reference: Tanenbaum & Woodhull ppg 311 – 313
One of the first mechanisms used to protect the operating system, and to protect processes from each other was
partitions. Here, we add two hardware registers to the memory address decoder: the base and limit registers.
When a process reads from or writes to address ‘X’, the memory decoder adds on the value of the base register,
so the actual operation become a read or write to address ‘base + X’.
46
Logical address Physical address
Top of Memory
User
program
0 base
If the input address is lower than ‘0’ or higher than ‘limit’, the memory hardware considers this an error, and
informs the operating system of a memory access error (usually via an interrupt). Thus, processes can only access
memory within these limits.
This addition of hardware allows multiple processes to be loaded into memory and to run without interference
from each other. When the operating system performs a context switch, it must remember to switch the memory
maps of two processes. This is done by changing the values of the two registers. Each process has its own individual
pair of base/limit registers, and the operating system chooses these pairs so that process’ memory maps never
overlap, and each process has enough memory.
On some systems, there is only a base register, and the limit is a constant. And on some systems have two register
pairs, one for data and one for machine code. N.B The latter allows two memory address 0s! This is ok, because
most processes never read their machine code themselves; the CPU reads and executes the code itself.
One problem with partitions is how much memory to allocate initially? If too little, the process may run out of
memory when data and stack collide. If too much, then memory is unused and wasted. It is usually impossible to
change the base/limit registers once set, as there are other processes are usually immediately above/below.
At this point, let us define two new terms:
Logical Memory is the memory and its location as seen by the process. In the partition system, all processes see
memory as starting at location 0, and going up to the location defined by the limit register. But because the
process (running in user mode) cannot modify the limit register, the process sees its logical memory as fixed
in size.
Physical Memory is the actual main memory and its location as seen by the operating system. This depends on the
physical system, but in general main memory starts at location 0, and goes up to a top location set by the
amount of RAM in the computer.
Note well: it’s unlikely that any process has its base register set to the value 0. Therefore, physical location
0 is unlikely to ever be logical location 0 for any process.
47
Top of Memory
P4
P2
P1
P5
0
As new processes are chosen from the P3 pool of programs waiting to be started, the operating
system has to choose a partition size and physical location for this partition. We need a
partition allocation algorithm. Here are some possible algorithms:
First fit: Find the first hole where the waiting program will fit. This is fast algorithm, but usually leaves a smaller
hole, except where the fit is exact.
Best fit: Find the hole that best fits the job. i.e with the least left over. Surprisingly this is not a good algorithm, as
it leaves a lot of tiny, useless holes all over the physical memory.
Worst fit: Find the biggest hole, leaving the biggest remainder. This is usually the partition allocation algorithm
chosen by system designers.
When a process exits, merging can be done if there is an existing hole above and/or below it.
Fragmentation describes when we have a lot of useless little holes. The system may get to a point when the
available memory is enough to start a process, but it is in the form of holes too small to load the process. As the
useless holes are outside of any process’ memory, this situation is known as external fragmentation. This can be
solved by compaction, by moving existing processes (and their partition registers) to consolidate the holes. The
operating system should use an algorithm to minimise the amount of copying to make the compaction as fast as
possible.
13 Pages
Textbook reference: Stallings ppg 299 – 323; Tanenbaum & Woodhull ppg 319 – 343
48
13.2 Pages
The concept of pages attempts to do just this. However, pages introduce their own problems. But first let’s examine
how the page mechanism performs memory mapping from logical to physical memory. Here is how page mapping
works:
• Memory is broken into lots of pages, which are of fixed size. The page size is set by the hardware design, but
is generally around 1K to 8K in size.
• We use the terminology of a page as seen by the process, and a page frame as a page as seen by the
operating system in physical memory.
• A Memory Decoder (or MMU) maps the memory addresses to pages requested by a process to a set of page
frames in physical memory. The decoder can also set protections on eachof the process’ page; for example,
a process’ page may be marked as read-only, read-write, invalid, or privileged-mode access only.
• When a process access address LX, the MMU divides the address by the size of the system’s pages. The
resulting dividend is the logical page number LP, and the remainder becomes the offset into that page O (i.e
LX = LP + O).
• The MMU then maps the logical page to a physical page frame, by using a lookup table. It then adds on O to
get PX, the final physical address of the location in the main memory.
LX LP O
PP O
Page Map
+
Protections
PX
As an example, consider a system where pages are 2,000 bytes in size. A process tries to read from logical
location 37,450. The MMU receives this location number from the address bus. It divides 37,450 by 2,000,
obtaining logical page number 18 and offset 1,450.
The MMU consults the current page map. Each process has its own page map, just as with partitions each
process had its own base/limit registers. In this page map, logical page 18 maps to physical page frame 115.
The MMU multiplies page frame 115 by 2,000 to get the bottom address of the page frame, 230,000. It then
adds back on the offset, 1,450, to get 231,450. Finally, the MMU performs the requested operation on
physical location 231,450.
From the process’ point of view, it has access location 37,450, which is on page 18. But the MMU has mapped
this to physical location 231,450 on page frame 115.
• If the original page has a suitable protection, the memory access is permitted. Otherwise, the MMU sends an
interrupt to the CPU to inform an appropriate interrupt handler of the protection error.
49
The page map here is the crucial element that makes the system work. Although processes see their logical memory
as being contiguous, their page map can spread their logical memory around as a number of separate page frames
in physical memory. For example, a process’ memory may actually be placed in physical memory like this:
Logical Physical memory Memory
Page
Map
Stack
The ‘holes’ (i.e the available physical memory areas) are now always the
same size, that is, the size of the system’s pages. So as long as there are
enough free page frames, these can be allocated to a new processes as they
arrive, or they can be added to the page map of existing process. The latter
allows a process’ allocated logical memory to grow at any time.
In the partition scheme, each process’ size must be less than the actual
physical amount of memory. With pages, we can make a process be the
size of the memory: we just don’t Data allocate all the pages to the process. In
other words, the process will occupy the entire address space, but most of
that space will be invalid. In this situation, the data section and the
stack are as far away from each other as they can possibly get, and they can grow
much more than with the partition system.
Code
There is one drawback of the page system. When we do a context switch between
two processes, we must save the page map for the old process out of the MMU, and
load the page map for the new process into the MMU. These operations can be
expensive if the page map is big.
Here is an example page entry (a single page to page frame mapping) from the Intel i386 and up. On this system,
pages are 4K in size, and there are 1,048,576 page entries in the page map, numbered from logical page #0 up to
logical page #1,048,575.
On the left of the page entry is the matching page frame number. So, if this was page entry #23, and the page
frame address was 117, then logical page 23 would be mapped by the MMU to page frame 117.
On the right are the page permissions. If the Valid bit is 0, then no page frame has been mapped to this page, and
any access will cause an error. Next is the Read/Write bit: if it is 0, only reads are allowed, if 1, writes are also
allowed.
When the User/Kernel bit is 1, access to the page can only be done when the CPU is in kernel mode. Generally, the
kernel sets its own pages to have this protection.
Finally, the Access and Modified bits are used by page replacement algorithms, which we will see in the next
lecture.
50
13.4 Pages vs. Paging
Memory management can be a confusing topic, and the terminology used doesn’t make it easier. For the rest of
these notes, I am using a terminology which is different from the textbook, but I believe will help you grasp the
concepts of memory management. Here I will briefly explain my terminology, and also they way in which it differs
from the textbook.
A page system is a way of mapping from a process’ logical memory (i.e its address space, or its memory map) to
the physical memory on the computer. The process’ pages are mapped to the system’s page frames.
Paging is a system which copies the contents of unused page frames out to disk, thus making the page frames free
for use by other processes. The storage of page frame contents on disk is known as virtual memory.
Note the following:
Physical memory: the system’s memory space as seen by the operating system.
Page system: a form of logical to physical memory mapping which uses a page map.
Paging system: a mechanism which copies the contents of unused page frames out to disk.
The textbook uses the term ‘virtual addresses’ where I use ‘logical addresses’, and it uses the term ‘paging’ to mean
both a ‘page system’ and a ‘paging system’.
C D S
This mapping allowed the data and stack of the process to be placed at opposite ends of this huge address space,
thus virtually ensuring that they would never collide. They are then able to grow; this allows the process to obtain
memory as required, rather than requesting it all when it starts execution.
51
When context switching between processes, the operating system must ‘unmap’ all of the pages of the old process,
and map in the pages of the next process. If 20 pages have to be unmapped and 30 pages have to be mapped in,
the operating system must send 50 commands to the MMU. This can be very slow with processes that have a large
number of page map entries.
A page system also suffers from internal fragmentation. If a process uses N pages for its code, then N −1 will be
full, and 1 will be partially full. The same sort of internal fragmentation applies for the data and stack regions in
each process.
The number of mappings the MMU must be able to do can be huge. For example, if the page size is 1K, and the
address space is 4G, the MMU’s page map must hold 4 million page mappings. Thus the MMU’s hardware must be
very big.
The page size also affects all of the above problems. If it is small, fragmentation is lower, but the number of
mappings is larger and context switching slower.
13.8 Copy-on-Write
Copy-on-write is method which can be implemented on a paged system to avoid doing any copying on a fork()
[see ahead to a multi-threaded server].
After a fork, we have two nearly identical processes, the parent and the child. Note that the contents of their pages
are the same, and they will stay the same until one or the other changes a memory value in a page.
Instead of making copies of the pages in the fork(), the operating system lets both processes share all the pages,
but marks every page as read-only.
If one process tries to alter a page, a memory error occurs. The operating system receives an interrupt, sees that
the page is shared, makes a copy of the page by physically copying the contents to another page frame and
mapping the 2nd page frame to the current process, and gets the process to retry the memory access.
Parent Real Memory Child Parent Real Memory Child
original
x copy
Not performing the memory copies unless necessary is a big saving under Unix, where a fork() is nearly always
followed by a exec(). This makes the fork/exec much faster.
52
13.9 Operating System Use of Page Entry Protections
We have seen that in the page entries in the MMU, the last three bits set the available protections on each page.
Although the bits provide basic protection, the operating system may wish to set similar protections for very
different reasons.
The operating system can use the set the basic protections on each process’ pages, but it also has to remember
why each page has this protection. Below is an example table of possible memory areas available to a process, and
includes the hardware protections and the operating system reason for setting the permissions that way.
Memory Area Hardware Protection OS Protection Reason
Page 0 Invalid Invalid Catches NULL pointer use
Code/Constants Read-only Read-only Unchanging memory, may be shared
Global Data Read-write Read-write Normal variable use
Read-only Copy-on-write Allows untouched pages to be shared
Observe how part of the process’ logical memory map is occupied by the operating system’s pages, but these are
marked as only accessible in kernel mode. When a process performs a TRAP instruction to do a system call, the
CPU switches over to kernel mode, and so the operating system’s pages immediately become visible. The TRAP
instruction then starts executing the trap handler code at a known location in the memory. The same thing happens
when interrupts or exceptions arrive. This is how the operating system makes itself appear when required.
14 Virtual Memory
Textbook reference: Stallings ppg 333 – 382; Tanenbaum & Woodhull ppg 331 – 343
53
Virtual memory describes methods that give processes more memory than is physically available, or that makes
the computer appear to have more memory than is physically available. There are several VM techniques; paging
is the one most commonly used these days.
14.1 Why Use Virtual Memory?
The memory management methods shown in the last two lectures arise because of one basic requirement: the
entire logical address space of a process must be in physical memory before the process can execute and while it
is executing. In many cases, the entire process need not be in memory:
If we could get the operating system to have in memory only the needed bits of a process:
• Users could write programs bigger than the size of available memory.
• As each process would use less physical memory while running, the operating system could fit more
processes into physical memory.
• Less I/O would be needed to load a process into memory.
The sections of memory currently in use by a process are known as its working set, or as its locality of reference.
The whole point here is to keep the working set in memory (which is fast), and to leave the rest of a process’s
memory out on disk (which is slow).
14.2 Paging
A paged architecture has a memory granularity less than the process size. Thus, if we can determine what pages
are not needed immediately, we can write them out to disk, and release these pages for other pages to use. In
other words, we copy to disk the contents of pages which are not part of any process’ working set.
54
After copying out a page to free it, the operating system scrubs the page for security reasons – if it didn’t do this,
a process requesting memory would receive pages with information from another process.
A paging optimisation: if the operating system can tell that the page is clean, i.e unchanged since it was last read
from disk, there is no need to copy the page out – it is still there.
page out: vi. 1. To become unaware of one’s surroundings temporarily, due to daydreaming or
preoccupation. “Can you repeat that? I paged out for a minute.”
55
14.6 Not Recently Used Algorithm
We use the A and M bits in the following manner. When a process starts up, all of its pages have A =0, M =0. Every
Nth clock tick, the A bits are set back to 0, to distinguish those pages not recently used. The M bits are left
untouched to indicate the dirty pages. The hardware sets the A and M bits as described before.
When a page is needed, the operating system categorises the pages into four categories:
Class 2 indicates that a page was changed a while ago, but not accessed recently (for N clock ticks).
NRU picks a page from the lowest non-empty class, i.e. Class 0, or Class 1 if 0 is empty etc. The justification here is
that it is better to page out a modified page that hasn’t been accessed recently, as against a clean by heavily used
page. The main attraction of NRU is:
56
Although LRU is more expensive to implement than other page choice algorithms, it performs better, and thus is
the algorithm most used.
14.10 VM Problems
Obviously, disk I/O is costly. If we page in/out too often, we waste time waiting for the disk.
Optimisation: when a running process needs a page in, block it and select another process. Then when the page
arrives, make the process ready to run.
Virtual memory makes the machine appear to have more memory than it actually has. However, it’s better to have
more real memory, thus eliminating the need for paging.
Virtual memory increases the CPU utilisation, especially when most processes are I/O bound, because there can
be more I/O bound processes in memory, increasing the use of the CPU. But, if all pages are in use, and paging
becomes continuous, the machine spends all of its time paging, and no work is done. This is known as thrashing.
Thrashing
Number of Processes
With an interactive operating system, there is no easy way to limit the number of processes, so you have to live
with the possibility of thrashing.
Sharing code pages and other pages copy-on-write can help minimise paging.
Instead of waiting for no pages to be free, most systems start paging out pages when a low-water mark is reached,
e.g 5% of total memory. Paging cuts in earlier, but can cope with peaks in memory demands better than
‘desperation’ paging.
57
thrash: vi. To move wildly or violently, without accomplishing anything useful. Paging or swapping
systems that are overloaded waste most of their time moving data into and out of core (rather than
performing useful computation) and are therefore said to thrash. Someone who keeps changing his mind
(esp. about what to work on next) is said to be thrashing. A person frantically trying to execute too many
tasks at once (and not spending enough time on any single task) may also be described as thrashing.
Lazy Page Allocation: Mark every page as invalid, including all the code pages. Every time a page is needed, one is
allocated and possibly pages in from disk. This maximises free memory, but can be slow to start up a process.
This is also known as Demand Paging.
Prepaging: Allocate every code and global data page. Only allocate new pages when the global data of the stack
grows. This minimises paging but wastes memory.
15.1 Introduction
For most users, the file system is the most visible aspect of an operating system, apart from the user interface.
Files store programs and data.
The operating system implements the abstract ‘file’ concept by managing I/O devices, for example, hard disks. Files
are usually organised into directories to make them easier to use. Files also usually have some form of protections.
The file system provides:
58
Figure5-2. Three kinds of files. (a) Byte sequence. (b) Record sequence.
(c)Tree.
Examples are byte-structured, record-structures, tree-structured (ISAM). Most systems also give files attributes.
Example attributes are:
• File’s name
• Identification of owner
• Size
• Creation time
• Last access time
• Last modification time
• File type
In some systems, the name is just a convention, and you can change the name of a file. In others, the conventions
are enforced. You are not allowed to rename files, as the type would then change. This can be a hassle, for example,
running file.pas through a Pascal source prettifier would produce file.dat, which could not be ‘converted’
back to file.pas.
59
Two-level directory (usually one directory per person).
Tree-structured (hierachial).
60
The last directory structure can cause problems due to the file duplication. For example, a backup program may
back the same file twice, wasting off-line storage.
We will consider the file system metadata in the next few sections.
61
Directory Disk blocks
File information
Name List of disk blocks
The first method allows the acyclic and general directory structure, where a file may have multiple names and
multiple attributes.
Non-contiguous
A contiguous file layout is fast (less seek time), but it has the same problem as memory partitions: there is often
not enough room to grow.
A non-contiguous file layout is slower (we need to find the portion of the file required), and involves keeping a
table of the file’s portions, but does not suffer from the constraints that contiguous file layout does.
The next problem is to choose a fixed size for the portions of a file: the size of the blocks. If the size is too small,
we must do more I/O to read the same information, which can slow down the system. If the size is too large, then
we get internal fragmentation in each block, thus wasting disk space. The average size of the system’s files also
affects this problem.
62
30000
25000
20000
15000
10000
5000
0
0 2000 4000 6000 8000 10000
File Size (bytes)
150 75
100 50
50 25
Data rate
0 0
128 256 512 1K 2K 4K 8K
Block size
Figure 5-15. The solid curve (left-hand scale) gives the data rate
of a disk. The dashed curve (right-hand scale) gives the disk space
efficiency. All files are 1K.
Because of these factors, a block size of 512 bytes, 1K or 2K is usually chosen. The
operating system must know:
0111011101110111
1101111101110111
63
A 1K disk block can hold 256 A bit map
32-bit disk block numbers
(a) (b)
Usually the free list resides in physical memory, so that the operating system can quickly find free blocks to allocate.
If the free list becomes too big, the operating system may keep a portion of the list in memory, and read in/out
the other portions as needed. In any case, a copy of the free list must be stored on disk so that the list can be
recovered if the machine is shut down.
Thus, as well as the blocks holding file data, the operating system maintains special blocks reserved for holding free
lists, directory structures etc. For example, a disk under Minix looks like:
Data
I-nodes Zone
bit map bit
map
Figure 5-28. Disk layout for the simplest disk: a 360K floppy
disk, with 128 i-nodes and a 1K block size (i.e., two consecutive
512-byte sectors are treated as a single block).
Ignore the i-nodes for now. The boot block holds the machine code to load the operating system from the disk
when the machine is turned on.
The super block describes the disk’s geometry, and such things as:
The i-nodes are used to store the directory structure and the attributes of each file. More on these in future
lectures.
Note that disks sometimes suffer from physical defects, causing bad blocks. The free list can be used to prevent
these bad blocks from being allocated. Mark a bad block as being used; this will prevent it from being allocated in
the future.
An alternative is to use a special value to indicate that the block is bad; MS-DOS uses this technique, for example.
Bad blocks cannot be marked with a free bitmap, because there are only two values per block: free and in-use.
16.1 Introduction
The data and attributes of files must be stored on disk to ensure their long-term storage. Different file systems use
different layouts of file data and attributes on disk. We will examine the filesystems for MS-DOS and Unix.
64
16.2 The MS-DOS Filesystem
MS-DOS breaks disks into up to four sections, known as partitions. The first block holds the primary boot sector,
which describes the types and sizes of each partition.
A partition may or may not have an MS-DOS filesystem in it. Within an MS-DOS filesystem is a secondary boot
block, a file allocation table or FAT, a duplicate FAT, a root directory and a number of blocks used for file storage.
Each file has 32 bytes of attributes which are stored in each directory as a directory entry. The entry describes the
first block used for data storage.
The list of blocks used by the file are kept in the FAT. The FAT lists all disk blocks, and describes if the block is free
or bad, or which block comes next in the file.
65
MS-DOS keeps some of the FAT in memory to speed lookups through the table. For large disks, the FAT may be
large, and must be stored in several disk blocks. The size (in bits) of a FAT entry limits the size of the filesystem. For
example, MS-DOS originally used a 320K floppy with 1K blocks numbered 0 – 319, using 12 bits to number each
block. Thus, the largest FAT (320 blocks) requires 512 bytes, which fits into one disk block.
When hard disks with more than 4096 blocks arrived, the 12-bit block numbers were too small. So MSDOS had to
move to 16-bits, causing the whole directory structure to change.
The biggest problem with the FAT scheme is, if the disk is big, the FAT is big. For example, a 64M hard disk, 64,000
1K blocks, thus 2 bytes/block number, thus 128K per FAT. One drawback of the FAT is a search through the FAT to
find a file’s list of blocks. This is fine for sequential access, but penalises any random access in the file.
MS-DOS performs first free block allocation. This may lay a file’s data blocks out poorly across the disk.
Defragmentation can be performed to make files contiguous.
Loss of the FAT makes a filesystem unusable. This is why MS-DOS keeps a duplicate FAT. Viruses, however, usually
destroy both FATs.
66
After the superblock are the i-nodes. These contain the attributes of files, but not the filenames. Note that the
attributes are quite different to MS-DOS. As Unix is a multiuser system, files have ownership and a three-level set
of protections (read, write, execute for user, a group of users, and all other users),
File mode
Link count
Owner’s id
Group id
File size
Last access time
Last mod time
Last inode access time
Addresses of
first 10 blocks
Single indirect ptr
Double indirect ptr
Triple indirect ptr
Instead of a linear structure for keeping the list of blocks, Unix uses a tree structure which is faster to traverse.
The i-node holds the first 10 block numbers used by the file. If the file grows larger, a disk block is used to store
further block numbers; this is a single indirect block. Assume the single indirect block can hold 256 block numbers:
this allows the file to grow to 10+256=266 blocks.
If the single indirect block becomes full, another two blocks are used. One becomes the next single indirect block,
and the second points to the new single indirect block; this is a double indirect block, which can point at 256 single
indirect blocks, allowing the file to grow to 10+256+256∗256=65,802 blocks.
Sometimes, a file will exceed 65,802 blocks. In this situation, a triple indirect block is allocated which points at up
to 256 double indirect blocks. There can only be one triple indirect block, but when used, a file can grow to be
10+256+2562+2563 =16,843,018 blocks, or around 16 gigabytes in size. One strength of the i-node system is that
the indirect blocks are used only as required, and for files less than 10 blocks, none are required. The main
advantage is that a tree search can be used to find the block number for any block, and as the tree is never more
than three levels deep, it never takes more than three disk accesses to find a block. This also speeds random file
accesses. A System V Unix directory entry looks like:
Bytes 2 14
File name
I-node
number
Figure 5-13. A UNIX directory entry.
Note here that Unix only stores the file name and i-node number in each directory entry. The rest of the information
about each file is kept in the i-node. This allows files to be linked, allowing acyclic and general graph directory
structures. This cannot be done with MS-DOS.
59 shared.c in /usr/fred
2
592 temp shared.c in /tmp
One problem, if the file shared.c is deleted, will the other file still exist? To prevent this, Unix keeps a link count
in every i-node, which is usually 1. When a file is removed, its directory entry is removed, and the i-node link count
decremented; if the link count becomes 0, the i-node and the data blocks are removed.
67
Linked files have been found to be useful, but there are problems with links. Programs that traverse the directory
structure (e.g backup programs) will backup the file multiple times. Even worse, if a directory is a link back to a
higher directory (thus making a loop), files in between may be backed up an infinite number of times. Thus, tools
must be made clever enough to recognise and keep count of linked files to ensure that the same file isn’t
encountered twice.
The System V filesystem has several problems. The performance ones are discussed below. The list of i-nodes is
fixed, and thus the number of files in each filesystem is fixed, even if there is ample disk space. Unix uses a bitmap
to hold the list of free disk blocks, and so bad blocks are not easily catered for.
orphaned i-node: n. 1. A file that retains storage but no longer appears in the directories of a filesystem.
2. By extension, a pejorative for any person serving no useful function.
link farm: n. A directory tree that contains many links to files in a master directory tree of files. Link farms
save space when (for example) one is maintaining several nearly identical copies of the same source
tree, e.g., when the only difference is architecture-dependent object files. “Let’s freeze the source and
then rebuild the FROBOZZ-3 and FROBOZZ-4 link farms.”
FFS breaks the filesystem up into several cylinder groups. Cylinder groups are usually a megabyte or so in size. Each
cylinder group has its own free block list and list of i-nodes.
I-nodes are Disk is divided into located near
cylinder groups, each the start with its own
i-nodes
68
of the disk
Cylinder group
(a) (b)
Figure 5-19. (a) I-nodes placed at the start of the disk. (b) Disk
divided into cylinder groups, each with its own blocks and i-nodes.
The superblock is replicated in each cylinder group, to minimise the problem of a corrupt filesystem due to loss of
the superblock.
To improve I/O speeds, blocks come in two sizes, with the smaller size called a fragment. Typical block/fragment sizes are
8K/1K. An FFS file is composed entirely of blocks, except for the last block which may contain one or more consecutive
fragments.
Unused fragments may be used by other files, and occasional recopying must be done as files grow in size, to either
merge fragments into single blocks, or to keep fragments consecutive within a block.
The Standard I/O library uses the block size to perform file I/O, which helps to maintain I/O performance.
FFS has several allocation policies to improve performance:
• Inodes of files within the same directory are placed in the same cylinder group.
• New directories are in different cylinder groups than their parents, and in the one with the highest free i-
node count.
• Data blocks are placed in the same cylinder group as the file’s i-node.
• Avoid filling a cylinder group with large files by changing to a new cylinder group at the first indirect block,
and at one megabyte thereafter.
• Allocate sequential blocks at rotationally optimal positions.
FFS also increased the filename to 255 characters maximum. FFS provides a significant increase in I/O performance
against the System V filesystem (typically several hundred percent improvement).
69
17.2 Bad Blocks
Disks come with bad blocks, and they appear as the disk is used. Each time a bad block appears, mark that block as
used in the free block table, but ensure that no file uses the block. This leaves the problem of fixing any file that
was using the bad block.
17.3 Backups
Generally, the best method of ensuring minimal effect after a catastrophic loss is to have a copy of the data on
another medium (disk/tapes); this is a backup. It is impossible to have a backup completely up to date, therefore
even with a backup you may still lose some data.
Backups should be done at regular intervals, and the entire contents of the file system is transferred to disk/tape
– a full backup. Usually weekly or monthly.
If the file system is large (e.g greater than 200M), backups are very big and time consuming. Between full backups,
do incrementalbackups i.e. copy only those files that have changed since the last full/incremental backup.
It is also a good idea to keep 3 full dumps, and rotate them when doing a full backup. Thus, even when you are
doing a full backup, you still have the last two full backups intact.
(a) (b)
0 1 2 3 4 5 6 7 8 9 101112131415 0 1 2 3 4 5 6 7 8 9 101112131415
1 1 0 1 0 1 1 1 1 0 0 1 1 1 0 0 Blocks in use 1 1 0 1 0 2 1 1 1 0 0 1 1 1 Blocks 0 0 in use
(c) (d)
Figure 5-18. File system states. (a) Consistent. (b) Missing block.
(c) Duplicate block in free list. (d) Duplicate data block.
• b) Missing block # 2. This could be a bad block, or a free block that hasn’t been marked.
• c) Duplicate free block #4. This can be ignored.
70
• d) Duplicate used block #5, i.e 2 or more files use this block. This should never happen! If one file is freed,
the block is on both the used and the free list! One solution is to make ‘n’ copies of the block and give each
file its own block. Also signal an error to the owners of each file.
• e) Block on both free and used lists. We must fix this immediately, as the free block may be allocated and
overwritten. Remove the block from the free list.
Some operating systems can protect against user errors. For example, under MS-DOS, you can usually undelete a
file that was accidentally deleted, because the directory entry is marked unused, but the remaining information is
still intact (i.e attributes, FAT tables). This is technically possible under Unix, but usually not possible because Unix
allows multiple processes and users, thus the freed blocks may be reused at any time.
scribble: n. To modify a data structure in a random and unintentionally destructive way. “Somebody’s
disk-compactor program went berserk and scribbled on the i-node table.” Synonymous with trash;
compare mung, which conveys a bit more intention, and mangle, which is more violent and final.
One problem with writes through the cache is that, if the machine crashes, dirty blocks will not be written to the
disk, and so data will be lost. This means recent disk writes are not written to disk.
The solution under Unix is to flush all dirty blocks to disk every 30 seconds. Caches that write direct to disk do not
suffer from this problem, but are usually slower because of the I/O delay in writing.
The larger the cache, the more blocks in memory, and thus the better the hit-rate and the performance. Disk caches
usually range from 512K to 8M of memory.
Finally, it’s a bit quirky that disks are used to hold pages from virtual memory, and memory is used to hold recently
used disk blocks. It makes sense when you realise that VM is providing more memory with a performance penalty,
whereas disk caching is providing some disk with a performance increase.
In fact, newer operating systems (Solaris, FreeBSD) have merged the disk cache into the virtual memory subsystem.
Disk blocks and memory pages are kept on the same LRU list, and so the whole of a computer’s memory can be
used for both working set storage and disk caching.
sync: /sink/ (var. ‘synch’) n., vi. 1. To synchronize, to bring into synchronization. 2. To force all pending
I/O to the disk. 3. More generally, to force a number of competing processes or agents to a state that
would be ‘safe’ if the system were to crash; thus, to checkpoint (in the database-theory sense).
71
17.6 File Block Allocation
Remember that disk arm motion is slow compared with rotational delay. Therefore the operating system should
attempt to allocate blocks to a file contiguously (or with minimal arm movement) where possible. This will lower
requested arm motion, and improve disk access.
Some operating systems (notably those for micros) have utilities that take a file system, and rearrange the used
blocks to that each file has its blocks arranged in contiguous order. This is really only useful for those files which
will not grow. Files that do grow will have some blocks on one area of disk and others on another disk area.
Empty
Other data/indirect blocks are allocated only as needed. Reads on unallocated blocks will return ‘empty’ blocks i.e.
all zeroes.
72
Question: what problems do files have as an IPC mechanism for concurrent processes?
deadlock: n. 1. [techspeak] A situation wherein two or more processes are unable to proceed because
each is waiting for one of the others to do something. A common example is a program communicating
to a server, which may find itself waiting for output from the server before sending anything more to it,
while the server is similarly waiting for more input from the controlling program before outputting
anything. (It is reported that this particular flavor of deadlock is sometimes called a ‘starvation deadlock’,
73
though the term ‘starvation’ is more properly used for situations where a program can never run simply
because it never gets high enough priority. Another common flavor is ‘constipation’, where each process
is trying to send stuff to the other but all buffers are full because nobody is reading anything.) See deadly
embrace. 2. Also used of deadlock-like interactions between humans, as when two people meet in a
narrow corridor, and each tries to be polite by moving aside to let the other pass, but they end up swaying
from side to side without making any progress because they always both move the same way at the
same time.
deadly embrace: n. Same as deadlock, though usually used only when exactly 2 processes are involved.
This is the more popular term in Europe, while deadlock predominates in the United States.
18.5 Messages
An alternative IPC is to use messages. Each process can send/receive short lumps of data called messages. Each
message is independent of all others, and thus is useful for separate requests. A process sends the message to
some named destination, either a process or an endpoint. These may be on the same machine or on remote
machines.
Each message also holds the addresses for the source and destination process, in a manner similar to e-mail.
Rendezvous Method: The two processes must rendezvous. The sender blocks until the receiver is receiving, and
vice versa. Once both are ready, the message is exchanged, and both unblock. This is like passing a letter by
hand. Delivery is guaranteed.
Queue Method: The receiver has a mailbox (sometimes known as a port or a socket) where sent messages are
delivered. A FIFO queue of messages builds up there. Each receive() receives a message. The process
blocks if there are no messages to receive. The sender never blocks. This is like sending a letter through
Australia Post. There are no guarantees of delivery and, even worse, the receiver’s buffer may overflow.
receive(&our_name, &dummy_buffer,
&reply); if (error) return(error); else
return(ok);
}
This can make message passing look like a regular program procedure call, and it is usually known as a remote
procedure call. For example, a request to read a disk block usually goes directly to the operating system:
The operating system performs the operation, or returns an error if there is a failure.
The call operation could be used to send the read request to a remote file server and get the buffer back from the
server. For this to work, we must create the right messages to send the request and receive the reply.
74
Here’s an example subprogram to replace blockread with a call to a remote server:
struct send_message
{ operation
o; block b;
file f; }
out;
struct recv_message
{ error e;
block b; file
f; char
buf[1024]; }
in;
out.o = READ;
out.b = b;
out.f = f;
A programmer can use the subprogram given above, and s/he will not be able to tell if the blockread is going
to the local operating system or to a server.
Actually this is sometimes untrue: the operating system will often timeout on a network send or read, and inform
the process of the problem.
Many operating systems provide already-written RPCs to perform operations to remote servers. Note that with
uni/bidirectional streams, message and RPCs, the two processes involved do not need to be on the same machine,
as the flow of data can travel across network wires.
18.7 Shared Memory
The fastest IPC of the lot. Let the operating system map some pages into the same logical locations in two or more
processes.
Process 1 Process 2
S S
D D
75
C C
This is easy to do with a paged architecture. The operating system already keeps a list of pages that a process owns;
just add a few more to the list. However, the operating system needs to keep a link count (like for shared files), so
that the page can be freed when the link count becomes zero.
It is not easy to share memory between two or more computers. Several groups are studying how to do this,
however.
hot spot: n. 1. It is received wisdom that in most programs, less than 10% of the code eats 90% of the
execution time; if one were to graph instruction visits versus code addresses, one would typically see a
few huge spikes amidst a lot of low-level noise. Such spikes are called ‘hot spots’ and are good
candidates for heavy optimization or hand-hacking. 2. In a massively parallel computer with shared
memory, the one location that all 10,000 processors are trying to read or write at once.
19 Synchronisation
Textbook reference: Stallings ppg 197 – 293; Tanenbaum & Woodhull ppg 57 – 82
Imagine balance is 800, process A wants to withdraw 500 and process B wants to withdraw 400. If, just after A
checks B − W ≥ 0 and gets 300, it is pre-empted, and B is scheduled, B will do the check and withdraw 400, leaving
400. When A is rescheduled, it will already have done the check, and perform the withdrawl, leaving a balance of
-100, which is wrong.
This is a race condition, where two or more processes are reading or writing some shared data, and the final result
depends on who runs precisely when.
The section of code where the race condition occurs is a critical section; here it is
A critical section happens because it is composed of several steps, and a process can be scheduled out after any of
the steps. If there was only one atomic step, there would be no critical section, as a process could only be scheduled
out after the entire operation.
An operation is atomic if it is guaranteed to be performed all at once, with no interruptions.
76
1. No two processes can be inside the critical section simultaneously. As we will see, this isn’t enough, so we add
three more conditions.
2. No assumptions are made about the relative speeds of each process or the number of CPUs. This disallows
solutions based on exact timing.
3. No process stopped outside its critical section should block another process.
4. No process should wait arbitrarily long to enter the critical section. In other words, starvation is not allowed.
Consider a general situation where there is a shared object, e.g the number of stock left for a particular item in a
wharehouse, with multiple requests for that stock coming in from various parts of the country. Timing of the
requests are unpredictable, and so is the duration of the operations to ship items and update the wharehouse
records.
Let us look at some possible solutions for avoiding critical sections.
for (ever)
{
loop while (turn != 0); /* Wait */ /* != 1, != 2, != 3 etc */
critical_section();
turn= 1; /* =2, =3, =0 etc */
other stuff;
}
Thus, the access to the critical section rotates through each process.
This is a bad solution, as each process loops continuously waiting for the value of turn to change and thus wasting
the CPU. This is known as busy-waiting.
Also, the solution relies on rotation through the set of processes. Even if a process doesn’t want the critical section,
it is given it, and others must wait for it to go through the section, thus violating condition 3. Finally, a slow process
slows down the rotation, thus violating condition 2.
busy-wait: vi. Used of human behavior, conveys that the subject is busy waiting for someone or
something, intends to move instantly as soon as it shows up, and thus cannot do anything else at the
moment. “Can’t talk now, I’m busy-waiting till Bill gets off the phone.”
Technically, ‘busy-wait’ means to wait on an event by spinning through a tight or timed-delay loop that
polls for the event on each pass, as opposed to setting up an interrupt handler and continuing execution
on another part of the task. This is a wasteful technique, best avoided on time-sharing systems where a
busy-waiting program may hog the processor.
77
19.5 Test and Set Lock Instruction
If the CPU provides an atomic way of reading and overwriting an address with a ‘1’ (a test and set lock instruction),
we can create the following:
enter_region:
tsl register, flag /* Copy the flag to the register, and set flag to 1 */
cmp register, 0 /* Was flag zero? */ jnz
enter_region /* No, loop until it is */
leave_region:
mov flag, 0; /* Set the flag to zero */ ret
/* and return */
enter_section();
leave_region();
A process can only enter the critical section if the flag is zero, and in checking the flag, it is set to one, thus
preventing anybody else from entering the critical section. This satisfies condition 1.
This also avoids infinite timeslices and the rotation problem. However, we still have busy-waiting, as we loop until
we have the flag, which wastes the CPU. At the same time, all four conditions are satisfied.
To prevent busywaiting, we need a way of blocking/unblocking a process if it wants to obtain access to a critical
section, but can’t get it yet. Hint: What normally performs process blocking?
block: 1. vi. To delay or sit idle while waiting for something. “We’re blocking until everyone gets here.”
Compare busy-wait. 2. ‘block on’ vt. To block, waiting for (something). “Lunch is blocked on Phil’s arrival.”
19.6 Semaphores
Semaphores were invented by Edgar Dijkstra in 1965. The operating system provides new objects called
semaphores, each with integer values. The operating system also provides two system call operations on
semaphores:
acquire(sem)
{ if (sem.count == 0) block process; sem.count=0; /*
When reawakened, lower count back to 0 */
}
release(sem)
{
sem.count=1; /* Raise semaphore’s value to 1 */
if (any process blocked waiting for sem) unblock one;
}
The operating system performs both operations in kernel mode, and guarantees that both operations are atomic.
Now, with the semaphore count initialised by the operating system to 1, we can do:
78
acquire(sem);
release(sem);
The first process to acquire the semaphore sets the count to zero. All other processes that try to access the
semaphore are blocked. When the process with the semaphore releases it, it sets the count to one,and another
process is unblocked. As soon as it is unblocked, it lowers the count to zero, and thus acquires the semaphore.
This solution satisfies all four conditions, and also avoid busy waiting. Usually, the operating system keeps a queue
for the set of processes blocked on a semaphore, but this is not strictly necessary.
19.7 Monitors
Of course, if a process goes ahead and enters a critical section without using any synchronisation method, problems
will occur. We have to trust that all programmers will do the right thing, and also that the code they produce is
correct.
Hoare and Hansen in 1975 suggested building synchronisation into the language so that it is invoked without
conscious work by the programmers. The monitor is a collection of code and condition variables which describe
the critical section. The compiler wraps the synchronisation code around the critical section transparently. The
synchronisation code can be semaphores, or whatever.
Not many languages provide monitors. The only one that I can think of off-hand is Java.
void producer(void)
{
int item;
message m; /* message buffer */
void consumer(void)
{
int item, i;
message m;
79
}
This only works when there are two processes; if there were more, to which process would the tokenholder send
the token?
20 Threads
Textbook reference: Stallings ppg 153 – 192; Tanenbaum & Woodhull ppg 53 – 56
20.1 Introduction
A process is a sequence of instructions executing in an address space, with access to the operating system’s
services. The process consists of: machine instructions, a data area, a stack, and the machine’s registers it is
using.
Stack
high
Data S low
Machine code
D
Non- C
segmentedSegmented
Architecture
Architecture
Switching between processes (a context switch) is expensive, as the operating system must save/reload the
processes’ registers and change memory protections.
80
In many instances, a process would like to be able to perform several independent tasks that can be performed
concurrently. Examples of this are database and other servers, and network protocol implementations.
A specific example, a Web server:
If the transmission takes a very long time, the server is not able to answer other clients’ requests. The server could
create a clone server to handle the transmission:
However, if the new server is another process, there is extra context switching overhead. Also, the original server
might cache pages in memory, so it would be useful for the two processes to share memory.
In other situations, some of the tasks to be performed can be done concurrently, but for one stage there is only
one task to do, e.g an image manipulation program:
A thread is a computational unit within a process. Each thread is relatively independent, but may sometimes need
to synchronise with other threads. All threads share a common logical address space. A process, therefore, consists
of one or more threads.
81
Advantages of threads: context switching between threads has less overhead because memory maps do not need
to be changed. Switching between threads in two different processes has normal overhead. Threads share
memory, and so can share information to perform their tasks. When running on a multiprocessor machine, each
thread can be scheduled to run on a separate CPU, thus increasing performance.
Disadvantages of threads: threads need to have synchronisation primitives available to prevent deadlocks and
other problems. Threads have read/write access to other threads’ memory, so a badly-behaved thread can damage
other threads.
There are several ways of implementing threads. Here are some that are currently in use.
Stack
Kernel
Data
Machine
code
Stack Running
Process
Data
Machine code
A kernel thread, therefore, doesn’t really have its own memory address space; it borrows the
running process’ and the kernel’s. It does have registers and a stack as its context. Context
switching of kernel threads is fast.
Disadvantages: only the operating system can use kernel threads, as each runs in kernel-mode and the address
space is not tied to a particular process.
82
20.3 Lightweight Processes
A lightweight process is a kernel-supported user-mode thread. Each LWP has its own context; it does not borrow
address spaces. Threads within a process have the same address space, and so context switching between them is
fast.
Each thread needs its own stack for local variables, but the entire address space is shared between the threads,
giving fast context switches but no inter-stack protection.
Stack
Stack
Stack
Data Data
Data
Machine Machine Machine
code code code
Stack
Data
83
Machine
code
The advantage is inter-stack protection, but a MWP context switch must perform more memory readressing, so it
is slower. Traditional processes are often known as heavyweight processes.
Extensibility: The system must be able to grow and change as market requirements change.
Portability: The system must be moved easily between different hardware platforms.
Reliability: The system should be robust, and protect itself from internal errors and external tampering.
Compatibility: The system should be able to run applications from previous Microsoft operating systems.
Performance: The system should be as fast and responsive as possible on all hardware platforms.
84
21.1 Overall Design
NT is a system which uses a micro-kernel approach and a client-server approach in its design. NT is broken into
several sections. The subsystems run as user-mode processes and provide other processes with certain services.
The subsystems are servers, and have the same protection as other processes. Clients access their services via
message passing.
The executive runs in kernel-mode and performs system calls as required by the processes; it also performs
message-passing. Below the executive is the kernel, which handles thread scheduling, multiprocessor
synchronisation, interrupt handling and dispatch, and system recovery on power failure. The hardware abstraction
layer hides the hardware’s complexity from the rest of the executive. The kernel and HAL effectively form the
micro-kernel of the system.
NT also uses the ‘object’ concept to provide access to things in the system: files, memory regions, etc.
Processes get ‘handles’ to object, and objects can have access control lists to give/deny access rights.
85
21.2 Environment Subsystems
Instead of performing true system calls, applications under NT obtain services from the user-mode environment
subsystems.
This allows NT to emulate any number of operating system environments. A subsystem, such as the POSIX
subsystem, receives the original request from a POSIX applications, translates the request into an NT system call,
and passes the request to the NT executive for servicing. The reply comes back to the POSIX subsystem, and is
translated into a reply fit for the POSIX application.
In this way, the original application is unaware that it is running on an emulated environment. Currently, NT
supports native NT applications, and has environments for POSIX, Win16, Win32, DOS and OS/2 applications.
86
Processes are protected and have their own logical address space. In a 32-bit address space, a process has 2G of
the address space, and the remaining 2G is reserved for the protected-mode kernel. When a TRAP is done, the
upper 2G becomes valid and the CPU jumps to the kernel’s code.
NT uses pre-emtive thread scheduling, with a priority scheme having 32 priority levels. The upper 16 levels are
reserved for real-time threads, which have fixed priorities. The lower levels are for normal threads, which start
with a particular priority, but move up/down in priority according to their amount of CPU usage.
Threads can be in the following states:
As well as syscalls to create and destroy threads, NT has operations to synchronise threads; these can be used to
protect critical sections, or for threads to wait until other threads have performed certain tasks.
87
Because context switching is expensive, using user-mode processes to emulate environments causes a
performance penalty. The implementation of NT has minimised the cost of context switches, improved the speed
of message passing, and made other optimisations to help performance.
Interestingly enough, some of the executive’s pages can be paged in/out of main memory. This helps to keep the
working set of the operating system small as well.
88
21.5 Input/Output
As with most systems, NT uses a layered approach to I/O. For each device, there is a device driver; some drivers
handle a number of devices. Above each driver, there is often some device-independent code, such as a filesystem
of a network stack.
I/O requests are sent via I/O request packets to the I/O manager, which passes them on to the appropriate part of
the executive/kernel. I/O can be done directly to a device driver, bypassing the deviceindependent code.
Unlike most other systems, NT allows an application to perform asynchronous I/O. Instead of blocking on I/O, the
process continues execution and, at a later date, the completed I/O causes an exceptional event which is handled
by the application.
89