Data Structures
Data Structures
These lecture notes are designed for on-line reference and review. Please do not print
them on university computing facilities!!
In my opinion, there are only three important ideas which must be mastered to write
interesting programs.
At this point, I expect that you have mastered about 1.5 of these 3.
A data type is a well-defined collection of data with a well-defined set of operations on it.
Example: The abstract data type Set has the operations EmptySet(S), Insert(x,S),
Delete(x,S), Intersection(S1,S2), Union(S1,S2), MemberQ(x,S), EqualQ(S1,S2),
SubsetQ(S1,S2).
This semester, we will learn to implement such abstract data types by building data
structures from arrays, linked lists, etc.
Modula-3 Programming
Iteration Constructs: REPEAT-UNTIL (at least once), WHILE-DO (at least 0), FOR-DO
(exactly n times).
However, you cannot make arrays bigger if your program decides it needs more space.
(bad)
RecordsThese let you organize non-homogeneous data into logical packages to keep
everything together. (good)
These packages do not include operations, just data fields (bad, which is why we need
objects)
Records do not help you process distinct items in loops (bad, which is why arrays of
records are used)
SetsThese let you represent subsets of a set with such operations as intersection, union,
and equivalence. (good)
Built-in sets are limited to a certain small size. (bad, but we can build our own set data
type out of arrays to solve this problem if necessary)
Subroutines
Subprograms allow us to break programs into units of reasonable size and complexity,
allowing us to organize and manage even very long programs.
This semester, you will first encounter programs big enough that modularization will be
necessary for survival.
Subroutines which call themselves are recursive. Recursion provides a very powerful
way to solve problems which takes some getting used to.
Such standard data structures as linked lists and trees are inherently recursive data
structures.
Parameter Passing
There are two mechanisms for passing data to a subprogram, depending upon whether the
subprogram has the power to alter the data it is given.
In pass by value, a copy of the data is passed to the subroutine, so that no matter what
happens to the copy the original is unaffected.
In pass by reference, the variable argument is renamed, not copied. Thus any changes
within the subroutine effect the original data. These are the VAR parameters.
Example: suppose the subroutine is declared Push(VAR s:stack, e:integer) and called
with Push(t,x). Any changes with Push to e have no effect on x, but changes to s effect t.
IMPORT SIO;
BEGIN
SIO.PutText("Prime number test\n");
REPEAT
SIO.PutText("Please enter a positive number; enter 0 to quit. ");
candidate:= SIO.GetInt();
IF candidate > 2 THEN
i:= 1;
REPEAT
i:= i + 1
UNTIL ((candidate MOD i) = 0) OR (i * i > candidate);
IF (candidate MOD i) = 0 THEN
SIO.PutText("Not a prime number\n")
ELSE
SIO.PutText("Prime number\n")
END; (*IF (candidate MOD i) = 0 ...*)
ELSIF candidate > 0 THEN
SIO.PutText("Prime number\n") (*1 and 2 are prime*)
END; (*IF candidate > 2*)
UNTIL candidate <= 0;
END Prim.
IMPORT SIO;
VAR
a, b: INTEGER; (* input values *)
x, y: CARDINAL; (* working variables *)
<*FATAL SIO.Error*>
BEGIN (*statement part*)
SIO.PutText("Euclidean algorithm\nEnter 2 positive numbers: ");
a:= SIO.GetInt();
WHILE a <= 0 DO
SIO.PutText("Please enter a positive number: ");
a:= SIO.GetInt();
END; (*WHILE a < 0*)
b:= SIO.GetInt();
WHILE b <= 0 DO
SIO.PutText("Please enter a positive number: ");
b:= SIO.GetInt();
END; (*WHILE b < 0*)
Programming Proverbs
KISS - ``Keep it simple, stupid.'' - Don't use fancy features when simple ones suffice.
RTFM - ``Read the fascinating manual.'' - Most complaints from the compiler can be
solved by reading the book. Logical errors are something else.
Make your documentation short but sweet. - Always document your variable
declarations, and tell what each subprogram does.
Every subprogram should do something and hide something - If you cannot concisely
explain what your subprogram does, it shouldn't exist. This is why I write the header
comments before I write the subroutine.
Program defensively - Add the debugging statements and routines at the beging, because
you know you are going to need them later.
A good program is a pretty program. - Remember that you will spend more time reading
your programs than we will.
Perfect Shuffles
1 1 1 1 1 1 1 1 1
2 27 14 33 17 9 5 3 2
3 2 27 14 33 17 9 5 3
4 28 40 46 49 25 13 7 4
5 3 2 27 14 33 17 9 5
6 29 15 8 30 41 21 11 6
7 4 28 40 46 49 25 13 7
8 30 41 21 11 6 29 15 8
9 5 3 2 27 14 33 17 9
10 31 16 34 43 22 37 19 10
11 6 29 15 8 30 41 21 11
12 32 42 47 24 38 45 23 12
13 7 4 28 40 46 49 25 13
14 33 17 9 5 3 2 27 14
15 8 30 41 21 11 6 29 15
16 34 43 22 37 19 10 31 16
17 9 5 3 2 27 14 33 17
18 35 18 35 18 35 18 35 18
19 10 31 16 34 43 22 37 19
20 36 44 48 50 51 26 39 20
21 11 6 29 15 8 30 41 21
22 37 19 10 31 16 34 43 22
23 12 32 42 47 24 38 45 23
24 38 45 23 12 32 42 47 24
25 13 7 4 28 40 46 49 25
26 39 20 36 44 48 50 51 26
27 14 33 17 9 5 3 2 27
28 40 46 49 25 13 7 4 28
29 15 8 30 41 21 11 6 29
30 41 21 11 6 29 15 8 30
31 16 34 43 22 37 19 10 31
32 42 47 24 38 45 23 12 32
33 17 9 5 3 2 27 14 33
34 43 22 37 19 10 31 16 34
35 18 35 18 35 18 35 18 35
36 44 48 50 51 26 39 20 36
37 19 10 31 16 34 43 22 37
38 45 23 12 32 42 47 24 38
39 20 36 44 48 50 51 26 39
40 46 49 25 13 7 4 28 40
41 21 11 6 29 15 8 30 41
42 47 24 38 45 23 12 32 42
43 22 37 19 10 31 16 34 43
44 48 50 51 26 39 20 36 44
45 23 12 32 42 47 24 38 45
46 49 25 13 7 4 28 40 46
47 24 38 45 23 12 32 42 47
48 50 51 26 39 20 36 44 48
49 25 13 7 4 28 40 46 49
50 51 26 39 20 36 44 48 50
51 26 39 20 36 44 48 50 51
52 52 52 52 52 52 52 52 52
Think about the Patriot missiles which tried to shoot down SCUD missiles in the Persian
Gulf war and think about how difficult it is to produce working software!
Even today, there is great controversy about how well the missiles actually did in the war.
How do you know that your program works? Not by testing it!
``Testing reveals the presence, but not the absence of bugs.'' - Dijkstra
Still, it is important to design test cases which exercise the boundary conditions of the
program.
In the Microsoft Excel group, there is one tester for each programmer! Types of test cases
include:
Boundary cases - Make sure that each line of code and branch of IF is executed at least
once.
Random data - Automatically generated test data can be useful to test user patterns you
might otherwise not consider, but you must verify that the results are correct!
Other users - People who were not involved in writing the program will have vastly
different ideas of how to use it.
Adversaries - People who want to attack your program and have access to the source for
often find bugs by reading it.
Verification
But how can we know that our program works? The ideal way is to mathematically prove
it.
For each subprogram, there is a precise set of preconditions which we assume is satisfied
by the input parameters, and a precise set of post-conditions which are satisfied by the
output parameters.
If we can show that any input satisfying the preconditions is always transformed to output
satisfying the post conditions, we have proven the subprogram correct.
Top-Down Refinement
To correctly build a complicated system requires first setting broad goals and refining
them over time. Advantages include:
A hierarchy hides information - This permits you to focus attention on only a manageable
amount of detail.
With the interfaces defined by the hierarchy, changes can be made without effecting the
rest of the structure - Thus systems can be maintained without being ground to a halt.
The best way to build complicated programs is to construct the hierarchy one level at a
time, finally writing the actual functions when the task is small enough to be easily done.
Most of software engineering is just common sense, but it is very easy to ignore common
sense.
Module Build-Military: The first decision is now to organize it, not what type of tank to
buy.
Several different organizations are possible, and in planning we should investigate each
one:
• Offense, Defense
• Army, Navy, Air Force, Marines, Coast Guard ...
To make this more concrete, lets outline how a non-trivial program should be structured.
Suppose that you wanted to write a program to enable a person to play the game
Battleship against a computer.
What is Battleship?
Each side places 5 ships on a grid, and then takes turns guessing grid points
until one side has covered all the ships:
For each query, the answer ``hit'', ``miss'', or ``you sunk my battleship'' must be given.
There are two distinct views of the world, one reflecting the truth about the board, the
other reflecting what your opponent knows.
Program: Battleship
Interesting subproblems are: display board, generate query, respond to query, generate
initial configuration, move-loop (main routine).
• Stacks are data structures which maintain the order of last-in, first-out
• Queues are data structures which maintain the order of first-in, first-out
Queues might seem fairer, which is why lines at stores are organized as queues instead of
stacks, but both have important applications in programs as a data structure.
Operations on Stacks
The terminology associated with stacks comes from the spring loaded plate containers
common in dining halls.
A stack is an appropriate data structure for this task since the plates don't care about when
they are used!
Stacks are used to maintain the return points when Modula-3 procedures call other
procedures which call other procedures ...
In the biblical story, Jacob and Esau were twin brothers where Esau was born first and
thus inherited Issac's birthright. However, Jacob got Esau to give it away for a bowl of
soup, and so Jacob went to become a patriarch of Israel.
Rashi, a famous 11th century Jewish commentator, explained the problem by saying
Jacob was conceived first, then Esau second, and Jacob could not get around the narrow
tube to assume his rightful place first in line!
• Push(x,s) and Pop(x,s) - Stack s, item x. Note that there is no search operation.
• Initialize(s), Full(s), Empty(s), - The latter two are Boolean queries.
Defining these abstract operations lets us build a stack module to use and reuse without
knowing the details of the implementation.
The easiest implementation uses an array with an index variable to represent the top of
the stack.
An alternative implementation, using linked lists is sometimes better, for it can't ever
overflow. Note that we can change the implementations without the rest of the program
knowing!
END Stack.
Stack Implementation
CONST
Max = 8; (*maximum number of elements on stack*)
TYPE
S = RECORD
info: ARRAY [1 .. Max] OF ET;
top: CARDINAL := 0; (*initialize stack to empty*)
END; (*S*)
PROCEDURE Push(elem:ET) =
(*adds element to top of stack*)
BEGIN
INC(stack.top); stack.info[stack.top]:= elem
END Push;
PROCEDURE Pop(): ET =
(*removes and returns top element*)
BEGIN
DEC(stack.top); RETURN stack.info[stack.top + 1]
END Pop;
BEGIN
END Stack.
BEGIN
PutText("Stack User. Please enter numbers:\n");
WHILE NOT Full() DO
Push(GetInt()) (*add entered number to stack*)
END;
WHILE NOT Empty() DO
PutInt(Pop()) (*remove number from stack and return
it*)
END;
Nl();
END StackUser.
FIFO Queues
Queues are more difficult to implement than stacks, because action happens at both ends.
The easiest implementation uses an array, adds elements at one end, and moves all
elements when something is taken off the queue.
It is very wasteful moving all the elements on each DEQUEUE. Can we do better?
Suppose that we maintaining pointers to the first (head) and last (tail) elements in the
array/queue?
Circular Queues
Note that the pointer to the front of the list is now behind the back pointer!
When the queue is full, the two pointers point to neighboring elements.
There are lots of possible ways to adjust the pointers for circular queues. All are tricky!
How do you distinguish full from empty queues, since their pointer positions might be
identical? The easiest way to distinguish full from empty is with a counter of how many
elements are in the queue.
END Fifo.
CONST
Max = 8; (*Maximum number of elements in FIFO
queue*)
TYPE
Fifo = RECORD
info: ARRAY [0 .. Max - 1] OF ET;
in, out, n: CARDINAL := 0;
END; (*Fifo*)
PROCEDURE Enqueue(elem:ET) =
(*adds element to end*)
BEGIN
w.info[w.in]:= elem; (*stores new element*)
w.in:= (w.in + 1) MOD Max; (*increments in-pointer in ring*)
INC(w.n); (*increments number of stored
elements*)
END Enqueue;
PROCEDURE Dequeue(): ET =
(*removes and returns first element*)
VAR e: ET;
BEGIN
e:= w.info[w.out]; (*removes oldest element*)
w.out:= (w.out + 1) MOD Max; (*increments out-pointer in ring*)
DEC(w.n); (*decrements number of stored
elements*)
RETURN e; (*returns the read element*)
END Dequeue;
Utility Routines
BEGIN
END Fifo.
User Module
BEGIN
PutText("FIFO User. Please enter texts:\n");
WHILE NOT Full() DO
Enqueue(GetText())
END;
WHILE NOT Empty() DO
PutText(Dequeue() & " ")
END;
Nl();
END FifoUser.
Other Queues
Double-ended queues - These are data structures which support both push and pop and
enqueue/dequeue operations.
Although arrays are good things, we cannot adjust the size of them in the middle of the
program.
If our array is too small - our program will fail for large data.
If our array is too big - we waste a lot of space, again restricting what we can do.
The right solution is to build the data structure from small pieces, and add a new piece
whenever we need to make it larger.
• To contact someone, you do not have to carry them with you at all times. All you
need is their number.
• Many different people can all have your number simultaneously. All you need do
is copy the pointer.
• More complicated structures can be built by combining pointers. For example,
phone trees or directory information.
Addresses are a more physically correct analogy for pointers, since they really are
memory addresses.
Linked Data Structures
All the dynamic data structures we will build have certain shared properties.
• We need a pointer to the entire object so we can find it. Note that this is a pointer,
not a cell.
• Each cell contains one or more data fields, which is what we want to store.
• Each cell contains a pointer field to at least one ``next'' cell. Thus much of the
space used in linked data structures is not data!
• We must be able to detect the end of the data structure. This is why we need the
NIL pointer.
Pointers in Modula-3
type
pointer = REF node;
node = record
info : item;
next : pointer;
end;
var
p,q,r : pointer; (* pointers *)
x,y,z : node; (* records *)
Note circular definition. Modula-3 lets you get away with this because it is a reference
type. Pointers are the same size regardless of what they point to!
We want dynamic data structures, where we make nodes as we need them. Thus
declaring nodes as variables are not the way to go!
Dynamic Allocation
p := New(ptype);
New(ptype) allocates enough space to store exactly one object of the type ptype. Further,
it returns a pointer to this empty cell.
Before a new or otherwise explicit initialization, a pointer variable has an arbitrary value
which points to trouble!
Warning - initialize all pointers before use. Since you cannot initialize them to explicit
constants, your only choices are
• NIL - meaning explicitly nothing.
• New(ptype) - a fresh chunk of memory.
• assignment to some previously initialized pointer of the same type.
Pointer Examples
p^.info := "music";
q^.next := nil;
The pointer value itself may be copied, which does not change any of the other fields.
Note this difference between assigning pointers and what they point to.
p := q;
We get a real mess. We have completely lost access to music and can't get it back!
Pointers are unidirectional.
Alternatively, we could copy the object being pointed to instead of the pointer itself.
p^ := q^;
Can we really get as much memory as we want without limit just by using New?
No, because there are the physical limits imposed by the size of the memory of the
computer we are using. Usually Modula-3 systems let the dynamic memory come from
the ``other side'' of the ``activation record stack'' used to maintain procedure calls:
Just as the stack reuses memory when a procedure exits, dynamic storage must be
recycled when we don't need it anymore.
Garbage Collection
The Modula-3 system is constantly keeping watch on the dynamic memory which it has
allocated, making sure that something is still pointing to it. If not, there is no way for you
to get access to it, so the space might as well be recycled.
The garbage collector automatically frees up the memory which has nothing pointing to
it.
It frees you from having to worry about explicitly freeing memory, at the cost of leaving
certain structures which it can't figure out are really garbage, such as a circular list.
Explicit Deallocation
Although certain languages like Modula-3 and Java support garbage collection, others
like C++ require you to explicitly deallocate memory when you don't need it.
Dispose(p) is the opposite of New - it takes the object which is pointed to by p and makes
it available for reuse.
Note that each dispose takes care of only one cell in a list. To dispose of an entire linked
structure we must do it one cell as a time.
Of course, it is too late to dispose of music, so it will endure forever without garbage
collection.
Suppose we dispose(p), and later allocation more dynamic memory with new. The cell
we disposed of might be reused. Now what does q point to?
Answer - the same location, but it means something else! So called dangling references
are a horrible error, and are the main reason why Modula-3 supports garbage collection.
A dangling reference is like a friend left with your old phone number after you move.
Reach out and touch someone - eliminate dangling references!
Security in Java
Java does not allow one to do such operations on pointers at all. The reason is security.
Pointers allow you access to raw memory locations. In the hands of skilled but evil
people, unchecked access to pointers permits you to modify the operating system's or
other people's memory contents.
Java is a language whose programs are supposed to be transferred across the Internet to
run on your computer. Would you allow a stranger's program to run on your machine if
they could ruin your files?
Linked Stacks and Queues
Lecture 5
Steven S. Skiena
var p, q : ^node;
dispose(p) returns to the system the memory used by the node pointed to by p. This is not
used because of Modula-3 garbage collection.
NIL is the only value a pointer can have which is not an address.
Linked Stacks
The problem with array-based stacks are that the size must be determined at compile
time. Instead, let's use a linked list, with the stack pointer pointing to the top element.
p^.next = top;
top = p;
Note this works even for the first push if top is initialized to NIL!
To pop an item from a linked stack, we just have to reverse the operation.
p = top;
top = top^.next;
p^.next = NIL; (*avoid dangling reference*)
Note again that this works in the boundary case of one item on the stack.
Note that to check we don't pop from an empty stack, we must test whether top = NIL
before using top as a pointer. Otherwise things crash or segmentation fault.
BEGIN
END Stacks.
TYPE
T <: REFANY; (*type of stack*)
ET = REFANY; (*type of elements*)
END Stacks.
IMPORT Stacks;
IMPORT FractionType;
FROM Stacks IMPORT Push, Pop, Empty;
FROM SIO IMPORT PutInt, PutText, Nl, PutReal, PutChar;
TYPE
Complex = REF RECORD r, i: REAL END;
VAR
stackFraction: Stacks.T:= Stacks.Create();
stackComplex : Stacks.T:= Stacks.Create();
c: Complex;
f: FractionType.T;
BEGIN (*StacksClient*)
PutText("Stacks Client\n");
FOR i:= 1 TO 4 DO
Push(stackFraction, FractionType.Create(1, i)); (*stores numbers
1/i*)
END;
FOR i:= 1 TO 4 DO
Push(stackComplex, NEW(Complex, r:= FLOAT(i), i:= 1.5 * FLOAT(i)));
END;
END StacksClient.
Linked Queues
Queues in arrays were ugly because we need wrap around for circular queues. Linked
lists make it easier.
We need two pointers to represent our queue - one to the rear for enqueue operations,
and one to the front for dequeue operations.
Note that because both operations move forward through the list, no back pointers are
necessary!
To enqueue an item :
p^.next := NIL;
if (back = NIL) then begin (* empty queue *)
front := p; back := p;
end else begin (* non-empty queue *)
back^.next := p;
back := p;
end;
To dequeue an item:
p := front;
front := front^.next;
p^.next := NIL;
if (front = NIL) then back := NIL; (* now-empty queue *)
Our calculator will do the same. Why? Because it is the easiest notation to implement!
The rule for conversion is to read the expression from left to right. When we see a
number, push it on the operation stack. When we see an operation, pop the last two
numbers on stack, do the operation, and push the result on the stack.
To implement addition, we add digits from right to left, with the carry one place if the
sum is greater than 10.
Note that the last carry can go beyond one or both numbers, so you must handle this
special case.
A borrow from the leftmost digit is complicated, since that gives a negative number.
This is why I suggest completing addition first before worrying about subtraction.
I recommend to test which number has a larger absolute value, subtract from that, and
then adjust the sign accordingly.
There are several possible ways to handle the problem of reading in the input line and
parsing it, i.e. breaking it into its elementary components of numbers and operators.
The way that seems best to me is to read the entire line as one character string in a
variable of type TEXT.
As detailed in your book, you can use the function Text.Length(S) to get the length of this
string, and the function Text.GetChar(S,i) to retreive any given character.
Useful functions on characters include the function ORD(c), which returns the integer
character code of c. Thus ORD(c) - ORD('0') returns the numerical value of a digit
character.
Standard I/O
The easiest way to read and write from the files is to use I/O redirection from UNIX.
Suppose calc is your binary program, and it expects input from the keyboard and output
to the screen. By running calc < filein at the command prompt, it will take its input from
the file filein instead of the keyboard.
Thus by writing your program to read from regular I/O, you can debug it interactively
and also run my test files.
Programming Hints
We will see a wide variety of different implementation of these operations over the
course of the semester.
Most of these operations should be pretty simple now that you understand pointers!
Search performs better when the item is near the front of the list than the back.
The easiest way to insert a new node p into a linked list is to insert it at the front of the
list:
p^.next = front;
front = p;
To maintain lists in sorted order, however, we will want to insert a node between the two
appropriate nodes. This means that as we traverse the list we must keep pointers to both
the current node and the previous node.
PROCEDURE Create(): T =
(* returns a new, empty list *)
BEGIN
RETURN NIL; (*creation is trivial; empty list is NIL*)
END Create;
Make sure you understand where these cases come from and can verify why all of them
work correct.
Deletion of a Node
To delete a node from a singly linked-list, we must have pointers to both the node-to-die
and the previous node, so we can reconnect the rest of the list.
Note the passing of a procedure as a parameter - it is legal, and useful to make more
general functions, for example a sort routine for both increasing and decreasing order, or
any order.
BEGIN (* Intlist *)
END Intlist.
Pointers provide, for better or (usually) worse, and alternate way to modify parameters.
Let us look at two different ways to swap the ``values'' of two pointers.
This is perhaps the simplest and best way - we just exchange the values of the pointers...
Alternatively, we could swap the values of what is pointed to, and leave the pointers
unchanged.
If swap2, since we do not change the values of p and q, they do not need to be var
parameters!
However, copying the values did not do the same thing as copying the pointers, because
in the first case the physical location of the data changed, while in the second the data
stayed put.
If data which is pointed to moves, the value of what is pointed to can change!
Moral: you must be careful about the side effects of pointer operations!!!
C language does not have var parameters. All side effects are done by passing pointers.
Additional pointer operations in C language help make this practical.
Programming Style
Although programming style (like writing style) is a somewhat subjective thing, there is a
big difference between good and bad.
The good programmer doesn't just strive for something that works, but something that
works elegantly and efficiently; something that can be maintained and understood by
others.
Just like a good writer rereads and rewrites their prose, a good programmer rewrites their
program to make it cleaner and better.
To get a better sense of programming style, let's critique some representative solutions to
the card-shuffling assignment to see what's good and what can be done better.
times:=times+1;
END card.
There are no variable or block comments. This program would be hard to understand.
This is an ugly looking program - the structure of the program is not reflected by the
white space.
See how the dependence on the number of cards is used several times within the body of
the program, instead of just in one CONST.
VAR
i : INTEGER; (* index variable *)
count : INTEGER; (* COUNT variable *)
BEGIN
REPEAT
count := count + 1;
FOR i := 1 TO 200 DO (* copy shuffled ->
tempshuf *)
tempshuf[i] := shuffled[i];
END;
END shufs;
Every subroutine should ``do'' something that is easily described. What does shufs do?
The solution to such problems is to write the block comments for the subroutine does
before writing the subroutine.
If you can't easily explain what it does, you don't understand it.
TYPE
Array= ARRAY [1..200] OF INTEGER; (*Create an integer array from *)
(*1 to 200 and called Array *)
VAR
original, temp1, temp2: Array; (*Declare original,temp1 and *)
(*temp2 to be Array *)
counter: INTEGER; (*Declare counter to be integer*)
(********************************************************************)
(* This is a procedure called shuffle used to return a number of *)
(* perfect shuffle. It input a number from the main and run the *)
(* program with it and then return the final number of perfect shuffle
*)
(********************************************************************)
BEGIN
...
END Shuffles. (* end the main program called Shuffles *)
This program has many comments which should be obvious to anyone who can read
Modula-3.
More useful would be enhanced block comments telling you what the program is done
and how it works.
The ``is it completely reshuffled yet?'' test is done cleanly, although all of the 200 cards
are tested regardless of deck size.
TYPE
nArray = ARRAY[1..n] OF INTEGER; (*n sized deck type*)
twonArray = ARRAY[1..2*n] OF INTEGER; (*2n sized deck type*)
VAR
merged : twonArray; (*merged deck*)
count : INTEGER;
Checkperfect is much more complicated than it need be; just check whether merged[i] =
i. You can return without the BOOLEAN variable.
The shuffle is slightly wasteful of space - two extra full arrays instead of two extra half
arrays.
BEGIN
SIO.PutLine("Welcome to Paul's card shuffling program!");
SIO.PutLine(" DECK SIZE NUMBER OF SHUFFLES ");
SIO.PutLine(" _________________________________ ");
num_cards := 2;
REPEAT
counter := 0;
FOR i := 1 TO (num_cards) DO
deck[i] :=i;
END; (*initializes deck*)
REPEAT
deck := Shuffle(deck,num_cards);
INC(counter);
UNTIL deck[2] = 2;
SIO.PutInt(num_cards,16); SIO.PutInt(counter,19);
SIO.PutText("\n");
INC(num_cards,2);
(*increments the number of cards in deck by 2.*)
UNTIL ( num_cards = ((2*n)+2));
END ShuffleCards.
Why we know that this stopping condition suffices to get us all the cards in the right
position. This should be proven prior to use.
Program Defensively
I am starting to see the wreckage of several programs because students are not building
their programs to be debugged.
• Add useful debug print statements! Have your program describe what it is doing!
• Document what you think your program does! Otherwise, how do you know
whine it is doing it!
• Build your program in stages! Thus you localize your bugs, and make sure you
understand simple things before going on to complicated things.
• Use spacing to show the structure of your program. A good program is a pretty
program!
The basic insertion and deletion routines for linked lists are more elegantly written using
recursion.
Often it is necessary to move both forward and backwards along a linked list. Thus we
need another pointer from each node, to make it doubly linked.
Extra pointers allow the flexibility to have both forward and backwards linked lists:
type
pointer = REF node;
node = record
info : item;
front : pointer;
back : pointer;
end;
Insertion
p^.front = r;
p^.back = q;
r^.back = p;
q^.front = p;
It is not absolutely necessary to have pointer r, since r = q .front, but it makes it cleaner.
The boundary conditions are inserting before the first and after the last element.
How do we insert before the first element in a doubly linked list (head)?
p^.back = NIL;
p^.front = head;
head^.back = p;
head = p; (* must point to entire structure *)
Inserting at the end is similar, except head doesn't change, and a back pointer is set to
NIL.
Recursion
Elegant recursive procedures seem to work by magic, but the magic is same reason
mathematical induction works!
Example: Prove .
Example: All horses are the same color! (be careful of your basis cases!)
The Tower of Hanoi
Combinatorial Objects
Many mathematical objects have simple recursive definitions which can be exploited
algorithmically.
Example: How can we build all subsets of n items? Build all subsets of n-1 items, copy
the subsets, and add item n to each of the subsets in one copy but not the other.
Once you start thinking recursively, many things have simpler formulations, such as
traversing a linked list or binary search.
Gray codes
We saw how to generate subsets recursively. Now let us generate them in an interesting
order.
Obviously, all subsets must differ in at least one element, or else they would be identical.
An order where they differ by exactly one from each other is called a Gray code.
Think about the base cases, the small cases where the problem is simple enough to solve.
Think about the general case, which you can solve if you can solve the smaller cases.
Unfortunately, many of the simple examples of recursion are equally well done by
iteration, making students suspicious.
Further, many of these classic problems have hidden costs which make recursion seem
expensive, but don't be fooled!
Factorials
Implementing Recursion
Part of the mystery of recursion is the question of how the machine keeps everything
straight.
The answer is that whenever a procedure or function is called, the local variables are
pushed on a stack, so the new recursive call is free to use them.
When a procedure ends, the variables are popped off the stack to restore them to where
they were before the call.
Thus the space used is equal to the depth of the recursion, since stack space is reused.
Tail Recursion
Tail recursion costs space, but not time. It can be removed mechanically and is by some
compilers.
The overhead of recursion vs. maintaining your own stack is too small to worry about.
By being clever, you can sometimes save stack space. Consider the following variation of
Quicksort:
Qsort(p,h)
else
Qsort(p,h)
Qsort(1,p)
By doing the smaller half first, the maximum stack depth is in the worst case.
Applications of Recursion
You may say, ``I just want to get a job and make lots of money. What can recursion do
for me?
• Backtracking
• Game Tree Search
• Recursion Descent Compilation
For example, how can we put n queens on an board so that no two queens attack
each other?
Tree Pruning
Backtracking really pays off when we can prove a node early in the search tree.
There are total sets of eight squares but no two queens can be in the same row.
There are ways to place eight queens in different rows. However, since no two queens
can be in the same column, there are only 8! permutations of columns, or only 40,320
possibilities.
We must also be clever to test as quickly as possible the new queen does not violate a
diagonal constraint
Applications of Recursion
Lecture 11
Steven S. Skiena
Game Trees
Chess playing programs work by constructing a tree of all possible moves from a given
position, so as to select the best possible path.
The player alternates at each level of the tree, but at each node the player whose move it
is picks the path that is best for them.
A player has a forced loss if lead down a path where the other guy wins if they play
correctly.
This is a recursive problem since we can always maximize, by just changing perspective.
In a game like chess, we will never reach the bottom of the tree, so we must stop at a
particular depth.
Alpha-beta Pruning
Sometimes we don't have to look at the entire game tree to get the right answer:
No matter what the red score is, it cannot help max and thus need not be looked at.
To do either, we need a precise description of the language, a BNF grammar which gives
the syntax. A grammar for Modula-3 is given throughout your text.
Our compiler will follow the grammar to break the program into smaller and smaller
pieces.
When the pieces get small enough, we can spit out the appropriate chunk of assembly
code.
To avoid getting into infinite loops, we place our trust in the fellow who wrote the
grammar. Proper design can ensure that there are no such troubles.
Example: Infinite Precision Integers. Data: Linked list of digits with sign bit. Operations:
Print number, Read Number, Add, Subtract, Multiply, Divide, Exponent, Module,
Compare.
Abstract data types add clarity by separating the definitions from the implementations.
Information Hiding - We should be able to define a type or procedure in one module and
forbid using it in another! Thus we can clearly separate the definition of an abstract data
type from its implementation!
Modula-3 supports all of these goals by separating interfaces (.i3 files) from
implementations (.m3 files).
You can insert money with "Deposit". The only other permissible
operation is smashing the piggy bank to get the ``money back''
The procedure "Smash" returns the sum of all deposited amounts
and makes the piggy bank unusable.
*)
END PiggyBank.
Note that this interface does not reveal where or how the total value is stored, nor how to
initialize it.
These are issues to be dealt with within the implementation of the module.
BEGIN
contents := 0 (* initialization of state variables in body *)
END PiggyBank.
<*FATAL Error*>
BEGIN (* Saving *)
PutText("Amount of deposit (negative smashes the piggy bank): \n");
REPEAT
cash := GetInt();
IF cash >= 0 THEN
Deposit(cash)
ELSE
PutText("The smashed piggy bank contained $");
PutInt(Smash());
Nl()
END;
UNTIL cash < 0
END Saving.
Exports describes what we are willing to make public, ultimately including the ``MAIN''
program.
By naming files with the same .m3 and .i3 names, the ``ezbuild'' make command can start
from the file with the main program, and final all other relevant files.
Ideally, the interface file should hide as much detail about the internal implementation of
a module from its users as possible. This is not easy without sophisticated language
features.
TYPE T = RECORD
num : INTEGER;
den : INTEGER;
END;
END Fraction.
Note that there is a dilemma here. We must make type T public so these procedures can
use it, but would like to prevent users from accessing (or even knowing about) the fields
num and dem directly.
Modula-3 permits one to declare subtypes of types, A <: B, which means that anything of
type A is of type B, but everything of type type B is not necessarily of type A.
REFANY is a pointer type which is a supertype of any other pointer. Thus a variable of
type REFANY can store a copy of any other pointer.
This enables us to define public interface files without actually revealing the guts of the
fraction type implementation.
END FractionType.
Somewhere within a module we must reveal the implementation of type T. This is done
with a REVEAL statement:
...
With generic pointers, it becomes necessary for type checking to be done a run-time,
instead of at compile-time as done to date.
This gives more flexibility, but much more room for you to hang yourself. For example:
TYPE
Student = REF RECORD lastname,firstname:TEXT END;
Address = REF RECORD street:TEXT; number:CARDINAL END;
VAR
r1 : Student;
r2 := NEW(Student, firstname:="Julie", lastname:="Tall");
adr := NEW(Address, street:="Washington", number:="21");
any := REFANY;
BEGIN
any := r2; (* always a safe assignment *)
r1 := any; (* legal because any is of type student *)
adr := any; (* produces a run-time error, not compile-time
*)
You should worry about the ideas behind generic implementations (why does Modula-3
do it this way?) more than the syntactic details (how does Modula-3 let you do this?). It is
very easy to get overwhelmed by the detail.
Generic Types
When we think about the abstract data type ``Stack'' or ``Queue'', the implementation of
th the data structure is pretty much the same whether we have a stack of integers or reals.
Without generic types, we are forced to declare the type of everything at compile time.
Thus we need two distinct sets of functions, like PushInteger and PushReal for each
operation, which is waste.
Object-Oriented Programming
Lecture 13
Steven S. Skiena
This provides only an alternate notation for dealing with things, but different notations
can sometimes make it easier to understand things - the history of Calculus is an example.
Objects do a great job of encapsulating the data items within, because the only access to
them is through the methods, or associated procedures.
Stack Object
TYPE
ET = INTEGER; (*Type of elements*)
Stack = OBJECT
top: Node := NIL; (*points to stack*)
METHODS
push(elem:ET):= Push; (*Push implements push*)
pop() :ET:= Pop; (*Pop implements pop*)
empty(): BOOLEAN:= Empty; (*Empty implements empty*)
END; (*Stack*)
Node = REF RECORD
info: ET; (*Stands for any
information*)
next: Node (*Points to the next node in
the stack
*)
END; (*Node*)
VAR
stack1, stack2: Stack := NEW(Stack); (*2 stack objects created*)
i1, i2: INTEGER;
BEGIN
stack1.push(2); (*2 pushed onto stack1*)
stack2.push(6); (*6 pushed onto stack2*)
i1:= stack1.pop(); (*pop element from stack1*)
i2:= stack2.pop(); (*pop element from stack2*)
SIO.PutInt(i1);
SIO.PutInt(i2);
SIO.Nl();
END StackObj.
Object-Oriented Programming
Inheritance
When we define an object type (class), we can specify that it be derived from (subtype to)
another class. For example, we can specialize the Stack object into a GarbageCan:
TYPE
GarbageCan = Stack OBJECT
OVERRIDES
pop():= Yech; (* Remove something from can?? *)
dump():= RemoveAll; (* Discard everything from can *)
END; (*GarbageCan*)
The GarbageCan type is a form of stack (you can still push in it the same way), but we
have modified the pop and dump methods.
How might object-oriented programming ideas have helped in writing the calculator
program?
Many of you noticed that the linked stack type was similar to the long integer type, and
wanted to reuse the code from one in another.
The following type hierarchy shows one way we could have exploited this, by creating
special stack methods push and pop, and overwriting the add and subtract methods for
general long-integers.
However, you should see why inheritance can be a big win in organizing larger programs.
Simulations
Lecture 14
Steven S. Skiena
Simulations
Often, a system we are interested in may be too complicated to readily understand, and
too expensive or big to experiment with.
• What direction will an oil spill move in the Persian Gulf, given certain weather
conditions?
• How much will increases in the price of oil change the American unemployment
rate?
• Now much traffic can an airport accept before long delays become common?
We can often get good insights into hard problems by performing mathematical
simulations.
Scoring in Jai-alai
Jai-alai is a Basque variation of handball, which is important because you can bet on it in
Connecticut. What is the best way to bet?
The scoring system in use in Connecticut is very interesting. Eight players or teams
appear in each match, numbered 1 to 8. The players are arranged in a queue, and the top
two players in the queue play each other. The winner gets a point and keeps playing, the
loser goes to the end of the queue. Winner is the first one to get to 7 points.
This scoring obviously favors the low numbered players. For fairness, after the first trip
through the queue, each point counts two.
Simulating Jai-Alai
1 PLAYS 2
1 WINS THE POINT, GIVING HIM 1
1 PLAYS 3
3 WINS THE POINT, GIVING HIM 1
4 PLAYS 3
3 WINS THE POINT, GIVING HIM 2
5 PLAYS 3
3 WINS THE POINT, GIVING HIM 3
6 PLAYS 3
3 WINS THE POINT, GIVING HIM 4
7 PLAYS 3
3 WINS THE POINT, GIVING HIM 5
8 PLAYS 3
8 WINS THE POINT, GIVING HIM 1
8 PLAYS 2
2 WINS THE POINT, GIVING HIM 2
1 PLAYS 2
2 WINS THE POINT, GIVING HIM 4
4 PLAYS 2
2 WINS THE POINT, GIVING HIM 6
5 PLAYS 2
5 WINS THE POINT, GIVING HIM 2
5 PLAYS 6
5 WINS THE POINT, GIVING HIM 4
5 PLAYS 7
7 WINS THE POINT, GIVING HIM 2
3 PLAYS 7
7 WINS THE POINT, GIVING HIM 4
8 PLAYS 7
7 WINS THE POINT, GIVING HIM 6
1 PLAYS 7
7 WINS THE POINT, GIVING HIM 8
WIN-PLACE-SHOW IS 7 2 3
We can simulate a lot of games and see how often each player wins the game!
But when player A plays a point against player B, how do we decide who wins? If the
players are all equally matched, we can flip a coin to decide. We can use a random
number generator to flip the coin for us!
Simulation Results
Compare these to the actual win results from Berenson's Jai-alai 1983-1986:
Yes, but not good enough to bet with! The matchmakers but the best players in the
middle, so as to even the results. A more complicated model will be necessary for better
results.
Limitations of Simulations
Although simulations are good things, there are several reasons to be skeptical of any
results we get.
After all, we wrote the simulation because we do not know the answers! How do you
debug a simulation of two galaxies colliding or the effect of oil price increases on the
economy?
We have shown that random numbers are useful for simulations, but how do we get
them?
First we must realize that there is a philosophical problem with generating random
numbers on a deterministic machine.
This is quite difficult - people are lousy at picking random numbers. Note that the
following sequence produces 0's + 1's with equal frequency but does not look like a fair
coin:
Von Neumann suggested generating random numbers by taking a big integer, squaring it,
and using the middle digits as the seed/random number.
It looks random to me... But what happens when the middle digits just happen to be
0000000000? From then on, all digits will be zeros!
Linear Congruential Generators
The most popular random number generators, because of simplicity, quality, and small
state requirements are linear congruential generators.
The quality of the numbers generated depends upon careful selection of the seed and
the constants a, c, and m.
Why does it work? Clearly, the numbers are between 0 and m-1. Taking the remainder
mod m is like seeing where a roulette ball drops in a wheel with m slots.
Asymptotics
Lecture 15
Steven S. Skiena
Analyzing Algorithms
There are often several different algorithms which correctly solve the same problem.
How can we choose among them? There can be several different criteria:
• Ease of implementation
• Ease of understanding
• Efficiency in time and space
The first two are somewhat subjective. However, efficiency is something we can study
with mathematical analysis, and gain insight as to which is the fastest algorithm for a
given problem.
What would we like as the result of the analysis of an algorithm? We might hope for a
formula describing exactly how long a program implementing it will run.
Example: Binary search will take milliseconds on an array of n
elements.
This would be great, for we could predict exactly how long our program will take. But it
is not realistic for several reasons:
CRAY than a PC. Maybe binary search will now take ms?
2. Dependence on language/compiler - Should our time analysis change when
someone uses an optimizing compiler?
3. Dependence of the programmer - Two different people implementing the same
algorithm will result in two different programs, each taking slightly differed
amounts of time.
4. Should your time analysis be average or worst case? - Many algorithms return
answers faster in some cases than others. How did you factor this in? Exactly
what do you mean by average case?
5. How big is your problem? - Sometimes small cases must be treated different from
big cases, so the same formula won't work.
For all of these reasons, we cannot hope to analyze the performance of programs
precisely. We can analyze the underlying algorithm, but at a less precise level.
Example: Binary search will use about iterations, where each iteration takes time
independent of n, to search an array of n elements in the worst case.
Note that this description is true for all binary search programs regardless of language,
machine, and programmer.
By describing the worst case instead of the average case, we saved ourselves some nasty
analysis. What is the average case?
Everyone knows two different algorithms for multiplication: repeated addition and digit-
by-digit multiplication.
How many additions can we do in the worst case? The biggest n-digit number is all nines,
and .
The total time complexity is the cost per addition times the number of additions, so the
total complexity .
Digit-by-Digit Multiplication
Since multiplying one digit by one other digit can be done by looking up in a
multiplication table (2D array), each step requires a constant amount of work.
Thus to multiply an n-digit number by one digit requires ``about'' n steps. With m ``extra''
zeros (in the worst case), ``about'' n + m steps certainly suffice.
We must do m such multiplications and add them up - each add costs as much as the
multiplication.
Which is faster?
Clearly the repeated addition method is much slower by our analysis, and the difference
is going to increase rapidly with n...
Further, it explains the decline and fall of Roman empire - you cannot do digit-by-digit
multiplication with Roman numbers!
We need a way of exactly comparing approximately defined functions. This is the big Oh
Notation:
If f(n) and g(n) are functions defined for positive integers, then f(n)= O(g(n)) means that
there exists a constant c such that for all sufficiently large positive
integers.
The idea is that if f(n)=O(g(n)), then f(n) grows no faster (and possibly slower) than g(n).
Note this definition says nothing about algorithms - it is just a way to compare numerical
functions!
Examples
In the big Oh Notation, multiplicative constants and lower order terms are unimportant.
Exponents are important.
The following functions are different according to the big Oh notation, and are ranked in
increasing order:
Quadratic growth
Exponential growth
Suppose I find two algorithms, one of which does twice as many operations in solving the
same problem. I could get the same job done as fast with the slower algorithm if I buy a
machine which is twice as fast.
But if my algorithm is faster by a big Oh factor - No matter how much faster you make
the machine running the slow algorithm the fast-algorithm, slow machine combination
will eventually beat the slow algorithm, fast machine combination.
I can search faster than a supercomputer for a large enough dictionary, If I use binary
search and it uses sequential search!
An Application: The Complexity of Songs
Suppose we want to sing a song which lasts for n units of time. Since n can be large, we
want to memorize songs which require only a small amount of brain space, i.e. memory.
Let S(n) be the space complexity of a song which lasts for n units of time.
The amount of space we need to store a song can be measured in either the words or
What bounds can we establish on S(n)? S(n) = O(n), since in the worst case we must
explicitly memorize every word we sing - ``The Star-Spangled Banner''
The Refrain
Most popular songs have a refrain, which is a block of text which gets repeated after each
stanza in the song:
Refrains made a song easier to remember, since you memorize it once yet sing it O(n)
times. But do they reduce the space complexity?
Then the space complexity is still O(n) since it is only halved (if the verse-size = refrain-
size):
The k Days of Christmas
On the First Day of Christmas, my true love gave to me, a partridge in a pear tree
All you must remember in this song is this template of size , and the current value
of n. The storage size for n depends on its value, but bits suffice.
Uh-huh, uh-huh
Reference: D. Knuth, `The Complexity of Songs', Comm. ACM, April 1984, pp.18-24
Introduction to Sorting
Lecture 16
Steven S. Skiena
Sorting
Issues in Sorting
Increasing or Decreasing Order? - The same algorithm can be used by both all we need
do is change to in the comparison function as we desire.
What about equal keys? - Does the order matter or not? Maybe we need to sort on
secondary keys, or leave in the same order as the original permutations.
What about non-numerical data? - Alphabetizing is sorting text strings, and libraries
have very complicated rules concerning punctuation, etc. Is Brown-Williams before or
after Brown America before or after Brown, John?
We can ignore all three of these issues by assuming a comparison function which
depends on the application. Compare (a,b) should return ``<'', ``>'', or ''=''.
Applications of Sorting
One reason why sorting is so important is that once a set of items is sorted, many other
problems become easy.
SearchingBinary search lets you test whether an item is in a dictionary in time.
Closest pairGiven n numbers, find the pair which are closest to each other.
Once the numbers are sorted, the closest pair will be next to each other in sorted order, so
an O(n) linear scan completes the job.
Element uniquenessGiven a set of n items, are they all unique or are there any duplicates?
Frequency distribution - ModeGiven a set of n items, which element occurs the largest
number of times?
Sort them and do a linear scan to measure the length of all adjacent runs.
Once the keys are placed in sorted order in an array, the kth largest can be found in
constant time by simply looking in the kth position of the array.
Selection Sort
In my opinion, the most natural and easiest sorting algorithm is selection sort, where we
repeatedly find the smallest element, move it to the front, then repeat...
* 5 7 3 2 8
2 * 7 3 5 8
2 3 * 7 5 8
2 3 5 * 7 8
2 3 5 7 * 8
If elements are in an array, swap the first with the smallest element- thus only one array is
necessary.
If elements are in a linked list, we must keep two lists, one sorted and one unsorted, and
always add the new element to the back of the sorted list.
TYPE
Array = ARRAY [1..N] OF TEXT;
VAR
a: Array; (*the array in which to search*)
x: TEXT; (*auxiliary variable*)
last, (*last valid index *)
min: INTEGER; (* current minimum*)
BEGIN
...
...
END SimpleSort.
One interesting observation is that selection sort always takes the same time no matter
what the data we give it is! Thus the best case, worst case, and average cases are all the
same!
To find the largest takes (n-1) steps, to find the second largest takes (n-2) steps, to find
the third largest takes (n-3) steps, ... to find the last largest takes 0 steps.
An advantage of the big Oh notation is that fact that the worst case time is
obvious - we have n loops of at most n steps each.
If instead of time we count the number of data movements, there are n-1, since there is
exactly one swap per iteration.
Insertion Sort
In insertion sort, we repeatedly add elements to a sorted subset of our data, inserting the
next element in order:
* 5 7 3 2 8
5 * 7 3 2 8
3 5 * 7 2 8
2 3 5 * 7 8
2 3 5 7 * 8
InsertionSort(A)
for i = 1 to n-1 do
j=i
while (A[j] > A[j-1]) do swap(A[j],A[j-
1])
In inserting the element in the sorted section, we might have to move many elements to
make room for it.
If the elements are in an array, we scan from bottom to top until we find the j such that
, then move from j+1 to the end down one to make room.
If the elements are in a linked list, we do the sequential search until we find where the
element goes, then insert the element there. No other elements need move!
Since we do not necessarily have to scan the entire sorted section of the array, the best,
worst, and average cases for insertion sort all differ!
Best case: the element always gets inserted at the end, so we don't have to move
anything, and only compare against the last sorted element. We have (n-1) insertions,
each with exactly one comparison and no data moves per insertion!
What is this best case permutation? It is when the array or list is already sorted! Thus
insertion sort is a great algorithm when the data has previously been ordered, but slightly
messed up.
Worst case: the element always gets inserted at the front, so all the sorted elements must
be moved at each insertion. The ith insertion requires (i-1) comparisons and moves so:
What is the worst case permutation? When the array is sorted in reverse order.
This is the same number of comparisons as with selection sort, but uses more movements.
The number of movements might get important if we were sorting large records.
Average Case: If we were given a random permutation, the chances of the ith insertion
Can we find a sorting algorithm which does significantly better than comparing each pair
of elements? If not, we are doomed to quadratic time complexity....
Logarithms
It is important to understand deep in your bones what logarithms are and where they
come from.
saying that .
Exponential functions, like the amount owed on a n year mortgage at an interest rate of
per year, are functions which grow distressingly fast. Thus inverse exponential
functions, ie. logarithms, grow refreshingly slowly.
If you have an algorithm which runs in time, take it, because this is blindingly
fast even on very large instances.
Properties of Logarithms
is just a constant.
since , and
.
Any exponential dominates every polynomial. This is why we will seek to avoid
exponential time algorithms.
2F1.1. Fraud and Deceit; Forgery; Offenses Involving Altered or Counterfeit Instruments
other than Counterfeit Bearer Obligations of the United States.
(1) If the loss exceeded $2,000, increase the offense level as follows:
The federal sentencing guidelines are designed to help judges be consistent in assigning
punishment. The time-to-serve is a roughly linear function of the total level.
However, notice that the increase in level as a function of the amount of money you steal
grows logarithmically in the amount of money stolen.
This very slow growth means it pays to commit one crime stealing a lot of money, rather
than many small crimes adding up to the same amount of money, because the time to
serve if you get caught is much less.
The Moral: ``if you are gonna do the crime, make it worth the time!''
Mergesort
Given two sorted lists with a total of n elements, at most n-1 comparisons are required to
merge them into one sorted list. Repeatedly compare the top elements on each list.
Example: and .
Fine, but how do we get the smaller sorted lists to start with? We do merges of even
smaller lists!
Working backwards, we eventually get to lists of one element, which are by definition
sorted!
Mergesort Example
Note that on each iteration, the size of the sorted lists doubles, form 1 to 2 to 4 to 8 to 16
...to n.
How many doublings (or iterations) does it take before the entire array of size n is sorted?
Answer: .
...
This is always less than n per stage!!! If we make at most n comparisons in each of
How much extra space (over the space used to represent the input elements) do we need
to do mergesort?
It is easy to merge two sorted linked lists without using any extra space.
However, to merge two sorted arrays (or portions of an array), we must use a third array
to store the result of the merge. This avoids steping on elements we have not needed yet:
QuickSort
Although Mergesort is , it is somewhat inconvienient to implementate
using arrays, since we need space to merge.
In practice, the fastest sorting algorithm is Quicksort, which uses partitioning as its main
idea.
17 12 6 19 23 8 5 10 - before
6 8 5 10 23 19 12 17 - after
Partitioning places all the elements less than the pivot in the left part of the array, and all
elements greater than the pivot in the right part of the array. The pivot fits in the slot
between them.
Note that the pivot element ends up in the correct place in the total order!
Once we have selected a pivot element, we can partition the array in one linear scan, by
maintaining three sections of the array: < pivot, > pivot, and unexplored.
| 17 12 6 19 23 8 5 | 10
| 5 12 6 19 23 8 | 17
5 | 12 6 19 23 8 | 17
5 | 8 6 19 23 | 12 17
5 8 | 6 19 23 | 12 17
5 8 6 | 19 23 | 12 17
5 8 6 | 23 | 19 12 17
5 8 6 ||23 19 12 17
5 8 6 10 19 12 17 23
As we scan from left to right, we move the left bound to the right when the element is
less than the pivot, otherwise we swap it with the rightmost unexplored element and
move the right bound one step closer to the left.
Since the partitioning step consists of at most n swaps, takes time linear in the number of
keys. But what does it buy us?
1. The pivot element ends up in the position it retains in the final sorted order.
2. After a partitioning, no element flops to the other side of the pivot in the final
sorted order.
Thus we can sort the elements to the left of the pivot and the right of the pivot
independently!
This gives us a recursive sorting algorithm, since we can use the partitioning approach to
sort each subproblem.
Quicksort Implementation
See Chapter 14 for the explanation of the file handling and Chapter
15 for exception handling, which is used in this example.
*)
VAR
out: SIO.Writer;
TYPE
ElemType = INTEGER;
VAR
array: ARRAY [1 .. 10] OF ElemType;
(*Partitioning:*)
i:= left; (*i iterates upwards from
left*)
j:= right; (*j iterates down from
right*)
x:= a[(left + right) DIV 2]; (*x is the middle element*)
REPEAT
WHILE a[i] < x DO INC(i) END; (*skip elements < x in left
part*)
WHILE a[j] > x DO DEC(j) END; (*skip elements > x in right
part*)
IF i <= j THEN
w:= a[i]; a[i]:= a[j]; a[j]:= w; (*swap a[i] and a[j]*)
INC(i);
DEC(j);
END; (*IF i <= j*)
UNTIL i > j;
END Quicksort;
BEGIN
TRY (*grasps bad file
format*)
InArray(array); (*read an array in*)
out:= SF.OpenWrite(); (*create output file*)
OutArray(array); (*output the array*)
Quicksort(array, 0, NUMBER(array) - 1); (*sort the array*)
OutArray(array); (*display the array*)
SF.CloseWrite(out); (*close output file to
make it permanent*)
EXCEPT
SIO.Error => SIO.PutLine("bad file format");
END; (*TRY*)
END Quicksort.
Since each element ultimately ends up in the correct position, the algorithm correctly
sorts. But how long does it take?
The best case for divide-and-conquer algorithms comes when we split the input as evenly
as possible. Thus in the best case, each subproblem is of size n/2.
The partition step on each subproblem is linear in its size. Thus the total effort in
The recursion tree for the best case looks like this:
The total partitioning on each level is O(n), and it take levels of perfect partitions to
get to single element subproblems. When we are down to single elements, the problems
Suppose instead our pivot element splits the array as unequally as possible. Thus instead
of n/2 elements in the smaller half, we get zero, meaning that the pivot element is the
biggest or smallest element in the array.
Now we have n-1 levels, instead of , for a worst case time of , since the
Thus the worst case time for Quicksort is worse than Heapsort or Mergesort.
To justify its name, Quicksort had better be good in the average case. Showing this
requires some fairly intricate analysis.
The divide and conquer principle applies to real life. If you will break a job into pieces, it
is best to make the pieces of equal size!
The book contains a rigorous proof that quicksort is in the average case. I
will instead give an intuitive, less formal explanation of why this is so.
Half the time, the pivot element will be from the center half of the sorted array.
Whenever the pivot element is from positions n/4 to 3n/4, the larger remaining subarray
contains at most 3n/4 elements.
If we assume that the pivot element is always in this range, what is the maximum number
of partitions we need to get from n elements down to 1 element?
good partitions suffice.
But how often when we pick an arbitrary element as pivot will it generate a decent
partition?
Since any number ranked between n/4 and 3n/4 would make a decent pivot, we get one
half the time on average.
If we need levels of decent partitions to finish the job, and half of random
partitions are decent, then on average the recursion tree to quicksort the array has
levels.
Since O(n) work is done partitioning on each level, the average time is .
The worst case for Quicksort depends upon how we select our partition or pivot element.
If we always select either the first or last element of the subarray, the worst-case occurs
when the input is already sorted!
A B D F H J K
B D F H J K
D F H J K
F H J K
H J K
J K
K
Having the worst case occur when they are sorted or almost sorted is very bad, since that
is likely to be the case in certain applications.
But how can we compare two algorithms to see which is faster? Using the
RAM model and the big Oh notation, we can't!
When Quicksort is implemented well, it is typically 2-3 times faster than mergesort or
heapsort. The primary reason is that the operations in the innermost loop are simpler. The
best way to see this is to implement both and experiment with different inputs.
Since the difference between the two programs will be limited to a multiplicative
constant factor, the details of how you program each algorithm will make a big
difference.
If you don't want to believe me when I say Quicksort is faster, I won't argue with you. It
is a question whose solution lies outside the tools we are using. The best way to tell is to
implement them and experiment.
When we compare the expected number of comparisons for Quicksort + Insertion sort, a
funny thing happens for small n:
Why not take advantage of this, and switch over to insertion sort when the size of the
subarray falls below a certain threshhold?
Why not indeed? But how do we find the right switch point to optimize performance?
Experiments are more useful than analysis here.
Randomization
Suppose you are writing a sorting program, to run on data given to you by your worst
enemy. Quicksort is good on average, but bad on certain worst-case instances.
If you used Quicksort, what kind of data would your enemy give you to run it on?
Exactly the worst-case instance, to make you look bad.
But instead of picking the median of three or the first element as pivot, suppose you
picked the pivot element at random.
Now your enemy cannot design a worst-case instance to give to you, because no matter
which data they give you, you would have the same probability of picking a good pivot!
Randomization is a very important and useful idea. By either picking a random pivot or
scrambling the permutation before sorting it, we can say:
``If you give me random input data, quicksort runs in expected time.''
Since the time bound how does not depend upon your input distribution, this means that
unless we are extremely unlucky (as opposed to ill prepared or unpopular) we will
certainly get good performance.
Randomization is a general tool to improve algorithms with bad worst-case but good
average-case complexity.
The worst-case is still there, but we almost certainly won't see it.
Who's Number 2?
Each game can be thought of as a comparison. Given n keys, we would like to determine
the k largest values. Can we do better than just sorting all of them?
In the tournament example, each team represents an leaf of the tree and each game is an
internal node of the tree. Thus there are n-1 games/comparisons for n teams/leaves.
Note that the champion is identified even though no team plays more than
games!
Lewis Carroll, author of ``Alice in Wonderland'', studied this problem in the 19th century
in order to design better tennis tournaments!
We will seek a data structure which will enable us to repeatedly identify the largest key,
and then delete it to retrieve the largest remaining key.
Binary Heaps
A binary heap is defined to be a binary tree with a key in each node such that:
Conditions 1 and 2 specify the shape of the tree, while condition 3 describes the labeling
of the nodes tree.
Unlike the tournament example, each label only appears on one node.
Note that heaps are not binary search trees, but they are binary trees.
Heap Test
Answer - No! A heap is not a binary search tree, and cannot be effectively used for
searching.
If we did not enforce the left constraint, we might have holes, and need room for
elements to store n things.
This implicit representation of trees saves memory but is less flexible than using pointers.
For this reason, we will not be able to use them when we discuss binary search trees.
Constructing Heaps
Heaps can be constructed incrementally, by inserting new elements into the left-most
open spot in the array.
If the new element is greater than its parent, swap their positions and recur.
Since at each step, we replace the root of a subtree by a larger one, we preserve the heap
order.
Since all but the last level is always filled, the height h of an n element heap is bounded
because:
so .
time.
The smallest (or largest) element in the heap sits at the root.
Deleting the root can be done by replacing the root by the nth key (which must be a leaf)
and letting it percolate down to its proper position!
The smallest element of (1) the root, (2) its left child, and (3) its right child is moved to
the root. This leaves at most one of the two subtrees which is not in heap order, so we
continue one level down.
After steps of O(1) time each, we reach a leaf, so the deletion is completed in
time.
This percolate-down operation is called often Heapify, for it merges two heaps with a
new root.
Heapsort
time:
Build-heap(A)
for i = 2 to n do
HeapInsert(A[i], A)
Exchanging the maximum element with the last element and calling heapify repeatedly
Heapsort(A)
Build-heap(A)
for i = n to 1 do
swap(A[1],A[i])
n = n - 1
Heapify(A,1)
Selection sort scans throught the entire array, repeatedly finding the smallest remaining
element.
For i = 1 to n
Using arrays or unsorted linked lists as the data structure, operation A takes O(n) time and
operation B takes O(1).
Using heaps, both of these operations can be done within time, balancing the
work and achieving a better tradeoff.
Priority Queues
A priority queue is a data structure on sets of keys supporting the operations: Insert(S, x)
- insert x into set S, Maximum(S) - return the largest key in S, and ExtractMax(S) - return
and remove the largest key in S
• In a stack, push inserts a new item and pop removes the most recently pushed
item.
• In a queue, enqueue inserts a new item and dequeue removes the least recently
enqueued item.
Both stacks and queues can be simulated by using a heap, when we add a new time field
to each item and order the heap according it this time field.
• To simulate the stack, increment the time with each insertion and put the
maximum on top of the heap.
• To simulate the queue, decrement the time with each insertion and put the
maximum on top of the heap (or increment times and keep the minimum on top)
This simulation is not as efficient as a normal stack/queue implementation, but it is a cute
demonstration of the flexibility of a priority queue.
In simulations of airports, parking lots, and jai-alai - priority queues can be used to
maintain who goes next.
The stack and queue orders are just special cases of orderings. In real life, certain people
cut in line.
In the priority queue, we will store the points we have not yet encountered, ordered by x
coordinate. and push the line forward one stop at a time.
Greedy Algorithms
In greedy algorithms, we always pick the next thing which locally maximizes our score.
By placing all the things in a priority queue and pulling them off in order, we can
improve performance over linear search or sorting, particularly if the weights change.
Sequential Search
The simplest algorithm to search a dictionary for a given key is to test successively
against each element.
This works correctly regardless of the order of the elements in the list. However, in the
worst case all elements might have to be tested.
A sentinel is a value placed at the end of an array to insure that the normal case of
searching returns something even if the item is not found. It is a way to simplify coding
by eliminating the special case.
...
i:= FIRST(a);
WHILE (i <= last) AND NOT Text.Equal(a[i], x) DO INC(i) END;
...
(* Do search *)
a[LAST(a)]:= x; (*sentinel at position N+1*)
i:= FIRST(a);
WHILE x # a[i] DO INC(i) END;
(* Output result *)
IF i = LAST(a) THEN
SIO.PutText("NOT found");
ELSE
SIO.PutText("Found at position: "); SIO.PutInt(i)
END;
SIO.Nl();
END SentinelSearch.
Sometimes sequential search is not a bad algorithm, especially when the list isn't long.
After all, sequential search is easier to implement than binary search, and does not require
the list to be sorted.
However, if we are going to do a sequential search, what order do we want the elements?
Sorted order in a linked list doesn't really help, except maybe to help us stop early if the
item isn't in the list.
Suppose you were organizing your personal phone book for sequential search. You
would want your most frequently called friends to be at the front: In sequential search,
you want the keys ordered by frequency of use!
Why? If is the probability of searching for the ith key, which is a distance from the
front, the expected search time is
For the list (Cheryl,0.4), (Lisa,0.25), (Lori,0.2), (Lauren,0.15), the expected search time
is:
If access probability had been uniform, the expected search time would have been
So I win using this order, and win even more if the access probabilities are furthered
skewed.
Self-Organizing Lists
Since it is often impractical to compute usage frequencies, and because usage frequencies
often change in the middle of a program (locality), we would like our data structure to
automatically adjust to the distribution.
The idea is to use a heuristic to move an element forward in the list whenever it is
accessed. There are two possibilities:
For list (1,2,3,4,5,6,7,8), the queries Find(8), Find(7), Find(8), Find(7), ... will search the
entire list every time. With move-to-front, it averages only two comparisons per query!
In fact, it can be shown that the total search time with move-to-front is never more than
twice the time if you knew the actual probabilities in advance!!
We will see self-organization again later in the semester when we talk about splay trees.
1. 11.
2. 12.
3. 13.
4. 14.
5. 15.
6. 16.
7. 17.
8. 18.
9. 19.
10. 20.
Binary Search
The basic algorithm is to find the middle element of the list, compare it against the key,
decide which half of the list must contain the key, and repeat with that half.
With one question, I can distinguish between two words: A and B; ``Is the key ?''
With two questions, I can distinguish between four words: A,B,C,D; ``Is the ?''
Each question I ask em doubles the number of words I can search in my dictionary.
Thus the number of questions we must ask is the base two logarithm of the size of the
dictionary.
The difficulty is maintaining the following two invariants with each iteration:
• The key must always remain between the low and high indices.
• The low or high indice must advance each iteration.
The boundary cases are very tricky: zero elements left, one elements left, two elements
left, and an even or odd number of elements!
There are at least two different versions of binary search, depending upon whether we
want to test for equality at each query or only at the end.
Alternately, we can test for equality at each comparison. Suppose we search for ``c'':
One-dimensional Arrays
The easiest way to view a one - dimensional array is as a contiguous block of memory
locations of length (# of array elements) (size of each element)
Because the size (in bytes) of each element is the same, the compiler can translated
A[500] into the address of the record
Two-Dimensional Arrays
How does the compiler know where to store element A[i,j] of a two-dimensional array?
By chopping the matrix into rows, it can be stored like a one- dimensional array:
Is this access formula for row-major or column-major order, assuming the first index
gives the row?
For three dimensions, cut the matrix into two dimensional slabs, and use the previous
formula. For k-dimensional arrays, we can find a similar formula by induction.
Thus we can access any element in a k-dimensional array in O(k) time, which is constant
for any reasonably dimension.
Fortran stores its arrays in column-major order, while most other languages use row-
major order. But why might we really need to know what is going on under the hood?
In C language, pointers are usually used to cruise through arrays. Cruising through a 2D
array meaningfully requires knowing the order of the elements.
Also, in a computer with virtual memory or a cache, it is often faster to access elements if
they are close to the last one we have read. Knowing the access function lets us choose
the right way to order nested loops.
(*row-major*)
(*column-major*)
Do i=1 to n Do j=
Do j=1 to n Do i=
A[i,j] = 0
Triangular Tables
By playing with our own access functions we can build efficient arrays of whatever shape
we want, including triangular and banded arrays.
Triangular tables prove useful for representing any symmetric function, such as the
distance from A to B, D[a,b] = D[b,a]. Thus we can save almost half the memory of a
rectangular array by storing it as a triangle
Binary search takes time to find a particular key in a sorted array. It can be
shown that, in the worst case, no faster algorithm exists. So how might we do faster?
Interpolation Search
Binary search is only optimal when you know nothing about your data except that it is
sorted!
When you look up AAA in the telephone book, you don't start in the middle. We use our
understanding of how things are named in the real world to choose where to prove next.
Such an algorithm is called an interpolation search, since we are interpolating(guessing)
where the key should be.
Interpolation search is only as good as our guesses. If we do not understand the data as
well as you think, interpolation search can be very slow - recall the Shifflett's of
Charlottesville!
With interpolation search, the cost of making a good guess might overwhelm the
reduction in the number of guesses, so watch out!
An array reference A[i] lets us quickly calculate exactly where the ith element of A is in
memory, knowing only i, the starting location of A, and the size of each array item.
Any time we can compute the exact position for an item in memory by a simple access
formula, we can find it as quickly as we can compute the formula!
One reason is that many of the fields we wish to search on are not integers, for example,
names in a telephone book. What address in the machine is defined by ``Skiena''?
Hashing
Lecture 21
Steven S. Skiena
Hashing
One way to convert form names to integers is to use the letters to form a base ``alphabet-
size'' number system:
To convert ``STEVE'' to a number, observe that e is the 5th letter of the alphabet, s is the
19th letter, t is the 20th letter, and v is the 22nd letter.
Thus ``Steve''
Thus one way we could represent a table of names would be to set aside an array big
enough to contain one element for each possible string of letters, then store data in the
elements corresponding to real people. By computing this function, it tells us where the
person's phone number is immediately!!
Because we must leave room for every possible string, this method will use an incredible
amount of memory. We need a data structure to represent a sparse table, one where
almost all entries will be empty.
We can reduce the number of boxes we need if we are willing to put more than one thing
in the same box!
Example: suppose we use the base alphabet number system, then take the remainder
Now the table is much smaller, but we need a way to deal with the fact that more than
one, (but hopefully every few) keys can get mapped to the same array element.
The basics of hashing is to apply a function to the search key so we can determine where
the item is without looking at the other items. To make the table of reasonable size, we
must allow for collisions, two distinct keys mapped to the same location.
There are several clever techniques we will see to develop good hash functions and deal
with the problems of duplicates.
Hash Functions
The verb ``hash'' means ``to mix up'', and so we seek a function to mix up keys as well as
possible.
The best possible hash function would hash m keys into n ``buckets'' with no more than
Let us consider hashing character strings to integers. The ORD function returns the
character code associated with a given character. By using the ``base character size''
number system, we can map each string to an integer.
1. A hash function which maps an arbitrary key to an integer turns searching into
array access, hence O(1).
2. To use a finite sized array means two different keys will be mapped to the same
place. Thus we must have some way to handle collisions.
3. A good hash function must spread the keys uniformly, or else we have a linear
search.
• Truncation - When grades are posted, the last four digits of your SSN are used,
because they distribute students more uniformly than the first four digits.
• Folding - We should get a better spread by factoring in the entire key. Maybe
subtract the last four digits from the first five digits of the SSN, and take the
absolute value?
• Modular Arithmetic - When constructing pseudorandom numbers, a good trick for
uniform distribution was to take a big number mod the size of our range. Because
of our roulette wheel analogy, the numbers tend to get spread well if the tablesize
is selected carefully.
Suppose we wanted to hash check totals by the dollar value in pennies mod 1000. What
happens?
, , and
If we instead use a prime numbered Modulus like 1007, these clusters will get broken:
, , and .
In general, it is a good idea to use prime modulus for hash table size, since it is less likely
the data will be multiples of large primes as opposed to small primes - all multiples of 4
get mapped to even numbers in an even sized hash table!
No matter how good our hash function is, we had better be prepared for collisions,
because of the birthday paradox.
Assuming 365 days a year, what is the probability that exactly two people share a
birthday? Once the first person has fixed their birthday, the second person has 365
possible days to be born to avoid a collision, or a 365/365 chance.
The moral is that collisions are common, even with good hash functions.
No matter how good our hash functions are, we must deal with collisions. What do we do
when the spot in the table we need is occupied?
The easiest approach is to let each element in the hash table be a pointer to a list of keys.
Insertion, deletion, and query reduce to the problem in linked lists. If the n keys are
distributed uniformly in a table of size m/n, each operation takes O(m/n) time.
Chaining is easy, but devotes a considerable amount of memory to pointers, which could
be used to make the table larger. Still, it is my preferred method.
Open Addressing
We can dispense with all these pointers by using an implicit reference derived from a
simple function:
If the space we want to use is filled, we can examine the remaining locations:
1. Sequentially
2. Quadratically
3. Linearly
The reason for using a more complicated scheme is to avoid long runs from similarly
hashed keys.
Deletion in an open addressing scheme is ugly, since removing one element can break a
chain of insertions, making some elements inaccessible.
Pragmatically, a hash table is often the best data structure to maintain a dictionary.
However, the worst-case running time is unpredictable.
The best worst-case bounds on a dictionary come from balanced binary trees, such as red-
black trees.
Tree Structures
Lecture 21
Steven S. Skiena
Trees
We have seen many data structures which allow fast search, but not fast, flexible update.
Hash Tables - The number of insertions are essentially bounded by the table size, which
must be specified in advance. Worst case O(n) search.
Binary trees will enable us to search, insert, and delete fast, without predefining the size
of our data structure!
To get search time, we used binary search, meaning we always had a choice
of two next elements to look at.
To combine these ideas, we want a ``linked list'' with two pointers per node! This is the
basic idea behind search trees!
Rooted Trees
A rooted tree is either (1) empty, or (2) consists of a node called the root, together with
two rooted trees called the left subtree and right subtree of the root.
A binary tree is a rooted tree where each node has at most two descendants, the left child
and the right child.
A binary tree can be implemented where each node has left and right pointer fields, an
(optional) parent pointer, and a data field.
Rooted trees can be used to model corporate heirarchies and family trees.
Note the inherently recursive structure of rooted trees. Deleting the root gives rise to a
certain number of smaller subtrees.
In a rooted tree, the order among ``brother'' nodes matters. Thus left is different from
right. The five distinct binary trees with five nodes:
A binary search tree is a binary tree where each node contains a key such that:
• All keys in the left subtree precede the key in the root.
• All keys in the right subtree succeed the key in the root.
• The left and right subtrees of the root are again binary search trees.
Left: A binary search tree. Right: A heap but not a binary search tree.
For any binary tree on n nodes, and any set of n keys, there is exactly one labeling to
make it a binary search tree!!
Binary Tree Search
Searching a binary tree is almost like binary search! The difference is that instead of
searching an array and defining the middle element ourselves, we just follow the
appropriate pointer!
The type declaration is simply a linked list node with another pointer. Left and right
pointers are identical types.
TYPE
T = BRANDED REF RECORD
key: ElemT;
left, right: T := NIL;
END; (*T*)
Dictionary search operations are easy in binary trees. The algorithm works because both
the left and right subtrees of a binary search tree are binary search trees - recursive
structure, recursive algorithm.
Search Implementation
This takes time proportional to the height of the tree, O(h). Good, balanced trees have
To insert a new node into an existing tree, we search for where it should be, then replace
that NIL pointer with a pointer to the new node.
The pointer in the parent node must be modified to remember where we put the new
node.
Insertion Routine
How many pointers are in the tree? There are n nodes in tree, each of which has 2
pointers, for a total of 2n pointers regardless of shape.
How many pointers are NIL, i.e ``wasted''? Except for the root, each node in the tree is
pointed to by one tree pointer Thus the number of NILs is
, for .
The order in which we explore each node and its children matters for many applications.
There are six permutations of {left, right, node} which define traversals. The most
interesting traversals are inorder {left, node, right}, preorder {node, left, right},
postorder {left, right, node},
Why do we care about different traversals? Depending on what the tree represents,
different traversals have different interpretations.
BEGIN (*Traverse*)
IF direction = Direction.Left THEN
CASE order OF
| Order.Pre => PreL(tree, 0);
| Order.In => InL(tree, 0);
| Order.Post => PostL(tree, 0);
END (*CASE order*)
ELSE (* direction = Direction.Right*)
CASE order OF
| Order.Pre => PreR(tree, 0);
| Order.In => InR(tree, 0);
| Order.Post => PostR(tree, 0);
END (*CASE order*)
END (*IF direction*)
END Traverse;
Insertion was easy because the new node goes in as a leaf and only its parent is affected.
Deletion of a leaf is just as easy - set the parent pointer to NIL. But what if the node to be
deleted is an interior node? We have two pointers to connect to only one parent!!
Deletion is somewhat more tricky than insertion, because the node to die may not be a
leaf, and thus effect other nodes.
Case (a), where the node is a leaf, is simple - just NIL out the parents child pointer.
Case (b), where a node has one chld, the doomed node can just be cut out.
Case (c), relabel the node as its predecessor (which has at most one child when z has two
children!) and delete the predecessor!
PROCEDURE LeftLargest(VAR x: T) =
VAR y: T;
BEGIN
IF x.right = NIL THEN (*x points to largest element left*)
y:= tree; (*y now points to target node*)
tree:= x; (*tree assumes the largest node to
the left*)
x:= x.left; (*Largest node left replaced by its
left subtree*)
tree.left:= y.left; (*tree assumes subtrees ...*)
tree.right:= y.right; (*... of deleted node*)
ELSE (*Largest element left not found*)
LeftLargest(x.right) (*Continue search to the right*)
END;
END LeftLargest;
BEGIN
IF tree = NIL THEN RETURN FALSE
ELSIF e < tree.key THEN RETURN Delete(tree.left, e)
ELSIF e > tree.key THEN RETURN Delete(tree.right, e)
ELSE (*found*)
IF tree.left = NIL THEN
tree:= tree.right;
ELSIF tree.right = NIL THEN
tree:= tree.left;
ELSE (*Target node has two nonempty
subtrees*)
LeftLargest(tree.left) (*Search in left subtree*)
END; (*IF tree.left...*)
RETURN TRUE
END; (*IF tree...*)
END Delete;
Who have seen that binary trees can have heights ranging from to n. How tall are
they on average?
By using an intuitive argument, like I did with quicksort. I will convince you a random
tree is usually quite close to balanced. The text contains a more rigorous proof, which
you should look at.
Consider the first insertion into an empty tree. This node becomes the root and never
changes. Since in a binary search tree all keys less than the root go in the left subtree, the
root acts as a partition or pivot element!
Let's say a key is a 'good' pivot element if it is in the center half of the sorted space of
keys. Half of the time, our root will be a 'good' pivot element.
The next insertion will form the root of a subtree, and will be drawn at random from the
items either > root or < root. Again, half the time each insertion will be a 'good' partition
of the appropriate subset of keys.
The bigger half of a good partition contains at most 3n/4 items. Thus the maximum depth
of good splits k is:
so .
Doubling the depth to account for bad splits still makes in on average!
On average, random search trees are very good - more careful analysis shows the average
height after n insertions is . Since , this is only 39%
more than a perfectly balanced tree.
Of course, if we get unlucky and insert keys in sorted order, we are doomed to the worst
case performance.
insert(a)
insert(b)
insert(c)
insert(d)
What we want is an insertion/deletion procedure which adjusts the tree a little after each
insertion, keeping it close enough to balanced so the maximum height is logarithmic, but
flexible enough so we can still update fast!
time.
Therefore, when we talk about "balanced" trees, we mean trees whose height is
Red-Black trees are binary search trees where each node is assigned a color, where the
AVL Trees
Lecture 24
Steven S. Skiena
AVL Trees
An AVL tree is a binary search tree in which the heights of the left and right subtrees of
the root differ by at most 1, and the left and right subtrees are again AVL trees.
Therefore, we can label each node of an AVL tree with a balance factor as well as a key:
AVL trees are named after their inventors, the Russians G.M. Adel'son-Velshi, and E.M.
Laudis in 1962.
These are the most unbalanced possible AVL trees with a skew always to the right.
By maintaining the balance of each node (i.e. the subtree below it) when we insert a new
node, we can easily see whether or not to take action!
The balance is more useful than maintaining the height of each node because it is a
relative, not absolute measure. Thus we can move subtrees around without affecting their
balance, even if they end up at different heights.
How good are AVL trees?
To find out how bad they can be, we want to find what the minimum number of modes a
tree of height h can have. If is a minimum node AVL tree, its left and right subtrees
must themselves be minimum node AVL trees of smaller size. Further, they should differ
in height by 1 to take advantage of AVL freedom.
Thus the worse case AVL tree is almost as good as a random tree - on average it is very
close to an optional tree.
Since we are adding the last two numbers together, we are more than doubling the next-
to-last and somewhat less that doubling the last number.
IMPORT BinaryTree;
TYPE T <: BinaryTree.T; (*T is a subtype of BinaryTree.T *)
END AVLTree.
REVEAL
T = BinaryTree.T BRANDED OBJECT
OVERRIDES
delete:= Delete;
insert:= Insert;
END;
ELSE
InsertBal(root.right, new, bal);
IF NOT bal THEN (* bal is set to stop the recurs.
adjustm. of balance *)
WITH done=NARROW(root, NodeT).balance DO
CASE done OF
|-1=> done:= 0; bal:= TRUE; (*insertion ok *)
| 0=> done:= +1; (*still balanced, but
continue*)
|+1=>
IF NARROW(root.right, NodeT).balance = +1
THEN RL(root)
ELSE RrL(root)
END;
NARROW(root, NodeT).balance:= 0;
bal:= TRUE; (*after rotation tree ok*)
END; (*CASE*)
END (*WITH*)
END (*IF*)
END;
END InsertBal;
ELSE
deleted:= root.info;
IF root.left = NIL
THEN
root:= root.right;
ELSIF root.right = NIL THEN
root:= root.left;
ELSE
root.info:= DeleteSmallest(root.right, bal);
IF NOT bal THEN BalanceRight(root, bal) END;
END;
RETURN deleted;
END;
END Delete;
BEGIN
END AVLTree.
We have seen that AVL trees are for insertion and query.
Don't ask! Actually, you can rebalance an AVL tree in but it is more
complicated than insertion.
We will later study B-trees, where deletion is simpler, so don't worry about the details of
deletions form AVL trees.
Red-Black Trees
Lecture 25
Steven S. Skiena
No! Because now all nodes may not have the same black height.
What does a red-black tree with two real nodes look like?
Not (1) - consecutive reds Not (2), (4) - Non-Uniform black height
Proof: Our strategy; first we bound the number of nodes in any subtree, then we bound
the height of any subtree.
We claim that any subtree rooted at x has at least - 1 internal nodes, where bh(x)
is the black height of node x.
Proof, by induction:
Now assume it is true for all tree with black height < bh(x).
If x is black, both subtrees have black height bh(x)-1. If x is red, the subtrees have black
height bh(x).
Thus and , so .
Therefore red-black trees have height at most twice optimal. We have a balanced search
tree if we can maintain the red-black tree structure under insertion and deletion.
Rotations
The basic restructuring step for binary search trees are left and right rotation:
Rotation Implementation
Red-Black Insertion
Since red-black trees have height, if we can preserve all properties of such
trees under insertion/deletion, we have a balanced tree!
Suppose we just did a regular insertion. Under what conditions does it stay a red-black
tree?
Since every insertion take places at a leaf, we will change a black NIL pointer to a node
with two black NIL pointers.
To preserve the black height of the tree, the new node must be red. If its new parent is
black, we can stop, otherwise we must restructure! How can we fix two reds in a row?
If our uncle is red, reversing our relatives' color either solves the problem or pushes it
higher!
If we get all the way to the root, recall we can always color a red-black tree's root black.
We always will, so initially it was black, and so this process terminates.
If our uncle was black, observe that all the nodes around us have to be black:
Since the root of the subtree is now black with the same black-height as before, we have
restored the colors and can stop!
Double Rotations
A double rotation can be required to set things up depending upon the left-right turn
sequence, but the principle is the same.
Case (c) relabel to node as its successor and delete the successor.
Suppose the node we remove was red, do we still have a red-black tree?
Yes! No two reds will be together, and the black height for each leaf stays the same.
However, if the dead node y was black, we must give each of its decendants another
black ancestor. If an appropriate node is red, we can simply color it black otherwise we
must restructure.
Case (b) red becomes black and black becomes ``double black'';
Case (c) red becomes black and black becomes ``double black''.
Our goal will be to recolor and restructure the tree so as to get rid of the ``double black''
node.
Case 3: x has a black brother, and its left nephew is red and its right nephew is black.
Case 4: x has a black brother, and its right nephew is red (left nephew can be any color).
Conclusion
Splay Trees
Lecture 26
Steven S. Skiena
In real life, it is difficult to obtain the actual probabilities, and they keep changing. What
can we do?
We can apply our self-organizing heuristics to search trees, as we did with linked lists.
Whenever we access a node, we can either:
Moving a made to the front of a search tree means making it the root!
Splay Trees
To search or insert into a splay tree, we first perform the operation as if it was a random
tree. After it is found or inserted, perform a splay operation to move the given key to the
root.
A splay operation consists of a sequence of double rotations until the node is within one
level of the root, where at most one single rotation suffices to finish the job.
The choice of which double rotation to do depends upon our relationship to our
grandparent - a single rotation is performed only when we have no grandparent!
Example: Splay(a)
Note that the tree would not have become more balanced had we just used single
rotations to promote a to the root, instead of double rotations.
Sleator and Tarjan showed that if the keys are accessed with a uniform distribution, the
per operation!
Further, if the distribution is non-uniform, we get amortized costs within a constant factor
of the best possible tree!
All of this is done without keeping any balance or color information - amazing!
Graphs
Lecture 27
Steven S. Skiena
Graphs
A graph G consists of a set of vertices V together with a set E of vertex pairs or edges.
Graphs are important because any binary relation is a graph, so graphs can be used to
represent essentially any relationship.
Example: A network of roads, with cities as vertices and roads between cities as edges.
Consider a graph where the vertices are people, and there is an edge between two people
if and only if they are friends.
This graph is well-defined on any set of people: SUNY SB, New York, or the world.
A graph is undirected if (x,y) implies (y,x). Otherwise the graph is directed. The
``heard-of'' graph is directed since countless famous people have never heard of
me! The ``had-sex-with'' graph is presumably undirected, since it requires a
partner.
• Am I my own friend?
An edge of the form (x,x) is said to be a loop. If x is y's friend several times over,
that could be modeled using multiedges, multiple edges between the same pair of
vertices. A graph is said to be simple if it contains no loops and multiple edges.
If I were trying to impress you with how tight I am with Mel Brooks, I would be
much better off saying that Uncle Lenny knows him than to go into the details of
how connected I am to Uncle Lenny. Thus we are often interested in the shortest
path between two nodes.
A graph is connected if there is a path between any two vertices. A directed graph
is strongly connected if there is a directed path between any two vertices.
A social clique is a group of mutual friends who all hang around together. A
graph theoretic clique is a complete subgraph, where each vertex pair has an edge
between them. Cliques are the densest possible subgraphs. Within the friendship
graph, we would expect that large cliques correspond to workplaces,
neighborhoods, religious organizations, schools, and the like.
A cycle is a path where the last vertex is adjacent to the first. A cycle in which no
vertex repeats (such as 1-2-3-1 verus 1-2-3-2-1) is said to be simple. The shortest
cycle in the graph defines its girth, while a simple cycle which passes through
each vertex is said to be a Hamiltonian cycle.
Can we save space if (1) the graph is undirected? (2) if the graph is sparse?
Adjacency ListsAn adjacency list consists of a array of pointers, where the ith
element points to a linked list of the edges incident on vertex i.
To test if edge (i,j) is in the graph, we search the ith list for j, which takes , where
is the degree of the ith vertex.
Note that can be much less than n when the graph is sparse. If necessary, the two
copies of each edge can be linked by a pointer to facilitate deletions.
Both representations are very useful and have different properties, although adjacency
lists are probably better for most problems.
Traversing a Graph
One of the most fundamental graph problems is to traverse every edge and vertex in a
graph. Applications include:
For efficiency, we must make sure we visit each edge at most twice.
For correctness, we must do the traversal in a systematic way so that we don't miss
anything.
Since a maze is just a graph, such an algorithm must be powerful enough to enable us to
get out of an arbitrary maze.
Marking Vertices
The idea in graph traversal is that we must mark each vertex when we first visit it, and
keep track of what have not yet completely explored.
For each vertex, we can maintain two flags:
We must also maintain a structure containing all the vertices we have discovered but not
yet completely explored.
To completely explore a vertex, we look at each edge going out of it. For each edge
which goes to an undiscovered vertex, we mark it discovered and add it to the list of work
to do.
Note that regardless of what order we fetch the next vertex to explore, each edge is
considered exactly twice, when each of its endpoints are explored.
Suppose not, ie. there exists a vertex which was unvisited whose neighbor was visited.
This neighbor will eventually be explored so we would visit it:
Traversal Orders
The order we explore the vertices depends upon what kind of data structure is used:
• Queue - by storing the vertices in a first-in, first out (FIFO) queue, we explore the
oldest unexplored vertices first. Thus our explorations radiate out slowly from the
starting vertex, defining a so-called breadth-first search.
• Stack - by storing the vertices in a last-in, first-out (LIFO) stack, we explore the
vertices by lurching along a path, constantly visiting a new neighbor if one is
available, and backing up only if we are surrounded by previously discovered
vertices. Thus our explorations quickly wander away from our starting point,
defining a so-called depth-first search.
The three possible colors of each node reflect if it is unvisited (white), visited but
unexplored (grey) or completely explored (black).
Breadth-First Search
BFS(G,s)
for each vertex do
color[u] = white
color[u] = grey
d[s] = 0
p[s] = NIL
while do
u = head[Q]
for each do
if color[v] = white then
color[v] = gray
d[v] = d[u] + 1
p[v] = u
enqueue[Q,v]
dequeue[Q]
color[u] = black
Depth-First Search
DFS has a neat recursive implementation which eliminates the need to explicitly use a
stack.
DFS(G)
color[u] = white
parent[u] = nil
time = 0
Initialize each vertex in the main routine, then do a search from each connected
component. BFS must also start from a vertex in each component to completely visit the
graph.
DFS-VISIT[u]
discover[u] = time
time = time+1
for each do
DFS-VISIT(v)
finish[u] = time
time = time+1
Midterm Exam
Name: Signature:
ID #: Section #:
INSTRUCTIONS:
1) (25 points) Assume that you have the linked structure on the left, where each node
contains a .next field consisting of a pointer, and the pointer p points to the structure as
shown. Describe the sequence of Modula-3 pointer manipulations necessary to convert it
to the linked structure on the right. You may not change any of the .info fields, but you
may use temporary pointers tmp1, tmp2, and tmp3 if you wish.
Many different solutions are possible, but recursive solutions are particularly clean and
elegant.
PROCEDURE compress(VAR head : pointer);
VAR
second : pointer; (* pointer to next element *)
BEGIN
IF (head # NIL) THEN
second := head^.next;
IF (second # NIL)
IF (head^.info = second^.info) THEN
head^.next = second^.next;
compress(head);
ELSE
compress(head^.next);
END;
END;
END;
END;
IMPORT SIO;
TYPE
ptr_to_integer = REF INTEGER;
VAR
a, b : ptr_to_integer;
begin
a := NEW(INTEGER); b := NEW(INTEGER);
a^ := 1;
b^ := 2;
SIO.PutInt(a^);
SIO.PutInt(b^); SIO.Nl();
modify(a,b);
SIO.PutInt(a^);
SIO.PutInt(b^); SIO.Nl();
end.
Answers:
1 2
3 3
4 4
3 4
4) (25 points)
Write brief essays answering the following questions. Your answer must fit completely in
the space allowed
(a) Explain the difference between objects and modules? ANSWER: Several answers
possible, but the basic differences are (1) the notation to use them, and (2) that objects
encapsulate both procedures and data where modules are procedure oriented. (b) What is
garbage collection? ANSWER: The automatic reuse of dynamic memory which, because
of pointer dereferencing, is no longer accessible. (c) What might be an advantage of a
doubly-linked list over a singly-linked list for certain applications? ANSWER: Additional
flexibility in moving both forward and in reverse on a linked list. Specific advantages
include being able to delete a node from a list given just a pointer to the node, and
efficiently implementing double-ended queues (supporing push, pop, enqueue, and
dequeue).
Midterm Exam
Name: Signature:
ID #: Section #:
INSTRUCTIONS:
1) (20 points) Show the state of the array after each pass by the following sorting
routines. You do not have to show the array after every move or comparison, but only
after each execution of the main sorting loop or recursive call. Sort in increasing order.
-------------------------------------------------------------
| 34 | 125 | 5 | 19 | 87 | 243 | 19 | -3 | 117 | 36 |
-------------------------------------------------------------
10 points
-------------------------------------------------------------
| 34 | 125 | 5 | 19 | 87 | 243 | 19 | -3 | 117 | 36 |
| 34 \ 125 | 5 | 19 | 87 | 243 | 19 | -3 | 117 | 36 |
| 34 | 125 \ 5 | 19 | 87 | 243 | 19 | -3 | 117 | 36 |
| 5 | 34 | 125 \ 19 | 87 | 243 | 19 | -3 | 117 | 36 |
| 5 | 19 | 34 | 125 \ 87 | 243 | 19 | -3 | 117 | 36 |
| 5 | 19 | 34 | 87 | 125\ 243 | 19 | -3 | 117 | 36 |
| 5 | 19 | 34 | 87 | 125| 243 \ 19 | -3 | 117 | 36 |
| 5 | 19 | 19 | 34 | 87 | 125| 243 \ -3 | 117 | 36 |
| -3 | 5 | 19 | 19 | 34 | 87 | 125| 243 \ 117 | 36 |
| -3 | 5 | 19 | 19 | 34 | 87 | 117| 125| 243 \ 36 |
| -3 | 5 | 19 | 19 | 34 | 36 | 87 | 117| 125| 243 |
-------------------------------------------------------------
(a) Write a function to implement sequential search in a linked list, with the move-to-
front heuristic. You may assume that there are at least two elements in the list and that the
item is always found.
var q, head:pointer;
head := p;
q := p;
(b) Which of these two heuristics is better suited for implementation with arrays? Why?
5 points
move-forward-one is better for arrays since it can be done via one swap.
3) (15 points) Assume you have an array with 11 elements that is to be used to store data
as an hash table. The hash function computes the number mod 11. Given the following
list of insertions to the table:
2 4 13 18 22 31 33 34 42 43 49
Show the resulting table after the insertions for each of the following hashing collision
handling methods.
a) Show the resulting table after the insertions for chaining. (array of linked lists)
10 points
0 - 22, 33 1 - 34 2 - 2, 13 3 - 4 - 4 5 - 49 6 - 7 - 18 8 - 9 - 31, 42 10 - 43
5 points.
Disadvantages - the links use up memory which can go to a bigger hash table.
4) (20 points) Write brief essays answering the following questions. Your answer must
fit completely in the space allowed
(b) Consider the following variant of insertion sort. Instead of using sequential search to
find the position of the next element we insert into the sorted array, we use a binary
search. We then move the appropriate elements over to create room for the new insertion.
What is the worst case number of element comparisons performed using this version of
insertion sort on n items (big Oh)? points
(c) What is the worst case number of element movements performed using the above
version of insertion sort on n items (big Oh)?
6 points
5) (20 points) The integer square root of integer n (SQRT(n)) is the largest integer x such
that . For example, SQRT(8) = 2, while SQRT(9) = 3.
Write a Modula-3 function to compute SQRT(n). For full credit, your algorithm should
run in time. Partial credit will be given for an algorithm.
var
low, high, mid : integer;
low := 1;
high := n;
return (mid);
end;