Data Structures
Data Structures
MSc. In Software
1
INTRODUCTION TO
DATA Structures
!
Pitfalls
Data Structures
Summary
MSc. In Software
This subject describes programming methods and tools that will prove effective
for projects.
Food for thought:
How do you rewrite the following function so that it accomplishes the same result
in a less tricky way?
Doessomething(int *first,int *second)
{
*first = *second - *first;
*second = *second - *first;
*first = * second + *first;
}
A probable solution is like this:
Doessomething(int *first, int *second)
{
* first = 2(*second - *first);
}
By now we know that the computer understands logical commands. The
instructions given to the computer must be in a very structured form. This
structured form is called Algorithm in computer jargon. The algorithm is a
representation of any event in a stepwise manner. In some cases these sequence
of activities are quite simple and the algorithm can be easily constructed. But
when the problem at hand is quite complex, and a lot of different activities have
to be considered within a single problem, keeping track of all these events and the
variables they involve becomes a very tedious task. To manage and handle these
MSc. In Software
events and variables in a more structured and orderly manner we take the aid of
data structures.
Food for thought:
Rewrite the following function with meaningful variables, with better format and
without unnecessary variables.
#define MAXINT 100
int calculate(int apple, int orange)
{ int peach, lemon;
peach = 0; lemon = 0; if (apple < orange) {
peach = orange;} else if (orange <= apple) {
peach = apple;} else { peach = MAXINT; lemon = MAXINT;
} if (lemon != MAXINT) {return (peach);}}
A probable solution:
Do it yourself. If you cannot, email to us.
The need for using data structures arises from the fact that it teaches how to
write a function with meaningful variable names, without extra variables that
contribute nothing to the understanding, with a better layout and without
redundant and useless statements.
PITFALLS
1. Be sure you understand your problem before you decide how to solve it.
2. Be sure you understand the algorithmic method before you start to
program.
3. In case of difficulty, divide the problem into pieces and think of each part
separately.
MSc. In Software
Name
XYZ
Age
34
Sex
M
Emp. No.
134
Entity Set : Entities with similar attributes (e.g., all the employees in an
organization) form an entity set.
Range of Values : Each attribute of an entity set has a range of values, the set
of all possible values that could be assigned to the particular attribute.
Information : The term information is sometimes used for data with given
attributes, or, in other words, meaningful or processed data.
The way data is organized into the hierarchy of fields, records and files reflect the
relationship between attributes, entitles and entity sets.
Field Single elementary unit of information representing an attribute of an
entity
Record - Collection of field values of a given entity
File Collection of records of the entities in a given entity set
Primary key : Each record in a file may contain many field items, but the value
in a certain field may uniquely determine the record in the file. Such a field K is
MSc. In Software
called a primary key, and the values K1, K2,. in such a field are called keys or
key values.
Consider these cases to understand Primary key better:
(1) Suppose an automobile dealership maintains a file where each record contains
the following data:
Serial Number
Type
Year
Price
Accessories
The Serial Number field can serve as a primary key for the file, since each
automobile has a unique serial number.
(2) Suppose an organization maintains a membership file where each record
contains the following data:
Name
Address
Telephone Number
Dues Owed
Although there are four data items, Name cannot be the primary key, since
more than one person can have the same name. Name and Address may be
group items and together can serve as a primary key. Note also that the
Address and Telephone Number fields may not serve as primary keys, since
some members may belong to the same family and have the same address
and telephone number. Dues Owed is out of the question because many
people can have the same value.
The above examples must have cleared your doubt about key words, which we
are going to use.
Records may also be classified according to length. A file can have fixed-length
records or variable-length records.
Fixed-length records All the records contain the same data items with the
same amount of space assigned to each data item.
Variable-length records - File records may contain different lengths.
The above can be explained as: Student records usually have variable lengths,
since different students take a varying number of courses. Usually, variablelength records have a minimum and maximum length.
MSc. In Software
The study of such data structures includes the following three steps:
(1)
(2)
(3)
DATA STRUCTURES
The logical or mathematical model of a particular organization of data is called a
data structure. Here we will discuss three types of data structures in detail. They
are: arrays, link list, and trees.
Arrays
The simplest type of data structure is a linear (or one-dimensional) array. By a
linear array, we mean a list of a finite number n of similar data elements referred
respectively by a set of n consecutive numbers, usually 1,2,3 n.
If we choose the name A for the array, then the elements of A are denoted by
subscript notation.
A1 , A2 , A3 ,.., An
Or by the parenthesis notation
Name1
Name2
Name3
Name4
Name5
Name6
Fig 1-1
MSc. In Software
sw
The size of this array is denoted by 3 X 4 (read 3 by 4), since it contains 3 rows
(the horizontal lines of numbers) and 4 columns (the vertical lines of numbers).
If we denote the 1st array members as array[0][0], the following array members
will be denoted like this:
array[0][0]
array [0][1]
array [0][2]
array [0][3]
array [1][0]
array [1][1]
array [1][2]
array [1][3]
array [2][0]
array [2][1]
array [2][2]
array [2][3]
Fig 1-2
The position of the highlighted cell in the block in the array notation is
array[2][3].
Linked Lists
A linked list is the most important and difficult part of data structure. So if you
want to be an expert in data structures, emphasize more on linked lists,
understand the basic idea and try to solve as many examples as you can.
If we only give the theory of linked lists it will be difficult for you to understand.
Therefore, we will introduce it with an example. Consider a file where each record
contains a customer's name and his or her salesperson, and suppose the file
contains the data as appearing in the figure 1-3. Clearly the file could be stored in
the computer by such a table, i.e. by two columns of five names. However, this
may not be the most useful way to store the data.
MSc. In Software
Salesperson1
Salesperson2
Salesperson3
Salesperson4
Salesperson5
Salesperson6
Salesperson7
Salesperson8
1
2
3
4
5
6
7
8
Customer1
Customer2
Customer3
Customer4
Customer5
Salesperson1
Salesperson2,
Sales person3
Salesperson3,
Sales person8
Salesperson4
Salesperson5
Salesperson1
Fig. 1-3
Another way of storing data in the figure 1-3 is to have a separate array for the
salespeople and an entry (called a pointer) in the customer file, which gives the
location of each customer's salesperson. This is done in the figure 1-4, where
against every customer name we have written the number (pointer) of the
corresponding salesperson.
Practically speaking, in figure 1-3, in front of each customer we have specified his
salespersons name. Now in the above case the number of customers is very less
that is why we can afford to do this. But imagine a case where there are hundreds
of customers. In such a case, repeating the name of the sales person will
consume lot of space. Instead, we will give numbers to the sales persons and
mention that number in front of the customers name. Thus we can save a lot of
space.
Customer1
Customer2
Customer3
Customer4
Customer5
Customer1
Customer2
Customer3
Customer4
Customer5
a
b
c
d
e
1
2,3
3,8
4
5,1
Fig.1-4
Suppose the firm wants the list of customers for a given salesperson. Using the
data representation in this figure 1-4, the firm would have to search through the
entire customer file. One way to simplify such a search is to have a table
containing customer name and a number (pointer) corresponding to each
customer. Each salesperson would now have a set of numbers (pointer) giving the
position of his or her customers, as in this figure 1-5.
MSc. In Software
Salesperson1
Salesperson2
Salesperson3
Salesperson4
Salesperson5
Salesperson6
Salesperson7
Salesperson8
a,e
b
b,c
d
e
Fig. 1-5
Disadvantage:
The main disadvantage of this representation is that each salesperson may have
many pointers and the set of pointers will change as customers are added and
deleted.
1
Salesperson1
a
Customer1
1
b
Customer2
a
c
Customer3
b
0
Customer4
c
Fig. 1-6
Here 1 is the pointer of Salesperson1. This pointer points to Customer1. a is
the pointer of Customer1, which in turn points to Customer2. Similarly, b,
which is the pointer of Customer2 points to Customer3 and so on. Since
Customer4 is the last customer i.e. this customer is not further connected to any
other customer, its pointer has been assigned to 0. (In this picture we have
considered that Salesperson1 has got Customer1, Customer2, Customer3,
Customer4).
Trees
Data frequently contains a hierarchical relationship between various elements.
The data structure, which reflects a hierarchical relationship between various
elements, is called a rooted trees graph or, simply, a tree.
MSc. In Software
10
Trees will be defined and discussed in detail in later modules but here we indicate
some of their basic properties by means of two examples:
(a) An employees personnel record :
This may contain the following data items
i) Social Security Number
ii) Name
iii) Address
iv) Age
v) Salary
vi) Dependents
However, Name may be a group item with the subitems Last, First and MI (middle
initial). Also, address may be a group item with the subitems Street address and
Area address, where Area itself may be a group item having subitems City, State
and ZIP code. This hierarchical structure is explained in figure 1-7 (a).
Fig 1-7(a)
Another way of picturing such a tree structure is in terms of levels, as shown in
figure 1-7 (b).
01 Employee
02. Social Security Number
02. Name
03. Last
03 First
03 Middle Initial
02 Address
03 Street
03 Area
04 City
04 State
MSc. In Software
02 Age
02 Salary
02 Dependents
11
04 ZIP
Fig. 1-7(b)
(b) An algebraic expression in the tree structure format for calculating.
Let the expression be
(2x + y) (a - 7b)3
Now we want to represent the expression by the tree, so lets use a vertical arrow
() for exponential and an asterisk (*) for multiplication. Thus now we can show
the expression in terms of a tree diagram as shown in figure 1-8. Observe that
the order in which the operations will be performed is reflected in the diagram:
the exponentiation must take place after the subtraction, and the multiplication at
the top of the tree must be executed last.
Fig. 1-8
MSc. In Software
12
City2
City4
City3
City5
Fig. 1-9
MSc. In Software
13
A real example will make our idea clear about these concepts:
An organization contains a membership file in which each record contains the
following data for a given member.
Name
Address
Telephone Number
Age
Sex
(a) Suppose the organization wants to announce a meeting through a mailing
system. Then one would traverse the file to obtain Name and Address for each
member.
(b) Suppose one wants to find the names of all members living in a certain area.
Again one would traverse the file to obtain the data.
(c) Suppose one wants to obtain address for a given Name. Then one would
search the file for the record containing Name.
(d) Suppose a new person joins the organization. Then one would insert his or her
record into the file.
(e) Suppose a member dies. Then one would delete his or her record from the file.
(f) Suppose a member has moved and has a new address and telephone number.
Given the name of the member, one would first need to search for the record
in the file. Then one would perform the "update"--i.e., change items in the
record with the new data.
(g) Suppose one wants to find the number of members 65 or older. Again one
would traverse the file, counting such members.
Searching Algorithms
Consider a membership file in which each record contains, among other data, the
name and telephone number of its member. Suppose we are given the name of a
MSc. In Software
14
member and we want to find his or her telephone number. One way to do this is
to linearly search through the file, i.e., to apply the following algorithm:
Linear Search
Search each record of the file, one at a time, until the given Name and the
corresponding telephone number is found.
Consider that the time required to execute the algorithm is proportional to the
number of comparisons.
Second, assuming that each name in the file is equally likely to be picked, it is
intuitively clear that the average number of comparisons for a file with n records
is equal to n/2; that is, the complexity of the linear search algorithm is given by
C(n) = n/2.
Binary Search
Compare the given Name with the name in the middle of the list. This indicates
which half of the list contains Name. Then compare Name with the name in the
middle of the correct half to determine which quarter of the list contains Name.
Continue the process until Name is found in the list.
One can show that the complexity of the binary search algorithm is given by
C(n) = log2n.
Thus, for example, one will not require more than 6 comparisons to find a given
Name in a list containing 64 (=26) names.
Drawback
Although the binary search algorithm is a very efficient algorithm, it has some
major drawbacks. Specifically, the algorithm assumes that one has direct access
to the middle name in the list or a sublist. This means that the list must be stored
in some type of array. Unfortunately, inserting an element in an array requires
elements to be moved down the list, and deleting an element from an array
requires elements to be moved up the list.
An Example of Time-Space Tradeoff
Suppose a file of records contains names, employee numbers and much additional
information among its fields. For finding the record for a given name, sorting the
file alphabetically and using a binary search is a very efficient way. On the other
hand, suppose we are given only the employee number of the person. Then we
would have to do a linear search for the record, which is extremely timeconsuming for a very large number of records. How can we solve such a problem?
One way is to have another file, which is sorted numerically according to the
MSc. In Software
15
employee number. This, however, would double the space required for storing the
data. Another way, pictured in figure 1-10, is to have the main file sorted
numerically by employee number and to have an auxiliary array with only two
columns, the first column containing an alphabetized list of the names and the
second column containing pointers, which give the locations of the corresponding
records in the main file. This is one way of solving the problem that is done
frequently, since the additional space, containing only two columns, is minimal for
the amount of extra information it provides.
Employee. No.
Name
1-abc
2-xyz
3-pqr
4-mnp
5-lmn
Name1
Name2
Name3
Name4
Name5
Extra Data
XXXX
XXXX
XXXX
XXXX
XXXX
Pointer
1
2
3
4
5
Name
Name1
Name2
Name3
Pointer
1
2
3
Fig. 1-10
Quote of the chapter:
Act in haste and repent in leisure.
Program in haste and debug forever.
Summary
# An entity is something that has certain attributes or properties that may
be assigned values. These values may be numeric or nonnumeric.
# The logical or mathematical model of a particular organization of data is
called a data structure. Three most popular data Structures are arrays, link
list and trees.
# Some other data types are stacks, queues and graphs.
# We have discussed two searching algorithms, (i) linear search,
(ii) binary search.
# Sometimes two or more operations may be used in a given situation to get
optimum speed of a given situation.
MSc. In Software
Zee Interactive Learning Systems
16
MSc. In Software
2
COMPEXITY, RATE OF
Growth,
BIG O NOTATION
MAIN POINTS COVERED
INTRODUCTION
a) Floor and Ceiling Functions
b) Remainder function: Modular Arithmetic
c) Integer and Absolute Value Functions
d) Summation Symbol: Sums
e) Factorial Function
f) Permutations
g) Exponents and Logarithms
MSc. In Software
!
ALGORITHMIC NOTATION
CONTROL STRUCTURES
COMPLEXITY OF ALGORITHMS
SUMMARY
INTRODUCTION
This section gives various mathematical functions, which appear very often in
the analysis of algorithms and in computer science.
a) Floor and Ceiling Functions
Let x be any real number. Then x lies between two integers called the floor
and the ceiling of x. Specifically,
x , called the floor of x, denotes the greatest integer that does not exceed
x.
x , called the ceiling of x, denotes the least integer that is not less than x.
If x is itself an integer, then x = x ; otherwise x + 1 = x
3.14 =
5 =
-8.5 =
7
=
3
2
-9
7
3.14
5
-8.5
7
=4
=3
= -8
=7
MSc. In Software
k = Mq + r
[dividend
[dividend
[dividend
[dividend
The term "mod" is also used for the mathematical congruence relation, which
is denoted and defined as:
a b (mod M)
if and only if
M divides (b a)
and a M
a (mod M)
12
INT ( 5 ) = 2,
or
INT(x) = x
INT (7)
according to whether x is
MSc. In Software
The absolute value of the real number x, written ABS(x) or |x| , is defined as
the greater of x or -x. Hence ABS (0) = 0, and, for x 0, ABS (x) = x, if x is
positive, and -x, if x is negative. Thus,
| - 15| = 15 ,
|7| = 7,
4.44,
|-0.075| = 0.075
We note that
| x| =
|-x|
|-3.33|
and, for x 0,
= 3.33,
|x |
|4.44|
is positive.
and
am + am+1 + + an
and
n
aj
j=m
2! = 1.2 = 2 ;
3! = 1.2.3 = 6 ;
MSc. In Software
(c)
(d)
(e)
4! = 1.2.3.4 = 24
5! = 5.4! = 5.24 = 120 ;
6! = 6. 5! = 6 . 120 = 720
f) Permutations
A permutation of a set of n elements is an arrangement of the elements in a
given order. For example the permutations of the set consisting of the
elements a , b , c are:
abc , acb, bac, bca, cab, cba
One can prove: There are n! permutations of a set of n elements.
Accordingly there are 4! = 24 permutations of a set of 4 elements, 5! = 120
permutations of a set with 5 elements, and so on.
g) Exponents and Logarithms
We consider first integer exponents (where m is a positive integer).
am = a . a . .. a ( m times )
a0 = 1,
a-m = 1 / am
Exponents are extended to include all rational numbers by defining, for any
rational number m / n
a
For example,
24 = 16 ,
m/n
= n am =
( a)
2-4 = 1 / 24 = 1/16 ,
1252/3 = 52 = 25
and
by = x
8=3
since 23 = 8
MSc. In Software
log
log
log
since 10 2 = 100
since 2 6 = 64
since 10 3 = 0.001
100 = 2
2 64 = 6
10 0.001 = -3
10
log b 1 = 0
log b b = 1
The logarithm of a negative number and the logarithm of 0 are not defined.
Exponent function f(x) = bx
Logarithmic function
g(x) = log b x
For example log
40 = 3.6889.
Natural logarithms
Common logarithms
Binary logarithms
where e = 2.718281..
ALGORITHMIC NOTATION
An algorithm is a finite step-by-step list of well-defined instructions for
solving a particular problem. This section describes the format that is used to
present algorithms throughout the text. This algorithmic notation is best
described by means of examples.
An array DATA of numerical values is in memory. We want to find the
location LOC and the value MAX of the largest element of DATA. Given no
other information about DATA, one way to solve the problem is:
i) Initially we being with LOC = 1 and MAX = DATA [1].
ii) Then compare MAX with each successive element DATA [K] of DATA.
iii) If DATA [K] exceeds MAX, then update LOC and MAX
so that LOC = K and MAX = DATA [K].
The final values appearing in LOC and MAX give the location and value of the
largest element of DATA.
Algorithm (Largest Element in Array) A nonempty array DATA with N
numerical values is given. The algorithm finds the location LOC and the value
MAX of the largest element of DATA.
The variable K is used as a counter.
Step 1
MSc. In Software
Step 2.
Step 3.
Step 4.
Step 5.
The format for the formal presentation of an algorithm consists of two parts.
The first part identifies the variables, which occur in the algorithm and lists
the input data. The second part of the algorithm consists of the list of steps
that is to be executed.
CONTROL STRUCTURES
Algorithms and their equivalent computer programs are more easily
understood if they use self-contained modules and three types of logic or
flow of control.
(1)
(2)
(3)
These three types of logic are discussed below and in each case we show the
equivalent flowchart.
Sequence Logic (Sequential Flow)
In this case the modules are executed in the obvious sequence. The
sequence may be presented explicitly, by means of numbered steps, or
implicitly, by the order in which the modules are written.
Algorithm
.
.
.
Module A
Module A
Module B
Module B
Module C
.
.
.
Module C
Sequence logic.
MSc. In Software
[ Module A ]
else
[ Module B ]
[ End of if structure. ]
Multiple alternatives:
The structure has the form
if condition(1), then
[ Module A1 ]
else if condition(1), then
[ Module A2 ]
:
:
else if condition(M), then
MSc. In Software
else
[ Module AM ]
[ Module B]
[ End of if structure ]
The logic of the structure allows only one of the modules to be executed.
Example
The solution of the quadratic equation
ax2 + bx + c = 0
where a 0, is given by
-b ( b2 4ac)
=
2a
Read A, B, C
Set D = B2 4AC
if D > 0, then
Set X1 = (-B + D)/2A and X2 = (-B - D)/2A
Write X1, X2
else if D = 0, then
Set X = -B / 2A
Write UNIQUE SOLUTION , X .
else
Write NO REAL SOLUTION .
[ End of If structure ]
Step 4.
Exit.
MSc. In Software
10
The third kind of logic refers to either of two types of structures involving
loops. Each type begins with a Repeat statement and is followed by a
module, called the body of the loop.
There are two types of such loops:
(1) repeat-for loop
Repeat for K = R to S step T
[ Module ]
[ End of Loop ]
(2) repeat-while loop
Repeat while condition
[ Module ]
[ End of loop ]
We have discussed an algorithm for finding the maximum element in an
array. Now we are going to discuss the same problem using a repeat-while
loop
Algorithm( Largest Element in Array ) Given a nonempty array DATA with N
numerical values this algorithm finds the location LOC and the value MAX of
the largest element of DATA.
1.
5.
6.
COMPLEXITY OF ALGORITHMS
In designing algorithms we need methods to separate bad algorithms from
good ones. This will enable us to choose the right one in a given situation.
The analysis of algorithms and comparisons of alternative methods constitute
an important part of software engineering. In order to compare algorithms,
we have to find out the efficiency of our algorithms. In this section we will
discuss how to find efficiency. Lets see this with an example.
Example 1
Food for thought:
MSc. In Software
11
What do you think are the possible criteria that measure the efficiency of an
algorithm?
(a) Time taken
(b) Length of the algorithm
(c) Memory space used
(d) Time required in writing the algorithm
(a) and (c) are the correct answers.
Suppose you are given an algorithm M, and the size of the input data is n.
Then the efficiency of the algorithm M depends on two main measures:
MSc. In Software
12
c) Never occur
d) Not quite sure
Just for thought, any answer can be true.
The word zee is not a very common word so W may not appear at all, so
the complexity f(n) of the algorithm will be large.
The above discussion leads us to the question of finding the complexity
function f(n) for certain cases. The two cases one usually investigates in the
complexity theory are as follows:
(1) Worst case:
the maximum value of f(n) for any possible input
(2) Average case: the expected value of f(n)
(3) Best case:
sometimes we also consider the minimum possible value
of f(n), called the best case
Food for thought:
What is the best case while searching in an array for a specific element?
(a) Element occurs at the end of the array
(b) Element occurs at the beginning of the array
(c) Element occurs at the middle most position of the array
(d) Best case does not exist
(b)
MSc. In Software
13
is, we compare ITEM with DATA[1], then DATA[2], and so on, until we find
LOC such that
ITEM = DATA[LOC].
A formal representation of the algorithm is as follows:
Algorithm ( Linear Search ) A linear array DATA with N elements and a
specific ITEM of information are given. The algorithm finds the location LOC
of ITEM in the array DATA or sets LOC = 0
1. [ Initialize ] Set K = 1 and LOC = 0.
2. Repeat Steps 3 and 4 while LOC = 0 and K N.
3.
if ITEM = DATA[K], then Set LOC = K.
4.
Set K = K + 1. [ Increments counter. ]
[ End of Step 2 loop. ]
5. [ Successful? ]
if LOC = 0, then
Write ITEM is not in the array DATA
else
Write LOC is the location of ITEM.
[ End of If structure. ]
6. exit
MSc. In Software
14
This agrees with our intuitive feeling that the average number of
comparisons needed to find the location of ITEM is approximately equal to
half the number of elements in the DATA set.
MSc. In Software
15
Fig 2-1
The above table is arranged such that the rate of growth of the function
increases from left to right with log n having the lowest rate of growth and
2n having the largest rate of growth.
To indicate the convenience of this notation, we give the complexity of
certain well known searching and sorting algorithms:
(a)
(b)
(c)
(d)
Linear search
Binary search
Bubble sort
Merge-sort
:
:
:
:
O(n)
O(log n)
O(n2)
O(n log n)
Summary
# In order to compare algorithms, we have to find out its efficiency. This
will help us to employ the right one in order to solve problems
effectively.
# Suppose M is an algorithm, and n is the size of input data in that
algorithm. Clearly the complexity f(n) of M increases as n increases.
# Big O notation states that for a function f(n), there exists a positive
integer n0 and a positive number M such that, for all n > n0, we have
| f(n) | < M | g(n) |
Then we may write
f(n) = O ( g(n) )
This is called big O notation.
Zee Interactive Learning Systems
MSc. In Software
3
ARRAYS
MAIN POINTS COVERED
!
INTRODUCTION
LINEAR ARRAY
MULTIDIMENSIONAL ARRAYS
MSc. In Software
Introduction
ata-structures are classified into two broad categories linear and nonLinear.
The most elementary data-structure that we will introduce is array.
Advantages:
This has a linear structure
They are easy to traverse, search and sort
They are easy to implement
Disadvantages:
The length of the array cannot be changed once it is specified
Food for thought:
What do you think is the reason for the above disadvantage?
(a) Array size has to be fixed at the beginning
(b) Problem with memory allocation occurs if size is altered
(c) Once a fixed block of memory has been reserved for a particular
array it cannot be altered
(d) Variable length arrays are not required in real life
(a)
LINEAR ARRAY
A linear array is a list of a finite number n of homogeneous data elements
(i.e., data elements of the same type) such that:
(a) The elements of the array are referenced respectively by an index
set consisting of n consecutive numbers
(b) The elements of the array are stored respectively in successive
memory locations
Length or size of an array = The number of elements in the array
Length = UB LB + 1
Where UB is the largest index called the upper bound
And LB is the smallest index called the lower bound
Notation For Representing Arrays
Food for thought:
You can represent the elements of an array A by
(a)
MSc. In Software
(b)
(c)
(d)
Linear
Circular
Random
We cannot know about the memory locations
(a) Is the correct choice as seen from the diagram below
MSc. In Software
(a) Array elements except the first one are not required
(b) The first element contains information of all the other elements
(c) Knowing the first element and the position of the required element
we can traverse the array to reach that element
(c) is the correct choice as explained below
Using this address Base (LA), the computer calculates the address of any
element of LA by the following formula:
LOC(LA[K]) = Base (LA) + w(K - lower bound)
Where w is the number of words per memory cell for the array LA. Observe
that the time to calculate LOC(LA[K]) is essentially the same for any value of
K. Furthermore, given any subscript K, one can locate and access the content
of LA[K] without scanning any other element of LA.
.
100
101
102
103
104
.
Fig 3-1
MSc. In Software
MSc. In Software
We can easily insert an element at the "end" of a linear array provided the
memory space allocated for the array is large enough to accommodate the
additional element. On the other hand, if we need to insert an element in the
middle of the array. Then, on the average, half of the elements must be
moved downward to new locations to accommodate the new element and
keep the order of the other elements.
Similarly, deleting an element at the "end" of an array presents no
difficulties, but deleting an element somewhere in the middle of the array
would require each subsequent element to be moved one location upward in
order to "fill up" the array.
Example
Suppose TEST has been declared to be a 5-element array but data have been
recorded only for TEST[1], TEST[2] and TEST[3]. If X is the value of the next
test, then one simply assigns
TEST [4] = X
to add X to the list. Similarly, if Y is the value of the subsequent test, then
we simply assign
TEST[5] = Y
to add Y to the list, Now, however, we cannot add any new test scores to the
list.
SORTING; BUBBLE SORT
Let A be a list of n numbers. Sorting A refers to the operation of
rearranging the elements of A so they are in increasing order, i.e., so that
A [1] < A [2] < A [3] < < A [N]
For example, suppose A originally is the list
After sorting, A is the list
23,4,5,13,6,19,6
4,5,6,13,19,23
MSc. In Software
Bubble Sort
Suppose the list of numbers A[1]. A[2],., A[N] is in the memory. The
bubble sort algorithm works as follows:
Step 1. First we have to compare A[1], A[2] and arrange them in the desired
order, so that A[1] < A[2].
Similarly we can compare A[2] and A[3] and arrange them so that A[2] <
A[3]. Then compare A[3] and A[4] and arrange them so that A[3] < A[4].
We have to continue this process of comparison until we compare A[N - 1]
with A[N] and arrange them so that A[N-1] < A[N].
You can note that the Step 1 involves n-1 comparisons. (During Step 1, the
largest element is "bubbled up" to the nth position or "sinks" to the nth
position.) When Step 1 is completed, A[N] will contain the largest element.
Step 2. Repeat Step 1 with one less comparison; that is, now we stop after
we compare and possibility rearrange A[N-2] and A[N-1]. (Step 2 involves N2 comparisons and, when Step 2 is completed, the second largest element
will occupy A[N-1].)
Step3. Repeat Step 1 with two fewer comparisons; that is, we stop after
comparing and rearranging A[N-3] and A[N-2].
.......................................................
.......................................................
.......................................................
Step N-1. Compare A[1] with A[2] and arrange them so that A[1] < A[2}.
After n - 1 steps, the list will be stored in increasing order.
MSc. In Software
Linear Search
Suppose DATA is a linear array with n elements. We have not been given any
other information about DATA. The most intuitive way to search for a given
ITEM in DATA is to compare ITEM with each element of DATA one by one.
That is, first we test whether DATA[1] = ITEM, and then we test whether
DATA[2] = ITEM, and so on. This method, which traverses DATA sequentially
to locate ITEM, is called linear search or sequential search.
To simplify this, we first assign ITEM to DATA[N + 1], the position following
the last element of DATA. Then the outcome
MSc. In Software
LOC = N + 1
When LOC denotes the location where ITEM first occurs in DATA, it signifies
that the search is unsuccessful. The purpose of this initial assignment is to
avoid repeated testing to find out if we have reached the end of the array
DATA. This way, the search must eventually "succeed".
We have shown an algorithm for linear search.
Observe that Step 1 guarantees that the loop in Step 3 must terminate.
Without Step 1 the Repeat statement in Step 3 must be replaced by the
following statement, which involves two comparisons, not one:
Repeat while LOC N and DATA [LOC] = ITEM
On the other hand, in order to use Step 1, one must guarantee that there is
an unused memory location.
Algorithm : (Linear Search) LINEAR (DATA, N, ITEM,LOC) Here DATA is
a linear array with N elements and ITEM is a given item of
information. This algorithm finds the location LOC of ITEM in
DATA, or sets LOC = 0 if the search is unsuccessful.
1 [Insert ITEM at the end of DATA.] Set DATA [N + 1] = ITEM.
2 [Initialize counter,] Set LOC = 1.
3 [Search for ITEM.]
Repeat while DATA[LOC] ITEM
Set LOC = LOC + 1
[End of loop.]
4 [Successful ?] if LOC = n + 1, then Set LOC = 0.
Exit.
MSc. In Software
10
MULTIDIMENSIONAL ARRAYS
The linear arrays we have discussed so far are also called one-dimensional
arrays, since each element we represent in the array is a single subscript.
Most programming languages allow two-dimensional and three-dimensional
arrays, i.e., arrays where elements are referenced, respectively, by two and
three subscripts. In fact, some programming languages allow the number of
dimensions for an array to be as high as 7.
Food for thought:
Which of the following events have to be represented on the computer by
multidimensional arrays?
Chess board
Sales figures as per year of a certain firm
Co-ordinates in mathematics
-------------
yes
no
yes
MSc. In Software
11
Two-Dimensional Arrays
A two-dimensional (m, n) array A is a collection of m X n data elements
such that each element is specified by a pair of integers (such as j, k), called
subscripts, with the property that
ijm
and
ikn
The element of A with first subscript j and second subscript k will be denoted
by
Aj, k
or A [j, k]
MSc. In Software
12
(1,1)
(1,2)
(1,3)
(1,4)
(2,1)
(2,2)
(2,3)
(2,4)
(3,1)
(3,2)
(3,3)
(3,4)
Column1
Column2
Column3
Column4
Row1
Row2
Row3
Recall that, for a linear array LA, the computer does not keep track of the
address LOCK (LA[K]) of every element LA [K] of LA, but it does keep track
of Base (LA), the address of the first element of LA. The computer uses the
formula
LOC(LA[K]) = Base (LA) + w(K - 1)
MSc. In Software
13
K1, K2,, Kn
or
B [K1, K2,, Kn ]
MSc. In Software
14
that the elements are listed such that the first subscript varies first (most
rapidly), the second subscript second (less rapidly), and so on.
Summary
In this module we studied,
# A linear array is a list of a finite number n of homogeneous data
elements of the same type.
# Let A be a collection of data elements stored in the memory of
the computer. Suppose we want to print the contents of each
element of A or suppose we want to count the number of
elements of A with a given property. We can accomplish this by
traversing A, that is, by accessing and processing (frequently
called visiting) each element of A exactly once.
# Let A be a collection of data elements in the memory of the
computer. "Inserting" refers to the operation of adding another
element to the collection A, and "deleting" refers to the operation
of removing one of the elements of A.
# Sorting means arranging numerical data in decreasing order or
arranging non-numerical data alphabetically.
# Searching refers to the operation of finding the location LOC of
ITEM in DATA, or printing some message that ITEM does not
appear there.
The linear arrays are called one-dimensional arrays, since each element we
represent in the array is a single subscript. Most programming languages
allow two-dimensional and three-dimensional arrays, i.e., arrays where
elements are referenced, respectively, by two and three subscripts. In fact,
some programming languages allow the number of dimensions for an array
to be as high as 7.
Zee Interactive Learning Systems
MSc. In Software
4
LINKED LISTS
MAIN POINTS COVERED
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Introduction
Traversing a Linked List
Searching a Linked List
Memory allocation; garbage collection
Insertion into a linked list
Deletion from a linked list
Header linked lists
Two-ways lists
Binary trees
Representing binary trees in memory
Traversing binary trees
Traversal Algorithms using Stacks
Binary search trees
Searching and inserting in binary search trees
Deleting in binary search trees
Heap; Heapsort
Summary
MSc. In Software
Introduction
linked list is a linear collection of data elements, called nodes, where the linear
order is maintained by pointers. We divide each node into two parts.
In figure 4-1, each node has two parts. The left part represents the information part of
the node, which may contain an entire record of data items (e.g., NAME, ADDRESS). The
right part represents the next pointer field of the node, and there is an arrow drawn from
it to the next node in the list. The pointer of the last node contains a special value, called
the null pointer, which is any invalid address.
Fig 4-1
We denote null pointer as X in the diagram, which signals the end of the list. The linked
list contains a list of pointer variables. One of them is START, which contains the address
of the first node in the list. We need only this address in START to trace through the list.
If the list contains no nodes it is called null list or empty list and is denoted by the null
pointer in the variable START.
MSc. In Software
processed. Thus the assignment PTR = PTR->LINK moves the pointer to the next node
in the list, as pictured in Figure 4-2.
Fig 4-2
Here we have initialized PTR to START. Then processed INFO[PTR], the information at
the first node. Updated PTR by the assignment PTR = PTR->LINK, so that PTR points to
the second node. Then processed INFO[PTR], the information at the second node. Again
updated PTR by the assignment PTR = PTR->LINK, and then processed PTR[INFO],
the information at the third node. And so on. We continued until we reached PTR =
NULL, which signals the end of the list.
A formal presentation of the algorithm is as follows:
Algorithm: (Traversing a Linked List) Let LIST be a linked list in the memory. This
algorithm traverses LIST, applying an operation PROCESS to each element of
LIST. The variable PTR points to the node currently being processed.
1.
2.
3.
4.
Example 1
The following procedure prints the information at each node of a linked list. Since the
procedure must traverse the list, it will be very similar to the Algorithm above.
Procedure: PRINT (INFO, LINK, START)
This procedure prints the information at each node of the list.
1.
Set PTR = START
2.
Repeat Steps 3 and 4 PTR NULL
3.
Write INFO [PTR]
4.
Set PTR = PTR->LINK [Updates pointer]
MSc. In Software
We can observe that the procedure traverses the linked list in order to count the number
of elements. Hence the procedure is very similar to the above traversing algorithm. Here,
however, we require an initialization step for the variable NUM before traversing the list.
MSc. In Software
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Free pool
MSc. In Software
Together with the linked lists in memory, a special list is maintained which consists of
unused memory cells. This list, which has its own pointer, is called the list of available
space or the free-storage list or the free pool.
Suppose we implement linked lists by parallel arrays and insertions and deletions are to
be performed on two linked lists. Then the unused memory cells in the arrays will also be
linked together to form a linked list using AVAIL as its list pointer variable. Hence this
free-storage list will also be called the AVAIL list. Such a data structure will frequently be
denoted by writing
LIST (INFO, LINK, START, AVAIL)
In the pointer-type representation of linked lists, the language provides facilities for
returning memory that is no longer in use. In C the function free returns memory that
has been obtained by a call to malloc or calloc.
Note: More examples on memory allocation are given on the web. Please refer to the site
zeelearn.com
Syntax of malloc:
void *malloc( size_t size )
malloc returns a pointer to space for an object of size size_t, or NULL if the request cannot
be satisfied. The space is initialized to zero bytes.
Syntax of calloc:
void *calloc( size_t nobj , size_t size )
calloc returns a pointer to space for an array of nobj objects, each of size size_t, or NULL.
If the request cannot be satisfied the space is initialized to zero bytes.
Syntax of free:
void free( void *p )
free de-allocates the space pointed to by p. It does nothing if p is NULL. p must be a
pointer to space previously allocated by calloc or malloc.
Garbage Collection
The operating system of a computer may periodically collect all the deleted space onto the
free-storage list. Any technique, which does this collection, is called garbage collection.
Garbage collection usually takes place in two steps. First the computer runs through all
lists, tagging those cells which are currently in use, and then runs through the memory,
collecting all untagged space onto the free-storage list. The garbage collection may take
place when there is only some minimum amount of space or no space at all left in the
MSc. In Software
free-storage list, or when the CPU is idle and has time to do the collection. Generally
speaking, the garbage collection is invisible to the programmer.
Fig. 4-3
We have shown insertion in Fig (b). That is, node A now points to the new node N, and
node N points to node B, to which A previously pointed.
Suppose our linked list is maintained in the memory in the form
LIST (INFO, LINK, START, AVAIL)
In the above discussion we did not consider the AVAIL list for the new node N. Let us
consider that the first node in the AVAIL list will be used for the new node N. Thus the
MSc. In Software
above figure looks like Figure 4-4. Observe that three pointer fields are changed as
follows:
(1) The nextpointer field of node A now points to the new node N, to which AVAIL
previously pointed
(2) AVAIL now points to the second node in the free pool, to which node N previously
pointed
(3) The nextpointer field of node N now points to node B, to which node A previously
pointed
Fig. 4-4
There are two special cases: If the new node N is the first node in the list, then START will
point to N; and if the new node N is the last node in the list, then N will contain the null
pointer.
Insertion Algorithms
Algorithms which insert nodes into linked lists come up in various situations. We will
discuss three of them here.
1) Inserting a node at the beginning of the list
2) Inserting a node after the node with a given location
3) Inserting a node into a sorted list
All our algorithms assume that the linked list is in the memory in the form LIST(INFO,
LINK, START,AVAIL) and the variable ITEM contains new information to be added to the
list.
Since our insertion algorithms will use a node in the AVAIL list, all the algorithms will
include the following steps:
MSc. In Software
(a)
Checking to see if space is available in the AVAIL list. If AVAIL is NULL, then the
algorithm will print the message OVERFLOW.
(b)
Removing the first node from the AVAIL list. Using the variable NEW to keep track of
the location of the new node, the steps can be implemented by the pair of
assignments (in this order).
NEW = AVAIL,
(c)
AVIAL = LINK->AVAIL
The systematic diagram of the latter two steps is given in Fig. 4-5
Fig. 4-5
MSc. In Software
10
Fig. 4-6
1.
2.
3.
4.
5.
MSc. In Software
11
The traversing continues as long as PTR[INFO] > ITEM, or in other words, the traversing
stops as soon as ITEM < PTR[INFO]. The PTR points to node B, so SAVE will contain the
location of the node A.
The formal statement of our procedure is as follows. The cases where the list is empty or
where ITEM < START[INFO], so LOC= NULL, are treated separately, since they do not
involve the variable SAVE.
Procedure:
1.
2.
3.
4.
5.
6.
7.
8.
Fig. 4-7
MSc. In Software
12
Now we have all the components to present an algorithm, which inserts ITEM into a linked
list. The simplicity of the algorithm comes from using the previous two procedures.
Algorithm :
1.
2.
3.
Fig. 4-8
The above figure does not take into account the fact that, when a node N is deleted from
our list it will immediately return its memory space to the AVAIL list. Specifically, for
easier processing, it will be returned to the beginning of the AVAIL list. Thus a more exact
schematic diagram of such a deletion is the one in Fig. 4-9.
MSc. In Software
13
Fig. 4-9
Observe that three pointer fields are changed as follows:
(1)
(2)
(3)
The nextpointer field of node A now points to node B, where node N previously
pointed
The nextpointer field of N now points to the original first node in the free pool,
where AVAIL previously pointed
AVAIL now points to the deleted node N
There are two special cases: If the deleted node N is the first node in the list, then START
will point to node B; and if the deleted node N is the last node in the list, then node A will
contain the NULL pointer.
MSc. In Software
14
START = START->LINK is the statement, which effectively deletes the first node from
the list. This covers the case when N is the first node.
Figure 4-10 is the schematic diagram of the assignment START = START->LINK
Fig. 4-10
Figure 4-11 is the schematic diagram of the assignment LOCP->LINK = LOC->LINK
which effectively deletes node N when N is not the first node.
The simplicity of the algorithm comes from the fact that we are already given the location
LOCP of the node, which precedes node N. In many applications, we must first find LOCP.
Fig. 4-11
MSc. In Software
SAVE = PTR
AND
15
PTR = PTR->LINK
We will continue with the traversing as long as PTR->INFO ITEM, or in other words, the
traversing stops as soon as ITEM = PTR->INFO. Then
PTR contains the location LOC of node N and
SAVE contains the location LOCP of the node preceding N
The formal statement of our procedure is as follows: The cases where the list is empty or
where START->INFO = ITEM (i.e., where node N is the first node) are treated separately,
since they do not involve the variable SAVE.
Procedure : FINDB(INFO, LINK, START, ITEM, LOC, LOCP)
This procedure finds the location LOC of the first node N which
contains ITEM and the location LOCP of the node preceding N.
If ITEM does not appear in the list, then the procedure
sets LOC = NULL; and if ITEM appears in the first
Node, then it sets LOCP = NULL.
1. [List empty?] if START = NULL, then
Set LOC = NULL and LOCP = NULL, and Return
[End of if Structure.]
2. [ITEM in first node] if START->INFO = ITEM, then
Set LOC = START and LOCP = NULL, and Return.
[End of if Structure,]
3. Set SAVE = START and PTR = START->LINK. [Initializes pointers.]
4. Repeat Steps 5 and 6 while PTR NULL.
5. if INFO[PTR] = ITEM, then
Set LOC = PTR and LOCP = SAVE, and Return.
[End of it Structure]
6. Set SAVE = PTR and PTR = PTR->LINK. [Updates pointers.]
[End of Step 4 loop]
7. Set LOC = NULL. [Search unsuccessful.]
8. Return.
Now we can easily present an algorithm to delete the first node N from a linked list, which
contains a given ITEM of information. The simplicity of the algorithm comes from the fact
that the task of finding the location of N and the location of its preceding node has already
been done in the above procedure.
Algorithm :
MSc. In Software
16
This algorithm deletes from a linked list the first node N which contains
the given item of information.
1. [Use Procedure above to find the location of N and its preceding
node.]
Call FINDB(START, ITEM, LOC, LOCP]
2. if LOC = NULL, then write ITEM node in list, and Exit.
3. [Delete node.]
if LOCP = NULL, then
Set START= START->LINK. [Deletes first node.]
else
Set LOCP->LINK = LOC->LINK.
[End of If structure.][Return deleted node to the AVAIL list.]
Set LOC->LINK = AVAIL and AVAIL = LOC.
4. Exit.
A grounded header list is a header list where the last node contains the null pointer.
(The term "grounded" comes from the fact that in many cases the electrical ground
symbol is used to indicate the null pointer.)
(2)
A circular header list is a header list where the last node points back to the header
node.
Figure 4-12 contains schematic diagrams of these header lists. Unless otherwise we have
stated or implied, our header lists will always be circular. Accordingly, in such a case, the
header node also acts as a sentinel indicating the end of the list.
We can observe that the list pointer START always points to the header node. Accordingly,
LINK [START] = NULL indicates that a grounded header list is empty, and
LINK[START] = START indicates that a circular header list is empty.
MSc. In Software
17
Fig. 4-12
Although header lists in the memory may maintain our data, the AVAIL list will always be
maintained as an ordinary linked list.
We frequently use Circular header lists instead of ordinary linked lists because many
operations are much easier to state and implement, using header lists. This comes from
the following two properties of circular header lists,
(1)
The null pointer is not used and hence all pointers contain valid addresses
(2)
Every (ordinary) node has a predecessor, so the first node may not require a
special case
MSc. In Software
18
The algorithm below finds the location LOC of the first node in LIST, which contains ITEM
when LIST is an ordinary linked list. The following is such an algorithm when LIST is a
circular header list.
Algorithm : SRCHHL (INFO, LINK, START, ITEM, LOC)
LIST is a circular header list in memory. This algorithm finds the location
LOC of the node where ITEM first appears in LIST or sets LOC = NULL.
1. Set PTR = START->LINK.
2. Repeat while PTR.INFO ITEM and PTR START
Set PTR = PTR->LINK. [PTR now points to the next node.]
[End of loop]
3. if PTR->INFO = ITEM, then
Set LOC = NULL.
[End of If structure.]
4. Exit.
The two tests which control the searching loop (step 2) were not performed at the same
time in the algorithm for ordinary linked list because for ordinary link list PTR->INFO is
not defined when PTR = NULL.
Enough with link list. Take a break and then continue again.
TWO-WAY LISTS
Each list we have discussed above is called a one-way list, since there is only one way we
can traverse the list.
We have introduced a new list structure, called a two-way list, which can be traversed in
two directions: in the usual forward direction from the beginning of the list to the end, or
in the backward direction from the end of the list to the beginning. Furthermore, given the
location LOC of a node N in the list, you now have immediate access to both the next
node and the preceding node in the list. This means, in particular, that you are able to
delete N from the list without traversing any part of the list.
A two-way list is a linear collection of data elements, called nodes, where each node N is
divided into three parts.
(1) An information field INFO which contains the data of N
(2) A pointer field FORW that contains the location of the next node in the list
(3) A pointer field BACK, which contains the location of the preceding node in the list
The list also requires two more pointer variables: FIRST, which points to the first node in
the list, and LAST, which points to the last node in the list. Figure 4-13 shows such a list.
MSc. In Software
19
Observe that the null pointer appears in the FORW field of the last node in the list and
also in the BACK field of the first node in the list.
Fig. 4-13
We can observe that, using the variable FIRST and the pointer field FORW, we can
traverse a two-way list in the forward direction as before. On the other hand, using the
variable LAST and the pointer field BACK, we can also traverse the list in the backward
direction.
Suppose LOCA and LOCB are the locations of nodes A and B, respectively, in a two-way
list. Then the way the pointers FORW and BACK are defined gives us the following:
Pointer property: FORW [LOCA] = LOCB if and only if
BACK [LOCB] = LOCA
In other words, the statement that node B follows node A is equivalent to the statement
that node A precedes node B.
We can maintain two-way lists in memory by means of linear arrays in the same way as
one-way lists except that now we require two pointer arrays, FORW and BACK, instead of
one pointer array LINK. We also require two list pointer variables, FIRST and LAST,
instead of one list pointer variable START. On the other hand, the list AVAIL of available
space in the arrays will still be maintained as a one-way list--using FORW as the pointer
field--since we delete and insert nodes only at the beginning of the AVAIL list.
MSc. In Software
20
Fig. 4-14
TREES
So far, we were concentrating on linear types of data structures: strings, arrays, lists, and
queues. Here we define a nonlinear data structure called a tree. This structure is mainly
used to represent data containing a hierarchical relationship between elements, e.g.
records, family and tables of contents.
First we investigate a special kind of tree, called a binary tree, which can be easily
maintained in the computer. Although such a tree may seem to be very restrictive, we will
see later in the chapter that more general trees may be viewed as binary trees.
BINARY TREES
A binary tree T is defined as a finite set of elements, called nodes, such that:
(a)
(b)
If T does contain a root R, then the two trees T1 and T2 are called the left and right
subtrees of R respectively. If T1 is nonempty then its root is called the left successor of R.
Similarly, if T2 is nonempty, then its root is called the right successor of R. We frequently
represent a binary tree T by means of a diagram. Specifically, the diagram in figure 4-15
represents a binary tree T as follows
MSc. In Software
21
Fig. 4-15
(i) T consists of 11 nodes, represented by the letters A through L excluding I
(ii) The root of T is the node A at the top of the diagram
(iii) A left-downward slanted line from a node N indicates a left successor of N and a rightdownward slanted line from N indicates a right successor of N
Observe that
(a)
(b)
Any node N in a binary tree T has either 0,1 or 2 successors. The nodes A, B, C and H
have successors. The nodes E and J have only one successor and the nodes D, F, G, L and
K have no successors. The nodes with no successors are called terminal nodes.
The above definition of the binary tree T is recursive since T is defined in terms of binary
subtrees T1 and T2. This means, in particular, that every node N of T contains a left and a
right subtree. Moreover, if N is a terminal node then both its left and right subtrees are
empty.
Binary trees T and T` are said to be similar if they have the same structure or in other
words if they have the same shape. The trees are said to be copies if they are similar and
if they have the same contents at corresponding nodes.
Food for thought:
Consider these four binary trees. Which is the right option for similar trees?
MSc. In Software
(a)
(b)
(c)
(d)
22
Terminology
We frequently use terminology to describe family relationships between the nodes of a
tree T. Specifically, suppose N is a node in T with left successor S1 and right successor S2,
then N is called the parent (or father) of S1 and S2. Analogously, S1 is called the left child
(or son) of N, and S2 is called the right child (or son) of N. Furthermore, S1and S2 are said
to be siblings (or brothers). Every node N in a binary tree T, except that the root has a
unique parent, called the predecessor of N.
The terms descendant and ancestor have their usual meaning. That is, a node L is called a
descendant of a node N (and N is called an ancestor of L) if there is a succession of
children from N to L. In particular, L is called a left or right descendant of N according to
whether L belongs to the left or right subtree of N.
Terminology from graph theory and horticulture is also used with a binary tree T.
Specifically, the line drawn from a node N of T to a successor is called an edge, and a
sequence of consecutive edges is called a path. A terminal node is called a leaf, and a
path ending in a leaf is called a branch.
Each node in a binary tree has got a level number. First, we have assigned the root R of
the tree T the level number 0, then for every other node we have assigned a level
number, which is 1 more than the level number of its parent. Furthermore, those nodes
with the same level number are said to be of the same generation.
The depth (or height) of a tree T is the maximum number of nodes in a branch of T. This
turns out to be 1 more than the largest level number of T. Binary trees T and T` are said
to be similar if they have the same structure or, in other words, if they have the same
shape. The trees are said to be copies if they are similar and if they have the same
contents at corresponding nodes.
MSc. In Software
23
Fig. 4-16
We get the term extended binary tree from the following operation. Consider any binary
tree T, such as the tree in figure 4-16. Then we may convert T into a 2-tree by replacing
each empty subtree by a new node, as pictured in the figure 4-16(b). Observe that the
tree is, indeed, a 2-tree. Furthermore, the nodes in the original tree T are now the
internal nodes in the extended tree, and the new nodes are the external nodes in the
extended tree.
tree_node *root ;
MSc. In Software
24
Suppose T is a binary tree that is complete or nearly complete. Then there is an efficient
way of maintaining T in memory called the sequential representation of T. This
representation uses only a single linear array TREE as follows:
(a) The root R of T is stored in TREE[1].
(b) If a node N occupies TREE[K], then its left child is stored in TREE[2*K] and its
right child is stored in TREE[2*K+1]
Again, NULL is used to indicate an empty subtree. In particular,
TREE[1] = NULL indicates that the tree is empty.
Figure 4-16(b) is the sequential representation of the binary tree T shown in figure 416(a). Observe that we require 14 locations in the array TREE even though T has only 9
nodes. In fact, if we include null entries for the successors of the terminal nodes, then we
would actually require TREE[29] for the right successor of TREE[14]. Generally speaking
the sequential representation of a tree with depth d will require an array with
approximately 2d+1 elements. Accordingly this sequential representation is usually
inefficient unless, as stated above, the binary tree T is complete or nearly complete. For
example, the tree T in Fig. 4-15 has 11 nodes and depth 5, which means it would require
an array with approximately 26 = 64 elements.
Fig. 4-16
(1)
MSc. In Software
Inorder:
(2)
(3)
(1)
(2)
(3)
Postorder: (1)
(2)
(3)
25
We can observe that each algorithm contains the same three steps, and that the left
subtree of R is always traversed before the right subtree. The difference between the
algorithms is the time at which the root R is processed. Specifically, in the "pre"
algorithm, the root R is processed before the subtrees are traversed; in the "in" algorithm,
the root R is processed between the traversals of the subtrees; and in the "post"
algorithm, the root R is processed after the subtrees are traversed.
The three algorithms are sometimes called, respectively, the node-left-right (NLR)
traversal, the left-node-right (LNR) traversal and the left-right-node (LRN) traversal.
Observe that each of the above traversal algorithms is recursively defined, since the
algorithm involves traversing subtrees in the given order. Accordingly, we will expect that
a stack be used when the algorithms are implemented on the computer.
Note: More examples on binary tree are given on the web. Please refer to the site
zeelearn.com
Preorder Traversal
The preorder traversal algorithm uses a variable PTR (pointer), which will contain the
location of the node N currently being scanned. This is pictured in this figure 4-17, where
L(N) denotes the left child of node N and R(N) denotes the right child. The algorithm also
uses an array STACK, which will hold the addresses of nodes for future processing.
MSc. In Software
26
Fig. 4-17
Algorithm: Initially push NULL onto STACK and then set PTR = ROOT. Then repeat
the following steps until PTR = NULL or, equivalently, while PTR NULL
(a)
(b)
Proceed down the left-most path rooted at PTR, processing each node N on the
path and pushing each right child R(N), if any, onto STACK. The traversing ends
after a node N with no left child L(N) is processed. (Thus PTR is updated using
the assignment PTR = LEFT [PTR], and the traversing stops when LEFT [PTR]
== NULL.)
[Backtracking.] Pop and assign to PTR the top element on STACK.
If PTR NULL, then return to Step (a); otherwise Exit.
MSc. In Software
27
Inorder Traversal
The inorder traversal algorithm also uses a variable pointer PTR, which will contain the
location of the node N currently being scanned, and an array STACK, which will hold the
addresses of nodes for future processing. In fact, with this algorithm, a node is processed
only when it is popped from STACK.
Algorithm: Initially push NULL onto STACK (for a sentinel) and then set PTR = ROOT.
Then repeat the following steps until NULL is popped from STACK.
(a) Proceed down the left-most path rooted at PTR, pushing each node N
onto STACK and stopping when a node N with no left child is pushed onto
STACK.
(b)[Backtracking.] Pop and process the nodes on STACK. If NULL is popped,
then Exit. If a node N with a right child R (N) is processed, set PTR = R (N)
(by assigning PTR = RIGHT [PTR] and return to Step (a).
We emphasize that a node N is processed only when it is popped from STACK.
Note: Examples on Traversal algorithm are given on the web. Please refer to the site
zeelearn.com
Postorder Traversal
The postorder traversal algorithm is more complicated than the proceeding two
algorithms, because here we may have to save a node N in two different situations. We
distinguish between the two cases by pushing either N or its negative, -N, onto STACK.
(In actual practice, the location of N is pushed onto STACK, so -N has the obvious
meaning.) Again, a variable PTR (pointer) is used which contains the location of the node
N that is currently being scanned, as shown in this figure 4-17.
Algorithm: Initially push NULL into STACK (as a sentinel) and then set PTR = ROOT.
Then repeat the following steps until NULL is popped from STACK.
(a) Proceed down the left-most path rooted at PTR. At each node N of the path,
push N onto STACK and, if N has a right child R (N), push -R(N) onto STACK.
(b)[Backtracking.] Pop and process positive nodes on STACK. If NULL is popped,
then Exit. If a negative node is popped, that is, if PTR = - N for some node N, set
PTR= N (by assigning PTR: = -PTR) and return to Step (a).
We emphasize that a node N is processed only when it is popped from STACK and it is
positive.
MSc. In Software
28
(a)
Sorted linear array. Here you can search for and find an element with a
running time f(n) = O(log2 n), but it is expensive to insert and delete
elements.
(b)
Linked list. Here you can easily insert and delete elements, but it is
expensive to search for and find an element, since you must use a linear
search with running time f(n) = O(n).
Although each node in a binary search tree may contain an entire record of data, the
definition of the binary tree depends on a given field whose values is distinct and may be
ordered.
Suppose T is a binary tree. Then we call T a binary search tree (or binary sorted tree) if
each node N of T has the following property: The value at N is greater than every value in
the left subtree of N and is less than every value in the right subtree of N. (It is not
difficult to see that this property guarantees that the inorder traversal of T will yield a
sorted listing of the elements of T.)
Note: Examples on binary search tree are given on the web. Please refer to the site
zeelearn.com
MSc. In Software
29
This figure 4-18 shows the six stages of the tree. We emphasize that if the six numbers
were given in a different order, then the tree might be different and we might have a
different depth.
Fig. 4-18
The formal presentation of our search and insertion algorithm will use the following
procedure, which finds the locations of a given ITEM and its parent. The procedure
traverses down the tree using the pointer PTR and the pointer SAVE for the parent node.
This procedure will also be used in the next section, on deletion.
Observe that, in Step 4, there are three possibilities: (1) the tree is empty, (2) ITEM is
added as a left child and (3) ITEM is added as a right child.
MSc. In Software
30
if ITEM = ROOT->INFO, then Set LOC = ROOT and PAR = NULL, and
Return.
3. [Initialize pointers PTR and SAVE.]
if ITEM < ROOT->INFO, then
Set PTR = ROOT->RIGHT and SAVE = ROOT.
else.
Set PTR = ROOT->RIGHT and SAVE = ROOT.
[End of If Structure.]
4. Repeat Steps 5 and 6 while PTR NULL
5.
[ITEM found?]
if ITEM = PTR->INFO, then Set LOC = PTR and
PAR = SAVE, and Return.
6.
if ITEM < PTR->INFO, then
Set SAVE = PTR and PTR = PTR->LEFT.
else
Set SAVE = PTR and PTR = PTR->RIGHT.
[End of If Structure.]
[End of Step 4 loop.]
7. [Search unsuccessful.] Set LOC = NULL and PAR = SAVE.
8. Exit.
Notice that, in step 6, we move to the left child or the right child according to whether
ITEM < PTR->INFO or ITEM > PTR->INFO.
Algorithm: INSBST(INFO, LEFT, RIGHT, ROOT, AVAIL, ITEM, LOC)
A binary search tree T is in memory and an ITEM of information is given. This
algorithm finds the location LOC of ITEM in T or adds ITEM as a new node in
T at location LOC.
1. Call FIND(INFO, LEFT, RIGHT, ROOT, ITEM, LOC, PAR).
2. if LOC NULL, then Exit.
3. [Copy ITEM into new node in AVAIL list.]
(a) if AVAIL = NULL, then Write OVERFLOW, and Exit.
(c) Set LOC = NEW, NEW->LEFT = NULL and
NEW->RIGHT = NULL
4. [Add ITEM to tree.]
if PAR = NULL, then
Set ROOT = NEW
else if ITEM < PAR->INFO = NEW
Set PAR->LEFT = NEW
else
Set PAR-> RIGHT = NEW
[End of if structure.]
5. Exit.
MSc. In Software
Observe
i)
ii)
iii)
31
compare
After all the elements have been scanned, there will be no duplicates.
Example
Suppose Algorithm a is applied to the following list of 15 numbers:
14, 10, 17, 12,10, 11, 20, 12, 18, 25, 20, 8 22, 11, 23
Observe that the first four numbers (14, 10, 18 and 12) are not deleted. However,
A5
A8
A11
A14
=
=
=
=
10 is
12 is
20 is
11 is
deleted, since A5 = A2
deleted, since A8 = A4
deleted, since A11 = A7
deleted, since A14 = A6
MSc. In Software
32
Example
MSc. In Software
33
Consider the complete tree H in this figure 4-19. Observe that H is a heap. This means, in
particular, that the largest element in H appears at the "top" of the heap, that is, at the
root of the tree. This figure 4-19(b) shows the sequential representation of H by the array
TREE. That is, TREE [1] is the root of the tree H, and the left and right children of node
TREE [K] are, respectively, TREE[2K] and TREE [2K + 1]. This means, in particular, that
the parent of any nonroot node TREE[J] is the node TREE [J / 2] (where J / 2 means
integer division). Observe that the nodes of H on the same level appear one after the
other in the array TREE.
101 92 97 67 57 97 50 67 37 50 57 66 67 26 39 19 41 31 27 25
1 2
3 4
5 6
7 8
9 10 11 12 13 14 15 16 17 18 19 20
(b) Sequential representation
Fig. 4-19
Inserting into a Heap
Suppose H is a heap with N elements, and suppose an ITEM of information is given. We
insert ITEM into the heap H as follows:
(1) First adjoin ITEM at the end of H so that H is still a complete tree, but not necessarily
a heap.
(2) Then let ITEM rise to its "appropriate place" in H so that H is finally a heap.
We will illustrate the way this procedure works before stating the procedure formally.
Example
Consider the heap H in this figure 4-20. Suppose we want to add ITEM = 71 to H. First we
adjoin 71 as the next element in the complete tree; that is, we set TREET [21] = 71. Then
MSc. In Software
34
71 is the right child of TREE [10] = 50. The path from 71 to the root of H is pictured in
this figure 4-20(a). We now find the appropriate place of 71 in the heap as follows:
(a)
Compare 71 with its parent, 50. Since 71 is greater than 50, interchange 71 and
50; the path will now look like this 4-20(b).
(b)
Compare 71 with its new parent, 57. Since 71 is greater than 57, interchange 71
and 57; the path will now look like this 4-20(c).
(c)
Compare 71 with its new parent, 92. Since 71 does not exceed 92, ITEM = 71 has
risen to is appropriate place in H.
This figure 4-23(d) shows the final tree. A dotted line indicates that an exchange has
taken place.
Fig. 4-23
Note: More examples on Inserting into heapsort are given on the web. Please refer to the
site zeelearn.com
Procedure:
MSc. In Software
35
MSc. In Software
36
Consider the heap H in this figure 4-24(a), where R = 95 is the root and L=22 is the last
node of the tree. Step 1 of the above procedure deletes R = 95 by L=22. This gives the
complete tree in this figure 4-24(b), which is not a heap. Observe, however, that both the
right and left subtrees of 22 are still heaps. Applying Step 3, we find the appropriate place
of 22 in the heap as follows:
(a)
Compare 22 with its two children, 85 and 70. Since 22 is less than the larger
child 85, interchange 22 and 85 so the tree now looks like this 4-24(c).
(b)
Compare 22 with its two new children, 55 and 33. Since 22 is less than the
larger child, 55, interchange 22 and 55 so the tree now looks like this 4-24(d).
(c)
Compare 22 with its new children, 15 and 20. Since 22 is greater than both
children, node 22 has dropped to its appropriate place in H.
Thus Fig. 4-24(d) is the required heap H without its original root R.
Fig. 4-24
Remark: As we are inserting an element into a heap, we must verify that the above
procedure does always yield a heap as a final tree. Again we leave this verification to the
reader. We also note that Step 3 of the procedure may not end until the node L reaches
the bottom of the tree. i.e. until L has no children.
The formal statement of our procedure is as follows.
MSc. In Software
37
7.
8.
9.
10.
The step 4 loop repeats as long as LAST has a right child. Step 8 takes care of the special
case in which LAST does not have a right child but does have a left child (which has to be
the last node in H). The reason for the two if statement in Step 8 is that TREE[LEFT]
may not defined when LEFT > N.
Application to Sorting
Suppose an array A with N elements is given. The heapsort algorithm to sort A consists of
the two following phases:
Phase A: Build a heap H out of the elements of A.
Phase B: Repeatedly delete the root element of H.
Since the root of H always contains the largest node in H, Phase B deletes the elements of
A in decreasing order. A formal statement of the algorithm is as follows.
Algorithm: HEAPSORT(A, N)
An array A with N elements is given. This algorithm sorts the elements of A.
1. [Build a heap H, using Procedure 7.9.]
Repeat for j=1 to N-1
MSc. In Software
38
Complexity of Heapsort
Suppose the heapsort algorithm is applied to an array A with n elements. The algorithm
has two phases, and we analyze the complexity of each phase separately.
Phase A. Suppose H is a heap. Observe that the number of comparisons to find the
appropriate place of a new element ITEM in H cannot exceed the depth of H. Since H is a
complete tree, its depth is bounded by log2 m where m is the number of elements in H.
Accordingly the total number g(n) of comparisons to insert the n elements of A into H is
bounded as follows:
g(n) n log2n
Consequently the running time of Phase A of heapsort is proportional to n log2 n.
Phase B. Suppose H is a complete tree with m elements, and suppose the left and right
subtrees of H are heaps and L is the root of H. Observe that reheaping uses 4
comparisons to move the node L one step down the tree H. Since the depth of H does not
exceed log2 m, reheaping uses at most 4 log2 m comparisons to find the appropriate place
of L in the tree. This means that the total number h(n) of comparisons to delete the
elements of A from H, which requires reheaping n times is bounded as follows:
H(n) 4n log2 n
Accordingly, the running time of Phase B of heapsort is also proportional to n log2n.
Since each phase requires time proportional to n log2n, the running time to sort the nelement array A using heapsort is proportional to n log2n, that is, f(n)=O(n log2,n).
Observe that this gives a worst-case complexity of the heapsort algorithm.
MSc. In Software
39
Summary
# A linked list is a linear collection of data elements, called nodes, where the linear
order (or linearity) is given by means of pointers. We divide each node into two
parts. The first part contains the information of the element, and the second part,
called the link field or next pointer field, contains the address of the next node in
the list.
#
Languages like C that support dynamic memory allocation with structures and
pointers use the following technique. The linked list node is defined as a structure.
# If ITEM is actually a key value and we are searching through a file for the record
containing ITEM then ITEM can appear only once in LIST.
# The list, which has its own pointer, is called the list of available space or the freestorage list or the free pool.
#
Sometimes we insert new data into a data structure but there is no available space,
i.e., the free-storage list is empty. This situation is usually called overflow. The
term underflow refers to the situation where we want to delete data from a data
structure that is empty.
Suppose we have not sorted our linked list and there is no reason to insert a new
node in any special place in the list. Then the easiest place to insert the node is at
the beginning of the list.
# A header-linked list is a linked list, which always contains a special node, called the
header node, at the beginning of the list.
# We frequently use Circular header lists instead of ordinary linked lists because
many operations are much easier to state and implement-using header lists.
# We have introduced a new list structure, called a two-way list, which can be
traversed in two directions: in the usual forward direction from the beginning of the
list to the end, or in the backward direction from the end of the list to the
beginning.
# Binary tree is a nonlinear data structure called a tree. This structure is mainly used
to represent data containing a hierarchical relationship between elements, e.g.
records, family and tables of contents.
# We frequently use terminology describing family relationships to describe
relationships between the nodes of a tree T. Specifically, suppose N is a node in T
with left successor S1 and right successor S2. Then N is called the parent (or father)
of S1 and S2. Analogously, S1 is called the left child (or son) of N, and S2 is called
the right child (or son) of N. Furthermore, S1and S2 are said to be siblings (or
brothers). Every node N in a binary tree T, except the root, has a unique parent,
called the predecessor of N.
MSc. In Software
40
MSc. In Software
5
STACKS, QUEUES,
RECURSION
Main Points Covered
!
!
!
!
!
!
!
!
!
!
!
!
!
Introduction
Stacks
Array Representation Of Stacks
Arithmetic Expressions; Polish Notation
Consider this C-Program on Stack
Quicksort
Recursion
Factorial Function
Fibonacci Sequence
Divide-and-Conquer Algorithms
Towers Of Hanoi
Queues
Representation of Queues
MSc. In Software
Introduction
e have already learned linear lists and linear arrays, which allows us to
insert and delete elements at any place in the list i.e. at the beginning, at
the end, or in the middle. There are certain frequent situations in computer
science when we want to restrict insertions and deletions so that they can take
place only at the beginning or the end of the list, not in the middle. Two of the data
structures that are useful in such situations are stacks and queues.
Stack : A stack is a linear structure in which items may be added or removed only
at one end. An example of such a structure is, a stack of dishes. Observe that an
item may be added or removed only from the top of any of the stacks. This means,
in particular, that the last item to be added to a stack is the first item to be
removed. Accordingly, stacks are also called last-in first-out (LIFO) lists. Other
names used for stacks are "piles" and "push-down lists." Although the stack may
seem to be a very restricted type of data structure, it has many important
applications in computer science.
Queue: A queue is a linear list in which items may be added only at one end and
items may be removed only at the other end. The name "queue" likely comes from
the everyday use of the term. Observe a queue at the bus stop. Each new person
who comes takes his or her place at the end of the line, and when the bus comes,
the people at the front of the line board first. That is, the person who comes first,
boards first and who comes last, boards last. Thus queues are also called first-in
first-out (FIFO) lists. Another example of a queue is a batch of jobs waiting to be
processed, assuming no job has higher priority than the others.
STACKS : Special terminology is used for two basic operations associated with
stacks.
(a) "Push" is the term used to insert an element into a stack.
(b) "Pop" is the term used to delete an element from a stack.
Remember that these terms are used only with stacks, not with other data
structures.
Suppose the following 6 elements are pushed, in order, onto an empty stack:
A, B, C, D, E, F
The figure 5-1 shows three ways of picturing such a stack. For notational
convenience, we will frequently designate the stack by writing.
STACK: A , B, C, D, E, F
The implication is that the right-most element is the top element. We emphasize
that, regardless of the way a stack is described, its underlying property is that
MSc. In Software
insertions and deletions can occur only at the top of the stack. This means E cannot
be deleted before F is deleted, D cannot be deleted before E and F are deleted, and
so on. Consequently, the elements may be popped from the stack only in the
reverse order of that in which they were pushed onto the stack.
top
(b)
F
E
D
C
B
A
Fig 5-1
Postponed Decisions
We use stacks frequently to indicate the order of the processing of data when
certain steps of the processing must be postponed until other conditions are
fulfilled. We have illustrated it below.
Suppose we are processing some project A, and desire to move on to project B,
then we need to complete project B before we return to project A. We place the
folder containing the data of A onto a stack, as pictured in the figure 5-2(a) and
begin to process B. However, suppose that while processing B we are led to project
C, for the same reason. Then we place B on the stack above A, as pictured in figure
5-2(b) and begin to process C. Furthermore, suppose that while processing C we
are likewise led to project D. Then we place C on the stack above B, as pictured in
the figure 5-2(c) and begin to process D.
(a)
(b)
(c)
(d)
(e)
(f)
MSc. In Software
Fig. 5-2
On the other hand, suppose we are able to complete the processing of project D.
Then the only project we may continue to process is project C, which is on top of
the stack. Hence we remove folder C from the stack, leaving the stack as pictured
in figure 5-2(d) and continue to process C. Similarly, after completing the
processing of C, we remove folder B from the stack, leaving the stack as pictured in
figure 5-2(e) and continue to process B. Finally, after completing the processing of
B, we remove the last folder A, from the stack, leaving the empty stack pictured in
figure 5-2(f) and continue the processing of our original project A.
Notice that, at each stage of the above processing, the stack automatically
maintains the order that is required to complete the processing.
Fig 5-3
Overflow:
MSc. In Software
The procedure for adding (pushing) an element is called PUSH and removing (pop)
an item is called POP. In executing the procedure PUSH, we have to test whether
there is room in the stack for the new item. If not, then we have the condition
known as overflow.
Underflow:
The case is same in executing the procedure POP. We must first test whether there
is an element in the stack to be deleted. If not, then we have the condition known
as underflow.
Procedure:
Procedure:
It is observed that frequently, TOP and MAXSTK are global variables; hence the
procedures may be called using only
PUSH (STACK, ITEM)
and
respectively. We note that the value to TOP is changed before the insertion in
PUSH, but the value of TOP is changed after the deletion in POP.
Lets see few examples on the above algorithm:
(a)
Consider the stack in figure 5-3. We simulate the operation PUSH (STACK,
WWW):
1.
Since TOP = 3, control is transferred to Step 2.
2.
TOP = 3 + 1 = 4
MSc. In Software
3.
4.
Return
Consider again the stack in figure 5-3. This time we simulate the operation
POP (STACK, ITEM):
1.
2.
ITEM = ZZZ
3.
Top = 3 - 1 = 2
4.
Return.
Observe that STACK [TOP] = STACK [2] = YYY is now the top element in the stack.
Minimizing Overflow
There is an essential difference between underflow and overflow in dealing with
stacks. Underflow depends exclusively on the given algorithm and the given input
data, and hence there is no direct control by the programmer. Overflow, on the
other hand, depends on the arbitrary choice of the programmer for the amount of
memory space reserved for each stack, and this choice does influence the number
of times overflow may occur.
Generally, the number of elements in a stack fluctuates as elements are added to
or removed from a stack. Accordingly, the particular choice of the amount of
memory for a given stack involves a time-space tradeoff. Initially reserving a great
deal of space for each stack will decrease the number of time overflow may occur.
However, this may be an expensive use of space if most of the space is seldom
used. On the other hand, reserving a small amount of space for each stack may
increase the number of times overflow occurs. The time required for resolving an
overflow, such as by adding space to the stack, may be more expensive than the
space saved.
Various techniques have been developed which modify the array representation of
stacks so that the amount of space reserved for more than one stack may be more
efficiently used. Most of these techniques lie beyond the scope of this text. One of
these techniques is shown in fig 5.4.
Suppose we have been given an algorithm, which requires two stacks, A and B. We
can define an array STACKA with n1 elements for stack A and an array STACKB with
n2 elements for stack B. Overflow will occur when either STACKA contains more
than n1 elements or STACKB contains more than n2 elements.
MSc. In Software
Suppose we define a single array STACK with n = n1 + n2 elements for stacks A and
B together. As pictured in the figure below, we define STACK [1] as the bottom of
stack A and let A grow to the right, and we define STACK [n] as the bottom of
stack B and let B 'grow' to the left. In this case, overflow will occur only when A and
B together have more than n = n1 + n2 elements. This technique will usually
decrease the number of times overflow occurs even though we have not increased
the total amount of space reserved for the two stacks. In using this data structure,
the operations of PUSH and POP need to be modified.
Fig. 5-4
587
299990
999995
MSc. In Software
C- D,
E*F ,
G/H
This is called infix notation. With this notation, we must distinguish between
(A + B) * C
and A + (B * C)
-CD
*EF
/GH
We translate, step by step, the following infix expressions into Polish notation using
brackets [ ] to indicate a partial translation:
(A + B) * C
= [+ AB]*C
= *+ ABC
A + (B * C)
= A + [*BC]
= + A * BC
(A + B)/(C - D) = [+ AB]/[ - CD] = / + AB CD
The fundamental property of Polish notation is that the order in which the
operations are to be performed is completely determined by the positions of the
operators and operands in the expression. Accordingly, one never needs
parentheses when writing expressions in Polish notation.
Reverse Polish notation (postfix notation) refers to the analogous notation in
which the operator symbol is placed after its two operands:
AB+
CD-
EF*
GH/
Again, we do not need parentheses to determine the order of the operations in any
arithmetic expression written in reverse Polish notation.
MSc. In Software
5.
6.
We note that, when Step 5 is executed, there should be only one number on
STACK.
Consider an arithmetic expression: 5 * ( 6 + 2 ) 12 / 4.
In postfix notation it will become: 5, 6, 2, +, *, 12, 4, /, (Commas are used to separate the elements of P so that 5, 6, 2 is not interpreted
as the number 562).
Symbol Scanned
(1)
5
(2)
6
(3)
2
(4)
+
(5)
*
STACK
5
5,6
5,6,2
5,8
40
MSc. In Software
(6)
(7)
(8)
(9)
(10)
12
4
/
)
10
40,12
40,12,4
40,3
37
POLISH (Q, P)
Suppose Q is an arithmetic expression written in infix notation.
This algorithm finds the equivalent postfix expression p.
1. Push "(" onto STACK, and add ")" to the end of Q.
2. Scan Q from left to right and repeat steps 3 to 6 for each
element of Q until the STACK is empty
3. if an operand is encountered, add it to P.
4. if a left parenthesis is encountered, push it onto STACK.
5. if an operator (x) is encountered, then
MSc. In Software
11
initialize(stack_ptr);
while(!full(stack_ptr) && (item =
push(item,stack_ptr);
while( !empty(stack_ptr))
{
pop(item_ptr, stack_ptr);
}
putchat(\n);
*/
MSc. In Software
12
Our rule is one step at a time. Let us now discuss the prototypes of all these
functions. Let MAXSTACK be a symbolic constant giving the maximum size allowed
for stacks and item_type be a type describing the data that will be put into the
stack.
So, first few lines of your program will be like this.
#define MAXSTACK 10
typedef char item_type;
typedef struct struct_tag {
int top;
item_type entry[MAXSTACK];
} stack_type;
Boolean_type empty(stack_type *);
Boolean_type full(stack_type*);
Void initialize(stack_type*);
Void push(item_type, stack_type *);
Void pop(item_type *, stack_type *);
/* Push : push an item onto the stack. */
void push (Item_type item, Stack_type * stack_ptr)
{
if (stack_ptr->top >= MAXSTACK)
Error ("Stack is full");
else
stack_ptr->entry [ stack_ptr->top++] = item;
}
/* Pop: Pop an item from the stack.*/
void pop (item_type *item_ptr, Stack_type * stack_ptr)
{
if (stack_ptr->top < = 0)
Error ("Stack is empty");
else
*item_ptr = stack_ptr->entry [- - stack_ptr->top];
}
Error is a function that receives a pointer to a character string, prints the string,
and then invokes the standard function exit to terminate the execution of the
program.
/* Error : print error message and terminate the program.*/
void Error (char *s)
{
fprintf (stderr, "%s\n", s);
exit (1);
}
MSc. In Software
13
37,
13,
57,
81,
93,
43,
64,
102,
25,
91,
(69)
The reduction step of the quicksort algorithm finds the final position of one of the
numbers i.e. 49. Then we consider the last number, 69, scanning the list from right
to left, comparing each number with 49 and stopping at the first number less than
49. The number is 25. Interchange 49 and 25 to obtain the list.
MSc. In Software
(25),
37,
13,
57,
81,
93,
43,
64,
14
102, (49),
91,
69
(Observe that the numbers 91 and 69 to the right of 49 are each greater than 49.)
Beginning with 25, next scan the list in the opposite direction, from left to right,
comparing each number with 49 and stopping at the first number greater than 49.
The number is 57. Interchange 49 and 57 to obtain the list.
25, 37,
13,
(49),
81,
93,
43,
64,
102, (57),
91,
69
(Observe that the numbers 25, 37 and 13 to the left of 49 are each less than 49).
Beginning this time with 57, now scan the list in the original direction, from right to
left, until you meet the first number less than 49. It is 43. Interchange 49 and 43
to obtain the list.
25,
37,
13,
(43),
81,
93,
49,
64,
102,
57,
91,
69
(Again, the numbers to the right of 49 are each greater than 49.) Beginning with
43, scan the list from left to right. The first number greater than 49 is 81.
Interchange 49 and 81 to obtain the list.
25,
37,
13,
43,
(49),
93,
(81),
64,
102,
57,
91,
69
(Again, the numbers to the left of 49 are each less than 49.) Beginning with 81,
scan the list from right to left seeking a number less than 49. We do not meet such
a number before meeting 49. This means all numbers have been scanned and
compared with 49. Furthermore, all numbers less than 49 now form the sub-list of
numbers to the left of 49, and all numbers greater than 49 now form the sub-list of
numbers to the right of 49, as shown below:
25,
37,
13,
43,
(49),
93,
(81),
64,
102,
57,
91,
69
First sub-list
Second sub-list
Thus 49 is correctly placed in its final position, and the task of sorting the original
list A has now been reduced to the task of sorting each of the above sub-lists.
We have to repeat the above reduction step with each sub-list containing 2 or more
elements. Since we can process only one sub-list at a time, we must be able to
keep track of some sub-lists for future processing. We can accomplish this by using
two stacks, called LOWER and UPPER, to temporarily "hold" such sub-lists. That is,
the addresses of the first and last elements of each sub-list, called its boundary
values, are pushed onto the stacks LOWER and UPPER, respectively. The reduction
step is applied to a sub-list only after its boundary values are removed from the
stacks. The following example illustrates the way the stacks LOWER and UPPER are
used.
MSc. In Software
Procedure:
15
MSc. In Software
7.
8.
16
MSc. In Software
17
swap(low,pivotloc,lp);
return pivotloc;
}
Read this function very carefully. We are leaving it now. We will discuss more about
it in our program section.
Complexity of the Quicksort Algorithm
We can measure the running time of a sorting algorithm by the number f(n) of
comparisons required to sort n elements. Generally speaking, the algorithm has a
worst-case running time of order n2/2, but an average-case running time of order
nlog n. The reason for this is indicated below.
The worst case occurs when we sort the list, the first element requires n
comparison to get recognized if it remains in the first position. Furthermore, if the
first sublist is empty, the second sublist will have n - 1 elements. Accordingly, we
have to do n-1 comparison to recognize that it remains in the second position. And
so on. Consequently, there will be a total of
f(n) = n + ( n 1 ) + . + 2 + 1 = n ( n + 1 ) / 2
= n2/2 + O(n) = O( n2 )
comparisons. Observe that this is equal to the complexity of the bubble sort
algorithm.
The complexity f(n) = O(n log n) of the average case comes from the fact that, on
the average, each reduction step of the algorithm produces two sublists.
(1)
(2)
(3)
(4)
Reducing
Reducing
Reducing
Reducing
the
the
the
the
And so on. Observe that the reduction step in the kth level finds the location of 2k-1
elements; hence there will be approximately log2 n levels of reductions steps.
Furthermore, each level uses at most n comparisons, so f(n) = O (n log n). In fact,
mathematical analysis and empirical evidence have both shown that,
f(n) = 1.4 [n log n]
is the expected number of comparisons for the quicksort algorithm.
RECURSION
Recursion is an important concept in computer science. There are many algorithms,
which we can best describe in terms of recursion. In this section we have
introduced this powerful tool.
MSc. In Software
18
There must be certain arguments, called base values, for which the function
does not refer to itself.
Each time the function does refer to itself, the argument of the function must
be closer to a base value.
A recursive function with these two properties is also said to be well defined.
The following examples should help clarify these ideas.
Factorial Function
We can explain the recursion with the example of factorial function. The product of
the positive integers from 1 to n, inclusive, is called "n factorial" and is usually
denoted by n!:
n! = 1. 2 . 3(n - 2) (n - 1) n
It is also convenient to define 0! = 1, so that the function is defined for all
nonnegative integers.
Thus we have
0! = 1
1! = 1 2! = 1.2 = 2
5! = 1.2.3.4.5 = 120
3! = 1.2.3 = 6
4! = 1.2.3.4.= 24
MSc. In Software
Definition:
19
(Factorial Function)
(a)
If n = 0, then n! = 1
(b)
If n > 0, then n! = n . (n - 1) !
Observe that this definition of n! is recursive, since it refers to itself when it uses (n
- 1) ! However, (a) the value of n! is explicitly given when n = 1 (thus) is the base
value); and (b) the value of n! for arbitrary n is defined in terms of a smaller value
of n which is closer to the base value 0. Accordingly, the definition is not circular, or
in other words, the procedure is well-defined.
The following are two procedures that calculate n factorial.
Procedure A:
FACTORIAL (FACT, N)
This procedure calculates N! and returns the value in the
variable FACT.
1. if N = 0, then Set FACT = 1, and return.
2. Set Fact = 1, [Initializes FACT for loop.]
3. Repeat for K = 1 to N.
Set FACT = K * FACT.
[End of loop.]
4. Return.
Procedure B:
FACTORIAL (FACT, N)
This procedure calculates N! and returns the value in the variable
FACT.
1. if N = 0, then, Set FACT = 1, and Return.
2. Call FACTORIAL (FACT, N - 1).
3. Set FACT = N * FACT.
4. Return.
We can observe that the first procedure evaluates N!, using an iterative loop
process. The second procedure, on the other hand, is a recursive procedure, since it
contains a call to itself.
Suppose P is a recursive procedure. During the running of an algorithm or a
program, which contains P, we associate a level number with each given execution
of procedure P as follows. The original execution of procedure P is assigned level 1;
and each time procedure P is executed because of a recursive call; its level is 1
more than the level of the execution that has made the recursive call.
The depth of recursion of a recursive procedure P with a given set of arguments
refers to the maximum level number of P during its execution.
MSc. In Software
20
Fibonacci Sequence
The celebrated Fibonacci sequence (usually denoted by F0, F1, F2, ) is as follows.
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55,
That is, F0, = 0 and F1 = 1 and each succeeding term is the sum of the two
preceding terms. For example, the next two terms of the sequence are
34 + 55 = 89
and 55 + 89 = 144
Divide-and-Conquer Algorithms
We consider a problem P associated with a set S. Suppose A is an algorithm which
partitions S into smaller sets such that the solution of the problem P for S is
reduced to the solution of P for one or more of the smaller sets. Then A is called a
divide-and-conquer algorithm.
We can use the quicksort algorithm to find the location of a single element and to
reduce the problem of sorting the entire set to the problem of sorting smaller sets.
We can use the binary search algorithm to divide the given sorted set into two
halves so that the problem of searching for an item in the entire set is reduced to
the problem of searching for the item in one of the two halves.
We can view a divide-and-conquer algorithm as a recursive procedure. The reason
for this is that the algorithm A may be viewed as calling itself when it is applied to
the smaller sets. The base criteria for these algorithms are usually the one-element
sets. For example, with a sorting algorithm, a one-element set is automatically
sorted; and with a searching algorithm, a once-element set requires only a single
comparison.
MSc. In Software
21
Ackermann Function
The Ackermann function is a function with two arguments each of which can be
assigned any nonnegative integer: 0, 1, 2, This function is defined as follows.
Definition: (Ackermann Function)
(a)
If m = 0, then A(m, n) = n + 1
(b)
If m 0 but n = 0, then a (m, n) = A(m-1, 1).
(c)
If m 0 and n 0, then A (m, n) = A (m - 1, A (m, n -1))
Once more, we have a recursive definition, since the definition refers to itself in
parts (b) and (c). Observe that A (m, n) is explicitly given only when m = 0. The
base criteria are the pairs.
(0, 0), (0, 1), (0, 2), (0, 3),.(0, n).
Although it is not obvious from the definition, the value of any A (m, n) may
eventually be expressed in terms of the value of the function on one or more of the
base pairs.
The Ackermann function is too complex to evaluate on any example. Its importance
comes from its use in mathematical logic. The function is stated here in order to
give another example of a classical recursive function and to show that the
recursion part of a definition may be complicated.
TOWERS OF HANOI
In the preceding section we have given some examples of recursive definition and
procedures. In this section we show how recursion may be used as a tool in
developing an algorithm to solve a particular problem. The problem we pick is
known as the Towers of Hanoi problem.
Suppose we have been given three pegs, labeled A, B and C, and suppose on peg A
there are placed a finite number n of disks with decreasing size. This is pictured in
figure 5-5 for the case n = 6. The object of the game is to move the disks from peg
A to peg C using peg B as an auxiliary. The rules of the game are as follows:
(a)
(b)
We can move only one disk at a time. Specifically, only the top disk on any
peg may be moved to any other peg.
At no time can a larger disk be placed on a smaller disk.
MSc. In Software
22
Fig. 5-5
Sometimes we will write X
Y to denote the instruction "Move top disk from peg X
to peg Y, " where X and Y may be any of the three pegs.
The solution to the Towers of Hanoi problem for n = 3 appears in fig. 5-6. Observe
that it consists of the following seven moves:
n=3
Move
Move
Move
Move
Move
Move
Move
top
top
top
top
top
top
top
disk
disk
disk
disk
disk
disk
disk
from
from
from
from
from
from
from
peg
peg
peg
peg
peg
peg
peg
A
A
C
A
B
B
A
to
to
to
to
to
to
to
peg
peg
peg
peg
peg
peg
peg
C.
B.
B.
C.
A.
C.
C.
MSc. In Software
23
Fig 5-6
In other words,
For completeness, we also give the solution to the Towers of Hanoi problem for n =
1 and n = 2
MSc. In Software
n = 1:
n = 2:
A
A
C
B,
C,
24
Note that n = 1 uses only one move and that n = 2 uses three moves.
Rather than finding a separate solution for each n, we use the technique of
recursion to develop a general solution. First we observe that the solution to the
Towers of Hanoi problem for n > 1 disks may be reduced to the following subproblems.
(1) Move the top n - 1 disks from peg A to peg B.
(2) Move the top disk from peg A to peg C: A
C
(3) Move the top n - 1 disks from peg B to peg C.
The reduction is illustrated in figure 5-7 for n = 6. That is, first we move the top
five disks from peg A to peg B, then we move the large disk from peg A to peg C,
and then we move the top five disks from peg B to peg C.
MSc. In Software
25
Fig. 5-7
Let us now introduce the general notation
TOWER (N, BEG, AUX, END)
To denote a procedure, which moves the top n, disks from the initial peg BEG to the
final Peg END using the peg AUX as an auxiliary. When n = 1, we have the following
obvious solution:
TOWER (1, BEG, AUX, END) consists of the single instruction
BEG
END
Furthermore, as discussed above, when n > 1, the solution may be reduced to the
solution of the following three sub-problems.
(1) TOWER (N - 1, BEG, END AUC)
(2) TOWER (1, BEG, AUX, END) or BEG ->
(3) TOWER (N - 1, AUX, BEG, END)
END
We can solve each of these three sub-problems directly or is essentially the same
as the original problem using fewer disks. Accordingly, this reduction process does
yield a recursive solution to the Towers of Hanoi problem.
Observe that the recursive solution for n = 4 disks consists of the following 15
moves:
MSc. In Software
26
QUEUES
MSc. In Software
27
A queue is a linear list of elements in which deletion can take place only at one end,
called the front, and insertions can take place only at the other end, called the rear.
We use the terms "front" and "rear" to describe a linear list only when it is
implemented as a queue.
We also call Queues as first-in first-out (FIFO) lists, since the first element in a
queue will be the first element out of the queue. In other words, the order in which
elements enter a queue is the order in which they leave. This contrasts with stacks,
which are last-in first-out (LIFO) lists.
Queues are important in everyday life. People waiting in line at a bank form a
queue, where the first person in line is the first person to be waited on. The
automobiles waiting to pass through an intersection form a queue, in which the first
car in line is the first car through. An important example of a queue in computer
science occurs in a timesharing system, in which programs with the same priority
form a queue while waiting to be executed.
Example
We have shown figure 5-8(a) is a schematic diagram of a queue with 4 elements,
where AAA is the front element and DDD is the rear element. You can observe that
the front and rear elements of the queue are also, respectively, the first and last
elements of the list. Suppose we delete an element from the queue. Then it must
be AAA. This yields the queue in figure (b), where we get BBB as the front element.
Next, suppose EEE is added to the queue and then FFF is added to the queue. Then
they must be added at the rear of the queue, as pictured in figure (c). Note that
FFF is now the rear element. Now suppose we delete another element from the
queue; then it must be BBB, to yield the queue in figure (d). And so on. Observe
that in such a data structure, EEE will be deleted before FFF because it has been
placed in the queue before FFF. However, EEE will have to wait until CCC and DDD
are deleted.
Fig. 5-8
MSc. In Software
28
Representation of Queues
We can represent queues in the computer in various ways, usually by means of
one-way lists or linear arrays. Unless otherwise stated or implied, each of our
queues will be maintained by a linear array QUEUE and two pointer variables
FRONT, containing the location of the front element of the queue, and REAR,
containing the location of the rear element of the queue. The condition FRONT =
NULL will indicate that the queue is empty.
Figure (5-9) shows the way the array will be stored in memory using an array
QUEUE with N elements. This figure also indicates the way elements will be deleted
from the queue and the way new elements will be added to the queue. Observe
that whenever an element is deleted from the queue, the value of FRONT is
increased by 1; this can be implemented by the assignment
FRONT = FRONT + 1
Similarly, whenever an element is added to the queue, the value of REAR is
increased by 1; this can be implemented by the assignment
REAR = REAR + 1
This means that after N insertions, the rear element of the queue will occupy
QUEUE [N] or, in other words, eventually the queue will occupy the last part of the
array. This occurs even though the queue itself may not contain many elements.
Suppose we want to insert an element ITEM into a queue at the time the queue
does occupy the last part of the array, i.e. when REAR = N. One way to do this is to
simply move the entire queue to the beginning of the array, changing FRONT and
REAR accordingly, and then inserting ITEM as above. This procedure may be very
expensive. The procedure we adopt is to assume that the array
MSc. In Software
29
Fig. 5-9
QUEUE is circular, that is, QUEUE[1] comes after QUEUE[N] in the array. With this
assumption, we insert ITEM into the queue by assigning ITEM to QUEUE [1].
Specifically, instead of increasing REAR to N + 1, we reset REAR = 1 and then
assign.
QUEUE [REAR] = ITEM
Similarly, if FRONT = N and an element of QUEUE is deleted, we reset FRONT = 1
instead of increasing FRONT to N + 1.
Suppose that our queue contains only one element, i.e.,
FRONT = REAR NULL
And suppose that the element is deleted. Then we assign
FRONT = NULL
MSc. In Software
30
Summary
#
"Pop" is the term used to delete an element from a stack. "Push" is the
term used to insert an element into a stack.
A queue is a linear list of elements in which deletion can take place only at
one end, called the front, and insertions can take place only at the other end,
called the rear.
MSc. In Software
6
SORTING AND SEARCHING
TECHNIQUES
MAIN POINTS COVERED
!
Introduction
Sorting
Insertion Sort
Selection Sort
Merging
Merge-sort
Summary
MSc. In Software
INTRODUCTION
We will discuss the complexity of each algorithm; that is, the running time f(n) of
each algorithm and the space requirements of our algorithms.
Sorting and searching apply to a file of records, and here are some standard
terminologies of that field. Each record in a file F can contain many fields, but
there may be one particular field whose values uniquely determine the records in
the file. Such a field K is called a primary key, and the values k1, k2 .... in such a
field are called keys or key values. Sorting the file F usually refers to sorting F
with respect to a particular primary key, and searching in F refers to searching for
the record with a given key value.
SORTING
Let A be a list of n elements A1, A2,.An in memory. Sorting A refers to the
operation of rearranging the contents of A so that they are increasing in order
(numerically or lexicographically), so that
A1 A2
A3
An.
Since A has n elements, there are n! ways that the contents can appear in A. Each
sorting algorithm must take care of this n! possibilities.
Example
Suppose an array named DATA contains 8 elements as follows:
DATA:
MSc. In Software
Since DATA consists of 8 elements there are 8! = 40320 ways that the numbers
2, 20,.100 can appear in DATA.
Complexity of Sorting Algorithms
We measure the complexity of a sorting algorithm of the running time as a
function of the numbers n of items to be sorted. We note that each sorting
algorithm S will be made up of the following operations, where A1, A2. An contain
the items to be sorted and B is an auxiliary location (used for temporary storage):
(a)
(b)
(c)
Worst case
n(n-1)/2 = O(n2)
n(n+3)/2 = O(n2)
3nlogn = O(n logn)
Average Case
n(n-1)/2 = O(n2)
( 1.4)nlogn = O(n logn)
3nlogn = O(n logn)
Remark - Note first that the bubble sort is a very slow way of sorting. Its main
advantage is the simplicity of the algorithm. Observe that the average-case
complexity (n log n) of heapsort is the same as that of quicksort, but its worstcase complexity (n log n) seems quicker than quicksort (n2). However, empirical
MSc. In Software
Employee Number
Sex
Salary
If we sort the file with respect to the Name key will yield a different order of the
records than sorting the file with respect to the Employee Number key. The
company may want to sort the file according to the Salary field even though the
field may not uniquely determine the employees. Sorting the file with respect to
the Sex key will likely be useless; it simply separates the employees into two
subfiles, one with the male employees and one with the female employees.
Sorting a file F by reordering the records in memory may be very expensive when
the records are very long. Moreover, the records may be in secondary memory,
where it is even more time-consuming to move records into different locations.
Accordingly, we may prefer to form an auxiliary array POINT containing pointers
to the records in memory and then sort the array POINT with respect to a field
KEY rather than sorting the records themselves. That is, we sort POINT so that
KEY[POINT[1]] KEY[POINT[2]] KEY[POINT[N]]
Note that choosing a different field KEY will yield a different order of the array
POINT.
INSERTION SORT
Suppose an array A with n element A[1], A[2], A[N] is in memory. The selection
sort algorithm scans A from A[1] to A[n], inserting each element A[K] into its
proper position in the previously sorted subarray A[1], A[2], A[K-1]. That is:
Pass 1.
MSc. In Software
Pass 2. A[2] is inserted either before or after A[1] so that : A[1], A[2] is sorted.
Pass 3. A[3] is inserted either into its proper place in A[1],A[2] that before A[1],
between A[1] and A[2] or after A[2] so that : A[1],A[2],A[3] is sorted
.
Pass 4. A[4] is inserted into its proper place in A[1], A[2], A[3] so that:
A[1], A[2], A[3], A[4] is sorted.
.
Pass N. A[N] is inserted into its proper place in A[1], A[2],.., A[N-1] so that
A[1], A[2],.A[N] is sorted.
MSc. In Software
(4)
Observe that there is an inner loop, which is essentially controlled by the variable
PTR, and there is an outer loop, which uses k as an index,
Worst Case
n(n 1)/2 = O(n2)
Average Case
n( n 1)/4 = O(n2)
Time may be saved by performing a binary search, rather than a linear search, to
find the location in which to insert A[K] in the subarray A[1],A[2],A[K-1]. This
requires, on the average, log K comparisons rather than (k-1)/2 comparisons.
However, one needs to move (k-1)/2 elements forward. Thus the order of
complexity is not changed. Furthermore, insertion sort is usually used only when
n is small, and in such a case, the linear search is about as efficient as the binary
search.
SELECTION SORT
Suppose an array A with n element A[1] A[2],A[n] is in memory. The selection
sort algorithm for sorting A works as follows. First we have to find the smallest
element in the list and put it in the first position then find the second smallest
element in the list and put it in the second position. And so on.
MSc. In Software
Pass 2.
Find the location LOC of the smallest in the sublist of n-1 element
A[2],A[3],..A[n] and then interchange A[LOC] and A[2] Then
A[1],A[2] is sorted, since A[1] A[2].
Pass 3.
Find the location LOC of the smallest in the sublist of n-2 elements
A[3],A[4],.A[n], and then interchange A[LOC] and A[3] . Then
A[1],A[2],..,A[3] is sorted since A[2] A[3].
Pass n-1
Find the location LOC of the smaller of the elements A[n-1], A[n],
and then interchange A[LOC] AND A[n-1]. Then
A[1],A[2],.A[n] is sorted , since A[n-1] A[n].
After comparing MIN with the last element A[N],MIN will contain the smallest
among the elements A[k], A[k+1],A[n] and LOC will contain its location.
The above process will be stated separately as a procedure.
Procedure: MIN (A, k, n, LOC)
An array A is in memory. This procedure finds the location LOC of the
smallest element among A[k], A[k+1],A[n].
(1)
(2)
MSc. In Software
(3)
[End of loop]
Return
Worst Case
n(n 1)/2 = O(n2)
Average Case
n( n 1)/4 = O(n2)
MERGING
Suppose A is a sorted list with r elements and B is a sorted list with s elements.
The operation that combines the elements of A and B into a single sorted list C
with n = r + s elements is called merging. One simple way to merge is to place
the elements of B after the elements of A and then use some sorting algorithm on
the entire list. We cannot take advantage of the fact that A and B are individually
sorted. Given below is an efficient algorithm. First, however, we indicate the
general idea of the algorithm by means of two examples.
MSc. In Software
and B[NB]
and assign the smaller element to C[PTR]. Then we increment PTR by setting
PTR = PTR+1 and we either increment NA by setting NA = NA+1, or increment NB
by setting NB = NB+1, according to whether the new element in C has come from
A or from B. Furthermore, if NA > r then the remaining elements of B are
assigned to C; or if NB > s, then the remaining elements of A are assigned to C.
The formal statement of the algorithm is as follows:
Algorithm: MERGING (A, R, B, S, C)
Let A and B be sorted array with R and S element, respectively. This
algorithm merges A and B into an array C with N=R+S elements.
(1)
(2)
3.
4.
MSc. In Software
10
Reducing the target set: Suppose after the first search we find that A[1]
is to be inserted after B[16]. Then we need to use only a binary search on
B[17] ,.B[100] to find the proper location to insert A[2], and so on.
(2)
Tabbing: The expected location for inserting A[1] in B is near B[20] (that
is, B[s/r]), not near b[50]. Hence we first use a linear search on B[20].
B[40], B[80] and B[100] to find B[K] such that A[1] B[K] , and then we
use a binary search on B[K-20], B[k-19],B[k]. (This is analogous to using
the tabs in a dictionary, which indicate the location of all the words with the
same first letter.)
MERGE-SORT
Suppose an array A with n elements A[1], A[2],.A[n] is in memory. The mergesort algorithm which sorts A will first be described by means of a specific
example.
Suppose the array A contains 14 elements as follows:
72, 37, 43, 25, 59, 91, 64, 13, 84, 22, 54, 47, 80, 33
Each pass of the merge-sort algorithm will start at the beginning of the array A
and merge pairs of sorted subarrays as follows:
Pass 1. Merge each pair of elements to obtain the following list of sorted pairs:
MSc. In Software
37
72
25
43
59, 91
13
64 22
84 47
11
54 33
80
Pass 2. Merge each pair of pairs to obtain the following list of sorted quadruplets
25, 37, 43, 72
33,80
Pass 3. Merge each pair of sorted quadruplets to obtain the following two sorted
subarrays:
13, 25, 37, 43, 59, 64, 72, 91
Pass 4. Merge the two-sorted subarrays to obtain the single sorted array
13, 22, 25, 33, 37, 43, 47, 54, 59, 64, 72, 80, 84, 91
The original array A is now sorted.
Description
The above merge-sort algorithm for sorting an array A has the following important
property. After Pass k, we will partition the array A into sorted subarrays where
each subarray, except possibly the last, will contain exactly L = 2k elements.
Hence the algorithm requires at the most log n passes to sort an n-elements array
A.
We will translate the above informal description of merge-sort into a formal
algorithm, which will be divided into two parts. The first part will be a procedure
MERGEPASS, which uses the procedure discussed above to execute a single pass
of the algorithm, and the second part will repeatedly apply MERGEPASS until A is
sorted.
We apply the MERGEPASS procedure to an n-element array, A which consists of a
sequence of sorted subarrays. Moreover, each subarray consists of L elements
except that the last subarray may have fewer than L elements. Dividing n by 2 *
L, we obtain the quotient Q which tells the number of pairs of L-elements sorted
subarrays; that is
Q = INT(N/(2*L))
(We use INT(X) to denote the integer value of X.) Setting S = 2*L*Q, we get the
total numbers S of elements in the Q pairs of subarray. Hence R = N-S denotes
the number of the remaining element. The procedure first merges the initial Q
pairs of L-element subarray. Then the procedure takes care of the case where
there is an odd number of subarray (when R L) or where the last subarray has
fewer than L elements.
MSc. In Software
12
Procedure: MERGEPASS(A,N, L, B)
The N-element array A is composed of sorted subarrays where each
subarray has L elements except possibly the last subarray, which
may have fewer than L elements. The procedure merges the pairs of
subarrays of A and assigns them to the array B.
1. Set Q = INT(N/(2*L)), S= 2*L*Q and R= N S .
2. [Use Procedure discussed above to merge the Q pairs of
subarrays.]
Repeat for J = 1, 2, ., Q
(a) Set LB = 1+(2*J-2)*L [find lower bound of first array]
(b) Call MERGE (A, L, LB, A, L, LB+L, B, LB)
[End of loop.]
3 [Only one subarray left?]
if R L then
Repeat for J = 1,2R
Set B( S + J ) = A( S + J )
[End of loop]
else
Call MERGE (A, L, S+1, A, R, L+S+1, B, S+1)
[End of if structure]
4 Return.
Algorithm:
MERGESORT(A,N)
This algorithm sorts the N-element array A using an auxiliary array
B.
1. Set L = 1. [Initialize the number of elements in the subarray]
2 Repeat Steps 3 to 6 while L < N
3.
Call MERGEPASS(A, N, L, B).
4.
Call MERGEPASS(B, N, 2*L, A).
5.
Set L = 4 * L.
[End of Step 2 loop]
6. Exit
Since we want the sorted array to finally appear in the original array A, we must
execute the procedure MERGEPASS an even number of times.
Complexity of the Merge-Sort Algorithm
We use f(n) to denote the number of comparisons needed to sort an n-element
array A using the merge-sort algorithm. Recall that the algorithm requires at
MSc. In Software
13
most log n passes. Moreover each pass merges a total of n elements, and by the
discussion on the complexity of merging. Each pass will require at most n
comparisons. Accordingly for both the worst case and average case,
f(n) n log n
Observe that this algorithm has the same order as heapsort and the same
average order as quicksort. The main drawback of merge-sort is that it requires
an auxiliary array with n element. Each of the other sorting algorithms we have
studied is that it requires only a finite number of extra locations, which is
independent of n.
The above results are summarized in the following table:
Algorithm
Merge-Sort
Worst Case
n log n = O(n log n)
Average Case
n log n = O(n log n)
Extra
O(n)
Summary
# Sorting and Searching are important operations in computer science.
Sorting refers to the operation of arranging data in some given order, such
as increasing or decreasing, with numerical data, or alphabetically, with
character data. Searching refers to the operation of finding the location of a
given item in a collection of items.
# We measure the complexity of a sorting algorithm of the running time as a
function of the numbers n of items to be sorted.
# We learned about Insertion sort, selection sort, Merge Sort.
MSc. In Software
7
HASHING TECHNIQUES
MAIN POINTS COVERED
Hashing
Hash Functions
Collision Resolution
Open Addressing
(Linear probing and modification)
Chaining
Summary
MSc. In Software
HASHING
Example
Suppose a company with n employees assigns an employee number to each
employee. We can in fact, use the employee number as the address of the
record in memory. The search will require no comparisons at all but a lot of
space will be wasted.
The idea of using the key to determine the address of a record is an excellent
idea, but it must be modified so that a great deal of space is not wasted. This
modification takes the form of a function H from the set K of keys into the
set L of memory addresses. Such a function,
H: K # L
is called a hash function or hashing function. Unfortunately, such a
function H may not yield distinct values. It is possible that two different keys
k1 and k2 will yield the same hash address. This situation is called collision.
And some method must be used to resolve it.
Accordingly the topic of hashing is divided into two parts:
(1) Hash function and
(2) Collision resolutions
We will discuss each of these two parts with you separately.
Hash Functions
The two principal criteria in selecting a hash function H: K # L are as follows:
First of all, the function H should be very easy and quick to compute. Second,
the function H should, as far as possible uniformly distribute the hash
addresses throughout set L so that there are minimum number of collisions.
MSc. In Software
H (k) = k (mod m)
H(k) = k(mod m) + 1
Example
We can consider the company in the previous example each of whose 68
employees is assigned a unique 4-digit employee number. Suppose L
consists of 100 two-digit addresses: 00, 01, 02,,99. We apply the above
hash functions to each of the following employee numbers.
3205,
7148,
2345
MSc. In Software
(a)
Division method. Choose a prime number m close to 99, such as m =
97. Then
H(3205) = 4,
H(7148) = 67, H(2345)=17
That is, dividing 3205 by 97 gives a remainder of 4, dividing 7148 by 97
gives reminder of 67, and dividing 2345 by 97 gives a remainder of 17. In
the case that the memory addresses begin with 01 rather than 00, we
choose that the function H(k) = k(mod m) + 1 to obtain:
H(3205) = 4+1=5,
(b)
H(7148) = 67 + 1 = 68,
H(2345) = 17+1=18
k :
3205
k2 :
10 272 025
H(k):
72
7148
2345
51 093 904 5 499 025
93
99
Observe that the fourth and fifth digits, counting from the right, are chosen
for the hash address.
(c)
Folding method. Chopping the key k into two parts and adding yields
the following hash addresses:
H(3205)= 32+05 = 37
H(7148)=71+48=19 H(2345)=23+45=68
H(2345)=23+54=77
Collision Resolution
Suppose we want to add a new record R with key k to our file F, and the
memory location address H(k) is already occupied. This situation is called
collision. In this subsection we discuss two general ways of resolving
collisions. The particular procedure that we have chosen depends on many
factors. One important factor is the ratio of the number n of keys in K (which
is the number of records in F) to the number m of hash addresses in L. This
ratio, = n/m, is called the load factor.
First we will show that collisions are almost impossible to avoid. Specifically,
suppose a student class has 24 students and suppose the table has space for
365 records. One random hash function is to choose the students birthday
as the hash address. Although the load factor = 24/365 7% is very
small. We can show that there is a better than fifty-fifty chance that two of
the students have the same birthday.
MSc. In Software
S(
) =
1
1
---- (1 + ----)
2
1-
and
U(
) =
1
1
---- (1+ ------2
(1-
)2
Chaining
Chaining involves maintaining two tables in memory. First of all, as we used
before, there is a table T in memory, which contains the records in F, except
that T now has an additional field LINK that is used so that all records in T
with the same hash address h may be linked together to form a linked list.
Second, there is a hash address table LIST that contains pointers to the
linked lists in T.
Suppose a new record R with key k is added to the file F. We place R in the
first available location in the table T and then add R to the linked list with
pointer LIST[H(k)]. If the linked lists of records are not sorted, then R is
simply inserted at the beginning of its linked list. Searching for a record or
deleting a record is nothing more than searching for a node or deleting a
node from a linked list.
MSc. In Software
The average number of probes, using chaining, for a successful search and
for an unsuccessful search are known to be the following approximate
values:
1
S(
) 1 + --
and U(
) e- +
2
Here the load factor = n/m may be greater than 1, since the number m of
hash addresses in L (not the number of locations in T) may be less than the
number n of records in F.
Example
Lets consider again the data in the previous example where the 8 records,
have the following hash addresses:
Record:
H(k):
A
4
B
8
C
2
D
11
E
4
X
11
Y
5
Z
1
Using chaining, the records will appear in memory as pictured in figure 7-1
Observe that the location of a record R in Table T is related to its hash
address. A record is simply put in the first node in the AVAIL list of table T.
In fact, table T need not have the same number of elements as the hash
address table.
Fig. 7-1
The main disadvantage of chaining is that we need 3m memory cells for the
data. Specifically there are m cells for the information field INFO, there are m
MSc. In Software
cells for the link field LINK, and there are m cells for the pointer array LIST.
Suppose each record requires only 1 word for its information field. Then it
may be more useful to use open addressing with a table with 3m locations,
which has the load factor 1/3, then to use chaining to resolve collisions.
Summary
$ The terminology, which we will be using in our presentation of
hashing, will be oriented towards file management.
$ Suppose we want to add a new record R with key k to our file F, but
suppose the memory location address H(k) is already occupied. This
situation is called collision.
$ Chaining involves maintaining two tables in memory. First of all, as we
used before, there is a table T in memory, which contains the records
in F, except that T now has an additional field LINK that is used so that
all records in T with the same hash address may be linked together to
form a linked list.