Puter Science Interdisciplinary Problems Principles and Python Programming
Puter Science Interdisciplinary Problems Principles and Python Programming
COMPUTER
SCIENCE
Interdisciplinary Problems,
Principles, and Python
Programming
CHAPMAN & HALL/CRC
TEXTBOOKS IN COMPUTING
Series Editors
This series covers traditional areas of computing, as well as related technical areas, such as
software engineering, artificial intelligence, computer engineering, information systems, and
information technology. The series will accommodate textbooks for undergraduate and gradu-
ate students, generally adhering to worldwide curriculum standards from professional societ-
ies. The editors wish to encourage new and imaginative ideas and proposals, and are keen to
help and encourage new authors. The editors welcome proposals that: provide groundbreaking
and imaginative perspectives on aspects of computing; present topics in a new and exciting
context; open up opportunities for emerging areas, such as multi-media, security, and mobile
systems; capture new developments and applications in emerging fields of computing; and
address topics that provide support for computing, such as mathematics, statistics, life and
physical sciences, and business.
Published Titles
DISCOVERING
COMPUTER
SCIENCE
Interdisciplinary Problems,
Principles, and Python
Programming
Jessen Havill
Denison University
Granville, Ohio, USA
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2016 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface xv
Acknowledgments xxiii
v
vi Contents
Built-in functions 45
Strings 47
Modules 51
*2.5 BINARY ARITHMETIC 54
Finite precision 55
Negative integers 56
Designing an adder 57
Implementing an adder 58
2.6 SUMMARY 62
2.7 FURTHER DISCOVERY 62
Bibliography 709
Index 713
This page intentionally left blank
Preface
xv
xvi Preface
Web resources
The text, exercises, and projects often refer to files on the book’s accompanying
web site, which can be found at
http://discoverCS.denison.edu .
This web site also includes pointers for further exploration, links to additional
documentation, and errata.
To students
Learning how to solve computational problems and implement them as computer
programs requires daily practice. Like an athlete, you will get out of shape and fall
behind quickly if you skip it. There are no shortcuts. Your instructor is there to
help, but he or she cannot do the work for you.
With this in mind, it is important that you type in and try the examples
throughout the text, and then go beyond them. Be curious! There are numbered
“Reflection” questions throughout the book that ask you to stop and think about, or
apply, something that you just read. Often, the question is answered in the book
immediately thereafter, so that you can check your understanding, but peeking
ahead will rob you of an important opportunity.
There are many opportunities to delve into topics more deeply. Boxes scattered
throughout the text briefly introduce related, but more technical, topics. For the
most part, these are not strictly required to understand what comes next, but I
encourage you to read them anyway. In the “Further discovery” section of each
chapter, you can find additional pointers to explore chapter topics in more depth.
At the end of most sections are several programming exercises that ask you
to further apply concepts from that section. Often, the exercises assume that you
have already worked through all of the examples in that section. All later chapters
conclude with a selection of more involved interdisciplinary projects that you may
be asked by your instructor to tackle.
The book assumes no prior knowledge of computer science. However, it does
assume a modest comfort with high school algebra and mathematical functions.
Occasionally, trigonometry is mentioned, as is the idea of convergence to a limit,
but these are not crucial to an understanding of the main topics in this book.
To instructors
This book may be appropriate for a traditional CS1 course for majors, a CS0 course
for non-majors (at a slower pace and omitting more material), or an introductory
computing course for students in the natural and/or social sciences.
As suggested above, I emphasize computer science principles and the role of
abstraction, both functional and data, throughout the book. I motivate functions
as implementations of functional abstractions, and point out that strings, lists,
and dictionaries are all abstract data types that allow us to solve more interesting
problems than would otherwise be possible. I introduce the idea of time complexity
Preface xvii
Chapter 7 Chapter 6
Chapter 8 Chapter 5
Designing Text, documents,
Data analysis Forks in the road
programs and DNA
Chapter 10
Chapter 9 Chapter 11
Self-similarity and
Flatland Organizing data
recursion
Chapter 13
Chapter 12
Abstract data
Networks
types
intuitively, without formal definitions, in the first chapter and return to it several
times as more sophisticated algorithms are developed. The book uses a spiral
approach for many topics, returning to them repeatedly in increasingly complex
contexts. Where appropriate, I also weave into the book topics that are traditionally
left for later computer science courses. A few of these are presented in boxes that
may be covered at your discretion. None of these topics is introduced rigorously, as
they would be in a data structures course. Rather, I introduce them informally and
intuitively to give students a sense of the problems and techniques used in computer
science. I hope that the tables below will help you navigate the book, and see where
particular topics are covered.
This book contains over 600 end-of-section exercises and over 300 in-text reflection
questions that may be assigned as homework or discussed in class. At the end of
most chapters is a selection of projects (about 30 in all) that students may work on
independently or in pairs over a longer time frame. I believe that projects like these
are crucial for students to develop both problem solving skills and an appreciation
for the many fascinating applications of computer science.
Because this book is intended for a student who may take additional courses in
computer science and learn other programming languages, I intentionally omit some
features of Python that are not commonly found elsewhere (e.g., simultaneous swap,
chained comparisons, enumerate in for loops). You may, of course, supplement
with these additional syntactical features.
There is more in this book than can be covered in a single semester, giving an
instructor the opportunity to tailor the content to his or her particular situation and
interests. Generally speaking, as illustrated in Figure 1, Chapters 1–6 and 8 form the
core of the book, and should be covered sequentially. The remaining chapters can be
covered, partially or entirely, at your discretion, although I would expect that most
instructors will cover at least parts of Chapters 7, 10, 11, and 13. Chapter 7 contains
xviii Preface
Chapter outlines
The following tables provide brief overviews of each chapter. Each table’s three
columns, reflecting the three parts of the book’s subtitle, provide three lenses through
which to view the chapter. The first column lists a selection of representative problems
that are used to motivate the material. The second column lists computer science
principles that are introduced in that chapter. Finally, the third column lists Python
programming topics that are either introduced or reinforced in that chapter to
implement the principles and/or solve the problems.
Chapter 9. Flatland
Sample problems Principles Programming
• earthquake data • 2-D data • 2-D data in list of lists
• Game of Life • cellular automata • nested loops
• image filters • digital images • 2-D data in a dictionary
• racial segregation • color models
• ferromagnetism
• dendrites
Software assumptions
To follow along in this book and complete the exercises, you will need to have installed
Python 3.4 (or later) on your computer, and have access to IDLE or another
programming environment. The book also assumes that you have installed the
matplotlib and numpy modules. Please refer to Appendix A for more information.
Errata
While I (and my students) have ferreted out many errors, readers will inevitably
find more. You can find an up-to-date list of errata on the book web site. If
you find an error in the text or have another suggestion, please let me know at
havill@denison.edu.
This page intentionally left blank
Acknowledgments
was extraordinarily naı̈ve when I embarked on this project two years ago. “How
I hard can it be to put these ideas into print?” Well, much harder than I thought,
as it turns out. I owe debts of gratitude to many who saw me through to the end.
First and foremost, my family not only tolerated me during this period, but
offered extraordinary support and encouragement. Thank you Beth, for your patience
and strength, and all of the time you have made available to me to work on the
book. I am grateful to my in-laws, Roger and Nancy Vincent, who offered me their
place in Wyoming for a month-long retreat in the final stretch. And, to my four
children, Nick, Amelia, Caroline, and Lillian, I promise to make up for lost time.
My colleagues Matt Kretchmar, Ashwin Lall, and David White used drafts in
their classes, and provided invaluable feedback. They have been fantastic sounding
boards, and have graciously provided many ideas for exercises and projects. Students
in Denison University’s CS 111 and 112 classes caught many typos, especially Gabe
Schenker, Christopher Castillo, Christine Schmittgen, Alivia Tacheny, Emily Lamm,
and Ryan Liedke. Dana Myers read much of the book and offered an abundance of
detailed suggestions. Joan Krone also read early chapters and offered constructive
feedback. I am grateful to Todd Feil for his support, and his frank advice after
reading the almost-final manuscript.
I have benefitted tremendously from many conversations about computational
science, geology, and life with my friend and colleague David Goodwin. Project 8.1
is based on an assignment that he has used in his classes. I have also learned a great
deal from collaborations with Jeff Thompson. Jeff also advised me on Section 6.7
and Project 6.2. Frank Hassebrock enthusiastically introduced me to theories of
problem solving in cognitive psychology. And Dee Ghiloni, the renowned cat herder,
has supported me and my work in more ways than I can count.
I am indebted to the following reviewers, who read early chapters and offered
expert critiques: Terry Andres (University of Manitoba), John Impagliazzo (Qatar
University), Daniel Kaplan (Macalester College), Nathaniel Kell (Duke University),
Andrew McGettrick (University of Strathclyde); Hamid Mokhtarzadeh (University
of Minnesota), George Novacky (University of Pittsburgh), and J. F. Nystrom (Ferris
State University).
I could not have completed this book without the Robert C. Good Fellowship
awarded to me by Denison University.
Finally, thank you to Randi Cohen, for believing in this project, and for her
advice and patience throughout.
xxiii
This page intentionally left blank
About the author
Jessen Havill is a Professor of Computer Science and the Benjamin Barney Chair
of Mathematics at Denison University, where he has been on the faculty since 1998.
Dr. Havill teaches courses across the computer science curriculum, as well as an
interdisciplinary elective in computational biology. He was awarded the college’s
highest teaching honor, the Charles A. Brickman Teaching Excellence Award, in
2013.
Dr. Havill is also an active researcher, with a primary interest in the development
and analysis of online algorithms. In addition, he has collaborated with colleagues
in biology and geosciences to develop computational tools to support research
and teaching in those fields. Dr. Havill earned his bachelor’s degree from Bucknell
University and his Ph.D. in computer science from The College of William and
Mary.
xxv
This page intentionally left blank
CHAPTER 1
What is computation?
We need to do away with the myth that computer science is about computers. Computer
science is no more about computers than astronomy is about telescopes, biology is about
microscopes or chemistry is about beakers and test tubes. Science is not about tools, it is
about how we use them and what we find out when we do.
omputers are the most powerful tools ever invented, but not because of their
C versatility and speed, per se. Computers are powerful because they empower
us to innovate and make unprecedented discoveries.
A computer is a machine that carries out a computation, a sequence of simple
steps that transforms some initial information, an input, into some desired result,
the output. Computer scientists harness the power of computers to solve complex
problems by designing solutions that can be expressed as computations. The output
of a computation might be a more efficient route for a spacecraft, a more effective
protocol to control an epidemic, or a secret message hidden in a digital photograph.
Computer science has always been interdisciplinary, as computational problems
arise in virtually every domain imaginable. Social scientists use computational models
to better understand social networks, epidemics, population dynamics, markets,
and auctions. Scholars working in the digital humanities use computational tools to
curate and analyze classic literature. Artists are increasingly incorporating digital
technologies into their compositions and performances. Computational scientists
work in areas related to climate prediction, genomics, particle physics, neuroscience,
and drug discovery.
In this book, we will explore the fundamental problem solving techniques of
computer science, and discover how they can be used to model and solve a variety of
interdisciplinary problems. In this first chapter, we will provide an orientation and
lay out the context in which to place the rest of the book. We will further develop
all of these ideas throughout, so don’t worry if they are not all crystal clear at first.
1
2 What is computation?
search engine
search terms
search results
Buxton Inn
313 E Broadway GPS device
Granville, OH 43023
address directions
y = 18x + 31
or
f (x) = 18x + 31,
you may have thought about the variable x as a representation of the input and
y, or f (x), as a representation of the output. In this example, when the input is
x = 2, the output is y = 67, or f (x) = 67. The arithmetic that turns x into y is a
very simple (and boring) example of a computation.
Reflection 1.1 What kinds of problems are you interested in? What are their inputs
and outputs? Are the inputs and outputs, as you have defined them, sufficient to
define the problem completely?
Problems and abstraction 3
To use the technologies illustrated in Figure 1.1 you do not need to understand how
the underlying computation transforms the input to the output; we can think of
the computation as a “black box” and still use the technology effectively. We call
this idea functional abstraction, a very important concept that we often take for
granted. Put simply,
A functional abstraction describes how to use a tool or technology without
necessarily providing any knowledge about how it works.
We exist in a world of abstractions; we could not function without them. We
even think about our own bodies in terms of abstractions. Move your fingers. Did
you need to understand how your brain triggered your nervous and musculoskeletal
systems to make that happen? As far as most of us are concerned, a car is also an
abstraction. To drive a car, do you need to know how turning the steering wheel
turns the car or pushing the accelerator makes it go faster? We understand what
should happen when we do these things, but not necessarily how they happen.
Without abstractions, we would be paralyzed by an avalanche of minutiae.
Reflection 1.2 Imagine that it was necessary to understand how a GPS device
works in order to use it. Or a music player. Or a computer. How would this affect
your ability to use these technologies?
New technologies and automation have introduced new functional abstractions into
everyday life. Our food supply is a compelling example of this. Only a few hundred
years ago, our ancestors knew exactly where their food came from. Inputs of hard
work and suitable weather produced outputs of grain and livestock to sustain a
family. In modern times, we input money and get packaged food; the origins of our
food have become much more abstract.
Reflection 1.3 Think about a common functional abstraction that you use regularly,
such as your phone or a credit card. How has this functional abstraction changed
over time? Can you think of instances in which better functional abstractions have
enhanced our ability to use a technology?
We also use layers of functional abstractions to work more efficiently. For example,
suppose you are the president of a large organization (like a university) that is
composed of six divisions and 5,000 employees. Because you cannot effectively
manage every detail of such a large organization, you assign a vice president to
oversee each of the divisions. You expect each VP to keep you informed about the
general activity and performance of that division, but insulate you from the day-to-
day details. In this arrangement, each division becomes a functional abstraction to
you: you know what each division does, but not necessarily how it does it. Benefitting
from these abstractions, you are now free to focus your resources on more important
organization-level activity. Each VP may utilize a similar arrangement within his
or her division. Indeed, organizations are often subdivided many times until the
number of employees in a unit is small enough to be overseen by a single manager.
4 What is computation?
Computers are similarly built from many layers of functional abstractions. When
you use a computer, you are presented with a “desktop” abstraction on which you
can store files and use applications (i.e., programs) to do work. That there appear
to be many applications executing simultaneously is also an abstraction. In reality,
some applications may be executing in parallel while others are being executed
one at a time, but alternating so quickly that they just appear to be executing in
parallel. This interface and the basic management of all of the computer’s resources
(e.g., hard drives, network, memory, security) is handled by the operating system.
An operating system is a complicated beast that is often mistakenly described as
“the program that is always running in the background on a computer.” In reality,
the core of an operating system provides several layers of functional abstractions
that allow us to use computers more efficiently.
Computer scientists invent computational processes (e.g., search engines, GPS
software, and operating systems), that are then packaged as functional abstractions
for others to use. But, as we will see, they also harness abstraction to correctly
and efficiently solve real-world problems. These problems are usually complicated
enough that they must be decomposed into smaller problems that human beings can
understand and solve. Once solved, each of these smaller pieces becomes a functional
abstraction that can be used in the solution to the original problem.
The ability to think in abstract terms also enables us to see similarities in prob-
lems from different domains. For example, by isolating the fundamental operations
of sexual reproduction (i.e., mutation and recombination) and natural selection (i.e.,
survival of the fittest), the process of evolution can be thought of as a randomized
computation. From this insight was borne a technique called evolutionary computa-
tion that has been used to successfully solve thousands of problems. Similarly, a
technique known as simulated annealing applies insights gained from the process
of slow-cooling metals to effectively find solutions to very hard problems. Other
examples include techniques based on the behaviors of crowds and insect swarms,
the operations of cellular membranes, and how networks of neurons make decisions.
use to carry out this formula. For example, each of these algorithms computes the
volume of a sphere.
1. Divide 4 by 3. 1. Compute r × r × r.
2. Multiply the previous result by π. 2. Multiply the previous result by π.
or
3. Repeat the following three times: 3. Multiply the previous result by 4.
multiply the previous result by r. 4. Divide the previous result by 3.
Both of these algorithms use the same formula, but they execute the steps in
different ways. This distinction may seem trivial to us but, depending on the level
of abstraction, we may need to be this explicit when “talking to” a computer.
Reflection 1.4 Write yet another algorithm that follows the volume formula.
def sphereVolume(radius):
volume = (4 / 3) * 3.14159 * (radius ** 3)
return volume
result = sphereVolume(10)
print(result)
Each line in a program is called a statement. The first statement in this program,
beginning with def, defines a new function. Like an algorithm, a function contains a
sequence of steps that transforms an input into an output. In this case, the function is
named sphereVolume, and it takes a single input named radius, in the parentheses
following the function name. The second line, which is indented to indicate that it is
part of the sphereVolume function, uses the volume formula to compute the volume
of the sphere, and then assigns this value to a variable named volume. The third line
indicates that the value assigned to volume should be “returned” as the function’s
output. These first three lines only define what the sphereVolume function should
do; the fourth line actually invokes the sphereVolume function with input radius
6 What is computation?
problem
solve
algorithm
implement
high-level program
execute or “run”
computation
equal to 10, and assigns the result (41.88787) to the variable named result. The
fifth line, you guessed it, prints the value assigned to result.
If you have never seen a computer program before, this probably looks like
“Greek” to you. But, at some point in your life, so did (4/3)πr3 . (What is this
symbol π? How can you do arithmetic with a letter? What is that small 3 doing
up there? There is no “plus” or “times” symbol; is this even math?) The same
can be said for notations like H2 O or 19○ C. But now that you are comfortable
with these representations, you can appreciate that they are more convenient and
less ambiguous than representations like “multiply 4/3 by π by the radius cubed”
and “two atoms of hydrogen and one atom of oxygen.” You should think of a
programming language as simply an extension of the familiar arithmetic notation.
But programming languages enable us to express algorithms for problems that reach
well beyond simple arithmetic.
The process that we have described thus far is illustrated in Figure 1.2. We start
with a problem having well-defined input and output. We then solve the problem
by designing an algorithm. Next, we implement the algorithm by expressing it as
a program in a programming language. Programming languages like Python are
often called high-level because they use familiar words like “return” and “print,”
and enable a level of abstraction that is familiar to human problem solvers. (As we
will see in Section 1.4, computers do not really “understand” high-level language
programs.) Finally, executing the program on a computer initiates the computation
Algorithms and programs 7
that gives us our results. (Executing a program is also called “running” it.) As we
will see soon, this picture is hiding some details, but it is sufficient for now.
Let’s look at another, slightly more complicated problem. Suppose, as part of an
ongoing climate study, we are tracking daily maximum water temperatures recorded
by a drifting buoy in the Atlantic Ocean. We would like to compute the average (or
mean) of these temperatures over one year. Suppose our list of high temperature
readings (in degrees Celsius) starts like this:
18.9, 18.9, 19.0, 19.2, 19.3, 19.3, 19.2, 19.1, 19.4, 19.3, ...
Reflection 1.5 What are the input and output for this problem? Think carefully
about all the information you need in the input. Express your output in terms of the
input.
The input to the mean temperature problem obviously includes the list of tempera-
tures. We also need to know how many temperatures we have, since we need to divide
by that number. The output is the mean temperature of the list of temperatures.
Reflection 1.6 What algorithm can we use to find the mean daily temperature?
Think in terms of the steps we followed in the algorithms to compute the volume of
a sphere.
Of course, we know that we need to add up all the temperatures and divide by the
number of days. But even a direction as simple as “add up all the temperatures” is too
complex for a computer to execute directly. As we will see in Section 1.4, computers
are, at their core, only able to execute instructions on par with simple arithmetic
on two numbers at a time. Any complexity or “intelligence” that we attribute to
computers is actually attributable to human beings harnessing this simple palette of
instructions in creative ways. Indeed, this example raises two necessary characteristics
of computer algorithms: their steps must be both unambiguous and executable by a
computer. In other words, the steps that a computer algorithm follows must correlate
to things a computer can actually do, without inviting creative interpretation by a
human being.
These two requirements are not really unique to computer algorithms. For example,
we hope that new surgical techniques are unambiguously presented and reference
actual anatomy and real surgical tools. Likewise, when an architect designs a
building, she must take into account the materials available and be precise about
their placement. And when an author writes a novel, he must write to his audience,
using appropriate language and culturally familiar references.
To get a handle on how to translate “add up all the temperatures” into something
a computer can understand, let’s think about how we would do this without a
8 What is computation?
Since this 367-step algorithm is pretty cumbersome to write, and steps 2–366 are
essentially the same, we can shorten the description of the algorithm by substituting
steps 2–366 with
For each temperature t in our list, repeat the following:
add t to the running sum, and assign the result to be the new
running sum.
In this shortened representation, which is called a loop, t stands in for each temper-
ature. First, t is assigned the first temperature in the list, 18.9, which the indented
statement adds to the running sum. Then t is assigned the second temperature, 18.9,
which the indented statement next adds to the running sum. Then t is assigned the
third temperature, 19.0. And so on. With this substitution (in red), our algorithm
becomes:
Algorithms and programs 9
We can visualize the execution of this loop more explicitly by “unrolling” it into its
equivalent sequence of statements. The statement indented under step 2 is executed
once for each different value of t in the list. Each time, the running sum is updated.
So, if our list of temperatures is the same as before, this loop is equivalent to the
following sequence of statements:
(a) Add 18.9 to 0 , and assign 18.9 to be the new running sum.
± ® ±
t the running sum the result
(b) Add 18.9 to 18.9 , and assign 37.8 to be the new running sum.
± ± ±
t the running sum the result
(c) Add 19.0 to 37.8 , and assign 56.8 to be the new running sum.
± ± ±
t the running sum the result
Another important benefit of the loop representation is that it renders the algorithm
representation independent of the length of the list of temperatures. In Mean Tem-
perature 1, we had to know up front how many temperatures there were because
we needed one statement for each temperature. However, in Mean Temperature
2, there is no mention of 365, except in the final statement; the loop simply repeats
as long as there are temperatures remaining. Therefore, we can generalize Mean
Temperature 2 to handle any number of temperatures. Actually, there is nothing
in the algorithm that depends on these values being temperatures at all, so we
should also generalize it to handle any list of numbers. If we let n denote the length
of the list, then our generalized algorithm looks like the following:
10 What is computation?
Algorithm Mean
Input: a list of n numbers
1. Initialize the running sum to 0.
2. For each number t in our list, repeat the following:
Add t to the running sum, and assign the result to be the new
running sum.
3. Divide the running sum by n, and assign the result to be the mean.
Output: the mean
As you can see, there are often different, but equivalent, ways to represent an
algorithm. Sometimes the one you choose depends on personal taste and the pro-
gramming language in which you will eventually express it. Just to give you a sense
of what is to come, we can express the Mean algorithm in Python like this:
def mean(values):
n = len(values)
sum = 0
for number in values:
sum = sum + number
mean = sum / n
return mean
Try to match the statements in Mean to the statements in the program. By doing
so, you can get a sense of what each part of the program must do. We will flesh
this out in more detail later, of course; in a few chapters, programs like this will
be second nature. As we pointed out earlier, once you are comfortable with this
notation, you will likely find it much less cumbersome and more clear than writing
everything in full sentences, with the inherent ambiguities that tend to emerge in
the English language.
Exercises
1.2.1. Describe three examples from your everyday life in which an abstraction is beneficial.
Explain the benefits of each abstraction versus what life would be like without it.
1.2.2. Write an algorithm to find the minimum value in a list of numbers. Like the
Mean algorithm, you may only examine one number at a time. But instead of
remembering the current running sum, remember the current minimum value.
1.2.3. You are organizing a birthday party where cookies will be served. On the invite
list are some younger children and some older children. The invite list is given in
shorthand as a list of letters, y for a young child and o for an older child. Each
older child will eat three cookies, while each younger child will eat two cookies.
Efficient algorithms 11
Write an algorithm that traces through the list, and prints how many cookies are
needed. For example, if the input were y, y, o, y, o then the output should be 12
cookies.
1.2.4. Write an algorithm for each player to follow in a simple card game like Blackjack
or Go Fish. Assume that the cards are dealt in some random order.
1.2.5. Write an algorithm to sort a stack of any 5 cards by value in ascending order. In
each step, your algorithm may compare or swap the positions of any two cards.
1.2.6. Write an algorithm to walk between two nearby locations, assuming the only legal
instructions are “Take s steps forward,” and “Turn d degrees to the left,” where s
and d are positive integers.
(For simplicity, we will assume that everyone answers the phone right away and
every phone call takes the same amount of time.) Is this the best algorithm for
solving the problem? What does “best” even mean?
Reflection 1.7 For the “phone tree” problem, what characteristics would make one
algorithm better than another?
As the question suggests, deciding whether one algorithm is better than another
12 What is computation?
(a) A 1 B 2 C 3 D 4 E 5 F 6 G 7 H
A
(b)
4
3 5
2 6
1 7
B C D E F G H
1 2
(c) B C
2 3 3 4
D E F G
3
depends on your criteria. For the “phone tree” problem, your primary criterion may
be to make as few calls as possible yourself, delegating most of the work to others.
The Alphabetical Phone Tree algorithm, graphically depicted in Figure 1.3(a),
is one way to satisfy this criterion. Alternatively, you may feel guilty about making
others do any work, so you decide to make all the calls yourself; this is depicted
in Figure 1.3(b). However, in the interest of community safety, we should really
organize the phone tree so that the last person is notified of the emergency as soon
as possible. Both of the previous two algorithms fail miserably in this regard. In fact,
they both take the longest possible time! A better algorithm would have you call
two people, then each of those people call two people, and so on. This is depicted in
Figure 1.3(c). Notice that this algorithm notifies everyone in only four steps because
more than one phone call happens concurrently; A and B are both making calls
during the second step; and B, C, and D are all making calls in the third step.
If all of the calls were utilizing a shared resource (such as a wireless network),
we might need to balance the time with the number of simultaneous calls. This
may not seem like an issue with only eight people in our phone tree, but it would
become a significant issue with many more people. For example, applying the
algorithm depicted in Figure 1.3(c) to thousands of people would result in thousands
of simultaneous calls.
Let’s now consider how these concerns might apply to algorithms more generally.
Efficient algorithms 13
(a) (b)
Plot of (a) a year’s worth of daily high temperature readings and (b) the
Figure 1.4
temperature readings smoothed over a five-day window.
Reflection 1.8 What general characteristics might make one algorithm for a par-
ticular problem better than another algorithm for the same problem?
As was the case in the phone tree problem, the most important hallmark of a good
algorithm is speed; given the choice, we almost always want the algorithm that
requires the least amount of time to finish. (Would you rather wait five minutes
for a web search or half of a second?) But there are other attributes as well that
can distinguish one algorithm from another. For example, we saw in Section 1.2
how a long, tedious algorithm can be represented more compactly using a loop;
the more compact version is easier to write down and translate into a program.
The compact version also requires less space in a computer’s memory. Because the
amount of storage space in a computer is limited, we want to create algorithms that
use as little of this resource as possible to store both their instructions and their
data. Efficient resource usage may also apply to network capacity, as in the phone
tree problem on a wireless network. In addition, just as writers and designers strive
to create elegant works, computer scientists pride themselves on writing elegant
algorithms that are easy to understand and do not contain extraneous steps. And
some advanced algorithms are considered to be especially important because they
introduce new techniques that can be applied to solve other hard problems.
A smoothing algorithm
To more formally illustrate how we can evaluate the time required by an algorithm,
let’s revisit the sequence of temperature readings from the previous section. Often,
when we are dealing with large data sets (much longer than this), anomalies can
arise due to errors in the sensing equipment, human fallibility, or in the network used
to send results to a lab or another collection point. We can mask these erroneous
measurements by “smoothing” the data, replacing each value with the mean of
the values in a “window” of values containing it. This technique is also useful for
14 What is computation?
(a) (b)
Figure 1.5Plot of (a) the ten high temperature readings and (b) the temperature
readings smoothed over a five-day window.
18.9, 18.9, 19.0, 19.2, 19.3, 19.3, 19.2, 22.1, 19.4, 19.3
The plot of this data in Figure 1.5(a) illustrates this erroneous “bump.”
Now let’s smooth the data by averaging over windows of size five. For each value
in the original list, its window will include itself and the four values that come
before it. Our algorithm will need to compute the mean of each of these windows,
and then add each of these means to a new smoothed list. The last four values do
not have four more values after them, so our smoothed list will contain four fewer
values than our original list. The first window looks like this:
18.9, 18.9, 19.0, 19.2, 19.3 19.3, 19.2, 22.1, 19.4, 19.3
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
mean = 95.3 / 5 = 19.06
To find the mean temperature for the window, we sum the five values and divide by
5. The result, 19.06, will represent this window in the smoothed list. The remaining
five windows are computed in the same way:
18.9, 18.9, 19.0, 19.2, 19.3, 19.3, 19.2, 22.1, 19.4, 19.3
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
mean = 95.7 / 5 = 19.14
Efficient algorithms 15
18.9, 18.9, 19.0, 19.2, 19.3, 19.3, 19.2, 22.1, 19.4, 19.3
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
mean = 96.0 / 5 = 19.20
18.9, 18.9, 19.0, 19.2, 19.3, 19.3, 19.2, 22.1, 19.4, 19.3
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
mean = 99.1 / 5 = 19.82
18.9, 18.9, 19.0, 19.2, 19.3, 19.3, 19.2, 22.1, 19.4, 19.3
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
mean = 99.3 / 5 = 19.86
18.9, 18.9, 19.0, 19.2, 19.3, 19.3, 19.2, 22.1, 19.4, 19.3
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
mean = 99.3 / 5 = 19.86
We can see from the plot of this smoothed list in Figure 1.5(b) that the “bump” has
indeed been smoothed somewhat.
We can express our smoothing algorithm using notation similar to our previous
algorithms. Notice that, to find each of the window means in our smoothing algorithm,
we call upon our Mean algorithm from Section 1.2.
Algorithm Smooth
Input: a list of n numbers and a window size w
1. Create an empty list of mean values.
2. For each position d, from 1 to n − w + 1, repeat the following:
(a) invoke the Mean algorithm on the list of numbers between
positions d and d + w − 1;
(b) append the result to the list of mean values.
Output: the list of mean values
Step 1 creates an empty list to hold the window means. We will append the mean
for each window to the end of this list after we compute it. Step 2 uses a loop to
compute the means for all of the windows. This loop is similar to the loop in the
Mean algorithm: the variable d takes on the values between 1 and n − w + 1, like t
took on each value in the list in the Mean algorithm. First, d is assigned the value
1, and the mean of the window between positions 1 and 1 + w − 1 = w is added to
the list of mean values. Then d is assigned the value 2, and the mean of the window
between positions 2 and 2 + w − 1 = w + 1 is added to the list of mean values. And so
on, until d takes on the value of n − w + 1, and the mean of the window between
positions n − w + 1 and (n − w + 1) + w − 1 = n is added to the list of mean values.
16 What is computation?
Reflection 1.9 Carry out each step of the Smooth algorithm with w = 5 and the
ten-element list in our example:
18.9, 18.9, 19.0, 19.2, 19.3, 19.3, 19.2, 22.1, 19.4, 19.3
Compare each step with what we worked out above to convince yourself that the
algorithm is correct.
Reflection 1.10 Based on this analysis, what are the elementary steps in this
algorithm? How many times are they each executed?
The only elementary steps we have identified are creating a list, appending to a
list, and arithmetic operations. Let’s start by counting the number of arithmetic
operations that are required. If each window has size five, then we perform five
Efficient algorithms 17
addition operations and one division operation each time we invoke the Mean
algorithm, for a total of six arithmetic operations per window. Therefore, the total
number of arithmetic operations is six times the number of windows. In general, the
window size is denoted w and there are n − w + 1 windows. So the algorithm performs
w additions and one division per window, for a total of (w + 1) ⋅ (n − w + 1) arithmetic
operations. For example, smoothing a list of 365 temperatures with a window size
of five days will require (w + 1) ⋅ (n − w + 1) = 6 ⋅ (365 − 5 + 1) = 2, 166 arithmetic
operations. In addition to the arithmetic operations, we create the list once and
append to the list once for every window, a total of n − w + 1 times. Therefore, the
total number of elementary steps is
(w + 1) ⋅ (n − w + 1) + 1 + n − w + 1 .
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ® ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
arithmetic create appends
list
The sum for the second window must be almost the same as the first window, since
they have four numbers in common. The only difference in the second window is
that it loses the first 18.9 and gains 19.3. So once we have the sum for the first
window (95.3), we can get the sum of the second window with only two additional
arithmetic operations: 95.3 − 18.9 + 19.3 = 95.7.
18.9, 18.9, 19.0, 19.2, 19.3, 19.3, 19.2, 22.1, 19.4, 19.3
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
sum = 95.7
18 What is computation?
We can apply this process to every subsequent window as well. At the end of the
algorithm, once we have the sums for all of the windows, we can simply divide each
by its window length to get the final list of means. Written in the same manner as
the previous algorithms, our new smoothing algorithm looks like this:
Algorithm Smooth 2
Input: a list of n numbers and a window size w
1. Create an empty list of window sums.
2. Compute the sum of the first w numbers and
append the result to the list of window sums.
3. For each position d, from 2 to n − w + 1, repeat the following:
(a) subtract the number in position d − 1 from the previous window
sum and then add the number in position d + w − 1;
(b) append the result to the list of window sums.
4. For each position d, from 1 to n − w + 1, repeat the following:
(a) divide the dth window sum by w.
Output: the list of mean values
Step 2 computes the sum for the first window and adds it to a list of window sums.
Step 3 then computes the sum for each subsequent window by subtracting the
number that is lost from the previous window and adding the number that is gained.
Finally, step 4 divides all of the window sums by w to get the list of mean values.
Reflection 1.12 As with the previous algorithm, carry out each step of the Smooth
2 algorithm with w = 5 and the ten-element list in our example:
18.9, 18.9, 19.0, 19.2, 19.3, 19.3, 19.2, 22.1, 19.4, 19.3
Does the Smooth 2 algorithm actually require fewer elementary steps than our
first attempt? Let’s look at each step individually. As before, step 1 counts as one
elementary step. Step 2 requires w − 1 addition operations and one append, for a
total of w elementary steps. Step 3 performs an addition, a subtraction, and an
append for every window but the first, for a total of 3(n − w) elementary steps.
Finally, step 4 performs one division operation per window, for a total of n − w + 1
arithmetic operations. This gives us a total of
1 + w + 3 (n − w) + n − w + 1
® ® ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
create first divisions
other
list window windows
Efficient algorithms 19
elementary steps. Combining all of these terms, we find that the time complexity of
Smooth 2 is 4n − 3w + 2. Therefore, our old algorithm requires
(w + 2) ⋅ (n − w + 1) + 1
4n − 3w + 2
times as many operations as the new one. It is hard to tell from this fraction, but
our new algorithm is about w/4 times faster than our old one. To see this, suppose
our list contains n = 10, 000 temperature readings. The following table shows the
value of the fraction for increasing window sizes w.
w Speedup
5 1.7
10 3.0
20 5.5
100 25.4
500 123.9
1,000 243.7
When w is small, our new algorithm does not make much difference, but the difference
becomes quite pronounced when w gets larger. In real applications of smoothing on
extremely large data sets containing billions or trillions of items, window sizes can
be as high as w = 100, 000. For example, smoothing is commonly used to visualize
statistics over DNA sequences that are billions of units long. So our refined algorithm
can have a marked impact! Indeed, as we will examine further in Section 6.4, it is
often the case that a faster algorithm can reduce the time required for a computation
significantly more than faster hardware.
Exercises
1.3.1. The phone tree algorithm depicted in Figure 1.3(c) comes very close to making
all of the phone calls in three steps. Is it possible to actually achieve three steps?
How?
1.3.2. Suppose the phone tree algorithm illustrated in Figure 1.3(c) was being executed
with a large number of people. We saw one call made during the first time step,
two calls during the second time step, and three calls during the third time step.
(a) How many concurrent calls would be made during time steps 4, 5, and 6?
(b) In general, can you provide a formula (or algorithm) to determine the number
of calls made during any time step t?
1.3.3. Describe the most important criterion for evaluating how good an algorithm is.
Then add at least one more criterion and describe how it would be applied to
algorithms for some everyday problem.
1.3.4. What is the time complexity of the Mean algorithm on Page 10, in terms of n
(the size of the input list)?
20 What is computation?
processors memory
core 1
HD / SSD Internet printer
core 2
bus
Inside a computer
The types of instructions that constitute a machine language are based on the internal
design of a modern computer. As illustrated in Figure 1.6, a computer essentially
consists of one or more processors connected to a memory. A computer’s memory,
Computers are dumb 21
often called RAM (short for random access memory), is conceptually organized as
a long sequence of cells, each of which can contain one unit of information. Each
cell is labeled with a unique memory address that allows the processor to reference
it specifically. So a computer’s memory is like a huge sequence of equal-sized post
office boxes, each of which can hold exactly one letter. Each P.O. box number is
analogous to a memory address and a letter is analogous to one unit of information.
The information in each cell can represent either one instruction or one unit of
data.1 So a computer’s memory stores both programs and the data on which the
programs work.
A processor, often called a CPU (short for central processing unit) or a core,
contains both the machinery for executing instructions and a small set of memory
locations called registers that temporarily hold data values needed by the current
instruction. If a computer contains more than one core, as most modern computers
do, then it is able to execute more than one instruction at a time. These instructions
may be from different programs or from the same program. This means that our
definition of an algorithm as a sequence of instructions on Page 7 is not strictly
correct. In fact, an algorithm (or a program) may consist of several semi-independent
sequences of steps called threads that cooperatively solve a problem.
As illustrated in Figure 1.6, the processors and memory are connected by a
communication channel called a bus. When a processor needs a value in memory,
it transmits the request over the bus, and then the memory returns the requested
value the same way. The bus also connects the processors and memory to several
other components that either improve the machine’s efficiency or its convenience,
like the Internet and secondary storage devices like hard drives, solid state drives,
and flash memory. As you probably know, the contents of a computer’s memory are
lost when the computer loses power, so we use secondary storage to preserve data
for longer periods of time. We interact with these devices through a “file system”
abstraction that makes it appear as if our hard drives are really filing cabinets.
When you execute a program or open a file, it is first copied from secondary storage
into memory where the processor can access it.
Machine language
The machine language instructions that a processor can execute are very simple. For
example, consider something as simple as computing x = 2 + 5. In a program, this
statement adds 2 and 5, and stores the result in a memory location referred to by
the variable x. But even this is likely to be too complex for one machine language
instruction. The machine language equivalent likely, depending on the computer,
consists of four instructions. The first instruction loads the value 2 into a register in
the processor. The second instruction loads the value 5 into another register in the
processor. The third instruction adds the values in these two registers, and stores
1
In reality, each instruction or unit of data usually occupies multiple contiguous cells.
22 What is computation?
the result in a third register. Finally, the fourth instruction stores the value in the
third register into the memory cell with address x.
From the moment a computer is turned on, its processors are operating in a
continuous loop called the fetch and execute cycle (or machine cycle). In each cycle,
the processor fetches one machine language instruction from memory and executes
it. This cycle repeats until the computer is turned off or loses power. In a nutshell,
this is all a computer does. The rate at which a computer performs the fetch and
execute cycle is loosely determined by the rate at which an internal clock “ticks”
(the processor’s clock rate). The ticks of this clock keep the machine’s operations
synchronized. Modern personal computers have clocks that tick a few billion times
each second; a 3 gigahertz (GHz) processor ticks 3 billion times per second (“giga”
means “billion” and a “hertz” is one tick per second).
So computers, at their most basic level, really are quite dumb; the processor
blindly follows the fetch and execute cycle, dutifully executing whatever sequence
of simple instructions we give it. The frustrating errors that we yell at computers
about are, in fact, human errors. The great thing about computers is not that they
are smart, but that they follow our instructions so very quickly; they can accomplish
an incredible amount of work in a very short amount of time.
Everything is bits
Our discussion so far has glossed over a very important consideration: in what
form does a computer store programs and data? In addition to machine language
instructions, we need to store numbers, documents, maps, images, sounds, presenta-
tions, spreadsheets, and more. Using a different storage medium for each type of
information would be insanely complicated and inefficient. Instead, we need a simple
storage format that can accommodate any type of data. The answer is bits. A bit
Computers are dumb 23
is the simplest possible unit of information, capable of taking on one of only two
values: 0 or 1 (or equivalently, off/on, no/yes, or false/true). This simplicity makes
both storing information (i.e., memory) and computing with it (e.g., processors)
relatively simple and efficient.
0100010001010101
This bit sequence can represent each of the following, depending upon how it is
interpreted:
(d) the Intel machine language instruction inc x (inc is short for “increment”);
or
(e) the following 4 × 4 black and white image, called a bitmap (0 represents a
white square and 1 represents a black square).
24 What is computation?
For now, let’s just look more closely at (a). Integers are represented in a computer
using the binary number system, which is understood best by analogy to the decimal
number system. In decimal, each position in a number has a value that is a power
of ten: from right to left, the positional values are 100 = 1, 101 = 10, 102 = 100, etc.
The value of a decimal number comes from adding the digits, each first multiplied
by the value of its position. So the decimal number 1,831 represents the value
The binary number system is no different, except that each position represents a
power of two, and there are only two digits instead of ten. From right to left, the
binary number system positions have values 20 = 1, 21 = 2, 22 = 4, 23 = 8, etc. So, for
example, the binary number 110011 represents the value
1 × 25 + 1 × 24 + 0 × 23 + 0 × 22 + 1 × 21 + 1 × 20 = 32 + 16 + 2 + 1 = 51
in decimal.
Reflection 1.13 To check your understanding, show why the binary number
1001000 is equivalent to the decimal number 72.
This idea can be extended to numbers with a fractional part as well. In decimal, the
positions to the right of the decimal point have values that are negative powers of
10: the tenths place has value 10−1 = 0.1, the hundredths place has value 10−2 = 0.01,
etc. So the decimal number 18.31 represents the value
Similarly, in binary, the positions to the right of the “binary point” have values that
are negative powers of 2. For example, the binary number 11.0011 represents the
value
1 1 3
1 × 21 + 1 × 20 + 0 × 2−1 + 0 × 2−2 + 1 × 2−3 + 1 × 2−4 = 2 + 1 + 0 + 0 + + =3
8 16 16
in decimal. This is not, however, how we derived (b) above. Numbers with fractional
components are stored in a computer using a different format that allows for a much
greater range of numbers to be represented. We will revisit this in Section 2.2.
Reflection 1.14 To check your understanding, show why the binary number
1001.101 is equivalent to the decimal number 9 5/8.
Computers are dumb 25
Each row of the truth table represents a different combination of the values of a
and b. These combinations are shown in the first two columns. The last column of
the truth table contains the corresponding values of a and b. We see that a and b
is 0 in all cases, except where both a and b are 1. If we let 1 represent “true” and
0 represent “false,” this conveniently matches our own intuitive meaning of “and.”
The statement “the barn is red and white” is true only if the barn both has red on
it and has white on it.
Second, a or b is equal to 1 if at least one of a or b is equal to 1; otherwise a or
b is equal to 0. This is represented by the following truth table:
a b a or b
0 0 0
0 1 1
1 0 1
1 1 1
Notice that a or b is 1 even if both a and b are 1. This is different from our normal
understanding of “or.” If we say that “the barn is red or white,” we usually mean it
is either red or white, not both. But the Boolean operator can mean both are true.
(There is another Boolean operator called “exclusive or” that does mean “either/or,”
but we won’t get into that here.)
Finally, the not operator simply inverts a bit, changing a 0 to 1, or a 1 to 0. So,
not a is equal to 1 if a is equal to 0, or 0 if a is equal to 1. The truth table for the
not operator is simple:
2
In formal Boolean algebra, a and b is usually represented a ∧ b, a or b is usually represented
a ∨ b, and not a is usually represented as ¬a.
26 What is computation?
a not a
0 1
1 0
With these basic operators, we can build more complicated expressions. For example,
suppose we wanted to find the value of the expression not a and b. (Note that the
not operator applies only to the variable immediately after it, in this case a, not
the entire expression a and b. For not to apply to the expression, we would need
parentheses: not (a and b).) We can evaluate not a and b by building a truth table
for it. We start by listing all of the combinations of values for a and b, and then
creating a column for not a, since we need that value in the final expression:
a b not a
0 0 1
0 1 1
1 0 0
1 1 0
Then, we create a column for not a and b by anding the third column with the
second.
a b not a not a and b
0 0 1 0
0 1 1 1
1 0 0 0
1 1 0 0
So we find that not a and b is 1 only when a is 0 and b is 1. Or, equivalently, not
a and b is true only when a is false and b is true. (Think about that for a moment;
it makes sense, doesn’t it?)
0 1 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0
control unit
(FSM)
of progress made toward some goal. For example, a simple elevator, with states
representing floors, is a finite state machine, as illustrated below.
up up
G 1 2
down down
down up
States are represented by circles and transitions are arrows between circles. In this
elevator, there are only up and down buttons (no ability to choose your destination
floor when you enter). The label on each transition represents the button press
event that causes the transition to occur. For example, when we are on the ground
floor and the down button is pressed, we stay on the ground floor. But when the
up button is pressed, we transition to the first floor. Many other simple systems
can also be represented by finite state machines, such as vending machines, subway
doors, traffic lights, and tool booths. The implementation of a finite state machine in
a computer coordinates the fetch and execute cycle, as well as various intermediate
steps involved in executing machine language instructions.
This question of whether a computational system is universal has its roots
in the very origins of computer science itself. In 1936, Alan Turing proposed an
abstract computational model, now called a Turing machine, that he proved could
compute any problem considered to be mathematically computable. As illustrated
in Figure 1.7, a Turing machine consists of a control unit containing a finite state
machine that can read from and write to an infinitely long tape. The tape contains
a sequence of “cells,” each of which can contain a single character. The tape is
initially inscribed with some sequence of input characters, and a pointer attached to
the control unit is positioned at the beginning of the input. In each step, the Turing
machine reads the character in the current cell. Then, based on what it reads and
the current state, the finite state machine decides whether to write a character in
the cell and whether to move its pointer one cell to the left or right. Not unlike
the fetch and execute cycle in a modern computer, the Turing machine repeats
this simple process as long as needed, until a designated final state is reached. The
output is the final sequence of characters on the tape.
The Turing machine still stands as our modern definition of computability. The
28 What is computation?
Exercises
1.4.1. Show how to convert the binary number 1101 to decimal.
1.4.2. Show how to convert the binary number 1111101000 to decimal.
1.4.3. Show how to convert the binary number 11.0011 to decimal.
1.4.4. Show how to convert the binary number 11.110001 to decimal.
1.4.5. Show how to convert the decimal number 22 to binary.
1.4.6. Show how to convert the decimal number 222 to binary.
1.4.7. See how closely you can represent the decimal number 0.1 in binary using six places
to the right of the binary point. What is the actual value of your approximation?
1.4.8. Consider the following 6 × 6 black and white image. Describe two plausible ways to
represent this image as a sequence of bits.
1.4.9. Design a truth table for the Boolean expression not (a and b).
1.4.10. Design a truth table for the Boolean expression not a or not b. Compare the
result to the truth table for not (a and b). What do you notice? The relationship
between these two Boolean expressions is one of De Morgan’s laws.
1.4.11. Design a truth table for the Boolean expression not a and not b. Compare the
result to the truth table for not (a or b). What do you notice? The relationship
between these two Boolean expressions is the other of De Morgan’s laws.
1.4.12. Design a finite state machine that represents a highway toll booth controlling a
single gate.
1.4.13. Design a finite state machine that represents a vending machine. Assume that the
machine only takes quarters and vends only one kind of drink, for 75 cents. First,
think about what the states should be. Then design the transitions between states.
Summary 29
problem
solve
algorithm
implement
high-level program
compile or interpret
machine language
program
execute or “run”
computation on
processor + memory
using Boolean logic
Figure 1.8An expanded (from Figure 1.2) illustration of the layers of functional
abstraction in a computer.
1.5 SUMMARY
In the first chapter, we developed a top-to-bottom view of computation at many
layers of abstraction. As illustrated in Figure 1.8, we start with a well-defined
problem and solve it by designing an algorithm. This step is usually the most
difficult by far. Professional algorithm designers rely on experience, creativity, and
their knowledge of a battery of general design techniques to craft efficient, correct
algorithms. In this book, we will only scratch the surface of this challenging field.
Next, an algorithm can be programmed in a high-level programming language like
Python, which is translated into machine language before it can be executed on
a computer. The instructions in this machine language program may either be
executed directly on a processor or request that the operating system do something
on their behalf (e.g., saving a file or allocating more memory). Each instruction in
30 What is computation?
http://www.coriolis.eu.org .
http://top500.org .
http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/deepblue/
http://www.ibm.com/smarterplanet/us/en/ibmwatson/ .
http://www.ncsa.illinois.edu/about/faq .
There are several good books available on the life of Alan Turing, the father of
computer science. The definitive biography, Alan Turing: The Enigma, was written
by Andrew Hodges [20].
CHAPTER 2
Elementary computations
It has often been said that a person does not really understand something until after teaching
it to someone else. Actually a person does not really understand something until after
teaching it to a computer, i.e., expressing it as an algorithm.
Donald E. Knuth
American Scientist (1973)
umbers are some of the simplest and most familiar objects on which we
N can compute. Numbers are also quite versatile; they can represent quantities,
measurements, identifiers, and are a means of ordering items. In this chapter, we
will introduce some of the fundamentals of carrying out computations with numbers
in the Python programming language, and point out the ways in which numbers
in a computer sometimes behave differently from numbers in mathematics. In
later chapters, we will build on these ideas to design algorithms that solve more
sophisticated problems requiring objects like text, lists, tables, and networks.
31
32 Elementary computations
To get started, launch the application called IDLE that comes with every Python
distribution (or another similar programming environment recommended by your
instructor). You should see a window appear with something like this at the top:
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "copyright", "credits" or "license()" for more information.
>>>
The program executing in this window is known as a Python shell . The first line tells
you which version of Python you are using (in this case, 3.4.2). The examples in this
book are based on Python version 3, so make sure that your version number starts
with a 3. If you need to install a newer version, you can find one at python.org.
The
>>>
on the fourth line in the IDLE window is called the prompt because it is prompting
you to type in a Python statement. To start, type in print(’Hello world!’) at
the prompt and hit return.
>>> print(’Hello world!’)
Hello world!
Congratulations, you have just written the traditional first program! This one-
statement program simply prints the characters in single quotes, called a character
string or just string, on the screen. Now you are ready to go!
As you read this book, we highly recommend that you do so in front of a
computer. Python statements that are preceded by a prompt, like those above, are
intended to be entered into the Python shell by the reader. Try out each example
for yourself. Then go beyond the examples, and experiment with your own ideas.
Instead of just wondering, “What if I did this?”, type it in and see what happens!
Similarly, when you read a “Reflection” question, pause for a moment and think
about the answer or perform an experiment to find the answer.
2.2 ARITHMETIC
We have to start somewhere, so it may as well be in elementary school, where you
learned your first arithmetic computations. Python, of course, can do arithmetic
too. At the prompt, type 8 * 9 and hit return.
>>> 8 * 9
72
>>>
The spaces around the * multiplication operator are optional and ignored. In general,
Python does not care if you include spaces in expressions, but you always want to
make your programs readable to others, and spaces often help.
Arithmetic 33
Notice that the shell responded to our input with the result of the computation,
and then gave us a new prompt. The shell will continue this prompt–compute–
respond cycle until we quit (by typing quit()). Recall from Section 1.4 that Python
is an interpreted language; the shell is displaying the interpreter at work. Each time
we enter a statement at the prompt, the Python interpreter transparently converts
that statement to machine language, executes the machine language, and then prints
the result.
Now let’s try something that most calculators cannot do:
>>> 2 ** 100
1267650600228229401496703205376
This result looks different because adding a decimal point to the 2 caused Python
to treat the value as a floating point number instead of an integer. A floating
point number is any number with a decimal point. Floating point derives its name
from the manner in which the numbers are stored, a format in which the decimal
point is allowed to “float,” as in scientific notation. For example, 2.345 × 104 =
23.45 × 103 = 0.02345 × 107 . To learn a bit more about floating point notation, see
Box 2.1. Whenever a floating point number is involved in a computation, the result
of the computation is also a floating point number. Very large floating point numbers,
like 1.2676506002282294e+30, are printed in scientific notation (the e stands for
34 Elementary computations
It is often convenient to enter numbers in scientific notation also. For example, know-
ing that the earth is about 4.54 billion years old, we can compute the approximate
number of days in the age of the earth with
>>> 4.54e9 * 365.25
1658235000000.0
Similarly, if the birth rate in the United States is 13.42 per 1, 000 people, then the
number of babies born in 2012, when the population was estimated to be 313.9
million, was approximately
>>> 13.42e-3 * 313.9e6
4212538.0
Finite precision
In normal arithmetic, 2100 and 2.0100 are, of course, the same number. However, on
the previous page, the results of 2 ** 100 and 2.0 ** 100 were different. The first
answer was correct, while the second was off by almost 1.5 trillion! The problem
is that floating point numbers are stored differently from integers, and have finite
precision, meaning that the range of numbers that can be represented is limited.
Limited precision may mean, as in this case, that we get approximate results, or it
may mean that a value is too large or too small to even be approximated well. For
example, try:
>>> 10.0 ** 500
OverflowError: (34, ’Result too large’)
An overflow error means that the computer did not have enough space to represent
the correct value. A similar fatal error will occur if we try to do something illegal,
like divide by zero:
>>> 10.0 / 0
ZeroDivisionError: division by zero
Division
Python provides two different kinds of division: so-called “true division” and “floor
division.” The true division operator is the slash (/) character and the floor division
operator is two slashes (//). True division gives you what you would probably
expect. For example, 14 / 3 will give 4.6666666666666667. On the other hand,
floor division rounds the quotient down to the nearest integer. For example, 14 // 3
will give the answer 4. (Rounding down to the nearest integer is called “taking the
floor” of a number in mathematics, hence the name of the operator.)
Arithmetic 35
>>> 14 / 3
4.6666666666666667
>>> 14 // 3
4
When both integers are positive, you can think of floor division as the “long division”
you learned in elementary school: floor division gives the whole quotient without the
remainder. In this example, dividing 14 by 3 gives a quotient of 4 and a remainder
of 2 because 14 is equal to 4 ⋅ 3 + 2. The operator to get the remainder is called the
“modulo” operator. In mathematics, this operator is denoted mod , e.g., 14 mod 3 = 2;
in Python it is denoted %. For example:
>>> 14 % 3
2
To see why the // and % operators might be useful, think about how you would
determine whether an integer is odd or even. An integer is even if it is evenly divisible
by 2; i.e., when you divide it by 2, the remainder is 0. So, to decide whether an
integer is even, we can “mod” the number by 2 and check the answer.
>>> 14 % 2
0
>>> 15 % 2
1
Reflection 2.1 The floor division and modulo operators also work with negative
numbers. Try some examples, and try to infer what is happening. What are the rules
governing the results?
Consider dividing some integer value x by another integer value y. The floor division
and modulo operators always obey the following rules:
1. x % y has the same sign as y, and
2. (x // y) * y + (x % y) is equal to x.
Confirm these observations yourself by computing
• -14 // 3 and -14 % 3
• 14 // -3 and 14 % -3
Order of operations
Now let’s try something just slightly more interesting: computing the area of a circle
with radius 10. (Recall the formula A = πr2 .)
>>> 3.14159 * 10 ** 2
314.159
36 Elementary computations
Operators Description
1. ** exponentiation (power)
2. +, - unary positive and negative, e.g., -(4 * 9)
3. *, /, //, % multiplication and division
4. +, - addition and subtraction
In this case, we needed parentheses around (100 * 18000) + 18e6 to compute the
total salaries before dividing by the number of employees. However, we also included
parentheses around 100 * 18000 for clarity.
We should also point out that the answers in the previous two examples were
floating point numbers, even though some of the numbers in the original expression
were integers. When Python performs arithmetic with an integer and a floating
point number, it first converts the integer to floating point. For example, in the last
expression, (100 * 18000) evaluates to the integer value 1800000. Then 1800000
is added to the floating point number 18e6. Since these two operands are of different
types, 1800000 is converted to the floating point equivalent 1800000.0 before
adding. Then the result is the floating point value 19800000.0. Finally, 19800000.0
is divided by the integer 101. Again, 101 is converted to floating point first, so
the actual division operation is 19800000.0 / 101.0, giving the final answer. This
process is also depicted below.
In most cases, this works exactly as we would expect, and we do not need to think
about this automatic conversion taking place.
Arithmetic 37
Complex numbers
Although we will not use them in this book, it is worth pointing out that Python
can also handle complex numbers. A complex number has both a real part and
an imaginary part involving the imaginary unit i, which has the property that
i2 = −1. In Python, a complex number like 3.2 + 2i is represented as 3.2 + 2j.(The
letter j is used instead of i because in some fields, such as electrical engineering, i
has another well-established meaning that could lead to ambiguities.) Most of the
normal arithmetic operators work on complex numbers as well. For example,
>>> (5 + 4j) + (-4 + -3.1j)
(1+0.8999999999999999j)
>>> (23 + 6j) / (1 + 2j)
(7-8j)
>>> (1 + 2j) ** 2
(-3+4j)
>>> 1j ** 2
(-1+0j)
Exercises
Use the Python interpreter to answer the following questions. Where appropriate, provide
both the answer and the Python expression you used to get it.
2.2.1. The Library of Congress stores its holdings on 838 miles of shelves. Assuming an
average book is one inch thick, how many books would this hold? (There are 5,280
feet in one mile.)
2.2.2. If I gave you a nickel and promised to double the amount you have every hour for
the next 24, how much money would you have at the end? What if I only increased
the amount by 50% each hour, how much would you have? Use exponentiation to
compute these quantities.
2.2.3. The Library of Congress stores its holdings on 838 miles of shelves. How many
round trips is this between Granville, Ohio and Columbus, Ohio?
2.2.4. The earth is estimated to be 4.54 billion years old. The oldest known fossils of
anatomically modern humans are about 200,000 years old. What fraction of the
earth’s existence have humans been around? Use Python’s scientific notation to
compute this.
2.2.5. If you counted at an average pace of one number per second, how many years
would it take you to count to 4.54 billion? Use Python’s scientific notation to
compute this.
2.2.6. Suppose the internal clock in a modern computer can “count” about 2.8 billion
ticks per second. How long would it take such a computer to tick 4.54 billion times?
2.2.7. A hard drive in a computer can hold about a terabyte (240 bytes) of information.
An average song requires about 8 megabytes (8 × 220 bytes). How many songs can
the hard drive hold?
38 Elementary computations
2.2.8. What is the value of each of the following Python expressions? Make sure you
understand why in each case.
(a) 15 * 3 - 2
(b) 15 - 3 * 2
(c) 15 * 3 // 2
(d) 15 * 3 / 2
(e) 15 * 3 % 2
(f) 15 * 3 / 2e0
The equal sign (=) is called the assignment operator because it is used to assign
values to names. In this example, we assigned the value 3.14159 to the name pi and
we assigned the value 10 to the name radius. It is convenient to think of variable
names as “Sticky notes”1 attached to locations in the computer’s memory.
3.14159 10
pi radius
Recall from Section 1.4 that numbers are stored in cells in a computer’s memory.
These cells are analogous to post office boxes; the cell’s address is like the number
on the post office box and the value stored in the cell is like the box’s content. In
the picture above, each rectangle containing a number represents a memory cell. A
variable name is a reference to a particular cell, analogous to a Sticky note with
a name on it on a post office box. In this picture, the name pi refers to the cell
containing the value 3.14159 and the name radius refers to the cell containing the
value 10. We do not actually know the addresses of each of these memory cells, but
there is no reason for us to because we will always refer to them by a variable name.
As with a sticky note, a value can be easily reassigned to a different name at a
later time. For example, if we now assign a different value to the name radius, the
radius Sticky note in the picture moves to a different value.
1
“Sticky note” is a registered trademark of the BIC Corporation.
What’s in a name? 39
>>> radius = 17
3.14159 10 17
pi radius
After the reassignment, a reference to a different memory cell containing the value
17 is assigned radius. The value 10 may briefly remain in memory without a name
attached to it, but since we can no longer access that value without a reference to
it, the Python “garbage collection” mechanism will soon free up the memory that it
occupies.
Naming values serves three important purposes. First, assigning oft-used values
to descriptive names can make our algorithms much easier to understand. For
example, if we were writing an algorithm to model competition between two species,
such as wolves and moose (as in Project 4.4), using names for the birth rates of
the two species, like birthRateMoose and birthRateWolves, rather than unlabeled
values, like 0.5 and 0.005, makes the algorithm much easier to read. Second, as we
will see in Section 3.5, naming inputs will allow us to generalize algorithms so that,
instead of being tied to one particular input, they work for a variety of possible
inputs. Third, names will serve as labels for computed values that we wish to use
later, obviating the need to compute them again at that time.
Notice that we are using descriptive names, not just single letters, as is the
convention in algebra. As noted above, using descriptive names is always important
when you are writing programs because it makes your work more accessible to
others. In the “real world,” programming is almost always a collaborative endeavor,
so it is important for others to be able to understand your intentions. Sometimes
assigning an intermediate computation to a descriptive name, even if not required,
can improve readability.
Names in Python can be any sequence of characters drawn from letters, digits,
and the underscore (_) character, but they may not start with a digit. You also
cannot use any of Python’s keywords, shown in Table 2.2. Keywords are elements of
the Python language that have predefined meanings. We saw a few of these already
in the snippets of Python code in Chapter 1 (def, for, in, and return), and we
will encounter most of the others as we progress through this book.
40 Elementary computations
Let’s try breaking some of these naming rules to see what happens.
>>> 49ers = 50
SyntaxError: invalid token
This syntax error is referring to the dash/hyphen/minus sign symbol (-) that we
have in our name. Python interprets the symbol as the minus operator, which is
why it is not allowed in names. Instead, we can use the underscore (_) character
(i.e., my_age) or vary the capitalization (i.e., myAge) to distinguish the two words in
the name.
In addition to assigning constant values to names, we can assign names to the
results of whole computations.
>>> volume = (4 / 3) * pi * (radius ** 3)
When you assign values to names in the Python shell, the shell does not offer any
feedback. But you can view the value assigned to a variable by simply typing its
name.
>>> volume
20579.50889333333
So we see that we have assigned the value 20579.50889333333 to the name volume,
as depicted below. (The values assigned to pi and radius are unchanged.)
3.14159 17 20579.50...
pi radius volume
Alternatively, you can display a value using print, followed by the value you wish
to see in parentheses. Soon, when we begin to write programs outside of the shell,
print will be the only way to display values.
What’s in a name? 41
>>> print(volume)
20579.50889333333
Now let’s change the value of radius again:
>>> radius = 20
3.14159 20 20579.50...
pi radius volume
Since the value of volume was based on the value of radius, it makes sense to check
whether the value of volume has also changed. Try it:
>>> volume
20579.50889333333
What’s going on here? While the value assigned to radius has changed, the value
assigned to volume has not. This example demonstrates that assignment is a one-
time event; Python does not “remember” how the value of volume was computed.
Put another way, an assignment is not creating an equivalence between a name and
a computation. Rather, it performs the computation on the righthand side of the
assignment operator only when the assignment happens, and then assigns the result
to the name on the lefthand side. That value remains assigned to the name until
it is explicitly assigned some other value or it ceases to exist. To compute a new
value for volume based on the new value of radius, we would need to perform the
volume computation again.
>>> volume = (4 / 3) * pi * (radius ** 3)
>>> volume
33510.29333333333
Now the value assigned to volume has changed, due to the explicit assignment
statement above.
3.14159 20 33510.29...
pi radius volume
Let’s try another example. The formula for computing North American wind
chill temperatures, in degrees Celsius, is
W = 13.12 + 0.6215 t + (0.3965 t − 11.37) v 0.16
where t is the ambient temperature in degrees Celsius and v is the wind speed in
km/h.2 To compute the wind chill for a temperature of −3○ C and wind speed of 13
km/h, we will first assign the temperature and wind speed to two variables:
2
Technically, wind chill is only defined at or below 10○ C and for wind speeds above 4.8 km/h.
42 Elementary computations
>>> temp = -3
>>> wind = 13
Then we type in the formula in Python, assigning the result to the name chill.
>>> chill = 13.12 + 0.6215 * temp + (0.3965 * temp - 11.37) * wind**0.16
>>> chill
-7.676796032159553
Notice once again that changing temp does not change chill:
>>> temp = 4.0
>>> chill
-7.676796032159553
To recompute the value of chill with the new temperature, we need to re-enter
the entire formula above, which will now use the new value assigned to temp.
>>> chill = 13.12 + 0.6215 * temp + (0.3965 * temp - 11.37) * wind**0.16
>>> chill
0.8575160333891443
This process is, of course, very tedious and error-prone. Fortunately, there is a much
better way to define such computations, as we will see in Section 3.5.
It is very important to realize that, despite the use of the equals sign, assignment
is not the same as equality. In other words, when we execute an assignment statement
like temp = 4.0, we are not asserting that temp and 4.0 are equivalent. Instead,
we are assigning a value to a name in two steps:
2. Assign the resulting value to the name on the lefthand side of the assignment
operator.
If the equals sign denoted equality, then the second statement would not make
any sense! However, if we interpret the statement using the two-step process, it
is perfectly reasonable. First, we evaluate the expression on the righthand side,
ignoring the lefthand side entirely. Since, at this moment, the value 20 is assigned
to radius, the righthand side evaluates to 21. Second, we assign the value 21 to
the name radius. So this statement has added 1 to the current value of radius, an
operation we refer to as an increment. To verify that this is what happened, check
the value of radius:
>>> radius
21
What’s in a name? 43
What if we had not assigned a value to radius before we tried to increment it? To
find out, we have to use a name that has not yet been assigned anything.
>>> trythis = trythis + 1
NameError: name ’trythis’ is not defined
This name error occurred because, when the Python interpreter tried to evaluate
the righthand side of the assignment, it found that trythis was not assigned a
value, i.e., it was not defined. So we need to make sure that we first define any
variable that we try to increment. This sounds obvious but, in the context of some
larger programs later on, it might be easy to forget.
Let’s look at one more example. Suppose we want to increment the number of
minutes displayed on a digital clock. To initialize the minutes count and increment
the value, we essentially copy what we did above.
>>> minutes = 0
>>> minutes = minutes + 1
>>> minutes
1
>>> minutes = minutes + 1
>>> minutes
2
But incrementing is not going to work properly when minutes reaches 59, and we
need the value of minutes to wrap around to 0 again. The solution lies with the
modulo operator: if we mod the incremented value by 60 each time, we will get
exactly the behavior we want. We can see what happens if we reset minutes to 58
and increment a few times:
>>> minutes = 58
>>> minutes = (minutes + 1) % 60
>>> minutes
59
>>> minutes = (minutes + 1) % 60
>>> minutes
0
>>> minutes = (minutes + 1) % 60
>>> minutes
1
When minutes is between 0 and 58, (minutes + 1) % 60 gives the same result
as minutes + 1 because minutes + 1 is less than 60. But when minutes is 59,
(minutes + 1) % 60 equals 60 % 60, which is 0. This example illustrates why
arithmetic using the modulo operator, formally called modular arithmetic, is often
also called “clock arithmetic.”
44 Elementary computations
Exercises
Use the Python interpreter to answer the following questions. Where appropriate, provide
both the answer and the Python expression you used to get it.
2.3.1. Every cell in the human body contains about 6 billion base pairs of DNA (3 billion
in each set of 23 chromosomes). The distance between each base pair is about 3.4
angstroms (3.4 × 10−10 meters). Uncoiled and stretched, how long is the DNA in
a single human cell? There are about 50 trillion cells in the human body. If you
stretched out all of the DNA in the human body end to end, how long would it
be? How many round trips to the sun is this? The distance to the sun is about
149,598,000 kilometers. Write Python statements to compute the answer to each
of these three questions. Assign variables to hold each of the values so that you
can reuse them to answer subsequent questions.
2.3.2. Set a variable named radius to have the value 10. Using the formula for the area
of a circle (A = πr2 ), assign to a new variable named area the area of a circle with
radius equal to your variable radius. (The number 10 should not appear in the
formula.)
2.3.3. Now change the value of radius to 15. What is the value of area now? Why?
2.3.4. Suppose we wanted to swap the values of two variables named x and y. Why
doesn’t the following work? Show a method that does work.
x = y
y = x
2.3.5. What are the values of x and y at the end of the following? Make sure you
understand why this is the case.
x = 12.0
y = 2 * x
x = 6
2.3.6. What is the value of x at the end of the following? Make sure you understand why
this is the case.
x = 0
x = x + 1
x = x + 1
x = x + 1
2.3.7. What are the values of x and y at the end of the following? Make sure you
understand why this is the case.
x = 12.0
y = 6
y = y * x
2.3.8. Given a variable x that refers to an integer value, show how to extract the individual
digits in the ones, tens and hundreds places. (Use the // and % operators.)
Using functions 45
4 x2 + 3 19
Does this look familiar? Like the example problems in Figure 1.1 (at the beginning
of Chapter 1), a mathematical function has an input, an output, and an algorithm
that computes the output from the input. So a mathematical function is another
type of functional abstraction.
Built-in functions
As we suggested above, in a programming language, functional abstractions are
also implemented as functions. In Chapter 3, we will learn how to define our own
functions to implement a variety of functional abstractions, most of which are much
richer than simple mathematical functions. In this section, we will set the stage by
looking at how to use some of Python’s built-in functions. Perhaps the simplest of
these is the Python function abs, which outputs the absolute value (i.e., magnitude)
of its argument.
>>> result2 = abs(-5.0)
>>> result2
5.0
On the righthand side of the assignment statement above is a function call , also
called a function invocation. In this case, the abs function is being called with
46 Elementary computations
argument -5.0. The function call executes the (hidden) algorithm associated with
the abs function, and abs(-5.0) evaluates to the output of the function, in this
case, 5.0. The output of the function is also called the function’s return value.
Equivalently, we say that the function returns its output value. The return value of
the function call, in this case 5.0, is then assigned to the variable name result2.
Two other helpful functions are float and int. The float function converts its
argument to a floating point number, if it is not already.
>>> float(3)
3.0
>>> float(2.718)
2.718
The int function converts its argument to an integer by truncating it, i.e., removing
the fractional part to the right of the decimal point. For example,
>>> int(2.718)
2
>>> int(-1.618)
-1
Normally, we would want to truncate or round the final wind chill temperature to
an integer. (It would be surprising to hear the local television meteorologist tell
us, “the expected wind chill will be −7.676796032159553○ C today.”) We can easily
compute the truncated wind chill with
>>> truncChill = int(chill)
>>> truncChill
-7
There is also a round function that we can use to round the temperature instead.
>>> roundChill = round(chill)
>>> roundChill
-8
Alternatively, if we do not need to retain the original value of chill, we can simply
assign the modified value to the name chill:
>>> chill = round(chill)
>>> chill
-8
Using functions 47
Function arguments can be more complex than just single constants and variables;
they can be anything that evaluates to a value. For example, if we want to convert
the wind chill above to Fahrenheit and then round the result, we can do so like this:
>>> round(chill * 9/5 + 32)
18
The expression in parentheses is evaluated first, and then the result of this expression
is used as the argument to the round function.
Not all functions output something. For example, the print function, which
simply prints its arguments to the screen, does not. For example, try this:
>>> x = print(chill)
-8
>>> print(x)
None
The variable x was assigned whatever the print function returned, which is different
from what it printed. When we print x, we see that it was assigned something called
None. None is a Python keyword that essentially represents “nothing.” Any function
that does not define a return value itself returns None by default. We will see this
again shortly when we learn how to define our own functions.
Strings
As in the first “Hello world!” program, we often use the print function to print
text:
>>> print(’I do not like green eggs and ham.’)
I do not like green eggs and ham.
The first and last arguments are strings, and the second is the variable we defined
above. You will notice that a space is inserted between arguments, and there are no
quotes around the variable name chill.
Reflection 2.2 Why do you think quotation marks are necessary around strings?
What error do you think would result from typing the following instead? (Try it.)
The quotation marks are necessary because otherwise Python has no way to dis-
tinguish text from a variable or function name. Without the quotation marks, the
Python interpreter will try to make sense of each argument inside the parentheses,
assuming that each word is a variable or function name, or a reserved word in the
Python language. Since this sequence of words does not follow the syntax of the
language, and most of these names are not defined, the interpreter will print an
error.
String values can also be assigned to variables and manipulated with the + and
* operators, but the operators have different meanings. The + operator concatenates
two strings into one longer string. For example,
>>> first = ’Monty’
>>> last = ’Python’
>>> name = first + last
>>> print(name)
MontyPython
To insert a space between the first and last names, we can use another + operator:
>>> name = first + ’ ’ + last
>>> print(name)
Monty Python
Alternatively, if we just wanted to print the full name, we could have bypassed the
name variable altogether.
>>> print(first + ’ ’ + last)
Monty Python
Reflection 2.3 Why do we not want quotes around first and last in the state-
ment above? What would happen if we did use quotes?
Placing quotes around first and last would cause Python to interpret them
literally as strings, rather than as variables:
>>> print(’first’ + ’ ’ + ’last’)
first last
Applied to strings, the * operator becomes the repetition operator , which repeats a
string some number of times. The operand on the left side of the repetition operator
is a string and the operand on the right side is an integer that indicates the number
of times to repeat the string.
>>> last * 4
’PythonPythonPythonPython’
>>> print(first + ’ ’ * 10 + last)
Monty Python
Using functions 49
We can also interactively query for string input in our programs with the input
function. The input function takes a string prompt as an argument and returns a
string value that is typed in response. For example, the following statement prompts
for your name and prints a greeting.
>>> name = input(’What is your name? ’)
What is your name? George
>>> print(’Howdy,’, name, ’!’)
Howdy, George !
The call to the input function above prints the string ’What is your name? ’ and
then waits. After you type something (above, we typed George, shown in red) and
hit Return, the text that we typed is returned by the input function and assigned
to the variable called name. The value of name is then used in the print function.
Reflection 2.4 How can we use the + operator to avoid the inserted space before
the exclamation point above?
To avoid the space, we can construct the string that we want to print using the
concatenation operator instead of allowing the print function to insert spaces
between the arguments:
>>> print(’Howdy, ’ + name + ’!’)
Howdy, George!
In the response to the prompt above, we typed 18 (shown in red). Then the input
function assigned what we typed to the variable text as a string, in this case ’18’
(notice the quotes). Then, using the int function, the string is converted to the
integer value 18 (no quotes) and assigned to the variable age. Now that age is a
numerical value, it can be used in the arithmetic expression age * 365.
Reflection 2.5 Type the statements above again, omitting age = int(text):
>>> text = input(’How old are you? ’)
How old are you? 18
>>> print(’You have been alive for’, text * 365, ’days!’)
Because the value of text is a string rather than an integer, the * operator was
interpreted as the repetition operator instead!
Sometimes we want to perform conversions in the other direction, from a numer-
ical value to a string. For example, if we printed the number of days in the following
form instead, it would be nice to remove the space between the number of days and
the period.
>>> print(’The number of days in your life is’, age * 365, ’.’)
The number of days in your life is 6570 .
But the following will not work because we cannot concatenate the numerical value
age * 365 with the string ’.’.
>>> print(’The number of days in your life is’, age * 365 + ’.’)
TypeError: unsupported operand type(s) for +: ’int’ and ’str’
To make this work, we need to first convert age * 365 to a string with the str
function:
>>> print(’The number of days in your life is’, str(age * 365) + ’.’)
The number of days in your life is 6570.
Assume the value 18 is assigned to age, then the str function converts the integer
value 6570 to the string ’6570’ before concatenating it to ’.’.
For a final, slightly more involved, example here is how we could prompt for a
temperature and wind speed with which to compute the current wind chill.
>>> text = input(’Temperature: ’)
Temperature: 2.3
>>> temp = float(text)
>>> text = input(’Wind speed: ’)
Wind speed: 17.5
>>> wind = float(text)
>>> chill = 13.12 + 0.6215 * temp + (0.3965 * temp - 11.37) * wind**0.16
>>> chill = round(chill)
>>> print(’The wind chill is’, chill, ’degrees Celsius.’)
The wind chill is -2 degrees Celsius.
Reflection 2.6 Why do we use float above instead of int? Replace one of the
calls to float with a call to int, and then enter a floating point value (like 2.3).
What happens and why?
Modules
In Python, there are many more mathematical functions available in a module
named math. A module is an existing Python program that contains predefined
values and functions that you can use. To access the contents of a module, we use
the import keyword. To access the math module, type:
>>> import math
After a module has been imported, we can access the functions in the module by
preceding the name of the function we want with the name of the module, separated
by a period (.). For example, to take the square root of 5, we can use the square
root function math.sqrt:
>>> math.sqrt(5)
2.23606797749979
Other commonly used functions from the math module are listed in Appendix B.1.
The math module also contains two commonly used constants: pi and e. Our volume
computation earlier would have been more accurately computed with:
>>> radius = 20
>>> volume = (4 / 3) * math.pi * (radius ** 3)
>>> volume
33510.32163829113
Notice that, since pi and e are variable names, not functions, there are no parentheses
after their names.
Function calls can also be used in longer expressions, and as arguments of other
functions. In this case, it is useful to think about a function call as equivalent to
the value that it returns. For example, we can use the math.sqrt function in the
computation of the √ volume of a tetrahedron with edge length h = 7.5, using the
formula V = h /(6 2).
3
>>> h = 7.5
>>> volume = h ** 3 / (6 * math.sqrt(2))
>>> volume
49.71844555217912
In the parentheses, the value of math.sqrt(2) is computed first, and then multiplied
by 6. Finally, h ** 3 is divided by this result, and the answer is assigned to volume.
If we want the rounded volume, we can use the entire volume computation as the
argument to the round function:
>>> volume = round(h ** 3 / (6 * math.sqrt(2)))
>>> volume
50
52 Elementary computations
round( h ** 3 / (6 * math.sqrt(2) ) )
´¹¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
421.875 1.4142...
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
8.4852...
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
49.7184...
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
50
Now suppose we wanted to find the cosine of a 52○ angle. We can use the math.cos
function to compute the cosine, but the Python trigonometric functions expect
their arguments to be in radians instead of degrees. (360 degrees is equivalent to 2π
radians.) Fortunately, the math module provides a function named radians that
converts degrees to radians. So we can find the cosine of a 52○ angle like this:
>>> math.cos(math.radians(52))
0.6156614753256583
The function call math.radians(52) is evaluated first, giving the equivalent of 52○
in radians, and this result is used as the argument to the math.cos function:
math.cos( math.radians(52) )
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
0.9075...
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
0.6156...
Finally, we note that, if you need to compute with complex numbers, you will want
to use the cmath module instead of math. The names of the functions in the two
modules are largely the same, but the versions in the cmath module understand
complex numbers. For example, attempting to find the square root of a negative
number using math.sqrt will result in an error:
>>> math.sqrt(-1)
ValueError: math domain error
This value error indicates that -1 is outside the domain of values for the√math.sqrt
function. On the other hand, calling the cmath.sqrt function to find −1 returns
the value of i:
Exercises
Use the Python interpreter to answer the following questions. Where appropriate, provide
both the answer and the Python expression you used to get it.
2.4.1. Try taking the absolute value of both −8 and −8.0. What do you notice?
2.4.2. Find the area of a circle with radius 10. Use the value of π from the math module.
2.4.3. The geometric mean of two numbers is the square root of their product. Find the
geometric mean of 18 and 31.
2.4.4. Suppose you have P dollars in a savings account that will pay interest rate r,
compounded n times per year. After t years, you will have
r nt
P (1 + )
n
dollars in your account. If the interest were compounded continuously (i.e., with n
approaching infinity), you would instead have
P ert
dollars after t years, where e is Euler’s number, the base of the natural logarithm.
Suppose you have P = $10, 000 in an account paying 1% interest (r = 0.01),
compounding monthly. How much money will you have after t = 10 years? How
much more money would you have after 10 years if the interest were compounded
continuously instead? (Use the math.exp function.)
2.4.5. Show how you can use the int function to truncate any positive floating point
number x to two places to the right of the decimal point. In other words, you want
to truncate a number like 3.1415926 to 3.14. Your expression should work with
any value of x.
2.4.6. Show how you can use the int function to find the fractional part of any positive
floating point number. For example, if the value 3.14 is assigned to x, you want to
output 0.14. Your expression should work with any value of x.
2.4.7. The well-known quadratic formula, shown below, gives solutions to the quadratic
equation ax2 + bx + c = 0.
√
−b ± b2 − 4ac
x=
2a
Show how you can use this formula to compute the two solutions to the equation
3x2 + 4x − 5 = 0.
2.4.8. Suppose we have two points (x1, y1) and (x2, y2). The distance between them is
equal to √
(x1 − x2)2 + (y1 − y2)2 .
Show how to compute this in Python. Make sure you test your answer with real
values.
54 Elementary computations
2.4.9. A parallelepiped is a three-dimensional box in which the six sides are parallelograms.
The volume of a parallelepiped is
√
V = abc 1 + 2 cos(x) cos(y) cos(z) − cos(x)2 − cos(y)2 − cos(z)2
where a, b, and c are the edge lengths, and x, y, and z are the angles between the
edges, in radians. Show how to compute this in Python. Make sure you test your
answer with real values.
2.4.10. Repeat the previous exercise, but now assume that the angles are given to you in
degrees.
2.4.11. Repeat Exercise 2.4.4, but prompt for each of the four values first using the input
function, and then print the result.
2.4.12. Repeat Exercise 2.4.7, but prompt for each of the values of a, b, and c first using
the input function, and then print the results.
2.4.13. The following program implements a Mad Lib.
adj1 = input(’Adjective: ’)
noun1 = input(’Noun: ’)
noun2 = input(’Noun: ’)
adj2 = input(’Adjective: ’)
noun3 = input(’Noun: ’)
1 1 1 0
+ 0 1 1 1
1
In the next column, we have 1 + 1 = 10. Since the answer contains more than one
bit, we carry the 1.
1
1 1 1 0
+ 0 1 1 1
0 1
In the next column, we have 1 + 1 = 10 again, but with a carry bit as well. Adding
in the carry, we have 10 + 1 = 11 (or 2 + 1 = 3 in decimal). So the answer for the
column is 1, with a carry of 1.
1 1
1 1 1 0
+ 0 1 1 1
1 0 1
Finally, in the leftmost column, with the carry, we have 1 + 0 + 1 = 10. We write the
0 and carry the 1, and we are done.
1 1 1
1 1 1 0
+ 0 1 1 1
1 0 1 0 1
We can easily check our work by converting everything to decimal. The top number
in decimal is 8 + 4 + 2 + 0 = 14 and the bottom number in decimal is 0 + 4 + 2 + 1 = 7.
Our answer in decimal is 16 + 4 + 1 = 21. Sure enough, 14 + 7 = 21.
Finite precision
Although Python integers can store arbitrarily large values, this is not true at the
machine language level. In other words, Python integers are another abstraction
built atop the native capabilities of the computer. At the machine language level
(and in most other programming langauges), every integer is stored in a fixed amount
of memory, usually four bytes (32 bits). This is another example of finite precision,
and it means that sometimes the result of an arithmetic operation is too large to fit.
We can illustrate this by revisiting the previous problem, but assuming that we
only have four bits in which to store each integer. When we add the four-bit integers
1110 and 0111, we arrived at a sum, 10101, that requires five bits to be represented.
When a computer encounters this situation, it simply discards the leftmost bit. In
our example, this would result in an incorrect answer of 0101, which is 5 in decimal.
Fortunately, there are ways to detect when this happens, which we leave to you to
discover as an exercise.
56 Elementary computations
Negative integers
We assumed above that the integers we were adding were positive, or, in program-
ming terminology, unsigned. But of course computers must also be able to handle
arithmetic with signed integers, both positive and negative.
Everything, even a negative sign, must be stored in a computer in binary. One
option for representing negative integers is to simply reserve one bit in a number to
represent the sign, say 0 for positive and 1 for negative. For example, if we store
every number with eight bits and reserve the first (leftmost) bit for the sign, then
00110011 would represent 51 and 10110011 would represent −51. This approach is
known as sign and magnitude notation. The problem with this approach is that the
computer then has to detect whether a number is negative and handle it specially
when doing arithmetic.
For example, suppose we wanted to add −51 and 102 in sign and magnitude
notation. In this notation, −51 is 10110011 and 102 is 01100110. First, we notice
that 10110011 is negative because it has 1 as its leftmost bit and 01100110 is
positive because it has 0 as its leftmost bit. So we need to subtract positive 51 from
102:
0 10 10 0 10 10
0 1/ 1/ 0/ 0 1/ 1/ 0/ ←Ð 102
− 0 0 1 1 0 0 1 1 ←Ð 51
0 0 1 1 0 0 1 1 ←Ð 51
Borrowing in binary works the same way as in decimal, except that we borrow a
2 (10 in binary) instead of a 10. Finally, we leave the sign of the result as positive
because the largest operand was positive.
To avoid these complications, computers use a clever representation called two’s
complement notation. Integers stored in two’s complement notation can be added
directly, regardless of their sign. The leftmost bit is also the sign bit in two’s
complement notation, and positive numbers are stored in the normal way, with
leading zeros if necessary to fill out the number of bits allocated to an integer. To
convert a positive number to its negative equivalent, we invert every bit to the left
of the rightmost 1. For example, since 51 is represented in eight bits as 00110011,
−51 is represented as 11001101. Since 4 is represented in eight bits as 00000100,
−4 is represented as 11111100.
To illustrate how addition works in two’s complement notation, let’s once again
add −51 and 102:
1 1 1 1
1 1 0 0 1 1 0 1 ←Ð −51
+ 0 1 1 0 0 1 1 0 ←Ð 102
1/ 0 0 1 1 0 0 1 1 ←Ð 51
As a final step in the addition algorithm, we always disregard an extra carry bit. So,
indeed, in two’s complement, −51 + 102 = 51.
Binary arithmetic 57
Designing an adder
Let’s look at how to an adder that takes in two single bit inputs and outputs a
two bit answer. We will name the rightmost bit in the answer the “sum” and the
leftmost bit the “carry.” So we want our abstract adder to look this:
a sum
b + carry
The two single bit inputs enter on the left side, and the two outputs exit on the
right side. Our goal is to replace the inside of this “black box” with an actual logic
circuit that computes the two outputs from the two inputs.
The first step is to design a truth table that represents what the values of sum
and carry should be for all of the possible input values:
a b carry sum
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0
Notice that the value of carry is 0, except for when a and b are both 1, i.e., when
we are computing 1 + 1. Also, notice that, listed in this order (carry, sum), the two
output bits can also be interpreted as a two bit sum: 0 + 0 = 00, 0 + 1 = 1, 1 + 0 = 1,
and 1 + 1 = 10. (As in decimal, a leading 0 contributes nothing to the value of a
number.)
Next, we need to create an equivalent Boolean expression for each of the two
outputs in this truth table. We will start with the sum column. To convert this
column to a Boolean expression, we look at the rows in which the output is 1. In
this case, these are the second and third rows. The second row says that we want
sum to be 1 when a is 0 and b is 1. The and in this sentence is important; in order
for an and expression to be 1, both inputs must be 1. But, in this case, a is 0 so
we need to flip it with not a. The b input is already 1, so we can leave it alone.
Putting these two halves together, we have not a and b. Now the third row says
that we want sum to be 1 when a is 1 and b is 0. Similarly, we can convert this to
the Boolean expression a and not b.
a b carry sum
0 0 0 0
0 1 0 1 ←Ð not a and b
1 0 0 1 ←Ð a and not b
1 1 1 0
58 Elementary computations
+ –
a b a and b
(a)
a
+ –
a or b
b
(b)
Figure 2.1 Simple electrical implementations of an (a) and and (b) or gate.
Finally, let’s combine these two expressions into one expression for the sum column:
taken together, these two rows are saying that sum is 1 if a is 0 and b is 1, or if a is
1 and b is 0. In other words, we need at least one of these two cases to be 1 for the
sum column to be 1. This is just equivalent to (not a and b) or (a and not b). So
this is the final Boolean expression for the sum column.
Now look at the carry column. The only row in which the carry bit is 1 says that
we want carry to be 1 if a is 1 and b is 1. In other words, this is simply a and b. In
fact, if you look at the entire carry column, you will notice that this column is the
same as in the truth table for a and b. So, to compute the carry, we just compute a
and b.
a b carry sum
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0
↑ ↑
a and b (not a and b) or (a and not b)
Implementing an adder
To implement our adder, we need physical devices that implement each of the
binary operators. Figure 2.1(a) shows a simple electrical implementation of an and
operator. Imagine that electricity is trying to flow from the positive terminal on
the left to the negative terminal on the right and, if successful, light up the bulb.
The binary inputs, a and b, are each implemented with a simple switch. When the
switch is open, it represents a 0, and when the switch is closed, it represents a 1.
The light bulb represents the output (off = 0 and on = 1). Notice that the bulb will
only light up if both of the switches are closed (i.e., both of the inputs are 1). An
or operator can be implemented in a similar way, represented in Figure 2.1(b). In
Binary arithmetic 59
a a
a and b a or b a not a
b b
this case, the bulb will light up if at least one of the switches is closed (i.e., if at
least one of the inputs is 1).
Physical implementations of binary operators are called logic gates. It is inter-
esting to note that, although modern gates are implemented electronically, they can
be implemented in other ways as well. Enterprising inventors have implemented
hydraulic and pneumatic gates, mechanical gates out of building blocks and sticks,
optical gates, and recently, gates made from molecules of DNA.
Logic gates have standard, implementation-independent schematic representa-
tions, shown in Figure 2.2. Using these symbols, it is a straightforward matter to
compose gates to create a logic circuit that is equivalent to any Boolean expression.
For example, the expression not a and b would look like the following:
a
not a and b
b
Both inputs a and b enter on the left. Input a enters a not gate before the and
gate, so the top input to the and gate is not a and the bottom input is simply b.
The single output of the circuit on the right leaves the and gate with value not a
and b. In this way, logic circuits can be built to an arbitrary level of complexity to
perform useful functions.
The circuit for the carry output of our adder is simply an and gate:
a
carry
b
a
b
sum
By convention, the solid black circles represent connections between “wires”; if there
is no solid black circle at a crossing, this means that one wire is “floating” above
the other and they do not touch. In this case, by virtue of the connections, the a
input is flowing into both the top and gate and the bottom not gate, while the b
input is flowing into both the top not gate and the bottom and gate. The top and
60 Elementary computations
gate outputs the value of a and not b and the bottom and gate outputs the value
of not a and b. The or gate then outputs the result of oring these two values.
Finally, we can combine these two circuits into one grand adder circuit with
two inputs and two outputs, to replace the “black box” adder we began with. The
shaded box represents the “black box” that we are replacing.
a
carry
b
sum
Notice that the values of both a and b are each now flowing into three different
gates initially, and the two outputs are conceptually being computed in parallel. For
example, suppose a is 0 and b is 1. The figure below shows how this information
flows through the adder to arrive at the final output values.
a 0
0
1 carry
b
0
0
1 0
1
sum
0 1
1
1
Exercises
2.5.1. Show how to add the unsigned binary numbers 001001 and 001101.
2.5.2. Show how to add the unsigned binary numbers 0001010 and 0101101.
2.5.3. Show how to add the unsigned binary numbers 1001 and 1101, assuming that
all integers must be stored in four bits. Convert the binary values to decimal to
determine if you arrived at the correct answer.
2.5.4. Show how to add the unsigned binary numbers 001010 and 101101, assuming that
all integers must be stored in six bits. Convert the binary values to decimal to
determine if you arrived at the correct answer.
Binary arithmetic 61
2.5.5. Suppose you have a computer that stores unsigned integers in a fixed number of
bits. If you have the computer add two unsigned integers, how can you tell if the
answer is correct (without having access to the correct answer from some other
source)? (Refer back to the unsigned addition example in the text.)
2.5.6. Show how to add the two’s complement binary numbers 0101 and 1101, assuming
that all integers must be stored in four bits. Convert the binary values to decimal
to determine if you arrived at the correct answer.
2.5.7. What is the largest positive integer that can be represented in four bits in two’s
complement notation? What is the smallest negative number? (Think especially
carefully about the second question.)
2.5.8. Show how to add the two’s complement binary numbers 1001 and 1101, assuming
that all integers must be stored in four bits. Convert the binary values to decimal
to determine if you arrived at the correct answer.
2.5.9. Show how to add the two’s complement binary numbers 001010 and 101101,
assuming that all integers must be stored in six bits. Convert the binary values to
decimal to determine if you arrived at the correct answer.
2.5.10. Suppose you have a computer that stores two’s complement integers in a fixed
number of bits. If you have the computer add two two’s complement integers, how
can you tell if the answer is correct (without having access to the correct answer
from some other source)?
2.5.11. Subtraction can be implemented by adding the first operand to the two’s comple-
ment of the second operand. Using this algorithm, show how to subtract the two’s
complement binary number 0101 from 1101. Convert the binary values to decimal
to determine if you arrived at the correct answer.
2.5.12. Show how to subtract the two’s complement binary number 0011 from 0110.
Convert the binary values to decimal to determine if you arrived at the correct
answer.
2.5.13. Copy the completed adder circuit, and show, as we did above, how the two outputs
(carry and sum) obtain their final values when the input a is 1 and the input b is 0.
2.5.14. Convert the Boolean expression not (a and b) to a logic circuit.
2.5.15. Convert the Boolean expression not a and not b to a logic circuit.
2.5.16. The single Boolean operator nand (short for “not and”) can replace all three
traditional Boolean operators. The truth table and logic gate symbol for nand are
shown below.
a b a nand b
0 0 1
0 1 1
1 0 1
1 1 0
a
a nand b
b
62 Elementary computations
Show how you can create three logic circuits, using only nand gates, each of which
is equivalent to one of the and, or, and not gates. (Hints: you can use constant
inputs, e.g., inputs that are always 0 or 1, or have both inputs of a single gate be
the same.)
2.6 SUMMARY
To learn how to write algorithms and programs, you have to dive right in! We started
our introduction to programming in Python by performing simple computations
on numbers. There are two numerical types in Python: integers and floating point
numbers. Floating point numbers have a decimal point and sometimes behave
differently than we expect due to their finite precision. Aside from these rounding
errors, most arithmetic operators in Python behave as we would expect, except that
there are two kinds of division: so-called true division and floor division.
We use descriptive variable names in our programs for two fundamental reasons.
First, they make our code easier to understand. Second, they can “remember” our
results for later so that they do not have to be recomputed. The assignment statement
is used to assign values to variable names. Despite its use of the equals sign, using
an assignment statement is not the same thing as saying that two things are equal.
Instead, the assignment statement evaluates the expression on the righthand side of
the equals sign first, and then assigns the result to the variable on the lefthand side.
Functional abstractions in Python are implemented as functions. Functions take
arguments as input and then return a value as output. When used in an expression,
it is useful to think of a function call as equivalent to the value that it returns. We
used some of Python’s built-in functions, and the mathematical functions provided
by the math module. In addition to numerical values, Python lets us print, input, and
manipulate string values, which are sequences of characters enclosed in quotation
marks.
Because everything in a computer is represented in binary notation, all of these
operators and functions are really computed in binary, and then expressed to us in
more convenient formats. As an example, we looked at how positive and negative
integers are stored in a computer, and how two binary integers can be added together
just by using the basic Boolean operators of and, or, and not.
In the next chapter, we will begin to write our own functions and incorporate
them into longer programs. We will also explore Python’s “turtle graphics” module,
which will allow us to draw pictures and visualize our data.
list of “100 or so Books that Shaped a Century of Science” [35]. Dr. Knuth also
invented the typesetting program TEX, which was used to write this book. He is the
recipient of many international awards, including the Turing Award, named after
Alan Turing, which is considered to be the “Nobel Prize of computer science.”
Guido van Rossum is a Dutch computer programmer who invented the Python
programming language. The assertion that Van Rossum was in a “slightly irreverent
mood” when he named Python after the British comedy show is from the Foreword
to Programming Python [29] by Mark Lutz. IDLE is an acronym for “Integrated
DeveLopment Environment,” but is also considered to be a tribute to Eric Idle, one
of the founders of Monty Python.
The “Hello world!” program is the traditional first program that everyone learns
when starting out. See
http://en.wikipedia.org/wiki/Hello_world_program
https://docs.python.org/3/index.html
There is also a list of links on the book web site, and references to commonly used
classes and function in Appendix B.
This page intentionally left blank
CHAPTER 3
Visualizing abstraction
Donald E. Knuth
Turing Award Lecture (1974)
We may say most aptly that the Analytical Engine weaves algebraical patterns just as the
Jacquard-loom weaves flowers and leaves.
Ada Lovelace
Notes (1843)
65
66 Visualizing abstraction
Probably not. However, simply plotting the points instantly provides insight:
A picture really is worth a thousand words, especially when we are faced with a
slew of data. To visualize data like this, we will often turn to turtle graphics. To
draw in turtle graphics, we create an abstraction called a “turtle” in a window and
move it with directional commands. As a turtle moves, its “tail” leaves behind a
trail, as shown in Figure 3.1. If we lift a turtle’s tail up, it can move without leaving
a trace. In this chapter, in the course of learning about turtle graphics, we will also
explore how abstractions can be created, used, and combined to solve problems.
(a) the types of information, called attributes, that we need to maintain about
the things, and
(b) the operations that we are allowed to use to access or modify that information.
For example, a turtle abstract data type specifies that the turtle must maintain the
following attributes about its current state.
+y
-x +x
-y
Figure 3.1A turtle graphics window containing two turtles. The blue turtle moved
forward, turned left 45○ , and then moved forward again. The red turtle turned left
120○ , moved forward, turned left again 90○ , and then moved forward again.
The Turtle ADT also describes the following operations that change or report on
the turtle’s state. If the operation requires an argument as input, that is listed in
the second column.
Just as a blueprint describes the structure of a house, but is not actually a house,
the Turtle class, or more abstractly the Turtle ADT, describes the structure (i.e.,
attributes and methods) of a drawing turtle, but is not actually a drawing turtle.
Actual turtles in turtle graphics, like those pictured in Figure 3.1, are called turtle
objects. When we create a new turtle object belonging to the Turtle class, the turtle
object is endowed with its own independent values of orientation, position, color,
and so on, as described in the class definition. For this reason, there can be more
than one turtle object, as illustrated in Figure 3.1.
The distinction between a class and an object can also be loosely described by
analogy to animal taxonomy. A species, like a class, describes a category of animals
sharing the same general (morphological and/or genetic) characteristics. An actual
living organism is an instance of a species, like an object is an instance of a class. For
example, the species of Galápagos giant tortoise (Chelonoidis nigra) is analogous to
a class, while Lonesome George, the famous Galápagos giant tortoise who died in
2012, is analogous to an object of that class. Super Diego, another famous Galápagos
giant tortoise, is a member of the same species but, like another object of the same
class, is a distinct individual with its own unique attributes.
Reflection 3.1 Can you think of another analogy for a class and its associated
objects?
Hexadecimal is used a convenient shorthand for binary. Because any 4 binary dig-
its can represent the values 0 through 15, they can be conveniently replaced by
a single hexadecimal digit. So the hexadecimal number 100522f10 is equivalent to
000100000000010100100010111100010000 in binary, as shown below:
1 0 0 5 2 2 f 1 0
0001 0000 0000 0101 0010 0010 1111 0001 0000
Instead of displaying this 36 bit binary number, it is more convenient to display the 9
digit hexadecimal equivalent.
the assignment. When a student completes the assignment, she is creating an object
that (hopefully) adheres to those requirements.
To create a turtle object in Python, we call a function with the class’ name,
preceded by the name of the module in which the class resides.
>>> george = turtle.Turtle()
The empty parentheses indicate that we are calling a function with no arguments.
The Turtle() function returns a reference to a Turtle object, which is then assigned
to the name george. You should also notice that a window appears on your screen
with a little arrow-shaped “turtle” in the center. The center of the window has
coordinates (0, 0) and is called the origin. In Figure 3.1, the axes are superimposed
on the window in light gray to orient you to the coordinate system. We can confirm
that george is a Turtle object by printing the object’s value.
>>> george
<turtle.Turtle object at 0x100522f10>
The odd-looking “0x100522f10” is the address in memory where this Turtle object
resides. The address is displayed in hexadecimal, or base 16, notation. The 0x at
the front is a prefix that indicates hexadecimal; the actual hexadecimal memory
address is 100522f10. See Box 3.1 for more about how hexadecimal works.
To call a method belonging to an object, we precede the name of the method
with the name of the object, separated by a period. For example, to ask george to
move forward 200 units, we write
70 Visualizing abstraction
>>> george.forward(200)
Since the origin has coordinates (0, 0) and george is initially pointing east (toward
positive x values), george has moved to position (200, 0); the forward method
silently changed george’s hidden position attribute to reflect this, which you can
confirm by calling george’s position method.
>>> george.position()
(200.00,0.00)
Notice that we did not change the object’s position attribute directly. Indeed, we do
not even know the name of that attribute because the class definition remains hidden
to us. This is by design. By interacting with objects only through their methods,
and not tinkering directly with their attributes, we maintain a clear separation
between the ADT specification and the underlying implementation. This allows for
the possibility that the underlying implementation may change, to make it more
efficient, for example, without affecting programs that use it. We will discuss these
issues in much greater detail in Chapter 13.
Exercises
3.1.1. Explain the difference between an abstract data type (ADT) and a class.
3.1.2. Give another analogy for the difference between a class and an object. Explain.
3.1.3. Why do we use methods to change the state of a Turtle object instead of directly
changing the values of its attributes?
With this method call, we have changed george’s hidden orientation attribute,
which we can confirm by calling the heading method.
Visualization with turtles 71
>>> george.heading()
135.0
To finish the drawing, we just have to repeat the previous forward and left calls
seven more times! (Hint: see IDLE help for how retrieve previous statements.)
>>> george.forward(200)
>>> george.left(135)
>>> george.forward(200)
>>> george.left(135)
>>> george.forward(200)
>>> george.left(135)
>>> george.forward(200)
>>> george.left(135)
>>> george.forward(200)
>>> george.left(135)
>>> george.forward(200)
>>> george.left(135)
>>> george.forward(200)
>>> george.left(135)
That was tedious. But before we look at how to avoid similar tedium in the future,
we are going to transition out of the Python shell. This will allow us to save our
programs so that we can easily modify them or fix mistakes, and then re-execute
them without retyping everything. In IDLE, we can create a new, empty program
file by choosing New Window from the File menu.1 In the new window, retype (or
copy and paste) the work we have done so far, shown below, plus four additional
lines highlighted in red. (If you copy and paste, be sure to remove the prompt
symbols before each line.)
import turtle
george = turtle.Turtle()
george.hideturtle()
george.speed(6)
george.forward(200)
george.left(135)
george.forward(200)
george.left(135)
george.forward(200)
george.left(135)
george.forward(200)
george.left(135)
george.forward(200)
george.left(135)
george.forward(200)
1
If you are using a different text editor, the steps are probably very similar.
72 Visualizing abstraction
george.left(135)
george.forward(200)
george.left(135)
george.forward(200)
george.left(135)
screen = george.getscreen()
screen.exitonclick()
The two statements after the first assignment statement hide george and speed up
the drawing a bit. (The argument to the speed method is a number from 0 to 10,
with 1 being slow, 10 being fast, and 0 being fastest.) The second to last statement
assigns to the variable screen an object of the class Screen, which represents the
drawing area in which george lives. The last statement calls the Screen method
exitonclick which will close the window when we click on it.
When you are done typing, save your file by selecting Save As. . . from the File
menu. The file name of a Python program must always end with the extension .py,
for example, george.py. To execute your new program in IDLE, select Run Module
from the Run menu. If IDLE prompts you to save your file again, just click OK. The
program should draw the “flower” shape again. When it is done, click on the turtle
graphics window to dismiss it.
It is easy to forget the colon. If you do forget it, you will be notified by a syntax error,
like the following, that points to the end of the line containing the for keyword.
Visualization with turtles 73
In the for loop syntax, for and in are Python keywords, and segment is called
the index variable. The name of the index variable can be anything we want, but
it should be descriptive of its role in the program. In this case, we chose the name
segment to represent one line segment in the shape, which is what the body of the
loop draws. range(8) represents the range of integers between 0 and 7. So the for
loop is essentially saying “execute the body of the loop once for each integer in the
range between 0 and 7.” Since there are 8 integers in this range, the body of the
loop is executed 8 times.
Reflection 3.3 Try changing segment to some other name. Did changing the name
change the behavior of the program?
Let’s look a little more closely at what is happening in this loop. Before each
iteration of the loop, the next value in the range between 0 and 7 is assigned to the
index variable, segment. Then the statements in the body of the loop are executed.
So, quite literally, this loop is equivalent to executing the following 24 statements:
segment = 0 ⎫
⎪
⎪
⎪
george.forward(200) ⎬ iteration 1
⎪
⎪
⎪
george.left(135) ⎭
segment = 1 ⎫
⎪
⎪
⎪
george.forward(200) ⎬ iteration 2
⎪
⎪
⎪
george.left(135) ⎭
segment = 2 ⎫
⎪
⎪
⎪
george.forward(200) ⎬ iteration 3
⎪
⎪
⎪
george.left(135) ⎭
⋮
segment = 6 ⎫
⎪
⎪
⎪
george.forward(200) ⎬ iteration 7
⎪
⎪
⎪
george.left(135) ⎭
segment = 7 ⎫
⎪
⎪
⎪
george.forward(200) ⎬ iteration 8
⎪
⎪
⎪
george.left(135) ⎭
Now that we have a basic flower shape, let’s add some color. To set the color that
the turtle draws in, we use the pencolor method. Insert
george.pencolor(’red’)
74 Visualizing abstraction
Figure 3.3 A simple geometric “flower,” outlined in red and filled in yellow.
before the for loop, and run your program again. A color can be specified in one of
two ways. First, common colors can be specified with strings such as ’red’, ’blue’,
and ’yellow’. Notice that, as we saw in the previous chapter, a string must be
enclosed in quotes to distinguish it from a variable or function name. A color can
also be defined by explicitly specifying its red, green, and blue (RGB) components,
as explained in Box 3.2.
Finally, we will specify a color with which to fill the “flower” shape. The fill
color is set by the fillcolor method. The statements that draw the area to be
filled must be contained between calls to the begin_fill and end_fill methods.
To color our flower yellow, precede the for loop with
Visualization with turtles 75
george.fillcolor(’yellow’)
george.begin_fill()
Be sure to not indent the call to george.end_fill() in the body of the for loop
since we want that statement to execute just once after the loop is finished. Your
flower should now look like Figure 3.3, and the complete flower bloom program
should look like the following:
import turtle
george = turtle.Turtle()
george.hideturtle()
george.speed(6)
george.pencolor(’red’)
george.fillcolor(’yellow’)
george.begin_fill()
for segment in range(8):
george.forward(200)
george.left(135)
george.end_fill()
screen = george.getscreen()
screen.exitonclick()
Reflection 3.5 Can you figure out why the shape was filled this way?
Exercises
Write a short program to answer each of the following questions. Submit each as a separate
python program file with a .py extension (e.g., picture.py).
3.2.1. Create two turtles and use them to draw the picture in Figure 3.1.
3.2.2. Write a modified version of the flower bloom program that draws a flower with 18
sides, using an angle of 100○ .
3.2.3. Write a program that uses a for loop to draw a square with side length 200.
3.2.4. Write a program that uses a for loop to draw a rectangle with length 200 and
width 100.
3.2.5. Draw an interesting picture using turtle graphics. Consult Appendices B.2 and B.3
for a list of methods. You might want to draw your picture on graph paper first.
76 Visualizing abstraction
import turtle
george = turtle.Turtle()
george.hideturtle()
george.speed(6)
george.pencolor(’red’)
george.fillcolor(’yellow’)
george.begin_fill()
for segment in range(8):
george.forward(200)
george.left(135)
george.end_fill()
george.up()
george.goto(200, 200)
george.down()
george.pencolor(’yellow’)
george.fillcolor(’purple’)
george.begin_fill()
for segment in range(8):
george.forward(150)
george.left(135)
george.end_fill()
screen = george.getscreen()
screen.exitonclick()
However, this strategy is a very bad idea. First, it is very time-consuming and
error-prone; when you repeatedly copy and paste, it is very easy to make mistakes,
or forget to make appropriate changes. Second, it makes your code unnecessarily
long and hard to read. Doing this a few times can quickly lead to dozens or hundreds
of lines of dense code. Third, it is difficult to make changes. For example, what if
Functional abstraction 77
you copied enough code to draw twenty flowers, and then decided that you wanted
to give all of them six petals instead of eight?
Instead, we can create a new function to draw a flower. Then we can repeatedly
call this function with different arguments to draw different flowers. To create a
function in Python, we use the def keyword, followed by the function name and, for
now, empty parentheses (we will come back to those shortly). As with a for loop,
the def line must end with a colon (:).
def bloom():
The body of the function is then indented relative to the def line. The body of our
new function will consist of the flower bloom code. Insert this new function after
the import statement:
import turtle
def bloom():
george.pencolor(’red’)
george.fillcolor(’yellow’)
george.begin_fill()
for segment in range(8):
george.forward(200)
george.left(135)
george.end_fill()
george = turtle.Turtle()
george.hideturtle()
george.speed(6)
bloom()
screen = george.getscreen()
screen.exitonclick()
The def construct only defines the new function; it does not execute it. We need to
call the function for it to execute. As we saw earlier, a function call consists of the
function name, followed by a list of arguments. Since this function does not have
any arguments (yet), and does not return a value, we can call it with
bloom()
inserted, at the outermost indentation level, where the flower bloom code used to
be (as shown above).
Reflection 3.7 Try running the program with and without the bloom() function
call to see what happens.
78 Visualizing abstraction
Before continuing, let’s take a moment to review what the program is doing. As
illustrated below, execution begins at the top, labeled “start,” so the program first
imports the turtle module.
start
1 import turtle
def bloom():
george.pencolor(’red’)
george.fillcolor(’yellow’)
2 5 george.begin_fill()
for segment in range(8):
george.forward(200)
george.left(135)
4 george.end_fill()
6
george = turtle.Turtle()
3 george.hideturtle()
george.speed(6)
bloom()
screen = george.getscreen()
7
screen.exitonclick()
end
Next, the bloom function is defined, but not executed, as signified by the dashed
line above marked with a 2. The next three statements define a new Turtle object
named george, hide the turtle and speed it up a bit. Next, the call to bloom()
causes execution to jump up to the beginning of the function, signified by the arrow
marked with a 4. The statements in the function then draw the flower (step 5).
When the function is complete, execution continues with the statement after the
function call, signified with arrow marked with a 6, and continues to the end of the
program.
Function parameters
The function named bloom that we have written is not as useful as it could be
because it always draws the same flower: yellow with segment length 200. To make
the function more general, we want to be able to pass in arguments for values like
the fill color and the segment length. We do this by adding one or more parameters
to the function definition. A parameter is a variable that is assigned the value of an
argument when the function is called. (Parameters and arguments are also called
formal parameters and actual parameters, respectively.) In the new version below,
we have defined two formal parameters (in red) to represent the fill color and the
segment length. We have also replaced the old constants ’yellow’ and 200 with
the names of these new parameters.
Functional abstraction 79
Now, to replicate the old behavior, we would call the function with
bloom(’yellow’, 200)
When this function is called, the value of the first argument ’yellow’ is assigned
to the first parameter fcolor and the value of the second argument 200 is assigned
to the second parameter length. Then the body of the function executes. When-
ever fcolor is referenced, it is replaced with ’yellow’, and whenever length is
referenced, it is replaced with 200.
Reflection 3.8 After making these changes, run the program again. Then try
running it a few more times with different arguments passed into the bloom function
call. For example, try bloom(’orange’, 50) and bloom(’purple’, 350). What
happens if you switch the order of the arguments in one these function calls?
Notice that we have just created a functional abstraction! When we want to draw
a flower bloom, we pass two arguments as input to the function, and the function
draws it, without us having to understand how it was drawn.
(Of course, because we wrote the function, we do understand how it was drawn. But
to call it, we do not need to understand how it works.)
We are going to make one more change to this function before moving on,
motivated by the following question.
Reflection 3.9 Look at the variable name george that is used inside the bloom
function. Where is it defined?
When the bloom function executes, the Python interpreter encounters the variable
name george in the first line, but george has not been defined in that function.
Realizing this, Python looks for the name george outside the function. This behavior
80 Visualizing abstraction
is called a scoping rule. The scope of a variable name is the part of the program
where the name is defined, and hence can be used.
The scope of a variable name that is defined inside a function, such as segment
in the bloom function, is limited to that function. Such a variable is called a local
variable. If we tried to refer to segment outside of the the bloom function, we would
get an error. We will look at local variables in more detail in Section 3.6.
A variable name that is defined at the outermost indentation level can be
accessed from anywhere in the program, and is called a global variable. In our
program, george and screen are global variable names. It is generally a bad idea
to have any global variables at all in a program, a topic that we will further discuss
in Sections 3.4 and 3.5. But even aside from that issue, we should be concerned that
our function is tied to one specific turtle named george that is defined outside our
function. It would be much better to make the turtle a parameter to the function,
so that we can call it with any turtle we want. Replacing george with a parameter
named tortoise gives us the following modified function:
We also need to update the function call by passing george as the first argument,
to be assigned to the first parameter, tortoise.
bloom(george, ’yellow’, 200)
To finish our flower, let’s create another function that draws a stem. Our stem-
drawing function will take two parameters: tortoise, which is the name of the
turtle object, and length, the length of the stem. For convenience, we will assume
that the stem length is the same as the length of a segment in the associated flower.
Include this function in your program immediately after where the bloom function
is defined.
Since the bloom function nicely returns the turtle to the origin, pointing east, we
will assume that tortoise is in this state when stem is called. We start the function
by setting the pen color to green, and thickening the turtle’s tail by calling the
method pensize. Notice that the pen size is based on the parameter length, so that
it scales properly with different size flowers. Next, we need to move halfway across
the flower and turn south to start drawing the stem. So that we do not draw over
the existing flower, we put the turtle’s tail up with the up method before we move,
and return it to its resting position again with down when we are done. Finally, we
turn right and move the turtle forward to draw a thick green stem. To draw a stem
for our yellow flower, we can insert the function call
stem(george, 200)
after the call to the bloom function. When you run your program, the flower should
look like Figure 3.4.
We now have two functions — two functional abstractions, in fact — that draw
a flower bloom and an associated stem. We can now focus on the larger problem of
creating a virtual garden without thinking further about how the flower-drawing
functions work.
Reflection 3.10 Do you see an opportunity in the program for yet another func-
tional abstraction?
Right, they draw a flower! So we can replace these two lines with another function
(another functional abstraction) that draws a flower:
82 Visualizing abstraction
import turtle
import random
Figure 3.5The flower program before the Figure 3.6The final flower program with
flower planting code. the flower planting code.
Functional abstraction 83
growflower function
flower function
Turtle class
turtle ADT
Because the bloom and stem functions together require a turtle, a fill color and a
length, and we want to be able to customize our flower in these three ways, these
are the parameters to our flower function. We pass all three of these parameters
straight through to the bloom function, and then we pass two of them to the stem
function. Finally, we can call our new flower function with
flower(george, ’yellow’, 200)
to accomplish the same thing as the two statements above. The program is shown
in Figure 3.5. Make sure you have all of this down before moving on.
In these three sections, we have learned how to draw with turtle graphics, but
more importantly, we learned how to identify and create abstractions. The Turtle
class implements a turtle abstract data type. Using this class, we have implemented
two functional abstractions with Python functions, bloom and stem. We used these,
in turn, to build another functional abstraction, flower. Creating a hierarchy of
functional abstractions is what allows us to solve large problems. This hierarchy
of abstractions, with one additional layer that will develop next, is depicted in
Figure 3.7.
84 Visualizing abstraction
After this function call, one of the strings in the list will be assigned to fill.
The following function incorporates these two function calls to draw a random
flower at coordinates (x, y). Insert it into your program after the definition of the
flower function, and insert import random at the top. (See Figure 3.6.) Make sure
you understand how this function works before you continue.
Notice that this function creates a new Turtle object each time it is called (one
per flower).
Next, we want to make our program respond to a mouse click by calling the
growFlower function with arguments equal to the coordinates of the click. This
is accomplished by the onclick method of the Screen class. In particular, the
following statement tells the Python interpreter that the function passed as an
Functional abstraction 85
argument (in this case, the growFlower function) should be called every time we
click in the window.
screen.onclick(growFlower)
The onclick function passes the (x, y) coordinates of each mouse click as parameters
to the growFlower function. (The function that we pass into onclick can only take
these two parameters, which is why we did not also include a Turtle object as a
parameter the growFlower function above.) Insert the statement above after the
call to george.getscreen(). Finally, replace screen.exitonclick() with
screen.mainloop()
as the last line of the program. The mainloop method repeatedly checks for mouse
click events and calls the appropriate function (which we indicated should be
growFlower) when click events happen. This final program is shown in Figure 3.6,
and a possible flower garden is shown in Figure 3.8.
Exercises
Write a short program to answer each of the following questions. Submit each as a separate
python program file with a .py extension (e.g., picture.py).
3.3.1. Modify the flower function so that it allows different numbers of petals. The
revised function will need to take an additional parameter:
flower(tortoise, fcolor, length, petals)
Note that the original function has the turtle travel a total of 8 ⋅ 135 = 1080 degrees.
When you generalize the number of petals, make sure that the total number of
degrees is still a multiple of 360.
3.3.2. Modify the flower function so that it creates a daffodil-like double bloom like the
one below. The revised function will need two fill color parameters:
flower(tortoise, fcolor1, fcolor2, length)
It might help to know that the distance between any two opposite points of a
bloom is about 1.08 times the segment length.
86 Visualizing abstraction
3.3.3. Write a program that draws the word “CODE,” as shown below. Use the circle
method to draw the arcs of the “C” and “D.” The circle method takes two
arguments: the radius of the circle and the extent of the circle in degrees. For
example, george.circle(100, 180) would draw half of a circle with radius 100.
Making the extent negative draws the arc in the reverse direction. In addition, use
the up method to move the turtle between letters without drawing, and the down
method to resume drawing.
3.3.4. Rewrite your program from Exercise 3.3.3 so that each letter is drawn by its own
function. Then use your functions to draw “DECODE.” (Call your “D” and “E”
functions each twice.)
3.3.5. Write a function
drawSquare(tortoise, width)
that uses the turtle named tortoise to draw a square with the given width. This
function generalizes the code you wrote for Exercise 3.2.3 so that it can draw a
square with any width. Use a for loop.
3.3.6. Write a function
drawRectangle(tortoise, length, width)
that uses the turtle named tortoise to draw a rectangle with the given length
and width. This function generalizes the code you wrote for Exercise 3.2.4 so that
it can draw a rectangle of any size. Use a for loop.
3.3.7. Write a function
drawPolygon(tortoise, sideLength, numSides)
that uses the turtle named tortoise to draw a regular polygon with the given num-
ber of sides and side length. This function is a generalization of your drawSquare
function from Exercise 3.3.5. Use the value of numSides in your for loop and
create a new variable for the turn angle.
3.3.8. Write a function
drawCircle(tortoise, radius)
that calls your drawPolygon function from Exercise 3.3.7 to approximate a circle
with the given radius.
Functional abstraction 87
The following are additional exercises that ask you to write functions that do not involve
turtle graphics. Be sure to test each one by calling it with appropriate arguments.
Program structure
Let’s return to the program that we wrote in the previous section (Figure 3.5), and
reorganize it a bit to reflect better programming habits. As shown in Figure 3.9,
every program should begin with documentation that identifies the program’s author
and its purpose. This type of documentation is called a docstring; we will look more
closely at docstrings and other types of documentation in the next section.
We follow this with our import statements. Putting these at the top of our
program both makes our program neater and ensures that the imported modules
are available anywhere later on.
Next, we define all of our functions. Because programs are read by the interpreter
from top to bottom, we should always define our functions above where we call
them. For example, in our flower program, we had to place the definition of the
flower function above the line where we called it. If we had tried to call the flower
function at the top of the program, before it was defined, we would have generated
an error message.
At the end of the flower-drawing program in Figure 3.5, there are six statements
90 Visualizing abstraction
"""
Purpose: Draw a flower
program docstring Author: Ima Student
Date: September 15, 2020
CS 111, Fall 2020
"""
Parameters:
tortoise: a Turtle object with which to draw the bloom.
fcolor: a color string to use to fill the bloom.
length: the length of each segment of the bloom.
Return value:
function None
"""
definitions
tortoise.pencolor('red') # set tortoise's pen color to red
tortoise.fillcolor(fcolor) # and fill color to fcolor
tortoise.begin_fill()
for segment in range(8): # draw a filled 8-sided
tortoise.forward(length) # geometric flower bloom
tortoise.left(135)
tortoise.end_fill()
def main():
"""Draws a yellow flower with segment length 200, and
waits for a mouse click to exit.
"""
at the outermost indentation level. Recall that the first and fifth of these statements
define a global variable name that is visible and potentially modifiable anywhere in
the program. When the value assigned to a global variable is modified in a function,
it is called a side effect. In large programs, where the values of global variables
can be potentially modified in countless different places, errors in their use become
nearly impossible to find. For this reason, we should get into the habit of never
using them, unless there is a very good reason, and these are pretty hard to come
by. See Box 3.3 for more information on how global names are handled in Python.
To prevent the use of global variables, and to make programs more readable, we
will move statements at the global level of our programs into a function named main,
and then call main as the last statement in the program, as shown at the end of the
program in Figure 3.9. With this change, the call to the main function is where the
Programming in style 91
def func1():
spam = 100
def func2():
global spam
spam = 200
func1()
print(spam)
func2()
print(spam)
The first print will display 13 because the assignment statement that is executed in
func1 defines a new local variable; it does not modify the global variable with the same
name. But the second print will display 200 because the global statement in func2
indicates that spam should refer to the global variable with that name, causing the
subsequent assignment statement to change the value assigned to the global variable.
This convention prevents accidental side effects because it forces the programmer to
explicitly decide to modify a global variable. In any case, using global is strongly
discouraged.
action begins in this program. (Remember that the function definitions above only
define functions; they do not execute them.) The main function eventually calls our
flower function, which then calls the bloom and stem functions. Getting used to
this style of programming has an additional benefit: it is very similar to the style of
other common programming languages (e.g., C, C++, Java) so, if you go on to use
one of these in the future, it should seem relatively familiar.
Documentation
Python program documentation comes in two flavors: docstrings and comments. A
docstring is meant to articulate everything that someone needs to know to use a
92 Visualizing abstraction
program or module, or to call a function. Comments, on the other hand, are used to
document individual program statements or groups of statements. In other words, a
docstring explains what a program or function does, while comments explain how
the code works; a docstring describes an abstraction while comments describe what
happens inside the black box. The Python interpreter ignores both docstrings and
comments while it is executing a program; both are intended for human eyes only.
Docstrings
A docstring is enclosed in a matching pair of triple double quotes ("""), and may
occupy several lines. We use a docstring at the beginning of every program to
identify the program’s author and its purpose, as shown at the top of Figure 3.9.2
We also use a docstring to document each function that we write, to ensure that the
reader understands what it does. A function docstring should articulate everything
that someone needs to know to call the function: the overall purpose of the function,
and descriptions of the function’s parameters and return value.
The beginning of a function’s docstring is indented on the line immediately
following the def statement. Programmers prefer a variety of different styles for
docstrings; we will use one that closely resembles the style in Google’s official
Python style guide. The following illustrate this docstring style, applied to the three
functions in the program from Figure 3.5. (The bodies of the functions are omitted.)
Parameters:
tortoise: a Turtle object with which to draw the bloom
fcolor: a color string to use to fill the bloom
length: the length of each segment of the bloom
Parameters:
tortoise: a Turtle object, located at the bloom starting position
length: the length of the stem and each segment of the bloom
2
Your instructor may require a different format, so be sure to ask.
Programming in style 93
Parameters:
tortoise: a Turtle object with which to draw the flower
fcolor: a color string to use to fill the bloom
length: the length of each segment of the bloom
In the first line of the docstring, we succinctly explain what the function does.
This is followed by a parameter section that lists each parameter with its intended
purpose and the class to which it should belong. If there are any assumptions made
about the value of the parameter, these should be stated also. For example, the
turtle parameter of the stem function is assumed to start at the origin of the bloom.
Finally, we describe the return value of the function. We did not have these functions
return anything, so they return None. We will look at how to write functions that
return values in Section 3.5.
Another advantage of writing docstrings is that Python can automatically
produce documentation from them, in response to calling the help function. For
example, try this short example in the Python shell:
>>> def spam(x):
"""Pointlessly returns x.
Parameters:
x: some value of no consequence
Return value:
x: same value of no consequence
"""
return x
>>> help(spam)
Help on function spam in module __main__:
spam(x)
Pointlessly returns x.
Parameters:
x: some value of no consequence
Return value:
x: same value of no consequence
You can also use help with modules and built-in functions. For example, try this:
94 Visualizing abstraction
In the documentation that appears, hit the space bar to scroll forward a page and
the Q key to exit.
Comments
A comment is anything between a hash symbol (#) and the end of the line. As with
docstrings, the Python interpreter ignores comments. Comments should generally
be neatly lined up to the right of your code. However, there are times when a longer
comment is needed to explain a complicated section of code. In this case, you might
want to precede that section with a comment on one or more lines by itself.
There is a fine line between under-commenting and over-commenting. As a
general rule, you want to supply high-level descriptions of what your code intends
to do. You do not want to literally repeat what each individual line does, as this is
not at all helpful to someone reading your code. Remember, anyone who reads your
program is going to also be a programmer, so you don’t need to explain what each
Python statement does. Doing so tends to clutter your code and make it harder to
read! As Albert Einstein purportedly said,
Here are examples of good comments for the body of the bloom function.
tortoise.pencolor(’red’) # set tortoise’s pen color to red
tortoise.fillcolor(fcolor) # and fill color to fcolor
tortoise.begin_fill()
for segment in range(8): # draw a filled 8-sided
tortoise.forward(length) # geometric flower bloom
tortoise.left(135)
tortoise.end_fill()
Notice that the five lines that draw the bloom are commented together, just to note
the programmer’s intention. In contrast, the following comments illustrate what not
to do. The following comments are both hard to read and uninformative.
tortoise.pencolor(’red’) # set tortoise’s pen color to red
tortoise.fillcolor(fcolor) # set tortoise’s fill color to fcolor
tortoise.begin_fill() # begin to fill a shape
for segment in range(8): # for segment = 0, 1, 2, ..., 7
tortoise.forward(length) # move tortoise forward length
tortoise.left(135) # turn tortoise left 135 degrees
tortoise.end_fill() # stop filling the shape
We leave the task of commenting the other functions in this program as an exercise.
Programming in style 95
Now it is clear that this code is computing the profit generated from selling cups of
something. But the meaning of the numbers is still a mystery. These are examples of
“magic numbers,” so-called in programming parlance because they seem to appear
out of nowhere. There are at least two reasons to avoid “magic numbers.” First,
they make your code less readable and obscure its meaning. Second, they make it
more difficult and error-prone to change your code, especially if you use the same
value multiple times. By assigning these numbers to descriptive variable names, the
code becomes even clearer.
cupsSold = 462
pricePerCup = 3.95
costPerCup = 1.85
fixedCost = 140
profit = (pricePerCup - costPerCup) * cupsSold - fixedCost
What we have now is what we call “self-documenting code.” Since we have named
all of our variables and values with descriptive names, just reading the code is
enough to deduce its intention. These same rules, of course, apply to function names
and parameters. By naming our functions with descriptive names, we make their
purposes clearer and we contribute to the readability of the functions from which we
call them. This practice will continue to be demonstrated in the coming chapters.
In this book, we use a naming convention that is sometimes called “camelCase,”
in which the first letter is in lower case and then the first letters of subsequent
words are capitalized. But other programmers prefer different styles. For example,
some programmers prefer “snake_case,” in which an underscore character is placed
between words (cupsSold would be cups_sold). Unless you are working in an
environment with a specific mandated style, the choice is yours, as long as it results
in self-documenting code.
96 Visualizing abstraction
Exercises
3.4.1. Incorporate all the changes we discussed in this section into your flower-drawing
program, and finish commenting the bodies of the remaining functions.
3.4.2. Write a function that implements your Mad Lib from from Exercise 2.4.13, and
then write a complete program (with main function) that calls it. Your Mad Lib
function should take the words needed to fill in the blanks as parameters. Your
main function should get these values with calls to the input function, and then
pass them to your function. Include docstrings and comments in your program.
For example, here is a function version of the example in Exercise 2.4.13 (without
docstrings or comments).
def party(adj1, noun1, noun2, adj2, noun3):
print(’How to Throw a Party’)
print()
print(’If you are looking for a/an’, adj1, ’way to’)
print(’celebrate your love of’, noun1 + ’, how about a’)
print(noun2 + ’-themed costume party? Start by’)
print(’sending invitations encoded in’, adj2, ’format’)
print(’giving directions to the location of your’, noun3 + ’.’)
def main():
firstAdj = input(’Adjective: ’)
firstNoun = input(’Noun: ’)
secondNoun = input(’Noun: ’)
secondAdj = input(’Adjective: ’)
thirdNoun = input(’Noun: ’)
party(firstAdj, firstNoun, secondNoun, secondAdj, thirdNoun)
main()
3.4.3. Study the following program, and then reorganize it with a main function that
calls one or more other functions. Your main function should be very short: create
a turtle, call your functions, and then wait for a mouse click. Document your
program with appropriate docstrings and comments.
import turtle
george = turtle.Turtle()
george.setposition(0, 100)
george.pencolor(’red’)
george.fillcolor(’red’)
george.begin_fill()
george.circle(-100, 180)
george.right(90)
george.forward(200)
george.end_fill()
george.up()
george.right(90)
george.forward(25)
george.right(90)
george.forward(50)
A return to functions 97
george.left(90)
george.down()
george.pencolor(’white’)
george.fillcolor(’white’)
george.begin_fill()
george.circle(-50, 180)
george.right(90)
george.forward(100)
george.end_fill()
screen = george.getscreen()
screen.exitonclick()
4 x2 + 3 19
The return statement defines the output of the function. Remember that the
function definition by itself does not compute anything. We must call the function
for it to be executed. For example, in a program, we might call our function f like
this:
def main():
y = f(4)
print(y)
main()
When the assignment statement in main is executed, the righthand side calls the
function f with the argument 4. So the function f is executed next with 4 assigned
to the parameter x. It is convenient to visualize this as follows, with the value 4
replacing x in the function:
98 Visualizing abstraction
4
def f(x):
4
return x
** 2 + 3
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
19
Next, the value 4 ** 2 + 3 = 19 is computed and returned by the function. This
return value becomes the value associated with the function call f(4), and is
therefore the value assigned to y in main. Since f(4) evaluates to 19, we could
alternatively replace the two lines in main with
def main():
print(f(4))
Similarly, we can incorporate calls to functions with return values into longer
expressions, such as
y = (3 * f(4) + 2) ** 2
In this example, f(4) evaluates to 19, and then the rest of the expression is evaluated:
y = (3 * f(4) + 2) ** 2
²
19
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
59
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
3481
Recall that, in Section 2.3, we computed quantities like volume and wind chill on
the righthand side of an assignment statement.
volume = (4 / 3) * pi * (radius ** 3)
Then, to recompute the volume with a different value of radius, or wind chill
with a different value of wind, we had to re-type the same formula again. This was
obviously tedious and error-prone, analogous to building a new microwave oven from
scratch every time we wanted to pop a bag of popcorn. What we were missing then
were functions. After creating a function for each of these computations, computing
the results with a different input simply amounts to calling the function again. For
example, the volume of a sphere can be computed with the following function:
import math
def volumeSphere(radius):
"""Computes the volume of a sphere.
Parameter:
radius: radius of the sphere
def main():
radius1 = float(input(’Radius of first sphere: ’))
radius2 = float(input(’Radius of second sphere: ’))
volume1 = volumeSphere(radius1)
volume2 = volumeSphere(radius2)
print(’The volumes of the spheres are’, volume1, ’and’, volume2)
main()
The return statement in the function assigns its return value to be the volume
of the sphere with the given radius. Now, as shown in the main function above,
computing the volumes of two different spheres just involves calling the function
twice. Similarly, suppose that inside an empty sphere with radius 20, we had a solid
sphere with radius 10. To compute the empty space remaining inside the larger
sphere, we can compute
emptyVolume = volumeSphere(20) - volumeSphere(10)
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
33510.3216... 4188.7902...
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
29321.5314...
In addition to defining a function’s return value, the return statement also causes
the function to end and return this value back to the function call. So the return
statement actually does two things. A return statement
Therefore, if we add statements to a function after the return statement, they will
never be executed. For example, the red call to the print function in the following
program is useless.
def volumeSphere(radius):
"""Computes the volume of a sphere.
Parameter:
radius: radius of the sphere
Return value:
volume of a sphere with the given radius
"""
def volumeSphere(radius):
"""Computes the volume of a sphere.
Parameter:
radius: radius of the sphere
Return value:
volume of a sphere with the given radius
"""
This will print the desired answer but, because we did not provide a return value,
the function’s return value defaults to None. So if we try to compute
emptyVolume = volumeSphere(20) - volumeSphere(10)
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
None None
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
None - None ???
we get an error because we are trying to assign emptyVolume = None - None, which
is nonsensical.
Exercises
Write a function for each of the following problems. In each case, make sure the function
returns the specified value (instead of just printing it).
that returns your team’s football score if the number of touchdowns (7 points),
field goals (3 points), and safeties (2 points) are passed as parameters. Then write
a complete program (with main function) that gets these three values using the
input function, passes them to your football function, and then prints the score.
3.5.4. Exercise 2.4.8 asked how to compute the distance between two points. Now write
a function
distance(x1, y1, x2, y2)
that returns the distance between points (x1, y1) and (x2, y2).
3.5.5. The ideal gas law states that P V = nRT where
• P = pressure in atmospheres (atm)
• V = volume in liters (L)
• n = number of moles (mol) of gas
• R = gas constant = 0.08 L atm / mol K
• T = absolute temperature of the gas in Kelvin (K)
Write a function
moles(V, P, T)
that returns the number of moles of an ideal gas in V liters contained at pressure
P and T degrees Celsius. (Be sure to convert Celsius to Kelvin in your function.)
Also write a complete program (with a main function) that gets these three values
using the input function, passes them to your moles function, and then prints the
number of moles of ideal gas.
3.5.6. Suppose we have two containers of an ideal gas. The first contains 10 L of gas at
1.5 atm and 20 degrees Celsius. The second contains 25 L of gas at 2 atm and
30 degrees Celsius. Show how to use two calls to your function in the previous
exercise to compute the total number of moles of ideal gas in the two containers.
Now replace the return statement in your moles function with a call to print
instead. (So your function does not contain a return statement.) Can you still
compute the total number of moles in the same way? If so, show how. If not,
explain why not.
3.5.7. Most of the world is highly dependent upon groundwater for survival. Therefore,
it is important to be able to monitor groundwater flow to understand potential
contamination threats. Darcy’s law states that the flow of a liquid (e.g., water)
through a porous medium (e.g., sand, gravel) depends upon the capacity of the
medium to carry the flow and the gradient of the flow:
dh
Q=K
dl
where
• K is the hydraulic conductivity of the medium, the rate at which the liquid
can move through it, measured in area/time
• dh/dl is the hydraulic gradient
102 Visualizing abstraction
x = 10
y = 1
swap(x, y)
so that after the function returns, x has the value 1 and y has the value 10? (The
function should not return anything.) If so, write it. If not, explain why not.
3.5.12. Given an integer course grade from 0 to 99, we convert it to the equivalent grade
point according to the following scale: 90–99: 4, 80–89: 3, 70–79: 2, 60–69: 1, < 60:
0. Write a function
gradePoint(score)
that returns the grade point (i.e., GPA) equivalent to the given score. (Hint: use
floor division and the built-in max function which returns the maximum of two or
more numbers.)
3.5.13. The function time.time() (in the time module) returns the current time in
seconds since January 1, 1970. Write a function
year()
that uses this function to return the current year as an integer value.
Parameters:
temp: temperature in degrees Celsius
wind: wind speed at 10m in km/h
Return value:
equivalent wind chill in degrees Celsius, rounded to
the nearest integer
"""
def main():
chilly = windChill(-3, 13)
print(’The wind chill is’, chilly)
main()
104 Visualizing abstraction
Notice that we have introduced a variable inside the windChill function named
chill to break up the computation a bit. We assign chill the result of the wind
chill computation, using the parameters temp and wind, and then return this value
rounded to the nearest integer. Because we created the name chill inside the
function windChill, its scope is local to the function. In other words, if we tried to
refer to chill anywhere in the program outside of the function windChill (e.g., in
the main function), we would get the following error:
NameError: name ’chill’ is not defined
Local namespaces
To better understand how the scoping rules of local variable and parameter names
work, let’s look more closely at how these names are managed in Python. To make
this a little more interesting, let’s modify our wind chill program a bit more, as
highlighted in red:
def main():
t = -3
w = 13
chilly = windChill(t, w)
print(’The wind chill is’, chilly)
main()
In this program, just after we call the windChill function, but just before the values
of the arguments t and w are assigned to the parameters temp and wind, we can
visualize the situation like this:
main windChill
t w temp wind
-3 13
Scope and namespaces 105
The box around t and w represents the scope of the main function, and the box
around temp and wind represents the scope of the windChill function. In each case,
the scope defines what names have been defined, or have meaning, in that function.
In the picture, we are using arrows instead of affixing the “Sticky notes” directly to
the values to make clear that the names, not the values, reside in their respective
scopes. The names are references to the memory cells in which their values reside.
The scope corresponding to a function in Python is managed with a namespace.
A namespace of a function is simply a list of names that are defined in that function,
together with references to their values. We can view the namespace of a particular
function by calling the locals function from within it. For example, insert the
following statement into the main function, just before the call to windChill:
print(’Namespace of main before calling windChill:’, locals())
This is showing us that, at that point in the program, the local namespace in
the main function consists of two names: t, which is assigned the value -3, and w,
which is assigned the value 13 (just as we visualized above). The curly braces ({ })
around the namespace representation indicate that the namespace is a dictionary,
another abstract data type in Python. We will explore dictionaries in more detail in
Chapter 8.
Returning to the program, when windChill is called and the parameters are as-
signed the values of their respective arguments, we are implicitly assigning temp = t
and wind = w, so the picture changes to this:
main windChill
t w temp wind
-3 13
We can see this in the program by inserting another call to locals as the first line
of the windChill function:
print(’Local namespace at the start of windChill:’, locals())
This is showing us that, at the beginning of the windChill function, the only
visible names are temp and wind, which have been assigned the values of t and w,
respectively. Notice, however, that t and w do not exist inside windChill, and there
is no direct connection between t and temp, or between w and wind; rather they are
only indirectly connected through the value to which they are both assigned.
Next, the first statement in the function, the assignment statement involving
the local variable chill inserts the new name chill into the namespace of the
windChill function and assigns it the result of the wind chill computation.
main windChill
-3 13 -7.67…
Finally, the function reassigns the rounded wind chill value to the local parameter
temp before returning it:
main windChill
-3 13 -7.67… -8
To see this sequence of events in the context of the program, insert another call to
locals in windChill, just before the return statement:
print(’Local namespace at the end of windChill:’, locals())
After the windChill function returns −8, the namespace of windChill, and all of
the local names in that namespace, cease to exist, leaving t and w untouched in the
main namespace. However, a new name, chilly, is created in the main namespace
and assigned the return value of the windChill function:
Scope and namespaces 107
main
t w chilly
-3 13 -8
To confirm this, add one more call to locals at the end of main:
print(’Local namespace in main after calling windChill:’, locals())
One important take-away from this example is that when we are dealing with
parameters that are numbers, a parameter and its corresponding argument are not
directly connected; if you change the value assigned to the parameter, as we did to
the parameter temp, it does not change the corresponding argument, in this case, t.
The result will be something like the following (some names are not shown):
Global namespace: {’main’: <function main at 0x10065e290>,
’windChill’: <function windChill at 0x10065e440>,
’__name__’: ’__main__’,
’__builtins__’: <module ’builtins’ (built-in)>, ... }
Notice that the only global names that we created refer to our two functions, main
and windChill. We can think of each of these names as referring to the functions’
respective namespaces, as illustrated below (references for some names are omitted):
108 Visualizing abstraction
def main():
t = -3
w = 13
print(’Namespace of main before calling windChill:’, locals())
chilly = windChill(t, w)
print(’The wind chill is’, chilly)
print(’Local namespace in main after calling windChill:’, locals())
main()
Figure 3.10 The complete wind chill program, with calls to the locals function.
builtins
__main__
main windChill
-3 13 -7.67… -8
The other names defined in the global namespace are standard names defined in every
Python program. The name __name__ refers to the name of the current module,
which, in this case, is ’__main__’ (not to be confused with the namespace of the
main function) because we are in the file that we initially executed in the Python
Scope and namespaces 109
builtins
math
__main__
main windChill
-3 13 -7.67… -8 3.141…
When we preface each of the function names in the math module with math (e.g.,
math.sqrt(7)), we are telling the Python interpreter to look in the math namespace
for the function.
Maintaining a mental model like this should help you manage the names that
you use in your programs, especially as they become longer.
Exercises
3.6.1. Show how to use the locals function to print all of the local variable names in the
distance function from Exercise 3.5.4. Use your modified function to compute the
distance between the points (3, 7.25) and (9.5, 1). What does the local namespace
look like while the function is executing?
110 Visualizing abstraction
import turtle
def main():
george = turtle.Turtle()
sideLength = 200
drawStar(george, sideLength)
screen = george.getscreen()
screen.exitonclick()
main()
Sketch a picture like that on Page 106 depicting the namespaces in this program
just before returning from the drawStar function. Here is a picture to get you
started:
main drawStar
Turtle
object 200 4
3.6.3. In economics, a demand function gives the price a consumer is willing to pay for
an item, given that a particular quantity of that item is available. For example,
suppose that in a coffee bean market the demand function is given by
2.3Q
D(Q) = 45 − ,
1000
where Q is the quantity of available coffee, measured in kilograms, and the returned
price is for 1 kg. So, for example, if there are 5000 kg of coffee beans available, the
price will be 45 − (2.3)(5000)/1000 = 33.50 dollars for 1 kg. The following program
computes this value.
Summary 111
def demand(quantity):
quantity = quantity / 1000
return 45 - 2.3 * quantity
def main():
coffee = 5000
price = demand(coffee)
print(price)
main()
Sketch a picture like that on Page 106 depicting the namespaces in this program
just before returning from the demand function and also just before returning from
the main function.
3.7 SUMMARY
In this chapter, we made progress toward writing more sophisticated programs. The
key to successfully solving larger problems is to break the problem into smaller,
more manageable pieces, and then treat each of these pieces as an abstract “black
box” that you can use to solve the larger problem. There are two types of “black
boxes,” those that represent things (i.e., data, information) and those that represent
actions. A “black box” representing a thing is described by an abstract data type
(ADT), which contains both hidden data and a set of functions that we can call
to access or modify that data. In Python, an ADT is implemented with a class,
and instances of a class are called objects. The class, such as Turtle, to which an
object belongs specifies what (hidden) data the object has and what methods can be
called to access or modify that data. Remember that a class is the “blueprint” for a
category of objects, but is not actually an object. We “built” new Turtle objects
by calling a function with the class’ name:
george = turtle.Turtle()
Once the object is created, we can do things with it by calling its methods, like
george.forward(100), without worrying about how it actually works.
A ”black box” that performs an action is called a functional abstraction. We
implement functional abstractions in Python with functions. Earlier in the chapter,
we designed functions to draw things in turtle graphics, gradually making them
more general (and hence more useful) by adding parameters. We also started using
for loops to create more interesting iterative algorithms. Later in the chapter, we
also looked at how we can add return values to functions, and how to properly think
about all of the names that we use in our programs. By breaking our programs up
into functions, like breaking up a complex organization into divisions, we can more
effectively focus on how to solve the problem at hand.
This increasing complexity becomes much easier to manage if we follow a few
guidelines for structuring and documenting our programs. We laid these out in
Section 3.4, and encourage you to stick to them as we forge ahead.
112 Visualizing abstraction
In the next chapter, we will design iterative algorithms to model quantities that
change over time, like the sizes of dynamic populations. These techniques will also
serve as the foundation for many other algorithms in later chapters.
http://amturing.acm.org .
The second epigraph is from Ada Lovelace, considered by many to be the first
computer programmer. She was born Ada Byron in England in 1815, the daughter of
the Romantic poet Lord Byron. (However, she never knew her father because he left
England soon after she was born.) In marriage, Ada acquired the title “Countess of
Lovelace,” and is now commonly known simply as Ada Lovelace. She was educated
in mathematics by several prominent tutors and worked with Charles Babbage, the
inventor of two of the first computers, the Difference Engine and the Analytical
Engine. Although the Analytical Engine was never actually built, Ada wrote a set
of “Notes” about its design, including what many consider to be the first computer
program. (The quote is from Note A, Page 696.) In her “Notes” she also imagined
that future computers would be able to perform tasks far more interesting than
arithmetic (like make music). Ada Lovelace died in 1852, at the age of 37.
The giant tortoise named Lonesome George was, sadly, the last surviving member
of his subspecies, Chelonoidis nigra abingdonii. The giant tortoise named Super
Diego is a member of a different subspecies, Chelonoidis nigra hoodensis.
The commenting style we use in this book is based on Google’s official Python
style guide, which you can find at
http://google-styleguide.googlecode.com/svn/trunk/pyguide.html .
CHAPTER 4
Our population and our use of the finite resources of planet Earth are growing exponentially,
along with our technical ability to change the environment for good or ill.
Stephen Hawking
TED talk (2008)
ome of the most fundamental questions asked in the natural and social sciences
S concern the dynamic sizes of populations and other quantities over time. For
example, we may be interested in the size of a plant population being affected by
an invasive species, the magnitude of an infection facing a human population, the
quantity of a radioactive material in a storage facility, the penetration of a product
in the global marketplace, or the evolving characteristics of a dynamic social network.
The possibilities are endless.
To study situations like these, scientists develop a simplified model that abstracts
key characteristics of the actual situation so that it might be more easily understood
and explored. In this sense, a model is another example of abstraction. Once we
have a model that describes the problem, we can write a simulation that shows
what happens when the model is applied over time. The power of modeling and
simulation lies in their ability to either provide a theoretical framework for past
observations or predict future behavior. Scientists often use models in parallel with
traditional experiments to compare their observations to a proposed theoretical
framework. These parallel scientific processes are illustrated in Figure 4.1. On the left
is the computational process. In this case, we use “model” instead of “algorithm” to
acknowledge the possibility that the model is mathematical rather than algorithmic.
(We will talk more about this process in Section 7.1.) On the right side is the
parallel experimental process, guided by the scientific method. The results of the
computational and experimental processes can be compared, possibly leading to
model adjustments or new experiments to improve the results.
113
114 Growth and decay
Conduct
Adjust the
additional
model
experiments
Interpret the results and
Interpret the results and
evaluate your model and/or
evaluate your experiment
simulation
Compare results
When we model the dynamic behavior of populations, we will assume that time
ticks in discrete steps and, at any particular time step, the current population size is
based on the population size at the previous time step. Depending on the problem,
a time step may be anywhere from a nanosecond to a century. In general, a new
time step may bring population increases, in the form of births and immigration,
and population decreases, in the form of deaths and emigration. In this chapter, we
will discuss a fundamental algorithmic technique, called an accumulator, that we
will use to model dynamic processes like these. Accumulators crop up in all kinds of
problems, and lie at the foundation of a variety of different algorithmic techniques.
We will continue to see examples throughout the book.
death rate of the population. The maximum annual fishing harvest allowed is 1,500
bass. Since this is a popular fishing spot, this harvest is attained every year. Is our
maximum annual harvest sustainable? If not, how long until the fish population dies
out? Should we reduce the maximum harvest? If so, what should it be reduced to?
We can find the projected population size for any given year by starting with the
initial population size, and then computing the population size in each successive
year based on the size in the previous year. For example, suppose we wanted to
know the projected population size four years from now. We start with the initial
population of largemouth bass, assigned to a variable named population0:
population0 = 12000
Then we want to set the population size at the end of the first year to be this initial
population size, plus 8% of the initial population size, minus 1,500. In other words,
if we let population1 represent the population size at the end of the first year, then
population1 = population0 + 0.08 * population0 - 1500 # 11,460
or, equivalently,
population1 = 1.08 * population0 - 1500 # 11,460.0
The population size at the end of the second year is computed in the same way,
based on the value of population1:
population2 = 1.08 * population1 - 1500 # 10,876.8
Continuing,
population3 = 1.08 * population2 - 1500 # 10,246.94
population4 = 1.08 * population3 - 1500 # 9,566.6952
So this model projects that the bass population in four years will be 9, 566 (ignoring
the “fractional fish” represented by the digits to the right of the decimal point).
The process we just followed was obviously repetitive (or iterative), and is
therefore ripe for a for loop. Recall that we used the following for loop in Section 3.2
to draw our geometric flower with eight petals:
for segment in range(8):
tortoise.forward(length)
tortoise.left(135)
The keywords for and in are required syntax elements, while the parts in red are
up to the programmer. The variable name, in this case segment, is called an index
variable. The part in red following in is a list of values assigned to the index variable,
one value per iteration. Because range(8) represents a list of integers between 0
and 7, this loop is equivalent to the following sequence of statements:
116 Growth and decay
segment = 0
tortoise.forward(length)
tortoise.left(135)
segment = 1
tortoise.forward(length)
tortoise.left(135)
segment = 2
tortoise.forward(length)
tortoise.left(135)
⋮
segment = 7
tortoise.forward(length)
tortoise.left(135)
Because eight different values are assigned to segment, the loop executes the two
drawing statements in the body of the loop eight times.
In our fish pond problem, to compute the population size at the end of the
fourth year, we performed four computations, namely:
population0 = 12000
population1 = 1.08 * population0 - 1500 # 11460.0
population2 = 1.08 * population1 - 1500 # 10876.8
population3 = 1.08 * population2 - 1500 # 10246.94
population4 = 1.08 * population3 - 1500 # 9566.6952
In each iteration of the loop, we want to compute the current year’s population size
based on the population size in the previous year. In the first iteration, we want to
compute the value that we assigned to population1. And, in the second, third, and
fourth iterations, we want to compute the values that we assigned to population2,
population3, and population4.
But how do we generalize these four statements into one statement that we can
repeat four times? The trick is to notice that, once each variable is used to compute
the next year’s population, it is never used again. Therefore, we really have no
need for all of these variables. Instead, we can use a single variable population,
called an accumulator variable (or just accumulator ), that we multiply by 1.08 and
subtract 1500 from each year. We initialize population = 12000, and then for each
successive year we assign
population = 1.08 * population - 1500
Remember that an assignment statement evaluates the righthand side first. So the
value of population on the righthand size of the equals sign is the previous year’s
population, which is used to compute the current year’s population that is assigned
to population of the lefthand side.
Discrete models 117
population = 12000
for year in range(4):
population = 1.08 * population - 1500
population = 12000
year = 0
population = 1.08 * population - 1500 # new population = 11460.0
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
12000
year = 1
population = 1.08 * population - 1500 # new population = 10876.8
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
11460.0
year = 2
population = 1.08 * population - 1500 # new population = 10246.94
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
10876.8
year = 3
population = 1.08 * population - 1500 # new population = 9566.6952
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
10246.94
In the first iteration of the for loop, 0 is assigned to year and population is
assigned the previous value of population (12000) times 1.08 minus 1500, which
is 11460.0. Then, in the second iteration, 1 is assigned to year and population is
once again assigned the previous value of population (now 11460.0) times 1.08
minus 1500, which is 10876.8. This continues two more times, until the for loop
ends. The final value of population is 9566.6952, as we computed earlier.
Reflection 4.1 Type in the statements above and add the following statement after
the assignment to population in the body of the for loop:
print(year + 1, int(population))
We see in this example that we can use the index variable year just like any other
variable. Since year starts at zero and the first iteration of the loop is computing
the population size in year 1, we print year + 1 instead of year.
Reflection 4.2 How would you change this loop to compute the fish population in
five years? Ten years?
Changing the number of years to compute is now simple. All we have to do is change
the value in the range to whatever we want: range(5), range(10), etc. If we put
this computation in a function, then we can have the parameter be the desired
number of years:
118 Growth and decay
def pond(years):
"""Simulates a fish population in a fishing pond, and
prints annual population size. The population
grows 8% per year with an annual harvest of 1500.
Parameter:
years: number of years to simulate
population = 12000
for year in range(years):
population = 1.08 * population - 1500
print(year + 1, int(population))
return population
def main():
finalPopulation = pond(10)
print(’The final population is’, finalPopulation)
main()
Reflection 4.3 What would happen if population = 12000 was inside the
body of the loop instead of before it? What would happen if we omitted the
population = 12000 statement altogether?
Reflection 4.4 Use the pond function to answer the original questions: Is this
maximum harvest sustainable? If not, how long until the fish population dies out?
Should the pond manager reduce the maximum harvest? If so, what should it be
reduced to?
Calling this function with a large enough number of years shows that the fish
population drops below zero (which, of course, can’t really happen) in year 14:
1 11460
2 10876
3 10246
⋮
13 392
14 -1076
⋮
Discrete models 119
This harvesting plan is clearly not sustainable, so the pond manager should reduce
it to a sustainable level. In this case, determining the sustainable level is easy: since
the population grows at 8% per year and the pond initially contains 12,000 fish, we
cannot allow more than 0.08 ⋅ 12000 = 960 fish to be harvested per year without the
population declining.
Reflection 4.5 Generalize the pond function with two additional parameters: the
initial population size and the annual harvest. Using your modified function, compute
the number of fish that will be in the pond in 15 years if we change the annual
harvest to 800.
population = initialPopulation
for year in range(years):
population = 1.08 * population - harvest
print(year + 1, int(population))
return population
The value of the initialPopulation parameter takes the place of our previous
initial population of 12000 and the parameter named harvest takes the place of
our previous harvest of 1500. To answer the question above, we can replace the call
to the pond function from main with:
finalPopulation = pond(15, 12000, 800)
The result that is printed is:
1 12160
2 12332
3 12519
4 12720
⋮
13 15439
14 15874
15 16344
The final population is 16344.338228396558
Reflection 4.6 How would you call the new version of the pond function to replicate
its original behavior, with an annual harvest of 1500?
Before moving on, let’s look at a helpful Python trick, called a format string, that
enables us to format our table of annual populations in a more attractive way. To
illustrate the use of a format string, consider the following modified version of the
previous function.
120 Growth and decay
population = initialPopulation
print(’Year | Population’)
print(’-----|-----------’)
for year in range(years):
population = 1.08 * population - harvest
print(’{0:^4} | {1:>9.2f}’.format(year + 1, population))
return population
The function begins by printing a table header to label the columns. Then, in the
call to the print function inside the for loop, we utilize a format string to line up
the two values in each row. The syntax of a format string is
’<replacement fields>’.format(<values to format>)
(The parts in red above are descriptive and not part of the syntax.) The period
between the the string and format indicates that format is a method of the string
class; we will talk more about the string class in Chapter 6. The parameters of
the format method are the values to be formatted. The format for each value is
specified in a replacement field enclosed in curly braces ({}) in the format string.
In the example in the for loop above, the {0:^4} replacement field specifies
that the first (really the “zero-th”; computer scientists like to start counting at
0) argument to format, in this case year + 1, should be centered (^) in a field
of width 4. The {1:>9.2f} replacement field specifies that population, as the
second argument to format, should be right justified (>) in a field of width 9 as
a floating point number with two places to the right of the decimal point (.2f).
When formatting floating point numbers (specified by the f), the number before
the decimal point in the replacement field is the minimum width, including the
decimal point. The number after the decimal point in the replacement field is the
minimum number of digits to the right of the decimal point in the number. (If we
wanted to align to the left, we would use <.) Characters in the string that are not in
replacement fields (in this case, two spaces with a vertical bar between them) are
printed as-is. So, if year were assigned the value 11 and population were assigned
the value 1752.35171, the above statement would print
12 | 1752.35
² ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
{0:^4} {1:>9.2f}
To fill spaces with something other than a space, we can use a fill character
immediately after the colon. For example, if we replaced the second replacement
field with {1:*>9.2f}, the previous statement would print the following instead:
12 | * *1752.35
² ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
{0:^4} {1:*>9.2f}
Discrete models 121
2 1 2 1 2 1 2 1
3 3 3 5
4 4
At each step, the red node is added to the network. In each step, the red links
represent all of the potential new connections that could result from the addition of
the new member.
Reflection 4.7 What is the maximum number of new connections that could arise
when each of nodes 2, 3, 4, and 5 are added? In general, what is the maximum
number of new connections that could arise from adding node number n?
node number: 2 3 4 5 ⋯ n
maximum increase in number of links: 1 2 3 4 ⋯ n − 1
Therefore, as shown in the last row, the maximum number of links in a network
with n nodes is the sum of the numbers in the second row:
1 + 2 + 3 + . . . + n − 1.
We will use this sum to represent the potential value of the network.
Let’s write a function, similar to the previous one, that lists the maximum
122 Growth and decay
number of new links, and the maximum total number of links, as new nodes are
added to a network. In this case, we will need an accumulator to count the total
number of links. Adapting our pond function to this new purpose gives us the
following:
def countLinks(totalNodes):
"""Prints a table with the maximum total number of links
in networks with 2 through totalNodes nodes.
Parameter:
totalNodes: the total number of nodes in the network
Return value:
the maximum number of links in a network with totalNodes nodes
"""
totalLinks = 0
for node in range(totalNodes):
newLinks = ???
totalLinks = totalLinks + newLinks
print(node, newLinks, totalLinks)
return totalLinks
In this function, we want our accumulator variable to count the total number of
links, so we renamed it from population to to totalLinks, and initialized it to
zero. Likewise, we renamed the parameter, which specifies the number of iterations,
from years to totalNodes, and we renamed the index variable of the for loop from
year to node because it will now be counting the number of the node that we are
adding at each step. In the body of the for loop, we add to the accumulator the
maximum number of new links added to the network with the current node (we
will return to this in a moment) and then print a row containing the node number,
the maximum number of new links, and the maximum total number of links in the
network at that point.
Before we determine what the value of newLinks should be, we have to resolve
one issue. In the table above, the node numbers range from 2 to the number of nodes
in the network, but in our for loop, node will range from 0 to totalNodes - 1. This
turns out to be easily fixed because the range function can generate a wider variety
of number ranges than we have seen thus far. If we give range two arguments instead
of one, like range(start, stop), the first argument is interpreted as a starting
value and the second argument is interpreted as the stopping value, producing a
range of values starting at start and going up to, but not including, stop. For
example, range(-5, 10) produces the integers −5, −4, −3, . . . , 8, 9.
To see this for yourself, type range(-5, 10) into the Python shell (or print it
in a program).
Discrete models 123
Unfortunately, you will get a not-so-useful result, but one that we can fix by
converting the range to a list of numbers. A list, enclosed in square brackets ([ ]),
is another kind of abstract data type that we will make extensive use of in later
chapters. To convert the range to a list, we can pass it as an argument to the list
function:
>>> list(range(-5, 10))
[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Reflection 4.8 What list of numbers does range(1, 10) produce? What about
range(10, 1)? Can you explain why in each case?
Reflection 4.9 Back to our program, what do we want our for loop to look like?
For node to start at 2 and finish at totalNodes, we want our for loop to be
for node in range(2, totalNodes + 1):
Now what should the value of newLinks be in our program? The answer is in the
table we constructed above; the maximum number of new links added to the network
with node number n is n − 1. In our loop, the node number is assigned to the name
node, so we need to add node - 1 links in each step:
newLinks = node - 1
def countLinks(totalNodes):
""" (docstring omitted) """
totalLinks = 0
for node in range(2, totalNodes + 1):
newLinks = node - 1
totalLinks = totalLinks + newLinks
print(node, newLinks, totalLinks)
return totalLinks
def main():
links = countLinks(10)
print(’The total number of links is’, links)
main()
As with our previous for loop, we can see more clearly what this loop does by
looking at an equivalent sequence of statements. The changing value of node is
highlighted in red.
124 Growth and decay
totalLinks = 0
node = 2
newLinks = node - 1 # newLinks is assigned 2 - 1 = 1
totalLinks = totalLinks + newLinks # totalLinks is assigned 0 + 1 = 1
print(node, newLinks, totalLinks) # prints 2 1 1
node = 3
newLinks = node - 1 # newLinks is assigned 3 - 1 = 2
totalLinks = totalLinks + newLinks # totalLinks is assigned 1 + 2 = 3
print(node, newLinks, totalLinks) # prints 3 2 3
node = 4
newLinks = node - 1 # newLinks is assigned 4 - 1 = 3
totalLinks = totalLinks + newLinks # totalLinks is assigned 3 + 3 = 6
print(node, newLinks, totalLinks) # prints 4 3 6
node = 10
newLinks = node - 1 # newLinks is assigned 10 - 1 = 9
totalLinks = totalLinks + newLinks # totalLinks is assigned 36 + 9 = 45
print(node, newLinks, totalLinks) # prints 10 9 45
We leave lining up the columns more uniformly using a format string as an exercise.
Reflection 4.10 What does countLinks(100) return? What does this represent?
Organizing a concert
Let’s look at one more example. Suppose you are putting on a concert and need to
figure out how much to charge per ticket. Your total expenses, for the band and
the venue, are $8000. The venue can seat at most 2,000 and you have determined
through market research that the number of tickets you are likely to sell is related
to a ticket’s selling price by the following relationship:
sales = 2500 - 80 * price
Discrete models 125
According to this relationship, if you give the tickets away for free, you will overfill
your venue. On the other hand, if you charge too much, you won’t sell any tickets
at all. You would like to price the tickets somewhere in between, so as to maximize
your profit. Your total income from ticket sales will be sales * price, so your
profit will be this amount minus $8000.
To determine the most profitable ticket price, we can create a table using a for
loop similar to that in the previous two problems. In this case, we would like to
iterate over a range of ticket prices and print the profit resulting from each choice.
In the following function, the for loop starts with a ticket price of one dollar and
adds one to the price in each iteration until it reaches maxPrice dollars.
def profitTable(maxPrice):
"""Prints a table of profits from a show based on ticket price.
Parameters:
maxPrice: maximum price to consider
def main():
profitTable(25)
main()
The number of expected sales in each iteration is computed from the value of the
index variable price, according to the relationship above. Then we print the price
and the resulting income and profit, formatted nicely with a format string. As we
did previously, we can look at what happens in each iteration of the loop:
126 Growth and decay
price = 1
sales = 2500 - 80 * price # sales is assigned 2500 - 80 * 1 = 2420
income = sales * price # income is assigned 2420 * 1 = 2420
profit = income - 8000 # profit is assigned 2420 - 8000 = -5580
print(price, income, profit) # prints $ 1.00 $ 2420.00 $-5580.00
price = 2
sales = 2500 - 80 * price # sales is assigned 2500 - 80 * 2 = 2340
income = sales * price # income is assigned 2340 * 2 = 4680
profit = income - 8000 # profit is assigned 4680 - 8000 = -3320
print(price, income, profit) # prints $ 2.00 $ 4680.00 $-3320.00
price = 3
sales = 2500 - 80 * price # sales is assigned 2500 - 80 * 3 = 2260
income = sales * price # income is assigned 2260 * 3 = 6780
profit = income - 8000 # profit is assigned 6780 - 8000 = -1220
print(price, income, profit) # prints $ 3.00 $ 6780.00 $-1220.00
⋮
Reflection 4.11 Run this program and determine what the most profitable ticket
price is.
The profit in the third column increases until it reaches $11,520.00 at a ticket price
of $16, then it drops off. So the most profitable ticket price seems to be $16.
Reflection 4.12 Our program only considered whole dollar ticket prices. How can
we modify it to increment the ticket price by fifty cents in each iteration instead?
The range function can only create ranges of integers, so we cannot ask the range
function to increment by 0.5 instead of 1. But we can achieve our goal by doubling
the range of numbers that we iterate over, and then set the price in each iteration
to be the value of the index variable divided by two.
Discrete models 127
def profitTable(maxPrice):
""" (docstring omitted) """
Now when price is 1, the “real price” that is used to compute profit is 0.5. When
price is 2, the “real price” is 1.0, etc.
Reflection 4.13 Does our new function find a more profitable ticket price than
$16?
If we look at the ticket prices around $16, we see that $15.50 will actually make $10
more.
Just from looking at the table, the relationship between the ticket price and the
profit is not as clear as it would be if we plotted the data instead. For example, does
profit rise in a straight line to the maximum and then fall in a straight line? Or is it
a more gradual curve? We can answer these questions by drawing a plot with turtle
graphics, using the goto method to move the turtle from one point to the next.
128 Growth and decay
import turtle
def main():
george = turtle.Turtle()
screen = george.getscreen()
screen.setworldcoordinates(0, -15000, 25, 15000)
profitPlot(george, 25)
screen.exitonclick()
main()
Our new main function sets up a turtle and then uses the setworldcoordinates
function to change the coordinate system in the drawing window to fit the points
that we are likely to plot. The first two arguments to setworldcoordinates set the
coordinates of the lower left corner of the window, in this case (0, −15, 000). So the
minimum visible x value (price) in the window will be zero and the minimum visible
y value (profit) will be −15, 000. The second two arguments set the coordinates in the
upper right corner of the window, in this case (25, 15, 000). So the maximum visible
x value (price) will be 25 and the maximum visible y value (profit) will be 15, 000.
In the for loop in the profitPlot function, since the first value of realPrice is
0.5, the first goto is
george.goto(0.5, -6770)
which draws a line from the origin (0, 0) to (0.5, −6770). In the next iteration, the
value of realPrice is 1.0, so the loop next executes
george.goto(1.0, -5580)
which draws a line from the previous position of (0.5, −6770) to (1.0, −5580). The
next value of realPrice is 1.5, so the loop then executes
george.goto(1.5, -4430)
which draws a line from from (1.0, −5580) to (1.5, −4430). And so on, until realPrice
takes on its final value of 25 and we draw a line from the previous position of
(24.5, 5230) to (25, 4500).
Discrete models 129
Reflection 4.14 What shape is the plot? Can you see why?
Reflection 4.15 When you run this plotting program, you will notice an ugly line
from the origin to the first point of the plot. How can you fix this? (We will leave
the answer as an exercise.)
Exercises
Write a function for each of the following problems. When appropriate, make sure the function
returns the specified value (rather than just printing it). Be sure to appropriately document
your functions with docstrings and comments.
that prints ticks times starting from midnight, where the clock ticks once each
minute. To simplify matters, the midnight hour can be denoted 0 instead of 12.
For example, clock(100) should print
0:00
0:01
0:02
⋮
0:59
1:00
1:01
⋮
1:38
1:39
To line up the colons in the times and force the leading zero in the minutes, use a
format string like this:
print(’{0:>2}:{1:0>2}’.format(hours, minutes))
4.1.7. There are actually three forms of the range function:
• 1 parameter: range(stop)
• 2 parameters: range(start, stop)
• 3 parameters: range(start, stop, skip)
With three arguments, range produces a range of integers starting at the start
value and ending at or before stop - 1, adding skip each time. For example,
range(5, 15, 2)
produces the range of numbers 5, 7, 9, 11, 13 and
range(-5, -15, -2)
produces the range of numbers -5, -7, -9, -11, -13. To print these numbers,
one per line, we can use a for loop:
for number in range(-5, -15, -2):
print(number)
(a) Write a for loop that prints the even integers from 2 to 100, using the third
form of the range function.
(b) Write a for loop that prints the odd integers from 1 to 100, using the third
form of the range function.
(c) Write a for loop that prints the integers from 1 to 100 in descending order.
(d) Write a for loop that prints the values 7, 11, 15, 19.
(e) Write a for loop that prints the values 2, 1, 0, −1, −2.
(f) Write a for loop that prints the values −7, −11, −15, −19.
Discrete models 131
4.1.19. Generalize the pond function further to allow for the pond to be annually restocked
with an additional quantity of fish.
4.1.20. Modify the countLinks function so that it prints a table like the following:
| | Links |
| Nodes | New | Total |
| ----- | --- | ----- |
| 2 | 1 | 1 |
| 3 | 2 | 3 |
| 4 | 3 | 6 |
| 5 | 4 | 10 |
| 6 | 5 | 15 |
| 7 | 6 | 21 |
| 8 | 7 | 28 |
| 9 | 8 | 36 |
| 10 | 9 | 45 |
4.1.21. Write a function
growth1(totalDays)
that simulates a population growing by 3 individuals each day. For each day, print
the day number and the total population size.
4.1.22. Write a function
growth2(totalDays)
that simulates a population that grows by 3 individuals each day but also shrinks
by, on average, 1 individual every 2 days. For each day, print the day number and
the total population size.
4.1.23. Write a function
growth3(totalDays)
that simulates a population that increases by 110% every day. Assume that the
initial population size is 10. For each day, print the day number and the total
population size.
4.1.24. Write a function
growth4(totalDays)
that simulates a population that grows by 2 on the first day, 4 on the second day,
8 on the third day, 16 on the fourth day, etc. Assume that the initial population
size is 10. For each day, print the day number and the total population size.
4.1.25. Write a function
sum(n)
that returns the sum of the integers between 1 and n, inclusive. For example,
sum(4) returns 1 + 2 + 3 + 4 = 10. (Use a for loop; if you know a shortcut, don’t
use it.)
134 Growth and decay
that prints the sequence of numbers generated by this game, starting with the
two digit number, and continuing for the given number of iterations. It will be
helpful to know that no number in a sequence will ever have more than three digits.
Execute your function with every integer between 15 and 25, with iterations at
least 30. What do you notice? Can you classify each of these integers into one of
two groups based on the results?
4.1.34. You have $1,000 to invest and need to decide between two savings accounts. The
first account pays interest at an annual rate of 1% and is compounded daily,
meaning that interest is earned daily at a rate of (1/365)%. The second account
pays interest at an annual rate of 1.25% but is compounded monthly. Write a
function
interest(originalAmount, rate, periods)
that computes the interest earned in one year on originalAmount dollars in an
account that pays the given annual interest rate, compounded over the given
number of periods. Assume the interest rate is given as a percentage, not a fraction
(i.e., 1.25 vs. 0.0125). Use the function to answer the original question.
4.1.35. Suppose you want to start saving a certain amount each month in an investment
account that compounds interest monthly. To determine how much money you
expect to have in the future, write a function
invest(investment, rate, years)
that returns the income earned by investing investment dollars every month in
an investment account that pays the given rate of return, compounded monthly
(rate / 12 % each month).
4.1.36. A mortgage loan is charged some rate of interest every month based on the current
balance on the loan. If the annual interest rate of the mortgage is r%, then interest
equal to r/12 % of the current balance is added to the amount owed each month.
Also each month, the borrower is expected to make a payment, which reduces the
amount owed.
Write a function
mortgage(principal, rate, years, payment)
that prints a table of mortgage payments and the remaining balance every month
of the loan period. The last payment should include any remaining balance. For
example, paying $1,000 per month on a $200,000 loan at 4.5% for 30 years should
result in the following table:
4.1.37. Suppose a bacteria colony grows at a rate of 10% per hour and that there are
initially 100 bacteria in the colony. Write a function
136 Growth and decay
bacteria(days)
that returns the number of bacteria in the colony after the given number of days.
How many bacteria are in the colony after one week?
4.1.38. Generalize the function that you wrote for the previous exercise so that it also
accepts parameters for the initial population size and the growth rate. How many
bacteria are in the same colony after one week if it grows at 15% per hour instead?
We saw lists briefly in Section 4.1 when we used the list function to visualize
range values; we will use lists much more extensively in Chapter 8. For now, we
only need to know how to build a list of population sizes in our for loop so that
we can plot them. Let’s look at how to do this in the fishing pond function from
Page 119, reproduced below.
population = initialPopulation
print(’Year Population’)
for year in range(years):
population = 1.08 * population - harvest
print(’{0:^4} {1:>9.2f}’.format(year + 1, population))
return population
1
MATLAB is a registered trademark of The MathWorks, Inc.
Visualizing population changes 137
We start by creating an empty list of annual population sizes before the loop:
populationList = [ ]
As you can see, an empty list is denoted by two square brackets with nothing in
between. To add an annual population size to the end of the list, we will use the
append method of the list class. We will first append the initial population size to
the end of the empty list with
populationList.append(initialPopulation)
If we pass in 12000 for the initial population parameter, this will result in
populationList becoming the single-element list [12000]. Inside the loop, we
want to append each value of population to the end of the growing list with
populationList.append(population)
Incorporating this code into our pond function, and deleting the calls to print,
yields:
Parameters:
years: number of years to simulate
initialPopulation: the initial population size
harvest: the size of the annual harvest
population = initialPopulation
populationList = [ ]
populationList.append(initialPopulation)
for year in range(years):
population = 1.08 * population - harvest
populationList.append(population)
return population
The table below shows how the populationList grows with each iteration by
appending the current value of population to its end, assuming an initial population
of 12,000. When the loop finishes, we have years + 1 population sizes in our list.
138 Growth and decay
Figure 4.2 Plot of population size in our fishing pond model with years = 15.
There is a strong similarity between the manner in which we are appending elements
to a list and the accumulators that we have been talking about in this chapter. In
an accumulator, we accumulate values into a sum by repeatedly adding new values
to a running sum. The running sum changes (usually grows) in each iteration of the
loop. With the list in the for loop above, we are accumulating values in a different
way—by repeatedly appending them to the end of a growing list. Therefore, we call
this technique a list accumulator .
We now want to use this list of population sizes as the list of y values in a
matplotlib plot. For the x values, we need a list of the corresponding years, which
can be obtained with range(years + 1). Once we have both lists, we can create
a plot by calling the plot function and then display the plot by calling the show
function:
pyplot.plot(range(years + 1), populationList)
pyplot.show()
Visualizing population changes 139
The first argument to the plot function is the list of x values and the second
parameter is the list of y values. The matplotlib module includes many optional
ways to customize our plots before we call show. Some of the simplest are functions
that label the x and y axes:
pyplot.xlabel(’Year’)
pyplot.ylabel(’Fish Population Size’)
Incorporating the plotting code yields the following function, whose output is shown
in Figure 4.2.
population = initialPopulation
populationList = [ ]
populationList.append(initialPopulation)
for year in range(years):
population = 1.08 * population - harvest
populationList.append(population)
return population
For more complex plots, we can alter the scales of the axes, change the color and
style of the curves, and label multiple curves on the same plot. See Appendix B.4
for a sample of what is available. Some of the options must be specified as keyword
arguments of the form name = value. For example, to color a curve in a plot red
and specify a label for the plot legend, you would call something like this:
pyplot.plot(x, y, color = ’red’, label = ’Bass population’)
pyplot.legend() # creates a legend from labeled lines
Exercises
4.2.1. Modify the countLinks function on Page 123 so that it uses matplotlib to plot
the number of nodes on the x axis and the maximum number of links on the y
axis. Your resulting plot should look like the one in Figure 4.3.
4.2.2. Modify the profitPlot function on Page 128 so that it uses matplotlib to
plot ticket price on the x axis and profit on the y axis. (Remove the tortoise
parameter.) Your resulting plot should look like the one in Figure 4.4. To get the
correct prices (in half dollar increments) on the x axis, you will need to create a
second list of x values and append the realPrice to it in each iteration.
140 Growth and decay
Figure 4.3 Plot for Exercise 4.2.1. Figure 4.4 Plot for Exercise 4.2.2.
4.2.3. Modify your growth1 function from Exercise 4.1.21 so that it uses matplotlib to
plot days on the x axis and the total population on the y axis. Create a plot that
shows the growth of the population over 30 days.
4.2.4. Modify your growth3 function from Exercise 4.1.23 so that it uses matplotlib to
plot days on the x axis and the total population on the y axis. Create a plot that
shows the growth of the population over 30 days.
4.2.5. Modify your invest function from Exercise 4.1.35 so that it uses matplotlib to
plot months on the x axis and your total accumulated investment amount on the
y axis. Create a plot that shows the growth of an investment of $50 per month for
ten years growing at an annual rate of 8%.
False, and the <body> is replaced with statements constituting the body of the loop.
The loop checks the value of the condition before each iteration. If the condition
is true, it executes the statements in the body of the loop, and then checks the
condition again. If the condition is false, the body of the loop is skipped, and the
loop ends.
We will talk more about building Boolean expressions in the next chapter; for now
we will only need very simple ones like population > 0. This Boolean expression is
true if the value of population is positive, and false otherwise. Using this Boolean
expression in the while loop in the following function, we can find the year that
the fish population drops to 0.
Parameters:
initialPopulation: the initial population size
harvest: the size of the annual harvest
population = initialPopulation
year = 0
while population > 0:
population = 1.08 * population - harvest
year = year + 1
return year
A loop that sometimes does not iterate at all is generally not a bad thing, and can
even be used to our advantage. In this case, if population were initially zero, the
function would return zero because the value of year would never be incremented
in the loop. And this is correct; the population dropped to zero in year zero, before
the clock started ticking beyond the initial population size. But it is something that
one should always keep in mind when designing algorithms involving while loops.
Second, a while loop may become an infinite loop. For example, suppose
initialPopulation is 12000 and harvest is 800 instead of 1500. In this case,
as we saw on Page 119, the population size increases every year instead. So the
population size will never reach zero and the loop condition will never be false, so
the loop will iterate forever. For this reason, we must always make sure that the
body of a while loop makes progress toward the loop condition becoming false.
Let’s look at one more example. Suppose we have $1000 to invest and we would
like to know how long it will take for our money to double in size, growing at
5% per year. To answer this question, we can create a loop like the following that
compounds 5% interest each year:
amount = 1000
while ???:
amount = 1.05 * amount
print(amount)
We want the loop to stop iterating when amount reaches 2000. Therefore, we want
the loop to continue to iterate while amount < 2000.
amount = 1000
while amount < 2000:
amount = 1.05 * amount
print(amount)
Reflection 4.18 What is printed by this block of code? What does this result tell
us?
Once the loop is done iterating, the final amount is printed (approximately $2078.93),
but this does not answer our question.
Reflection 4.19 How do figure out how many years it takes for the $1000 to
double?
To answer our question, we need to count the number of times the while loop
iterates, which is very similar to what we did in the yearsUntilZero function. We
can introduce another variable that is incremented in each iteration, and print its
value after the loop, along with the final value of amount:
Conditional iteration 143
amount = 1000
while amount < 2000:
amount = 1.05 * amount
year = year + 1
print(year, amount)
Reflection 4.20 Make these changes and run the code again. Now what is printed?
Oops, an error message is printed, telling us that the name year is undefined.
The problem is that we did not initialize the value of year before the loop. Therefore,
the first time year = year + 1 was executed, year was undefined on the right
side of the assignment statement. Adding one statement before the loop fixes the
problem:
amount = 1000
year = 0
while amount < 2000:
amount = 1.05 * amount
year = year + 1
print(year, amount)
Reflection 4.22 Now what is printed by this block of code? In other words, how
many years until the $1000 doubles?
We will see some more examples of while loops later in this chapter, and again in
Section 5.5.
Exercises
4.3.1. Suppose you put $1000 into the bank and you get a 3% interest rate compounded
annually. How would you use a while loop to determine how long will it take for
your account to have at least $1200 in it?
4.3.2. Repeat the last question, but this time write a function
interest(amount, rate, target)
that takes the initial amount, the interest rate, and the target amount as parameters.
The function should return the number of years it takes to reach the target amount.
4.3.3. Since while loops are more general than for loops, we can emulate the behavior
of a for loop with a while loop. For example, we can emulate the behavior of the
for loop
for counter in range(10):
print(counter)
with the while loop
144 Growth and decay
counter = 0
while counter < 10:
print(counter)
counter = counter + 1
Execute both loops “by hand” to make sure you understand how these loops are
equivalent.
(a) What happens if we omit counter = 0 before the while loop? Why does
this happen?
(b) What happens if we omit counter = counter + 1 from the body of the
while loop? What does the loop print?
(c) Show how to emulate the following for loop with a while loop:
for index in range(3, 12):
print(index)
(d) Show how to emulate the following for loop with a while loop:
for index in range(12, 3, -1):
print(index)
4.3.4. In the profitTable function on Page 127, we used a for loop to indirectly consider
all ticket prices divisible by a half dollar. Rewrite this function so that it instead
uses a while loop that increments price by $0.50 in each iteration.
4.3.5. A zombie can convert two people into zombies everyday. Starting with just one
zombie, how long would it take for the entire world population (7 billion people)
to become zombies? Write a function
zombieApocalypse()
that returns the answer to this question.
4.3.6. Tribbles increase at the rate of 50% per hour (rounding down if there are an odd
number of them). How long would it take 10 tribbles to reach a population size of
1 million? Write a function
tribbleApocalypse()
that returns the answer to this question.
4.3.7. Vampires can each convert v people a day into vampires. However, there is a band
of vampire hunters that can kill k vampires per day. If a coven of vampires starts
with vampires members, how long before a town with a population of people
becomes a town with no humans left in it? Write a function
vampireApocalypse(v, k, vampires, people)
that returns the answer to this question.
4.3.8. An amoeba can split itself into two once every h hours. How many hours does it
take for a single amoeba to become target amoebae? Write a function
amoebaGrowth(h, target)
that returns the answer to this question.
Continuous models 145
Since both the growth rate of 0.08 and the harvest of 1500 are based on 1 year, we
have divided both of them by 12.
Reflection 4.23 Is the final value of population the same in both cases?
If the initial value of population is 12000, the value of population after one
annual update is 11460.0 while the final value after 12 monthly updates is
11439.753329049303. Because the rate is “compounding” monthly, it reduces the
population more quickly.
This is exactly how bank loans work. The bank will quote an annual percentage
rate (APR) of, say, 6% (or 0.06) but then compound interest monthly at a rate of
6/12% = 0.5% (or 0.005), which means that the actual annual rate of interest you are
being charged, called the annual percentage yield (APY), is actually (1+0.005)12 −1 ≈
0.0617 = 6.17%. The APR, which is really defined to be the monthly rate times 12,
is sometimes also called the “nominal rate.” So we can say that our fish population
is increasing at a nominal rate of 8%, but updated every month.
Difference equations
A population model like this is expressed more accurately with a difference equation,
also known as a recurrence relation. If we let P (t) represent the size of the fish
population at the end of year t, then the difference equation that defines our original
model is
P (t) = P (t − 1) + 0.08 ⋅ P (t − 1) − 1500
´¹¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ±
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ harvest
previous year’s proportional
population increase
or, equivalently,
P (t) = 1.08 ⋅ P (t − 1) − 1500.
146 Growth and decay
In other words, the size of the population at the end of year t is 1.08 times the size
of the population at the end of the previous year (t − 1), minus 1500. The initial
population or, more formally, the initial condition is P (0) = 12, 000. We can find
the projected population size for any given year by starting with P (0), and using
the difference equation to compute successive values of P (t). For example, suppose
we wanted to know the projected population size four years from now, represented
by P (4). We start with the initial condition: P (0) = 12, 000. Then, we apply the
difference equation to t = 1:
P (1) = 1.08 ⋅ P (0) − 1500 = 1.08 ⋅ 12, 000 − 1500 = 11, 460.
P (2) = 1.08 ⋅ P (1) − 1500 = 1.08 ⋅ 11, 460 − 1500 = 10, 876.8.
Continuing,
P (3) = 1.08 ⋅ P (2) − 1500 = 1.08 ⋅ 10, 876.8 − 1500 = 10, 246.94
and
P (4) = 1.08 ⋅ P (3) − 1500 = 1.08 ⋅ 10, 246.94 − 1500 = 9, 566.6952.
So this model projects that the bass population in 4 years will be 9, 566. This is the
same process we followed in our for loop in Section 4.1.
To turn this discrete model into a continuous model, we define a small update
interval, which is customarily named ∆t (∆ represents “change,” so ∆t represents
“change in time”). If, for example, we want to update the size of our population
every month, then we let ∆t = 1/12. Then we express our difference equation as
This difference equation is defining the population size at the end of year t in terms
of the population size one ∆t fraction of a year ago. For example, if t is 3 and ∆t
is 1/12, then P (t) represents the size of the population at the end of year 3 and
P (t − ∆t) represents the size of the population at the end of “year 2 11
12
,” equivalent
to one month earlier. Notice that both the growth rate and the harvest number are
scaled by ∆t, just as we did in the for loop on Page 145.
To implement this model, we need to make some analogous changes to the
algorithm from Page 119. First, we need to pass in the value of ∆t as a parameter
so that we have control over the accuracy of the approximation. Second, we need
to modify the number of iterations in our loop to reflect 1/∆t decay events each
year; the number of iterations becomes years ⋅ (1/∆t) = years/∆t. Third, we need
to alter how the accumulator is updated in the loop to reflect this new type of
difference equation. These changes are reflected below. We use dt to represent ∆t.
Continuous models 147
Figure 4.5 The plot produced by calling pond(15, 12000, 1500, 0.01).
population = initialPopulation
for step in range(1, int(years / dt) + 1):
population = population + (0.08 * population - harvest) * dt
return population
We start the for loop at one instead of zero because the first iteration of the loop
represents the first time step of the simulation. The initial population size assigned
to population before the loop represents the population at time zero.
To plot the results of this simulation, we use the same technique that we used in
Section 4.2. But we also use a list accumulator to create a list of time values for the
x axis because the values of the index variable step no longer represent years. In
the following function, the value of step * dt is assigned to the variable t, and
then appended to a list named timeList.
148 Growth and decay
Parameters:
years: number of years to simulate
initialPopulation: the initial population size
harvest: the size of the annual harvest
dt: value of "Delta t" in the simulation (fraction of a year)
population = initialPopulation
populationList = [initialPopulation]
timeList = [0]
t = 0
for step in range(1, int(years / dt) + 1):
population = population + (0.08 * population - harvest) * dt
populationList.append(population)
t = step * dt
timeList.append(t)
pyplot.plot(timeList, populationList)
pyplot.xlabel(’Year’)
pyplot.ylabel(’Fish Population Size’)
pyplot.show()
return population
Figure 4.5 shows a plot produced by calling this function with ∆t = 0.01. Compare
this plot to Figure 4.2. To actually come close to approximating a real continuous
process, we need to use very small values of ∆t. But there are tradeoffs involved in
doing so, which we discuss in more detail later in this section.
Radiocarbon dating
When archaeologists wish to know the ages of organic relics, they often turn to
radiocarbon dating. Both carbon-12 (12 C) and carbon-14 (or radiocarbon, 14 C)
are isotopes of carbon that are present in our atmosphere in a relatively constant
proportion. While carbon-12 is a stable isotope, carbon-14 is unstable and decays at
a known rate. All living things ingest both isotopes, and die possessing them in the
same proportion as the atmosphere. Thereafter, an organism’s acquired carbon-14
decays at a known rate, while its carbon-12 remains intact. By examining the current
ratio of carbon-12 to carbon-14 in organic remains (up to about 60,000 years old),
Continuous models 149
and comparing this ratio to the known ratio in the atmosphere, we can approximate
how long ago the organism died.
The annual decay rate (more correctly, the decay constant2 ) of carbon-14 is
about
k = −0.000121.
Radioactive decay is a continuous process, rather than one that occurs at discrete
intervals. Therefore, Q(t), the quantity of carbon-14 present at the beginning of year
t, needs to be defined in terms of the value of Q(t − ∆t), the quantity of carbon-14
a small ∆t fraction of a year ago. Therefore, the difference equation modeling the
decay of carbon-14 is
Since the decay constant k is based on one year, we scale it down for an interval
of length ∆t by multiplying it by ∆t. We will represent the initial condition with
Q(0) = q, where q represents the initial quantity of carbon-14.
Although this is a completely different application, we can implement the model
the same way we implemented our continuous fish population model.
Parameters:
originalAmount: the original quantity of carbon-14 (g)
years: number of years to simulate
dt: value of "Delta t" in the simulation (fraction of a year)
k = -0.000121
amount = originalAmount
t = 0
timeList = [0] # x values for plot
amountList = [amount] # y values for plot
pyplot.plot(timeList, amountList)
2
The probability of a carbon-14 molecule decaying in a very small fraction ∆t of a year is k∆t.
150 Growth and decay
Figure 4.6 Plot of carbon-14 decay generated with decayC14(100, 20000, 0.01).
pyplot.xlabel(’Years’)
pyplot.ylabel(’Quantity of carbon-14’)
pyplot.show()
return amount
Like all of our previous accumulators, this function initializes our accumulator
variable, named amount, before the loop. Then the accumulator is updated in the
body of the loop according to the difference equation above. Figure 4.6 shows an
example plot from this function.
Reflection 4.25 How much of 100 g of carbon-14 remains after 5,000 years of
decay? Try various ∆t values ranging from 1 down to 0.001. What do you notice?
where Q(0) is the original amount of carbon-14. We can use this equation to directly
compute how much of 1000 g of carbon-14 would be left after 2,000 years with:
Although simple differential equations like this are easily solved if you know calcu-
lus, most realistic differential equations encountered in the sciences are not, making
approximate iterative solutions essential. The approximation technique we are using
in this chapter, called Euler’s method , is the most fundamental, and introduces error
proportional to ∆t. More advanced techniques seek to reduce the approximation error
further.
To get a sense of how decreasing values of ∆t affect the outcome, let’s look
at what happens in our decayC14 function with originalAmount = 1000 and
years = 2000. The table below contains these results, with the theoretically derived
answer in the last row (see Box 4.1). The error column shows the difference between
the result and this value for each value of dt. All values are rounded to three
significant digits to reflect the approximate nature of the decay constant. We can see
from the table that smaller values of dt do indeed provide closer approximations.
Reflection 4.26 What is the relationship between the value of dt and the error?
What about between the value of dt and the execution time?
152 Growth and decay
Each row in the table represents a computation that took ten times as long as that
in the previous row because the value of dt was ten times smaller. But the error
is also ten times smaller. Is this tradeoff worthwhile? The answer depends on the
situation. Certainly using dt = 0.0001 is not worthwhile because it gives the same
answer (to three significant digits) as dt = 0.001, but takes ten times as long.
These types of tradeoffs — quality versus cost — are common in all fields, and
computer science is no exception. Building a faster memory requires more expensive
technology. Ensuring more accurate data transmission over networks requires more
overhead. And finding better approximate solutions to hard problems requires more
time.
Propagation of errors
In both the pond and decayC14 functions in this section, we skirted a very subtle
error that is endemic to numerical computer algorithms. Recall from Section 2.2
that computers store numbers in binary and with finite precision, resulting in slight
errors, especially with very small floating point numbers. But a slight error can
become magnified in an iterative computation. This would have been the case in
the for loops in the pond and decayC14 functions if we had accumulated the value
of t by adding dt in each iteration with
t = t + dt
t = 0
for index in range(iterations):
t = t + dt
The loop accumulates the value 0.0001 one million times, so the correct value for t
is 0.0001 ⋅ 1, 000, 000 = 100. However, by running the code, we see that the final value
of t is actually 100.00000000219612, a small fraction over the correct answer. In
some applications, even errors this small may be significant. And it can get even
worse with more iterations. Scientific computations can often run for days or weeks,
and the number of iterations involved can blow up errors significantly.
Reflection 4.27 Run the code above with 10 million and 100 million iterations.
What do you notice about the error?
Continuous models 153
To avoid this kind of error, we instead assigned the product of dt and the current
iteration number to t, step:
t = step * dt
In this way, the value of t is computed from only one arithmetic operation instead
of many, reducing the potential error.
Simulating an epidemic
Real populations interact with each other and the environment in complex ways.
Therefore, to accurately model them requires an interdependent set of difference
equations, called coupled difference equations. In 1927, Kermack and McKendrick
[23] introduced such a model for the spread of infectious disease called the SIR model .
The “S” stands for the “susceptible” population, those who may still acquire the
disease; “I” stands for the “infected” population, and “R” stands for the “recovered”
population. In this model, we assume that, once recovered, an individual has built
an immunity to the disease and cannot reacquire it. We also assume that the total
population size is constant, and no one dies from the disease. Therefore, an individual
moves from a “susceptible” state to an “infected” state to a “recovered” state, where
she remains, as pictured below.
These assumptions apply very well to common viral infections, like the flu.
A virus like the flu travels through a population more quickly when there are
more infected people with whom a susceptible person can come into contact. In
other words, a susceptible person is more likely to become infected if there are more
infected people. Also, since the total population size does not change, an increase
in the number of infected people implies an identical decrease in the number who
are susceptible. Similarly, a decrease in the number who are infected implies an
identical increase in the number who have recovered. Like most natural processes,
the spread of disease is fluid or continuous, so we will need to model these changes
over small intervals of ∆t, as we did in the radioactive decay model.
We need to design three difference equations that describe how the sizes of the
three groups change. We will let S(t) represent the number of susceptible people on
day t, I(t) represent the number of infected people on day t, and R(t) represent
the number of recovered people on day t.
The recovered population has the most straightforward difference equation. The
size of the recovered group only increases; when infected people recover they move
from the infected group into the recovered group. The number of people who recover
at each step depends on the number of infected people at that time and the recovery
rate r: the average fraction of people who recover each day.
Reflection 4.28 What factors might affect the recovery rate in a real outbreak?
154 Growth and decay
Since we will be dividing each day into intervals of length ∆t, we will need to use a
scaled recovery rate of r ⋅ ∆t for each interval. So the difference equation describing
the size of the recovered group on day t is
Since no one has yet recovered on day 0, we set the initial condition to be R(0) = 0.
Next, we will consider the difference equation for S(t). The size of the susceptible
population only decreases by the number of newly infected people. This decrease
depends on the number of susceptible people, the number of infected people with
whom they can make contact, and the rate at which these potential interactions
produce a new infection. The number of possible interactions between susceptible
and infected individuals at time t − ∆t is simply their product: S(t − ∆t) ⋅ I(t − ∆t).
If we let d represent the rate at which these interactions produce an infection, then
our difference equation is
Reflection 4.29 What factors might affect the infection rate in a real outbreak?
If N is the total size of our population, then the initial condition is S(0) = N − 1
because we will start with one infected person, leaving N − 1 who are susceptible.
The difference equation for the infected group depends on the number of suscep-
tible people who have become newly infected and the number of infected people who
are newly recovered. These numbers are precisely the number leaving the susceptible
group and the number entering the recovered group, respectively. We can simply
copy those from the equations above.
Parameters:
population: the population size
days: number of days to simulate
dt: the value of "Delta t" in the simulation
(fraction of a day)
Figure 4.7 shows the output of the model with 2,200 individuals (students at a small
college, perhaps?) over 30 days. We can see that the infection peaks after about 2
weeks, then decreases steadily. After 30 days, the virus is just about gone, with only
about 40 people still infected. We also notice that not everyone is infected after 30
days; about 80 people are still healthy.
Reflection 4.30 Look at the relationships among the three population sizes over
time in Figure 4.7. How do the sizes of the susceptible and recovered populations cause
the decline of the affected population after the second week? What other relationships
do you notice?
When we are implementing a coupled model, we need to be careful to compute the
change in each population size based on population sizes from the previous iteration
156 Growth and decay
Figure 4.7 Output from the SIR model with 2,200 individuals over 30 days with
∆t = 0.01.
Interestingly, the SIR model can also be used to model the diffusion of ideas, fads, or
memes in a population. A “contagious” idea starts with a small group of people and
can “infect” others over time. As more people adopt the idea, it spreads more quickly
because there are more potential interactions between those who have adopted the
idea and those who have not. Eventually, people move on to other things and forget
about the idea, thereby “recovering.”
Continuous models 157
Exercises
Write a function for each of the following problems.
4.4.1. Suppose we have a pair of rabbits, one male and one female. Female rabbits are
mature and can first mate when they are only one month old, and have a gestation
period of one month. Suppose that every mature female rabbit mates every month
and produces exactly one male and one female offspring one month later. No rabbit
ever dies.
(a) Write a difference equation R(m) representing the number of female rabbits
at the end of each month. (Note that this is a discrete model; there are no
∆t’s.) Start with the first month and compute successive months until you
see a pattern. Don’t forget the initial condition(s). To get you started:
• R(0) = 1, the original pair
• R(1) = 1, the original pair now one month old
• R(2) = 2, the original pair plus a newborn pair
• R(3) = 3, the original pair plus a newborn pair plus a one-month-old
pair
• R(4) = 5, the pairs from the previous generation plus two newborn pairs
⋮
• R(m) = ?
(b) Write a function
rabbits(months)
that uses your difference equation from part (a) to compute and return the
number of rabbits after the given number of months. Your function should
also plot the population growth using matplotlib. (The sequence of numbers
generated in this problem is called the Fibonacci sequence.)
4.4.2. Consider Exercise 4.1.37, but now assume that a bacteria colony grows continuously
with a growth rate of 0.1∆t per small fraction ∆t of an hour.
(a) Write a difference equation for B(t), the size of the bacteria colony, using
∆t’s to approximately model continuous growth.
(b) Write a function
bacteria(population, dt, days)
that uses your difference equation from part (a) compute and return the
number of bacteria in the colony after the given number of days. The other
parameters, population and dt, are the initial size of the colony and the
step size (∆t), respectively. Your function should also plot the population
growth using matplotlib.
4.4.3. Radioactive elements like carbon-14 are normally described by their half-life, the
time required for a quantity of the element to decay to half its original quantity.
Write a function
halflifeC14(originalAmount, dt)
158 Growth and decay
that computes the half-life of carbon-14. Your function will need to simulate
radioactive decay until the amount is equal to half of originalAmount.
The known half-life of carbon-14 is 5, 730 ± 40 years. How close does this approxi-
mation come? Try it with different values of dt.
4.4.4. Finding the half-life of carbon-14 is a special case of radiocarbon dating. When an
organic artifact is dated with radiocarbon dating, the fraction of extant carbon-14
is found relative to the content in the atmosphere. Let’s say that the fraction
in an ancient piece of parchment is found to be 70%. Then, to find the age
of the artifact, we can use a generalized version of halflifeC14 that iterates
while amount > originalAmount * 0.7. Show how to generalize the halflifeC14
function in the previous exercise by writing a new function
carbonDate(originalAmount, fractionRemaining, dt)
that returns the age of an artifact with the given fraction of carbon-14 remaining.
Use this function to find the approximate age of the parchment.
4.4.5. Complete the implementation of the SIR simulation. Compare your results to
Figure 4.7 to check your work.
4.4.6. Run your implementation of the SIR model for longer periods of time than that
shown in Figure 4.7. Given enough time, will everyone become infected?
4.4.7. Suppose that enough people have been vaccinated that the infection rate is cut in
half. What effect do these vaccinations have?
4.4.8. The SIS model represents an infection that does not result in immunity. In other
words, there is no “recovered” population; people who have recovered re-enter the
“susceptible” population.
(a) Write difference equations for this model.
(b) Copy and then modify the SIR function (renaming it SIS) so that it imple-
ments your difference equations.
(c) Run your function with the same parameters that we used for the SIR model.
What do you notice? Explain the results.
4.4.9. Suppose there are two predator species that compete for the same prey, but do
not directly harm each other. We might expect that, if the supply of prey was low
and members of one species were more efficient hunters, then this would have a
negative impact on the health of the other species. This can be modeled through
the following pair of difference equations. P (t) and Q(t) represent the populations
of predator species 1 and 2, respectively, at time t.
The initial conditions are P (0) = p and Q(0) = q, where p and q represent the initial
population sizes of the first and second predator species, respectively. The values
bP and dP are the birth rate and death rates (or, more formally, proportionality
constants) of the first predator species, and the values bQ and dQ are the birth
rate and death rate for the second predator species. In the first equation, the term
bP ⋅ P (t − ∆t) represents the net number of births per month for the first predator
Numerical analysis 159
species, and the term dP ⋅ P (t − ∆t) ⋅ Q(t − ∆t) represents the number of deaths
per month. Notice that this term is dependent on the sizes of both populations:
P (t − ∆t) ⋅ Q(t − ∆t) is the number of possible (indirect) competitions between
individuals for food and dP is the rate at which one of these competitions results
in a death (from starvation) in the first predator species.
This type of model is known as indirect, exploitative competition because the two
predator species do not directly compete with each other (i.e., eat each other), but
they do compete indirectly by exploiting a common food supply.
Write a function
compete(pop1, pop2, birth1, birth2, death1, death2, years, dt)
that implements this model for a generalized indirect competition scenario, plotting
the sizes of the two populations over time. Run your program using
• p = 21 and q = 26
• bP = 1.0 and dP = 0.2; bQ = 1.02 and dQ = 0.25
• dt = 0.001; 6 years
and explain the results. Here is a “skeleton” of the function to get you started.
def compete(pop1, pop2, birth1, birth2, death1, death2, years, dt):
""" YOUR DOCSTRING HERE """
pop1List = [pop1]
pop2List = [pop2]
timeList = [0]
for step in range(1, int(years / dt) + 1):
# YOUR CODE GOES HERE
we stretch the rubber rope uniformly by 1 meter, so it is now 2 meters long. During
the next minute, the ant walks another 10 cm, and then we stretch the rope again
by 1 meter. If we continue this process indefinitely, will the ant ever reach the other
end of the rope? If it does, how long will it take?
The answer lies in counting what fraction of the rope the ant traverses in each
minute. During the first minute, the ant walks 1/10 of the distance to the end of the
rope. After stretching, the ant has still traversed 1/10 of the distance because the
portion of the rope on which the ant walked was doubled along with the rest of the
rope. However, in the second minute, the ant’s 10 cm stroll only covers 10/200 = 1/20
of the entire distance. Therefore, after 2 minutes, the ant has covered 1/10 + 1/20
of the rope. During the third minute, the rope is 3 m long, so the ant covers only
10/300 = 1/30 of the distance. This pattern continues, so our problem boils down to
whether the following sum ever reaches 1.
1 1 1 1 1 1
+ + +⋯= (1 + + + ⋯)
10 20 30 10 2 3
Naturally, we can answer this question using an accumulator. But how do we add
these fractional terms? In Exercise 4.1.25, you may have computed 1 + 2 + 3 + ⋯ + n:
def sum(n):
total = 0
for number in range(1, n + 1):
total = total + number
return total
In each iteration of this for loop, we add the value of number to the accumulator
variable total. Since number is assigned the values 1, 2, 3, . . . , n, total has the sum
of these values after the loop. To compute the fraction of the rope traveled by the
ant, we can modify this function to add 1 / number in each iteration instead, and
then multiply the result by 1/10:
def ant(n):
"""Simulates the "ant on a rubber rope" problem. The rope
is initially 1 m long and the ant walks 10 cm each minute.
Parameter:
n: the number of minutes the ant walks
Return value:
fraction of the rope traveled by the ant in n minutes
"""
total = 0
for number in range(1, n + 1):
total = total + (1 / number)
return total * 0.1
Numerical analysis 161
To answer our question with this function, we need to try several values of n to see
if we can find a sufficiently high value for which the sum exceeds 1. If we find such
a value, then we need to work with smaller values until we find the value of n for
which the sum first reaches or exceeds 1.
Reflection 4.31 Using at most 5 calls to the ant function, find a range of minutes
that answers the question.
For example, ant(100) returns about 0.52, ant(1000) returns about 0.75,
ant(10000) returns about 0.98, and ant(15000) returns about 1.02. So the ant
will reach the other end of the rope after 10,000–15,000 minutes. As cumbersome as
that was, continuing it to find the exact number of minutes required would be far
worse.
Reflection 4.32 How would we write a function to find when the ant first reaches
or exceeds the end of the rope? (Hint: this is similar to the carbon-14 half-life
problem.) We will leave the answer as an exercise.
Reflection 4.33 Knowing that Hn ≈ ln n + 0.577, how can you approximate how
long until the ant will reach the end of the rope if it walks only 1 cm each minute?
100 = ln n + 0.577
since the ant’s first step is not 1/100 of the total distance. This is the same as
e100−0.577 = n.
In Python, we can find the answer with math.exp(100 - 0.577), which gives about
1.5 × 1043 minutes, a long time indeed.
Approximating π
The value π is probably the most famous mathematical constant. There have been
many infinite series found over the past 500 years that can be used to approximate
π. One of the most famous is known as the Leibniz series, named after Gottfried
Leibniz, the co-inventor of calculus:
1 1 1
π = 4 (1 − + − + ⋯)
3 5 7
Like the harmonic series approximation of the natural logarithm, the more terms
we compute of this series, the closer we get to the true value of π. To compute this
sum, we need to identify a pattern in the terms, and relate them to the values of
the index variable in a for loop. Then we can fill in the red blank line below with
an expression that computes the ith term from the value of the index variable i.
Numerical analysis 163
def leibniz(terms):
"""Computes a partial sum of the Leibniz series.
Parameter:
terms: the number of terms to add
Return value:
the sum of the given number of terms
"""
sum = 0
for i in range(terms):
sum = sum +
pi = sum * 4
return pi
To find the pattern, we can write down the values of the index variable next to
the values in the series to identify a relationship:
i 0 1 2 3 4 ⋯
i th
term 1 − 13 1
5
− 17 1
9
⋯
Ignoring the alternating signs for a moment, we can see that the absolute value of
the ith term is
1
.
2i + 1
To alternate the signs, we use the fact that −1 raised to an even power is 1, while
−1 raised to an odd power is −1. Since the even terms are positive and odd terms
are negative, the final expression for the i term is
1
(−1)i ⋅ .
2i + 1
Therefore, the red assignment statement in our leibniz function should be
sum = sum + (-1) ** i / (2 * i + 1)
Reflection 4.34 Call the completed leibniz function with a series of increasing
arguments. What do you notice about how the values converge to π?
By examining several values of the function, you might notice that they alternate
between being greater and less than the actual value of π. Figure 4.9 illustrates this.
164 Growth and decay
√
Here, n is the value whose square root we want. The approximation of n will be
better for larger values of k; X(20) will be closer to the actual square root than
X(10).
Similar to our previous examples, we can compute successive values of the
difference equation using iteration. In this case, each value is computed from the
previous value according to the formula above. If we let the variable name x
represent a term in the difference equation, then we can compute the kth term with
the following simple function:
Numerical analysis 165
Parameters:
n: the number to take the square root of
k: number of iterations
Return value:
the approximate square root of n
"""
x = 1.0
for index in range(k):
x = 0.5 * (x + n / x)
return x
√
Reflection 4.35 Call the function above to approximate 10 with various values of
k. What value of k is necessary to match the value given by the math.sqrt function?
Exercises
4.5.1. Recall Reflection 4.32: How would we write a function to find when the ant first
reaches or exceeds the end of the rope? (Hint: this is similar to the carbon-14
half-life problem.)
(a) Write a function to answer this question.
(b) How long does it take for the ant to traverse the entire rope?
(c) If the ant walks 5 cm each minute, how long does it take to reach the other
end?
4.5.2. Augment the ant function so that it also produces the plot in Figure 4.8.
4.5.3. The value e (Euler’s number, the base of the natural logarithm) is equal to the
infinite sum
1 1 1
e=1+ + + +⋯
1! 2! 3!
Write a function
e(n)
that approximates the value of e by computing and returning the value of n terms of
this sum. For example, calling e(4) should return the value 1+1/1+1/2+1/6 ≈ 2.667.
Your function should call the factorial function you wrote for Exercise 4.1.29 to
aid in the computation.
4.5.4. Calling the factorial function repeatedly in the function you wrote for the
previous problem is very inefficient because many of the same arithmetic operations
are being performed repeatedly. Explain where this is happening.
166 Growth and decay
4.5.5. To avoid the problems suggested by the previous exercise, rewrite the function
from Exercise 4.5.3 without calling the factorial function.
4.5.6. Rather than specifying the number of iterations in advance, numerical algorithms
usually iterate until the absolute value of the current term is sufficiently small. At
this point, we assume the approximation is “good enough.” Rewrite the leibniz
function so that it iterates while the absolute value of the current term is greater
than 10−6 .
4.5.7. Similar to the previous exercise, rewrite the sqrt function so that it iterates while
the absolute value of the difference between the current and previous values of x is
greater than 10−15 .
4.5.8. The following expression, discovered in the 14th century by Indian mathematician
Madhava of Sangamagrama, is another way to compute π.
√ 1 1 1
π= 12 (1 − + − + ⋯)
3⋅3 5⋅3 2 7 ⋅ 33
Write a function
approxPi(n)
that computes n terms of this expression to approximate π. For example,
approxPi(3) should return the value
√ 1 1 √
12 (1 − + ) ≈ 12(1 − 0.111 + 0.022) ≈ 3.156.
3⋅3 5⋅32
4.5.10. The Nilakantha series, named after Nilakantha Somayaji, a 15th century Indian
mathematician, is another infinite series for π:
4 4 4 4
π =3+ − + − +⋯
2 ⋅ 3 ⋅ 4 4 ⋅ 5 ⋅ 6 6 ⋅ 7 ⋅ 8 8 ⋅ 9 ⋅ 10
Write a function
nilakantha(terms)
that computes the given number of terms in the Nilakantha series.
4.5.11. The following infinite product was discovered by François Viète, a 16th century
French mathematician:
2 2 2 2
π =2⋅ √ ⋅ √ √ ⋅ √ √ √ ⋅ √ √ ⋯
2 2+ 2 √ √
2+ 2+ 2 2+ 2+ 2+ 2
Write a function
viete(terms)
that computes the given number of terms in the Viète’s product. (Look at the
pattern carefully; it is not as hard as it looks if you base the denominator in each
term on the denominator of the previous term.)
4.6 SUMMING UP
Although we have solved a variety of different problems in this chapter, almost all
of the functions that we have designed have the same basic format:
def accumulator( ):
sum = # initialize the accumulator
for index in range( ): # iterate some number of times
sum = sum + # add something to the accumulator
return sum # return final accumulator value
The functions we designed differ primarily in what is added to the accumulator (the
red statement) in each iteration of the loop. Let’s look at three of these functions
in particular: the pond function from Page 119, the countLinks function from
Page 123, and the solution to Exercise 4.1.27 from Page 134, shown below.
def growth(finalAge):
height = 95
for age in range(4, finalAge + 1):
height = height + 6
return height
height = height + 6
In the countLinks function, the value of the index variable, minus one, is added to
the accumulator:
newLinks = node - 1
totalLinks = totalLinks + newLinks
And in the pond function, a factor of the accumulator itself is added in each iteration:
population = population + 0.08 * population # ignoring "- 1500"
(To simplify things, we will ignore the subtraction present in the original program.)
These three types of accumulators grow in three different ways. Adding a constant
value to the accumulator in each iteration, as in the growth function, results in a
final sum that is equal to the number of iterations times the constant value. In other
words, if the initial value is a, the constant added value is c, and the number of
iterations is n, the final value of the accumulator is a + cn. (In the growth function,
a = 95 and c = 6, so the final sum is 95 + 6n.) As n increases, cn increases by a
constant amount. This is called linear growth, and is illustrated by the blue line in
Figure 4.10.
Adding the value of the index variable to the accumulator, as in the countLinks
function, results in much faster growth. In the countLinks function, the final value
Summing up 169
1 + 2 + 3 + ⋯ + (n − 2) + (n − 1) + n
for any positive integer n. The first technique is to add the numbers in the sum from
the outside in. Notice that the sum of the first and last numbers is n + 1. Then, coming
in one position from both the left and right, we find that (n − 1) + 2 = n + 1 as well.
Next, (n − 2) + 3 = n + 1. This pattern will obviously continue, as we are subtracting 1
from the number on the left and adding 1 to the number on the right. In total, there
is one instance of n + 1 for every two terms in the sum, for a total of n/2 instances of
n + 1. Therefore, the sum is
n n(n + 1)
1 + 2 + 3 + ⋯ + (n − 2) + (n − 1) + n = (n + 1) = .
2 2
For example, 1+2+3+⋯+8 = (8⋅9)/2 = 36 and 1+2+3+⋯+1000 = (1000⋅1001)/2 = 500, 500.
The second technique to derive this result is more visual. Depict each number in the
sum as a column of circles, as shown on the left below with n = 8.
The first column has n = 8 circles, the second has n − 1 = 7, etc. So the total number of
circles in this triangle is equal to the value we are seeking. Now make an exact duplicate
of this triangle, and place its mirror image to the right of the original triangle, as shown
on the right above. The resulting rectangle has n rows and n + 1 columns, so there are
a total of n(n + 1) circles. Since the number of circles in this rectangle is exactly twice
the number in the original triangle, the number of circles in the original triangle is
n(n + 1)/2. Based on this representation, numbers like 36 and 500, 500 that are sums of
this form are called triangular numbers.
of the accumulator is
1 + 2 + 3 + ⋯ + (n − 1)
which is equal to
1 n2 − n
⋅ n ⋅ (n − 1) = .
2 2
Box 4.2 explains two clever ways to derive this result. Since this sum is proportional
to n2 , we say that it exhibits quadratic growth, as shown by the red curve in
Figure 4.10. This sum is actually quite handy to know, and it will surface again in
Chapter 11.
170 Growth and decay
Exercises
4.6.1. Decide whether each of the following accumulators exhibits linear, quadratic, or
exponential growth.
(a) sum = 0
for index in range(n):
sum = sum + index * 2
(b) sum = 10
for index in range(n):
sum = sum + index / 2
(c) sum = 1
for index in range(n):
sum = sum + sum
(d) sum = 0
for index in range(n):
sum = sum + 1.2 * sum
(e) sum = 0
for index in range(n):
sum = sum + 0.01
(f) sum = 10
for index in range(n):
sum = 1.2 * sum
4.6.2. Look at Figure 4.10. For values of n less than about 80, the fast-growing exponential
curve is actually below the other two. Explain why.
4.6.3. Write a program to generate Figure 4.10.
Further discovery 171
If you are interested in learning more about population dynamics models, and
computational modeling in general, a great source is Introduction to Computational
Science [52] by Angela Shiflet and George Shiflet.
4.8 PROJECTS
Project 4.1 Parasitic relationships
For this project, we assume that you have read the material on discrete difference
equations on Pages 145–146 of Section 4.4.
A parasite is an organism that lives either on or inside another organism for
part of its life. A parasitoid is a parasitic organism that eventually kills its host.
A parasitoid insect infects a host insect by laying eggs inside it then, when these
eggs later hatch into larvae, they feed on the live host. (Cool, huh?) When the host
eventually dies, the parasitoid adults emerge from the host body.
The Nicholson-Bailey model , first proposed by Alexander Nicholson and Victor
Bailey in 1935 [36], is a pair of difference equations that attempt to simulate the
relative population sizes of parasitoids and their hosts. We represent the size of the
host population in year t with H(t) and the size of the parasitoid population in
year t with P (t). Then the difference equations describing this model are
H(t) = r ⋅ H(t − 1) ⋅ e−aP (t−1)
P (t) = c ⋅ H(t − 1) ⋅ (1 − e−aP (t−1) )
where
• r is the average number of surviving offspring from an uninfected host,
• c is the average number of eggs that hatch inside a single host, and
• a is a scaling factor describing the searching efficiency or search area of the
parasitoids (higher is more efficient).
The value (1 − e−aP (t−1) ) is the probability that a host is infected when there are
P (t − 1) parasitoids, where e is Euler’s number (the base of the natural logarithm).
Therefore,
H(t − 1) ⋅ (1 − e−aP (t−1) )
is the number of hosts that are infected during year t − 1. Multiplying this by c
gives us P (t), the number of new parasitoids hatching in year t. Notice that the
probability of infection grows exponentially as the size of the parasitoid population
grows. A higher value of a also increases the probability of infection.
172 Growth and decay
Question 4.1.1 Similarly explain the meaning of the difference equation for H(t).
(e−aP (t−1) is the probability that a host is not infected.)
that uses these difference equations to plot both population sizes over time. Your
function should plot these values in two different ways (resulting in two different
plots). First, plot the host population size on the x-axis and the parasitoid population
size on the y-axis. So each point represents the two population sizes in a particular
year. Second, plot both population sizes on the y-axis, with time on the x-axis. To
show both population sizes on the same plot, call the pyplot.plot function for
each population list before calling pyplot.show. To label each line and include a
legend, see the end of Section 4.2.
Question 4.1.2 Write a main function that calls your NB function to simulate
initial populations of 24 hosts and 12 parasitoids for 35 years. Use values of r = 2,
c = 1, and a = 0.056. Describe and interpret the results.
Question 4.1.3 Run the simulation again with the same parameters, but this time
assign a to be -math.log(0.5) / paraPop. (This is a = − ln 0.5/12 ≈ 0.058, just
slightly above the original value of a.) What do you observe?
Question 4.1.4 Run the simulation again with the same parameters, but this time
assign a = 0.06. What do you observe?
Question 4.1.5 Based on these three simulations, what can you say about this
model and its sensitivity to the value of a?
where K is the carrying capacity. In this new difference equation, the average number
of surviving host offspring, formerly r, is now represented by
er(1−H(t−1)/K) .
Projects 173
Notice that, when the number of hosts H(t − 1) equals the carrying capacity K,
the exponent equals zero. So the number of surviving host offspring is e0 = 1. In
general, as the number of hosts H(t − 1) gets closer to the carrying capacity K, the
exponent gets smaller and the value of the expression above gets closer to 1. At the
other extreme, when H(t − 1) is close to 0, the expression is close to er . So, overall,
the number of surviving offspring varies between 1 and er , depending on how close
H(t − 1) comes to the carrying capacity.
Write a function
NB_CC(hostPop, paraPop, r, c, a, K, years)
that implements this modified model and generates the same plots as the previous
function.
Question 4.1.7 Run your simulation with all three values of a that we used in
Part 1. How do these results differ from the prior simulation?
• the nominal annual interest rate (the monthly rate times twelve)
that computes the number of months required to pay off the loan with each monthly
payment amount. The interest on the loan balance should compound monthly at a
rate of rate / 100 / 12. Your function should also plot the loan balances, with
each payment amount, over time until both balances reach zero. Then it should
print the length of both repayment periods and how much sooner the loan will be
paid off if the higher monthly payment is chosen. For example, your output might
look like this:
174 Growth and decay
If you pay $500.00 per month, the repayment period will be 13 years
and 11 months.
If you pay $750.00 per month, the repayment period will be 8 years
and 2 months.
If you pay $250.00 more per month, you will repay the loan 5 years
and 9 months earlier.
Question 4.2.1 How long would it take to pay off $20,000 in student loans with a
4% interest rate if you paid $100 per month? Approximately how much would you
have to pay per month to pay off the loan in ten years?
Question 4.2.2 If you run your program to determine how long it would take to
pay off the same loan if you paid only $50 per month, you should encounter a
problem. What is it?
Question 4.2.3 Suppose you are 30 and, after working for a few years, have
managed to save $6,000 for retirement. If you continue to invest $200 per month,
how much will you have when you retire at age 72 if your investment grows 3% per
year? How much more will you have if you invest $50 more per month?
176 Growth and decay
Your savings will last until you are 99 years and 10 months old.
Question 4.2.4 How long will your retirement savings last if you follow the plan
outlined in Question 4.2 (investing $200 per month) and withdraw 4% at retirement?
Projects 177
2. Second, the rate of adoption is affected by word of mouth within the pop-
ulation. Members of the population who have already adopted the product
can influence those who have not. The more adopters there are, the more
potential interactions exist between adopters and non-adopters, which boosts
the adoption rate. The fraction of all potential interactions that are between
adopters and non-adopters in the previous time step is
The fraction of these interactions that result in a new adoption during one
week is called the social contagion. We will denote the social contagion by s.
So the fraction of new adopters due to social contagion during the time step
ending at week t is
The social contagion measures how successfully adopters are able to convince
non-adopters that they should adopt the product. At the extremes, if s = 1,
then every interaction between a non-adopter and an adopter results in the
non-adopter adopting the product. On the other hand, if s = 0, then the
current adopters cannot convince any non-adopters to adopt the product.
178 Growth and decay
Putting these two parts together, the difference equation for the Bass diffusion
model is
Question 4.3.1 Describe the picture and explain the pattern of new adoptions and
the resulting pattern of total adoptions over the 15-week launch.
Question 4.3.2 Now make r very small but leave s the same (r = 0.00001, s = 1.03),
and answer the same question. What kind of market does this represent?
Question 4.3.3 Now set r to be 100 times its original value and s to be zero
(r = 0.2, s = 0), and answer the first question again. What kind of market does this
represent?
while imitators can be influenced by either group. The numbers of influentials and
imitators in the population are NA and NB , respectively, so the total population
size is N = NA + NB . We will let A(t) now represent the fraction of the influential
population that has adopted the product at time t and let B(t) represent the fraction
of the imitator population that has adopted the product at time t.
The adoption rate of the influentials follows the same difference equation as
before, except that we will denote the adoption rate and social contagion for the
influentials with rA and sA .
A(t) = A(t − ∆t) + rA ⋅ (1 − A(t − ∆t)) ⋅ ∆t + sA ⋅ A(t − ∆t) ⋅ (1 − A(t − ∆t)) ⋅ ∆t
The adoption rate of the imitators will be different because they value the
opinions of both influentials and other imitators. Let rB and sB represent the
adoption rate and social contagion for the imitators. Another parameter, w (between
0 and 1), will indicate how much the imitators value the opinions of the influentials
over the other imitators. At the extremes, w = 1 means that the imitators are
influenced heavily by influentials and not at all by other imitators. On the other
hand, w = 0 means that they are not at all influenced by influentials, but are
influenced by imitators. We will break the difference equation for B(t) into three
parts.
1. First, there is a constant rate of adoptions from among the imitators that have
not yet adopted, just like the first part of the difference equation for A(t):
rB ⋅ (1 − B(t − ∆t)) ⋅ ∆t
2. Second, there is a fraction of the imitators who have not yet adopted who will
be influenced to adopt, through social contagion, by influential adopters.
w ⋅ sB ⋅ A(t − ∆t) ⋅ (1 − B(t − ∆t)) ⋅ ∆t.
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
fraction of fraction of
influentials who imitators who
have adopted have not adopted
Recall from above that w is the extent to which imitators are more likely to
be influenced by influentials than other imitators.
3. Third, there is a fraction of the imitators who have not yet adopted who will
be influenced to adopt, through social contagion, by other imitators who have
already adopted.
(1 − w) ⋅ sB ⋅ B(t − ∆t) ⋅ (1 − B(t − ∆t)) ⋅ ∆t.
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
fraction of fraction of
imitators who imitators who
have adopted have not adopted
Putting these three parts together, we have the difference equation modeling the
growth of the fraction of imitators who adopt the product.
that implements this product diffusion model with influentials and imitators. The pa-
rameters are similar to the previous function (but their names have been shortened).
The first two parameters are the sizes of the influential and imitator populations,
respectively. The third and fourth parameters are the adoption rate (rA ) and social
contagion (sA ) for the influentials, respectively. The fifth and sixth parameters
are the same values (rB and sB ) for the imitators. The seventh parameter weight
is the value of w. Your function should produce two plots. In the first, plot the
new adoptions for each group, and the total new adoptions, over time (as before,
normalized by dividing by dt). In the second, plot the total adoptions for each group,
and the total adoptions for both groups together, over time. Unlike in the previous
function, plot the numbers of adopters in each group rather than the fractions of
adopters, so that the different sizes of each population are taken into account. (To
do this, just multiply the fraction by the total size of the appropriate population.)
Write a program that uses your function to simulate the same product launch as
before, except now there are 600 influentials and 400 imitators in a total population
of 1000. The adoption rate and social contagion for the influentials are the same as
before (rA = 0.002 and sA = 1.03), but these values are rB = 0 and sB = 0.8 for the
imitators. Use a value of w = 0.6, meaning that the imitators value the opinions of
the influentials over other imitators.
Question 4.3.4 Describe the picture and explain the pattern of new adoptions
and the resulting pattern of total adoptions. Point out any patterns that you find
interesting.
Question 4.3.5 Now set w = 0.01 and rerun the simulation. Describe the new
picture and explain how and why this changes the results.
In the 1920’s, Alfred Lotka and Vito Volterra independently introduced the
now-famous Lotka-Volterra equations to model predator-prey relationships. The
model consists of a pair of related differential equations that describe the sizes of
the two populations over time. We will approximate the differential equations with
discrete difference equations. Let’s assume that the predators are wolves and the
prey are moose. We will represent the sizes of the moose and wolf populations at
the end of month t with M (t) and W (t), respectively. The difference equations
describing the populations of wolves and moose are:
• dM is the moose death rate, or the rate at which a wolf kills a moose that it
encounters (per month)
• bW is the wolf birth rate, or the moose death rate × how efficiently an eaten
moose produces a new wolf (per month)
that simulates this predator prey model using the difference equations above. The
parameters preyPop and predPop are the initial sizes of the prey and predator
populations (M (0) and W (0)), respectively, dt (∆t) is the time interval used in the
simulation, and months is the number of months (maximum value of t) for which
to run the simulation. To cut back on the number of parameters, you can assign
constant birth and death rates to local variables inside your function. Start by trying
birthRateMoose = 0.5 # bM
deathRateMoose = 0.02 # dM
birthRateWolves = 0.005 # bW = dM * efficiency of 0.25
deathRateWolves = 0.75 # dW
Your function should plot, using matplotlib, the sizes of the wolf and moose
populations over time, as the simulation progresses. Write a program that calls your
PP function to simulate 500 moose and 25 wolves for 5 years with dt = 0.01.
Question 4.4.1 What happens to the sizes of the populations over time? Why do
these changes occur?
Question 4.4.3 What would the wolf death rate need to be for the wolf population
to die out within five years? Note that the death rate can exceed 1. Try increasing
the value of dW slowly and watch what happens. If it seems like you can never kill
all the wolves, read on.
Killing off the wolves appears to be impossible because the equations you are
using will never let the value reach zero. (Why?) To compensate, we can set either
population to zero when it falls below some threshold, say 1.0. (After all, you can’t
really have a fraction of a wolf.) To do this, insert the following statements into the
body of your for loop after you increment the predator and prey populations, and
try answering the previous question again.
if preyPop < 1.0:
preyPop = 0.0
if predPop < 1.0:
predPop = 0.0
Projects 183
(Replace preyPop and predPop with the names you use for the current sizes of
the populations.) As we will see shortly, the first two statements will assign 0 to
preyPop if it is less than 1. The second two statements do the same for predPop.
bM (1 − M (t − ∆t)/MCC ) M (t − ∆t)∆t.
Notice that now, as the moose population size approaches the carrying capacity, the
birth rate slows.
Question 4.4.4 Why does this change cause the moose birth rate to slow as the
size of the moose population approaches the carrying capacity?
Implement this change to your simulation, setting the moose carrying capacity
to 750, and run it again with the original birth and death rates, with 500 moose
and 25 wolves, for 10 years.
Question 4.4.5 How does the result differ from your previous run? What does the
result demonstrate? Does the moose population reach its carrying capacity of 750?
If not, what birth and/or death rate parameters would need to change to allow this
to happen?
Question 4.4.6 Reinstate the original birth and death rates, and introduce hunting
again; now what would the wolf death rate need to be for the wolf population to die
out within five years?
This page intentionally left blank
CHAPTER 5
Yogi Berra
o far, our algorithms have been entirely deterministic; they have done the same
S thing every time we execute them with the same inputs. However, the natural
world and its inhabitants (including us) are usually not so predictable. Rather, we
consider many natural processes to be, to various degrees, random. For example, the
behaviors of crowds and markets often change in unpredictable ways. The genetic
“mixing” that occurs in sexual reproduction can be considered a random process
because we cannot predict the characteristics of any particular offspring. We can also
use the power of random sampling to estimate characteristics of a large population
by querying a smaller number of randomly selected individuals.
Most non-random algorithms must also be able to conditionally change course, or
select from among a variety of options, in response to input. Indeed, most common
desktop applications do nothing unless prompted by a key press or a mouse click.
Computer games like racing simulators react to a controller several times a second.
The protocols that govern data traffic on the Internet adjust transmission rates
continually in response to the perceived level of congestion on the network. In this
chapter, we will discover how to design algorithms that can behave differently in
response to input, both random and deterministic.
185
186 Forks in the road
simulations repeat a random experiment many, many times, and average over these
trials to arrive at a meaningful result; one run of the simulation, due to its random
nature, does not carry any significance.
In 1827, British Botanist Robert Brown, while observing pollen grains suspended
in water under his microscope, witnessed something curious. When the pollen
grains burst, they emitted much smaller particles that proceeded to wiggle around
randomly. This phenomenon, now called Brownian motion, was caused by the
particles’ collisions with the moving water molecules. Brownian motion is now used
to describe the motion of any sufficiently small particle (or molecule) in a fluid.
We can model the essence of Brownian motion with a single randomly moving
particle in two dimensions. This process is known as a random walk . Random walks
are also used to model a wide variety of other phenomena such as markets and the
foraging behavior of animals, and to sample large social networks. In this section,
we will develop a Monte Carlo simulation of a random walk to discover how far
away a randomly moving particle is likely to get in a fixed amount of time.
You may have already modeled a simple random walk in Exercise 3.3.15 by
moving a turtle around the screen and choosing a random angle to turn at each
step. We will now develop a more restricted version of a random walk in which the
particle is forced to move on a two-dimensional grid. At each step, we want the
particle to move in one of the four cardinal directions, each with equal probability.
To simulate random processes, we need an algorithm or device that produces ran-
dom numbers, called a random number generator (RNG). A computer, as described
in Chapter 1, cannot implement a true RNG because everything it does is entirely
predictable. Therefore, a computer either needs to incorporate a specialized device
that can detect and transmit truly random physical events (like subatomic quantum
fluctuations) or simulate randomness with a clever algorithm called a pseudorandom
number generator (PRNG). A PRNG generates a sequence of numbers that appear
to be random although, in reality, they are not.
The Python module named random provides a PRNG in a function named
Random walks 187
random. The random function returns a pseudorandom number between zero and
one, but not including one. For example:
>>> import random
>>> random.random()
0.9699738944412686
(Your output will differ.) It is convenient to refer to the range of real numbers
produced by the random function as [0, 1). The square bracket on the left means
that 0 is included in the range, and the parenthesis on the right means that 1 is not
included in the range. Box 5.1 explains a little more about this so-called interval
notation, if it is unfamiliar to you.
To randomly move our particle in one of four cardinal directions, we first use
the random function to assign r to be a random value in [0, 1). Then we divide our
space of possible random numbers into four equal intervals, and associate a direction
with each one.
Now let’s write an algorithm to take one step of a random walk. We will save the
particle’s (x, y) location in two variables, x and y (also called the particle’s x and y
coordinates). To condition each move based on the interval in which r resides, we
will use Python’s if statement. An if statement executes a particular sequence of
statements only if some condition is true. For example, the following statements
assign a pseudorandom value to r, and then implement the first case by incrementing
x when r is in [0, 0.25):
x = 0
y = 0
r = random.random()
if r < 0.25: # if r < 0.25,
x = x + 1 # move to the east
Reflection 5.1 Why do we not need to also check whether r is at least zero?
An if statement is also called a conditional statement because, like the while loops
we saw earlier, they make decisions that are conditioned on a Boolean expression.
(Unlike while loops, however, an if statement is only executed once.) The Boolean
expression in this case, r < 0.25, is true if r is less than 0.25 and false otherwise.
If the Boolean expression is true, the statement(s) that are indented beneath the
condition are executed. On the other hand, if the Boolean expression is false,
188 Forks in the road
the indented statement(s) are skipped, and the statement following the indented
statements is executed next.
In the second case, to check whether r is in [0.25, 0.5), we need to check whether
r is greater than or equal to 0.25 and whether r is less than 0.5. The meaning of
“and” in the previous sentence is identical to the Boolean operator from Section 1.4.
In Python, this condition is represented just as you would expect:
r >= 0.25 and r < 0.5
The >= operator is Python’s representation of “greater than or equal to” (≥). It is
one of six comparison operators (or relational operators), listed in Table 5.1, some
of which have two-character representations in Python. (Note especially that ==
is used to test for equality. We will discuss these operators further in Section 5.4.)
Adding this case to the first case, we now have two if statements:
if r < 0.25: # if r < 0.25,
x = x + 1 # move to the east
if r >= 0.25 and r < 0.5: # if r is in [0.25, 0.5),
y = y + 1 # move to the north
Now if r is assigned a value that is less than 0.25, the condition in the first if
statement will be true and x = x + 1 will be executed. Next, the condition in the
second if statement will be checked. But since this condition is false, y will not be
incremented. On the other hand, suppose r is assigned a value that is between 0.25
and 0.5. Then the condition in the first if statement will be false, so the indented
statement x = x + 1 will be skipped and execution will continue with the second
if statement. Since the condition in the second if statement is true, y = y + 1
will be executed.
To complete our four-way decision, we can add two more if statements:
There are four possible ways this code could be executed, one for each interval in
which r can reside. For example, suppose r is in the interval [0.5, 0.75). We first
execute the if statement on line 1. Since the if condition is false, the indented
statement on line 2, x = x + 1, is skipped. Next, we execute the if statement on
line 3, and test its condition. Since this is also false, the indented statement on
line 4, y = y + 1, is skipped. We continue by executing the third if statement, on
line 5. This condition, r >= 0.5 and r < 0.75, is true, so the indented statement
on line 6, x = x - 1, is executed. Next, we continue to the fourth if statement
on line 7, and test its condition, r >= 0.75 and r < 1.0. This condition is false,
so execution continues on line 9, after the entire conditional structure, where the
values of x and y are printed. In each of the other three cases, when r is in one of
the three other intervals, a different condition will be true and the other three will
be false. Therefore, exactly one of the four indented statements will be executed for
any value of r.
Reflection 5.2 Is this sequence of steps efficient? If not, what steps could be skipped
and in what circumstances?
The code behaves correctly, but it seems unnecessary to test subsequent conditions
after we have already found the correct case. If there were many more than four
cases, this extra work could be substantial. Here is a more efficient structure:
The keyword elif is short for “else if,” meaning that the condition that follows
an elif is checked only if no preceding condition was true. In other words, as we
sequentially check each of the four conditions, if we find that one is true, then the
associated indented statement(s) are executed, and we skip the remaining conditions
in the group. We were also able to eliminate the lower bound check from each
condition (e.g., r >= 0.25 in the second if statement) because, if we get to an
elif condition, we know that the previous condition was false, and therefore the
value of r must be greater than or equal to the interval being tested in the previous
case.
190 Forks in the road
To illustrate, let’s first suppose that the condition in the first if statement on
line 1 is true, and the indented statement on line 2 is executed. Now none of the
remaining elif conditions are checked, and execution skips ahead to line 9. On
the other hand, if the condition in the first if statement is false, we know that r
must be at least 0.25. Next the elif condition on line 3 would be checked. If this
condition is true, then the indented statement on line 4 would be executed, none of
the remaining conditions would be checked, and execution would continue on line 9.
If the condition in the second if statement is also false, we know that r must be at
least 0.5. Next, the condition on line 5, r < 0.75, would be checked. If it is true,
the indented statement on line 6 would be executed, and execution would continue
on line 9. Finally, if none of the first three conditions is true, we know that r must
be at least 0.75, and the elif condition on line 7 would be checked. If it is true, the
indented statement on line 8 would be executed. And, either way, execution would
continue on line 9.
Reflection 5.3 For each of the four possible intervals to which r could belong, how
many conditions are checked?
Reflection 5.4 Suppose you replace every elif in the most recent version above
with if. What would then happen if r had the value 0.1?
This code can be streamlined a bit more. Since r must be in [0, 1), there is no point
in checking the last condition, r < 1.0. If execution has proceeded that far, r must
be in [0.75, 1). So, we should just execute the last statement, y = y - 1, without
checking anything. This is accomplished by replacing the last elif with an else
statement:
The else signals that, if no previous condition is true, the statement(s) indented
under the else are to be executed.
Reflection 5.5 Now suppose you replace every elif with if. What would happen
if r had the value 0.1?
If we (erroneously) replaced the two instances of elif above with if, then the final
else would be associated only with the last if. So if r had the value 0.1, all three
Random walks 191
if conditions would be true and all three of the first three indented statements
would be executed. The last indented statement would not be executed because the
last if condition was true.
Reflection 5.6 Suppose we wanted to randomly move a particle on a line instead.
Then we only need to check two cases: whether r is less than 0.5 or not. If r is
less than 0.5, increment x. Otherwise, decrement x. Write an if/else statement
to implement this. (Resist the temptation to look ahead.)
In situations where there are only two choices, an else can just accompany an if.
For example, if wanted to randomly move a particle on a line, our conditional would
look like:
if r < 0.5: # if r < 0.5,
x = x + 1 # move to the east and finish
else: # otherwise (r >= 0.5),
x = x - 1 # move to the west and finish
print(x) # executed after the if/else
We are now ready to use our if/elif/else conditional structure in a loop to
simulate many steps of a random walk on a grid. The randomWalk function below
does this, and then returns the distance the particle has moved from the origin.
Parameters:
steps: the number of steps in the random walk
tortoise: a Turtle object
To make the grid movement easier to see, we make the turtle move further in each
step by multiplying each position by a variable moveLength. To try out the random
walk, write a main function that creates a new Turtle object and then calls the
randomWalk function with 1000 steps. One such run is illustrated in Figure 5.1.
if draw:
tortoise.goto(x * moveLength, y * moveLength)
⋮
Now, when we call randomWalk, we pass in either True or False for our third
argument. If draw is True, then tortoise.goto(⋯) will be executed but, if draw
is False, it will be skipped.
To find the average over many trials, we will call our randomWalk function
repeatedly in a loop, and use an accumulator variable to sum up all the distances.
Parameters:
steps: the number of steps in the random walk
trials: the number of random walks
totalDistance = 0
for trial in range(trials):
distance = randomWalk(steps, None, False)
totalDistance = totalDistance + distance
return totalDistance / trials
Notice that we have passed in None as the argument for the second parameter
(tortoise) of randomWalk. With False being passed in for the parameter draw,
the value assigned to tortoise is never used, so we pass in None as a placeholder
“dummy value.”
Reflection 5.7 Get ten different estimates of the average distance traveled over
five trials of a 500-step random walk by calling rwMonteCarlo(500, 5) ten times
in a loop and printing the result each time. What do you notice? Do you think five
trials is enough? Now perform the same experiment with 10, 100, 1000, and 10000
trials. How many trials do you think are sufficient to get a reliable result?
Parameters:
maxSteps: the maximum number of steps for the plot
trials: the number of random walks in each simulation
distances = [ ]
stepRange = range(100, maxSteps + 1, 100)
for steps in stepRange:
distance = rwMonteCarlo(steps, trials)
distances.append(distance)
The function we are seeking has actually been mathematically determined, and
√
is approximately n. You can confirm this empirically by plotting this function
alongside the simulated results. To do so, insert each of these three statements in
their appropriate locations in the plotDistances function:
y = [ ] # before loop
The result is shown in Figure 5.2. This result tells us that after a random particle
√
moves n unit-length steps, it will be about n units of distance away from where it
started, on average. In any particular instance, however, a particle may be much
closer or farther away.
As you discovered in Reflection 5.1, the quality of any Monte Carlo approximation
depends on the number of trials. If you call plotDistances a few more times with
different numbers of trials, you should find that the plot of the simulation results
Random walks 195
gets “smoother” with more trials. But more trials obviously take longer. This is
another example of the tradeoffs that we often encounter when solving problems.
Histograms
As we increase the number of trials, our average results become more consistent, but
what do the individual trials look like? To find out, we can generate a histogram of
the individual random walk distances. A histogram for a data set is a bar graph
that shows how the items in the data set are distributed across some number of
intervals, which are usually called “buckets” or “bins.”
The following function is similar to rwMonteCarlo, except that it initializes an
empty list before the loop, and appends one distance to the list in every iteration.
Then it passes this list to the matplotlib histogram function hist. The first
argument to the hist function is a list of numbers, and the second parameter is the
number of buckets that we want to use.
196 Forks in the road
Parameters:
steps: the number of steps in the random walk
trials: the number of random walks
distances = [ ]
for trial in range(trials):
distance = randomWalk(steps, None, False)
distances.append(distance)
pyplot.hist(distances, 75)
pyplot.show()
Exercises
5.1.1. Sometimes we want a random walk to reflect circumstances that bias the probability
of a particle moving in some direction (i.e., gravity, water current, or wind). For
example, suppose that we need to incorporate gravity, so a movement to the north
is modeling a real movement up, away from the force of gravity. Then we might
want to decrease the probability of moving north to 0.15, increase the probability
of moving south to 0.35, and leave the other directions as they were. Show how to
modify the randomWalk function to implement this situation.
5.1.2. Suppose the weather forecast calls for a 70% chance of rain. Write a function
weather()
that prints ’RAIN’ (or something similar) with probability 0.7, and ’SUN!’ other-
wise.
Now write another version that snows with probability 0.66, produces a sunny day
with probability 0.33, and rains cats and dogs with probability 0.01.
5.1.3. Write a function
loaded()
that simulates the rolling of a single “loaded die” that rolls more 1’s and 6’s than it
should. The probability of rolling each of 1 or 6 should be 0.25. The function should
use the random.random function and an if/elif/else conditional construct to
assign a roll value to a variable named roll, and then return the value of roll.
Random walks 197
You then close your eyes and throw darts at the circle. Assuming every dart lands
inside the square, the fraction of the darts that land in the circle estimates the
ratio between the area of the circle and the area of the square. We know that the
198 Forks in the road
area of the circle is C = πr2 = π12 = π and the area of the square is S = 22 = 4. So
the exact ratio is π/4. With enough darts, f , the fraction (between 0 and 1) that
lands in the circle will approximate this ratio: f ≈ π/4, which means that π ≈ 4f .
To make matters a little simpler, we can just throw darts in the upper right quarter
of the circle (shaded above). The ratio here is the same: (π/4)/1 = π/4. If we place
this quarter circle on x and y axes, with the center of the circle at (0, 0), our darts
will now all land at points with x and y coordinates between 0 and 1.
Use this idea to write a function
montePi(darts)
that approximates the value π by repeatedly throwing random virtual darts that
land at points with x and y coordinates in [0, 1). Count the number that land at
points within distance 1 of the origin, and return this fraction.
5.1.7. The Good, The Bad, and The Ugly are in a three-way gun fight (sometimes called
a “truel”). The Good always hits his target with probability 0.8, The Bad always
hits his target with probability 0.7, and The Ugly always hits his target with
probability 0.6. Initially, The Good aims at The Bad, The Bad aims at The Good,
and The Ugly aims at The Bad. The gunmen shoot simultaneously. In the next
round, each gunman, if he is still standing, aims at his same target, if that target
is alive, or at the other gunman, if there is one, otherwise. This continues until
only one gunman is standing or all are dead. What is the probability that they all
die? What is the probability that The Good survives? What about The Bad? The
Ugly? On average, how many rounds are there? Write a function
goodBadUgly()
that simulates one instance of this three-way gun fight. Your function should return
1, 2, 3, or 0 depending upon whether The Good, The Bad, The Ugly, or nobody is
left standing, respectively. Next, write a function
monteGBU(trials)
that calls your goodBadUgly function repeatedly in a Monte Carlo simulation to
answer the questions above.
5.1.8. What is printed by the following sequence of statements in each of the cases below?
Explain your answers.
if votes1 >= votes2:
print(’Candidate one wins!’)
elif votes1 <= votes2:
print(’Candidate two wins!’)
else:
print(’There was a tie.’)
5.1.9. There is a problem with the code in the previous exercise. Fix it so that it correctly
fulfills its intended purpose.
Random walks 199
5.1.10. What is printed by the following sequence of statements in each of the following
cases? Explain your answers.
majority = (votes1 + votes2 + votes3) / 2
if votes1 > majority:
print(’Candidate one wins!’)
if votes2 > majority:
print(’Candidate two wins!’)
if votes3 > majority:
print(’Candidate three wins!’)
else:
print(’A runoff is required.’)
5.1.11. Make the code in the previous problem more efficient and fix it so that it fulfills
its intended purpose.
5.1.12. What is syntactically wrong with the following sequence of statements?
if x < 1:
print(’Something.’)
else:
print(’Something else.’)
elif x > 3:
print(’Another something.’)
5.1.13. What is the final value assigned to result after each of the following code segments?
(a) n = 13
result = n
if n > 12:
result = result + 12
if n < 5:
result = result + 5
else:
result = result + 2
(b) n = 13
result = n
if n > 12:
result = result + 12
elif n < 5:
result = result + 5
else:
result = result + 2
200 Forks in the road
def randomScores(n):
scores = []
for count in range(n):
scores.append(random.gauss(80, 10))
return scores
5.1.15. Determining the number of bins to use in a histogram is part science, part art.
If you use too few bins, you might miss the shape of the distribution. If you use
too many bins, there may be many empty bins and the shape of the distribution
will be too jagged. Experiment with the correct number of bins for 10,000 trials in
rwHistogram function. At the extremes, create a histogram with only 3 bins and
another with 1,000 bins. Then try numbers in between. What seems to be a good
number of bins? (You may also want to do some research on this question.)
Reflection 5.10 Notice that we are computing R(t) from R(t − 1) at every step.
Does this look familiar?
Pseudorandom number generators 201
If you read Section 4.4, you may recognize this as another example of a difference
equation. A difference equation is a function that computes its next value based on
its previous value.
A simple PRNG algorithm, known as a Lehmer pseudorandom number generator ,
is named after the late mathematician Derrick Lehmer. Dr. Lehmer taught at the
University of California, Berkeley, and was one of the first people to run programs
on the ENIAC, the first electronic computer. A Lehmer PRNG uses the following
difference equation:
R(t) = a ⋅ R(t − 1) mod m
where m is a prime number and a is an integer between 1 and m − 1. For example,
suppose m = 13, a = 5, and the seed R(0) = 1. Then
R(1) = 5 ⋅ R(0) mod 13 = 5 ⋅ 1 mod 13 = 5 mod 13 = 5
R(2) = 5 ⋅ R(1) mod 13 = 5 ⋅ 5 mod 13 = 25 mod 13 = 12
R(3) = 5 ⋅ R(2) mod 13 = 5 ⋅ 12 mod 13 = 60 mod 13 = 8
So this pseudorandom sequence begins 5, 12, 8, . . ..
The next four values are R(4) = 1, R(5) = 5, R(6) = 12, and R(7) = 8. So the
sequence is now 5, 12, 8, 1, 5, 12, 8, . . ..
Implementation
But first, let’s implement the PRNG “black box” in Figure 5.3, based on the Lehmer
PRNG difference equation. It is a very simple function!
Parameters:
r: the seed or previous pseudorandom number
m: a prime number
a: an integer between 1 and m - 1
return (a * r) % m
202 Forks in the road
The parameter r represents R(t − 1) and the value returned by the function is the
single value R(t).
Now, for our function to generate good pseudorandom numbers, we need some
better values for m and a. Some particularly good ones were suggested by Keith
Miller and Steve Park in 1988:
m = 231 − 1 = 2, 147, 483, 647 and a = 16, 807.
A Lehmer generator with these parameters is often called a Park-Miller psuedo-
random number generator . To create a Park-Miller PRNG, we can call the lehmer
function repeatedly, each time passing in the previously generated value and the
Park-Miller values of m and a, as follows.
Parameters:
length: the number of pseudorandom numbers to generate
seed: the initial seed
r = seed
m = 2**31 - 1
a = 16807
randList = [ ]
for index in range(length):
r = lehmer(r, m, a)
randList.append(r)
return randList
In each iteration, the function appends a new pseudorandom number to the end of
the list named randList. (This is another list accumulator.) The randomSequence
function then returns this list of length pseudorandom numbers. Try printing the
result of randomSequence(100, 1).
Because all of the returned values of lehmer are modulo m, they are all in the
interval [0 . . m − 1]. Since the value of m is somewhat arbitrary, random numbers
are usually returned instead as values from the interval [0, 1), as Python’s random
function does. This is accomplished by simply dividing each pseudorandom number
by m. To modify our randomSequence function to returns a list of numbers in [0, 1)
instead of in [0 . . m − 1], we can simply append r / m to randList instead of r.
Reflection 5.13 Make this change to the randomSequence function and call the
function again with seeds 3, 4, and 5. Would you expect the results to look similar
with such similar seeds? Do they?
Pseudorandom number generators 203
As an aside, you can set the seed used by Python’s random function by calling the
seed function. The default seed is based on the current time.
The ability to generate an apparently random, but reproducible, sequence by
setting the seed has quite a few practical applications. For example, it is often useful
in a Monte Carlo simulation to be able to reproduce an especially interesting run
by simply using the same seed. Pseudorandom sequences are also used in electronic
car door locks. Each time you press the button on your dongle to unlock the door,
it is sending a different random code. But since the dongle and the car are both
using the same PRNG with the same seed, the car is able to recognize whether the
code is legitimate.
Testing randomness
How can we tell how random a sequence of numbers really is? What does it even
mean to be truly “random?” If a sequence is random, then there must be no way to
predict or reproduce it, which means that there must be no shorter way to describe
the sequence than the sequence itself. Obviously then, a PRNG is not really random
at all because it is entirely reproducible and can be described by simple formula.
However, for practical purposes, we can ask whether, if we did not know the formula
used to produce the numbers, could we predict them, or any patterns in them?
One simple test is to generate a histogram from the list of values.
Reflection 5.14 Suppose you create a histogram for a list of one million random
numbers in [0, 1). If the histogram contains 100 buckets, each representing an interval
with length 0.01, about how many numbers should be assigned to each bucket?
If the list is random, each bucket should contain about 1%, or 10,000, of the numbers.
The following function generates such a histogram.
def histRandom(length):
"""Displays a histogram of numbers generated by the Park-Miller PRNG.
Parameter:
length: the number of pseudorandom numbers to generate
samples = randomSequence(length, 6)
pyplot.hist(samples, 100)
pyplot.show()
204 Forks in the road
Exercises
5.2.1. We can visually test the quality of a PRNG by using it to plot random points
on the screen. If the PRNG is truly random, then the points should be uniformly
distributed without producing any noticeable patterns. Write a function
testRandom(n)
that uses turtle graphics and random.random to plot n random points with x and
y each in [0, 1). Here is a “skeleton” of the function with some turtle graphics set
up for you. Calling the setworldcoordinates function redefines the coordinate
system so that the lower left corner is (0, 0) and the upper right corner is (1, 1).
Use the turtle graphics functions goto and dot to move the turtle to each point
and draw a dot there.
def testRandom(n):
""" your docstring here """
tortoise = turtle.Turtle()
screen = tortoise.getscreen()
screen.setworldcoordinates(0, 0, 1, 1)
screen.tracer(100) # only draw every 100 updates
tortoise.up()
tortoise.speed(0)
returns a value according to the normal distribution with mean 0 and standard
deviation 0.25.
>>> random.gauss(0, 0.25)
-0.3371607214433552
def sumRandom(n):
"""Returns the sum of n pseudorandom numbers in [0,1).
Parameter:
n: the number of pseudorandom numbers to generate
sum = 0
for index in range(n):
sum = sum + random.random()
return sum
Parameters:
n: the number of pseudorandom numbers in each sum
trials: the number of sums to generate
samples = [ ]
for index in range(trials):
samples.append(sumRandom(n))
pyplot.hist(samples, 100)
pyplot.show()
Reflection 5.16 Call sumRandomHist with 100,000 trials and values of n equal to
1, 2, 3, and 10. What do you notice?
n = 1 n = 2
n = 3 n = 10
Figure 5.5 Results of sumRandomHist(n, 100000) with n = 1, 2, 3, 10.
mean, this also implies that the average of all of the measurements will be close to
the true mean.
Reflection 5.17 Think back to the experiment you ran to answer Reflection 5.9.
The shape of the histogram should have resembled a normal distribution. Can you
use the central limit theorem to explain why?
Exercises
5.3.1. A more realistic random walk has the movement in each step follow a normal
distribution. In particular, in each step, we can change both x and y according to a
normal distribution with mean 0. Because the values produced by this distribution
will be both positive and negative, the particle can potentially move in any direction.
To make the step sizes small, we need to use a small standard deviation, say 0.5:
x = x + random.gauss(0, 0.5)
y = y + random.gauss(0, 0.5)
Modify the randomWalk function so that it updates the position of the particle
in this way instead. Then use the rwMonteCarlo and plotDistances functions
to run a Monte Carlo simulation with your new randomWalk function. As we did
Back to Booleans 209
earlier, call plotDistances(1000, 10000). How do your results differ from the
original version?
5.3.2. Write a function
uniform(a, b)
that returns a number in the interval [a, b) using only the random.random function.
(Do not use the random.uniform function.)
5.3.3. Suppose we want a pseudorandom integer between 0 and 7 (inclusive). How can
we use the random.random() and int functions to get this result?
5.3.4. Write a function
randomRange(a, b)
that returns a pseudorandom integer in the interval [a . . b] using only the
random.random function. (Do not use random.randrange or random.randint.)
5.3.5. Write a function
normalHist(mean, stdDev, trials)
that produces a histogram of trials values returned by the gauss function with
the given mean and standard deviation. In other words, reproduce Figure 5.4.
(This is very similar to the sumRandomHist function.)
5.3.6. Write a function
uniformHist(a, b, trials)
that produces a histogram of trials values in the range [a, b] returned by the
uniform function. In other words, reproduce the top left histogram in Figure 5.5.
(This is very similar to the sumRandomHist function.)
5.3.7. Write a function
plotChiSquared(k, trials)
that produces a histogram of trials values, each of which is the sum of k squares
of values given by the random.gauss function with mean 0 and standard deviation
1. (This is very similar to the sumRandomHist function.) The resulting probability
distribution is known as the chi-squared distribution (χ2 distribution) with k
degrees of freedom.
if r < 0.5:
x = x + 1
else:
x = x - 1
to control a random walk based on a random value of r. The outcome in this case
is based upon the value of the Boolean expression r < 0.5. In Python, Boolean
expressions evaluate to either the value True or the value False, which correspond
to the binary values 1 and 0, respectively, that we worked with in Section 1.4. The
values True and False can be printed, assigned to variable names, and manipulated
just like numeric values. For example, try the following examples and make sure
you understand each result.
>>> print(0 < 1)
True
>>> name = ’Kermit’
>>> print(name == ’Gonzo’)
False
>>> result = 0 < 1
>>> result
True
>>> result and name == ’Kermit’
True
The “double equals” (==) operator tests for equality; it has nothing to do with
assignment. The Python interpreter will remind you if you mistakenly use a single
equals in an if statement. For example, try this:
>>> if r = 0:
if r = 0:
^
SyntaxError: invalid syntax
However, the interpreter will not catch the error if you mistakenly use == in an
assignment statement. For example, try this:
>>> r = 1
>>> r == r + 1 # increment r?
False
a b a and b a or b not a
False False False False True
False True False True True
True False False True False
True True True True False
Figure 5.6 Combined truth table for the three Boolean operators.
When 53000 is assigned to income, the two Boolean expressions income >= 40000
and income <= 65000 are both True, so income >= 40000 and income <= 65000
is also True. However, when 12000 is assigned to income, income >= 40000 is False,
while income <= 65000 is True, so income >= 40000 and income <= 65000 is
now False. We can also incorporate this test into a function that simply returns
the value of the condition:
>>> def middleClass(income):
return (income >= 40000 and income <= 65000)
>>> middleClass(53000)
True
>>> middleClass(12000)
False
• If income is between 40000 and 65000 (inclusive), it will return True because
both parts of the or expression are True.
• If income < 40000, it will return True because income <= 65000 is True.
• If income > 65000, it will return True because income >= 40000 is True.
212 Forks in the road
Parameters:
employee: average employee pay
ceo: CEO pay
ratio: the fair ratio
This function will not always work properly because, if the average employees’
compensation equals 0, the division operation will result in an error. Therefore,
we have to test whether employee == 0 before attempting the division and, if so,
return False (because not paying employees is obviously never fair). Otherwise, we
want to return the result of the fairness test. The following function implements
this algorithm.
if employee == 0:
result = False
else:
result = (ceo / employee <= ratio)
return result
However, with short circuit evaluation, we can simplify the whole function to:
Back to Booleans 213
Parameters:
employee: average employee pay
ceo: CEO pay
ratio: the fair ratio
if employee == 0:
result = True
else:
result = (ceo / employee > ratio)
return result
However, taking advantage of short circuit evaluation with the or operator, we can
simplify the whole function to:
In this case, if (employee == 0) is True, the whole expression returns True with-
out evaluating the division test, thus avoiding an error. On the other hand, if
(employee == 0) is False, the division test is evaluated, and the final result is
equal to the outcome of this test.
214 Forks in the road
Complex expressions
Some situations require Boolean expressions that are more complex than what we
have seen thus far. To illustrate, let’s consider how to test whether a particle in a
random walk at position (x, y) is in one of the four corners of the screen, as depicted
by the shaded regions below.
2 1
(-d,d) (d,d)
(0,0)
(-d,-d) (d,-d)
3 4
There are (at least) two ways we can write a Boolean expression to test this condition.
One way is to test whether the particle is in each corner individually.
1. For the point to be corner 1, x > d and y > d must be True;
2. for the point to be corner 2, x < -d and y > d must be True;
3. for the point to be corner 3, x < -d and y < -d must be True; or
4. for the point to be corner 4, x > d and y < -d must be True.
Since any one of these can be True for our condition to be True, the final test, in the
form of a function, looks like the following. (The “backslash” (\) character below is
the line continuation character . It indicates that the line that it ends is continued
on the next line. This is sometimes handy for splitting very long lines of code.)
>>> def corners(x, y, d):
return x > d and y > d or x < -d and y > d or \
x < -d and y < -d or x > d and y < -d
>>> corners(11, -11, 10)
True
>>> corners(11, 0, 10)
False
Although our logic is correct, this expression is only correct if the Python inter-
preter evaluates the and operator before it evaluates the or operator. Otherwise,
if the or operator is evaluated first, then the first expression evaluated will be
y > d or x < -d, shown in red below, which is not what we intended.
Back to Booleans 215
Operators Description
1. ** exponentiation (power)
2. +, - unary positive and negative, e.g., -(4 * 9)
3. *, /, //, % multiplication and division
4. +, - addition and subtraction
5. <, <=, >, >=, !=, ==, comparison operators
in, not in
6. not Boolean not
7. and Boolean and
8. or Boolean or
x > d and y > d or x < -d and y > d or x < -d and y < -d or x > d and y < -d
Therefore, understanding the order in which operators are evaluated, known as
operator precedence, becomes important for more complex conditions such as this.
In this case, the expression is correct because and has precedence over or, as
illustrated in Table 5.2. Also notice from Table 5.2 that the comparison operators
have precedence over the Boolean operators, which is also necessary for our expression
to be correct. If in doubt, you can use parentheses to explicitly specify the intended
order. Parentheses would be warranted here in any case, to make the expression
more understandable:
>>> def corners(x, y, d):
return (x > d and y > d) or (x < -d and y > d) or \
(x < -d and y < -d) or (x > d and y < -d)
Now let’s consider an alternative way to think about the corner condition: for a
point to be in one of the corners, it must be true that x must either exceed d or be
less than -d and y must either exceed d or be less than -d. The equivalent Boolean
expression is enclosed in the following function.
>>> def corners2(x, y, d):
return x > d or x < -d and y > d or y < -d
>>> corners2(11, -11, 10)
True
>>> corners2(11, 0, 10)
True
Reflection 5.20 The second call to corners2 gives an incorrect result. Do you see
the problem? (You might want to draw a picture.)
Due to the operator precedence order, this is a situation where parentheses are
required. Without parentheses, the and expression in red below is evaluated first,
which is not our intention.
216 Forks in the road
To confirm that any Boolean expression is correct, we must create a truth table for
it, and then confirm that every case matches what we intended. In this expression,
there are four separate Boolean “inputs,” one for each expression containing a
comparison operator. In the truth table, we will represent each of these with a letter
to save space:
To fill in the fifth column, we only need to look at the first and second columns.
Since this is an or expression, for each row in the fifth column, we enter a T if there
is at least one T among the first and second columns of that row, or a F otherwise.
Similarly, to fill in the sixth column, we only look at the third and fourth column.
To fill in the last column, we look at the two previous columns, filling in a T in each
row in which the fifth and sixth columns are both T, or a F otherwise.
To confirm that our expression is correct, we need to confirm that the truth
value in each row of the last column is correct with respect to that row’s input
values. For example, let’s look at the fifth, sixth, and eighth rows of the truth table
(in red).
In the fifth row, the last column indicates that the expression is False. Therefore,
it should be the case that a point described by the truth values in the first four
columns of that row is not in one of the four corners. Those truth values say that x
> d is false, x < -d is true, y > d is false, and y < -d is false; in other words, x is
less than -d and y is between -d and d. A point within these bounds is not in any
of the corners, so that row is correct.
Now let’s look at the sixth row, where the final expression is true. The first
columns of that row indicate that x > d is false, x < -d is true, y > d is false, and
y < -d is true; in other words, both x and y are less than -d. A point within these
bounds is in the bottom left corner, so that row is also correct.
The eighth row is curious because it states that both y > d and y < -d are true.
But this is, of course, impossible. Because of this, we say that the implied statement
is vacuously true. In practice, we cannot have such a point, so that row is entirely
irrelevant. There are seven such rows in the table, leaving only four other rows that
are true, matching the four corners.
218 Forks in the road
Parameters:
a, b: two numbers
if a >= b:
result = a
else:
result = b
return result
We can simplify this function a bit by returning the appropriate value right in the
if/else statement:
if a >= b:
return a
else:
return b
It may look strange at first to see two return statements in one function, but it
all makes perfect sense. Recall from Section 3.5 that return both ends the function
and assigns the function’s return value. So this means that at most one return
statement can ever be executed in a function. In this case, if a >= b is true, the
function ends and returns the value of a. Otherwise, the function executes the else
clause, which returns the value of b.
The fact that the function ends if a >= b is true means that we can simplify it
even further: if execution continues past the if part of the if/else, it must be the
case that a >= b is false. So the else is extraneous; the function can be simplified
to:
Back to Booleans 219
if a >= b:
return a
return b
This same principle can be applied to situations with more than two cases. Suppose
we wrote a function to convert a percentage grade to a grade point (i.e., GPA) on a
0–4 scale. A natural implementation of this might look like the following:
def assignGP(score):
"""Returns the grade point equivalent of score.
Parameter:
score: a score between 0 and 100
Reflection 5.22 Why do we not need to check upper bounds on the scores
in each case? In other words, why does the second condition not need to be
score >= 80 and score < 90?
Suppose score was 92. In this case, the first condition is true, so the function
returns the value 4 and ends. Execution never proceeds past the statement return
4. For this reason, the “el” in the next elif is extraneous. In other words, because
execution would never have made it there if the previous condition was true, there is
no need to tell the interpreter to skip testing this condition if the previous condition
was true.
Now suppose score was 82. In this case, the first condition would be false, so
we continue on to the first elif condition. Because we got to this point, we already
know that score < 90 (hence the omission of that check). The first elif condition
is true, so we immediately return the value 3. Since the function has now completed,
220 Forks in the road
there is no need for the “el” in the second elif either. In other words, because
execution would never have made it to the second elif if either of the previous
conditions were true, there is no need to skip testing this condition if a previous
condition was true. In fact, we can remove the “el”s from all of the elifs, and the
final else, with no loss in efficiency at all.
def assignGP(score):
""" (docstring omitted) """
Some programmers find it clearer to leave the elif statements in, and that is fine
too. We will do it both ways in the coming chapters. But, as you begin to see more
algorithms, you will probably see code like this, and so it is important to understand
why it is correct.
Exercises
5.4.1. Write a function
password()
that asks for a username and a password. It should return True if the username is
entered as alan.turing and the password is entered as notTouring, and return
False otherwise.
5.4.2. Suppose that in a game that you are making, the player wins if her score is at
least 100. Write a function
hasWon(score)
that returns True if she has won, and False otherwise.
5.4.3. Suppose you have designed a sensor that people can wear to monitor their health.
One task of this sensor will be to monitor body temperature: if it falls outside the
range 97.9○ F to 99.3○ F, the person may be getting sick. Write a function
monitor(temperature)
that takes a temperature (in Fahrenheit) as a parameter, and returns True if
temperature falls in the healthy range and False otherwise.
Back to Booleans 221
Grade Remark
96-100 Outstanding
90-95 Exceeds expectations
80-89 Acceptable
1-79 Trollish
5.4.10. Write a function that takes two integer values as parameters and returns their sum
if they are not equal and their product if they are.
5.4.11. Write a function
amIRich(amount, rate, years)
that accumulates interest on amount dollars at an annual rate of rate percent
for a number of years. If your final investment is at least double your original
amount, return True; otherwise, return False.
222 Forks in the road
1, 1, 2, 3, 5, 8, 3, 1, 4, 5, 9, 4, 3, 7, 0, . . .
Write a function
mystery(a, b)
that returns the length of the sequence when the last two numbers repeat the
values of a and b for the first time. (When a = 1 and b = 1, the function should
return 62.)
5.4.15. The Chinese zodiac relates each year to an animal in a twelve-year cycle. The
animals for the most recent years are given below.
Write a function
zodiac(year)
that takes as a parameter a four-digit year (this could be any year in the past or
future) and returns the corresponding animal as a string.
5.4.16. A year is a leap year if it is divisible by four, unless it is a century year in which
case it must be divisible by 400. For example, 2012 and 1600 are leap years, but
2011 and 1800 are not. Write a function
leap(year)
that returns a Boolean value indicating whether the year is a leap year.
Back to Booleans 223
(-d,d) (d,d)
(0,0)
(-d,-d) (d,-d)
and
(y > d) and (x > d or x < -d)
are equivalent.
5.4.25. Write a function
drawRow(tortoise, row)
that uses turtle graphics to draw one row of an 8 × 8 red/black checkerboard. If
the value of row is even, the row should start with a red square; otherwise, it
should start with a black square. You may want to use the drawSquare function
you wrote in Exercise 3.3.5. Your function should only need one for loop and only
need to draw one square in each iteration.
5.4.26. Write a function
checkerBoard(tortoise)
that draws an 8 × 8 red/black checkerboard, using the function you wrote in
Exercise 5.4.25.
import random
def guessingGame(maxGuesses):
"""Plays a guessing game. The human player tries to guess
the computer’s number from 1 to 100.
Parameter:
maxGuesses: the maximum number of guesses allowed
The randrange function returns a random integer that is at least the value of its first
argument, but less than its second argument (similar to the way the range function
interprets its arguments). After the function chooses a random integer between 1
and 100, it enters a for loop that will allow us to guess up to maxGuesses times.
The function prompts us for our guess with the input function, and then assigns
the response to myGuess as a string. Because we want to interpret the response as
an integer, we use the int function to convert the string. Once it has myGuess, the
function uses the if/else statement to tell us whether we have guessed correctly.
After this, the loop will give us another guess, until we have used them all up.
Reflection 5.23 Try playing the game by calling guessingGame(20). Does it work?
Is there anything we still need to work on?
1. After we guess correctly, unless we have used up all of our guesses, the loop
iterates again and gives us another guess. Instead, we want the function to
end at this point.
2. It would be much friendlier for the game to tell us whether an incorrect guess
is too high or too low.
Because we are replacing our for loop with this while loop, we will now need to
manage the index variable manually. We do this by initializing guesses = 0 before
the loop and incrementing guesses in the body of the loop. Here is the function
with these changes:
226 Forks in the road
def guessingGame(maxGuesses):
""" (docstring omitted) """
Reflection 5.24 Notice that we have also included myGuess = 0 before the loop.
Why do we bother to assign a value to myGuess before the loop? Is there anything
special about the value 0? (Hint: try commenting it out.)
If we comment out myGuess = 0, we will see the following error on the line containing
the while loop:
UnboundLocalError: local variable ’myGuess’ referenced before assignment
This error means that we have referred to an unknown variable named myGuess.
The name is unknown to the Python interpreter because we had not defined it
before it was first referenced in the while loop condition. Therefore, we need to
initialize myGuess before the while loop, so that the condition makes sense the first
time it is tested. To make sure the loop iterates at least once, we need to initialize
myGuess to a value that cannot be the secret number. Since the secret number will
be at least 1, 0 works for this purpose. This logic can be generalized as one of two
rules to always keep in mind when designing an algorithm with a while loop:
1. Initialize the condition before the loop. Always make sure that the condition
makes sense and will behave in the intended way the first time it is tested.
2. In each iteration of the loop, work toward the condition eventually becoming
false. Not doing so will result in an infinite loop.
To ensure that our condition will eventually become false, we need to understand
when this happens. For the and expression in the while loop to become false, either
(myGuess != secretNumber) must be false or (guesses < maxGuesses) must be
false.
Reflection 5.25 How do the statements in the body of the loop ensure that eventu-
ally (myGuess != secretNumber) or (guesses < maxGuesses) will be false?
A guessing game 227
Prompting for a new guess creates the opportunity for the first part to become false,
while incrementing guesses ensures that the second part will eventually become
false. Therefore, we cannot have an infinite loop.
This reasoning is enshrined in the first of De Morgan’s laws, named after 19th
century British mathematician Augustus De Morgan. De Morgan’s two laws are:
You may recognize these from Exercises 1.4.10 and 1.4.11. The first law says that
a and b is false if either a is false or b is false. The second law says that a or b is
false if both a is false and b is false. Applied to our case, the first law tells us that
the negation of our while loop condition is
(myGuess == secretNumber) or (guesses >= maxGuesses)
2. Friendly hints
Inside the loop, we currently handle two cases: (1) we win, and (2) we do not win
but get another guess. To be friendlier, we should split the second case into two: (2a)
our guess was too low, and (2b) our guess was too high. We can accomplish this by
replacing the not-so-friendly print(’Nope. Try again.’) with another if/else
that decides between the two new subcases:
if myGuess == secretNumber: # win
print(’You got it!’)
else: # try again
if myGuess < secretNumber: # too low
print(’Too low. Try again.’)
else: # too high
print(’Too high. Try again.’)
Now, if myGuess == secretNumber is false, we execute the first else clause, the
body of which is the new if/else construct. If myGuess < secretNumber is true,
we print that the number is too low; otherwise, we print that the number is too
high.
Reflection 5.26 Do you see a way in which the conditional construct above can be
simplified?
The conditional construct above is really just equivalent to a decision between three
disjoint possibilities: (a) the guess is equal to the secret number, (b) the guess is
less than the secret number, or (c) the guess is greater than the secret number. In
other words, it is equivalent to:
228 Forks in the road
In the game, if we incorrectly use our last guess, then two things must be true just
before the if condition is tested: first, myGuess is not equal to secretNumber and,
second, guesses is equal to maxGuesses. So we can incorporate this condition into
the if/elif/else:
if (myGuess != secretNumber) and (guesses == maxGuesses): # lose
print(’Too bad. You lose.’)
elif myGuess == secretNumber: # win
print(’You got it!’)
elif myGuess < secretNumber: # too low
print(’Too low. Try again.’)
else: # too high
print(’Too high. Try again.’)
Notice that we have made the previous first if condition into an elif statement
because we only want one of the four messages to be printed. However, here is an
alternative structure that is more elegant:
if myGuess == secretNumber: # win
print(’You got it!’)
elif guesses == maxGuesses: # lose
print(’Too bad. You lose.’)
elif myGuess < secretNumber: # too low
print(’Too low. Try again.’)
else: # too high
print(’Too high. Try again.’)
By placing the new condition second, we can leverage the fact that, if we get to the
first elif, we already know that myGuess != secretNumber. Therefore, we do not
need to include it explicitly.
A guessing game 229
There is a third way to handle this situation that is perhaps even more elegant.
Notice that both of the first two conditions are going to happen at most once, and
at the end of the program. So it might make more sense to put them after the loop.
Doing so also exhibits a nice parallel between these two events and the two parts of
the while loop condition. As we discussed earlier, the negation of the while loop
condition is
(myGuess == secretNumber) or (guesses >= maxGuesses)
So when the loop ends, at least one of these two things is true. Notice that these
two events are exactly the events that define a win or a loss: if the first part is
true, then we won; if the second part is true, we lost. So we can move the win/loss
statements after the loop, and decide which to print based on which part of the
while loop condition became false:
if myGuess == secretNumber: # win
print(’You got it!’)
else: # lose
print(’Too bad. You lose.’)
In the body of the loop, with these two cases gone, we will now need to check if we
still get another guess (mirroring the while loop condition) before we print one of
the “try again” messages:
if (myGuess != secretNumber) and (guesses < maxGuesses):
if myGuess < secretNumber: # too low
print(’Too low. Try again.’)
else: # too high
print(’Too high. Try again.’)
Reflection 5.28 Why is it not correct to combine the two if statements above into
a single statement like the following?
if (myGuess != secretNumber) and (guesses < maxGuesses) \
and (myGuess < secretNumber):
print(’Too low. Try again.’)
else:
print(’Too high. Try again.’)
Hint: what does the function print when guesses < maxGuesses is false and
myGuess < secretNumber is true?
These changes are incorporated into the final game that is shown below.
230 Forks in the road
import random
def guessingGame(maxGuesses):
""" (docstring omitted) """
def main():
guessingGame(10)
main()
As you play the game, think about what the best strategy is. How many guesses
do different strategies require? Exercise 5.5.7 asks you to write a Monte Carlo
simulation to compare three different strategies for playing the game.
Exercises
5.5.1. Write a function
ABC()
that prompts for a choice of A, B, or C and uses a while loop to keep prompting
until it receives the string ’A’, ’B’, or ’C’.
5.5.2. Write a function
numberPlease()
that prompts for an integer between 1 and 100 (inclusive) and uses a while loop
to keep prompting until it receives a number within this range.
A guessing game 231
if outcome == 1:
print(’Player 1 wins!’)
elif outcome == -1:
print(’Player 2 wins!’)
else:
print(’Player 1 and player 2 tied.’)
5.5.5. Write a function
yearsUntilDoubled(amount, rate)
that returns the number of years until amount is doubled when it earns the given
rate of interest, compounded annually. Use a while loop.
5.5.6. The hailstone numbers are a sequence of numbers generated by the following simple
algorithm. First, choose any positive integer. Then, repeatedly follow this rule: if
the current number is even, divide it by two; otherwise, multiply it by three and
add one. For example, suppose we choose the initial integer to be 3. Then this
algorithm produces the following sequence:
3, 10, 5, 16, 8, 4, 2, 1, 4, 2, 1, 4, 2, 1 . . .
For every initial integer ever tried, the sequence always reaches one and then
repeats the sequence 4, 2, 1 forever after. Interestingly, however, no one has ever
proven that this pattern holds for every integer!
Write a function
hailstone(start)
1
See http://en.wikipedia.org/wiki/Rock-paper-scissors-lizard-Spock for the rules.
232 Forks in the road
that prints the hailstone number sequence starting from the parameter start,
until the value reaches one. Your function should return the number of integers
in your sequence. For example, if n were 3, the function should return 8. (Use a
while loop.)
5.5.7. In this exercise, we will design a Monte Carlo simulation to compare the effectiveness
of three strategies for playing the guessing game. Each of these strategies will be
incorporated into the guessing game function we designed in this chapter, but
instead of checking whether the player wins or loses, the function will continue
until the number is guessed, and then return the number of guesses used. We
will also make the maximum possible secret number a parameter, so that we can
compare the results for different ranges of secret numbers.
The first strategy is to just make a random guess each time, ignoring any previous
guesses:
def guessingGame1(maxNumber):
"""Play the guessing game by making random guesses."""
return guesses
The second strategy is to incrementally try every possible guess from 1 to 100,
thereby avoiding any duplicates:
def guessingGame2(maxNumber):
"""Play the guessing game by making incremental guesses."""
return guesses
Finally, the third strategy uses the outcomes of previous guesses to narrow in on
the secret number:
Summary 233
def guessingGame3(maxNumber):
"""Play the guessing game intelligently by narrowing in
on the secret number."""
return guesses
Write a Monte Carlo simulation to compare the expected (i.e., average) behavior
of these three strategies. Use a sufficiently high number of trials to get consistent
results. Similarly to what we did in Section 5.1, run your simulation for a range
of maximum secret numbers, specifically 5, 10, 15, . . . , 100, and plot the average
number of guesses required by each strategy for each maximum secret number.
(The x-axis of your plot will be the maximum secret number and the y-axis will be
the average number of guesses.) Explain the results. In general, how many guesses
on average do you think each strategy requires to guess a secret number between 1
and n?
5.6 SUMMARY
In previous chapters, we designed deterministic algorithms that did the same thing
every time we executed them, if we gave them the same inputs (i.e., arguments).
Giving those algorithms different arguments, of course, could change their behavior,
whether it be drawing a different size shape, modeling a different population, or
experimenting with a different investment scenario. In this chapter, we started to
investigate a new class of algorithms that can change their behavior “on the fly,” so
to speak. These algorithms all make choices using Boolean logic, the same Boolean
logic on which computers are fundamentally based. By combining comparison
operators and Boolean operators, we can characterize just about any decision. By
incorporating these Boolean expressions into conditional statements (if/elif/else)
and conditional loops (while), we vastly increase the diversity of algorithms that
we can design. These are fundamental techniques that we will continue to use
and develop over the next several chapters, as we start to work with textual and
numerical data that we read from files and download from the web.
234 Forks in the road
http://physerver.hamilton.edu/Research/Brownian/index.html
The Drunkard’s Walk by Leonard Mlodinow [34] is a very accessible book about
how randomness and chance affect our lives. For more information about generating
random numbers, and the differences between PRNGs and true random number
generators, visit
https://www.random.org/randomness/ .
The Park-Miller random number generator is due to Keith Miller and the late Steve
Park [37].
The Roper Center for Public Opinion Research, at the University of Connecticut,
maintains some helpful educational resources about random sampling and errors in
the context of public opinion polling at
http://www.ropercenter.uconn.edu/education.html .
5.8 PROJECTS
Project 5.1 The magic of polling
According to the latest poll, the president’s job approval rating is at 45%,
with a margin of error of ±3%, based on interviews with approximately
1,500 adults over the weekend.
We see news headlines like this all the time. But how can a poll of 1,500 randomly
chosen people claim to represent the opinions of millions in the general population?
How can the pollsters be so certain of the margin of error? In this project, we will
investigate how well random sampling can really estimate the characteristics of a
larger population. We will assume that we know the true percentage of the overall
population with some characteristic or belief, and then investigate how accurate a
much smaller poll is likely to get.
Suppose we know that 30% of the national population agrees with the statement,
“Animals should be afforded the same rights as human beings.” Intuitively, this
means that, if we randomly sample ten individuals from this population, we should,
on average, find that three of them agree with the statement and seven do not.
But does it follow that every poll of ten randomly chosen people will mirror the
percentage of the larger population? Unlike a Monte Carlo simulation, a poll is
taken just once (or maybe twice) at any particular point in time. To have confidence
Projects 235
in the poll results, we need some assurance that the results would not be drastically
different if the poll had queried a different group of randomly chosen individuals.
For example, suppose you polled ten people and found that two agreed with the
statement, then polled ten more people and found that seven agreed, and then
polled ten more people and found that all ten agreed. What would you conclude?
There is too much variation for this poll to be credible. But what if we polled more
than ten people? Does the variation, and hence the trustworthiness, improve?
In this project, you will write a program to investigate questions such as these
and determine empirically how large a sample needs to be to reliably represent the
sentiments of a large population.
1. Simulate a poll
In conducting this poll, the pollster asks each randomly selected individual whether
he or she agrees with the statement. We know that 30% of the population does, so
there is a 30% chance that each individual answers “yes.” To simulate this polling
process, we can iterate over the number of individuals being polled and count them
as a “yes” with probability 0.3. The final count at the end of the loop, divided by
the number of polled individuals, gives us the poll result. Implement this simulation
by writing a function
poll(percentage, pollSize)
that simulates the polling of pollSize individuals from a large population in which
the given percentage (between 0 and 100) will respond “yes.” The function should
return the percentage (between 0 and 100) of the poll that actually responded “yes.”
Remember that the result will be different every time the function is called. Test
your function with a variety of poll sizes.
Test your function with a variety of poll sizes and numbers of trials.
236 Forks in the road
that plots the minimum and maximum percentages returned by calling the function
pollExtremes(percentage, pollSize, trials) for values of pollSize ranging
from minPollSize to maxPollSize, in increments of step. For each poll size, call
your pollExtremes function with
low, high = pollExtremes(percentage, pollSize, trials)
and then append the values of low and high each to its own list for the plot. Your
function should return the margin of error for the largest poll, defined to be (high
- low) / 2. The poll size should be on the x-axis of your plot and the percentage
should be on the y-axis. Be sure to label both axes.
Question 5.1.1 Assuming that you want to balance a low margin of error with the
labor involved in polling more people, what is a reasonable poll size? What margin
of error does this poll size give?
Write a main function (if you have not already) that calls your plotResults
function to investigate an answer to this question. You might start by calling
it with plotResults(30, 10, 1000, 10, 100).
that plots the margin of error in a poll of pollSize individuals, for actual percentages
ranging from minPercentage to maxPercentage, in increments of step. To find
the margin of error for each poll, call the pollExtremes function as above, and
compute (high - low) / 2. In your plot, the percentage should be on the x-axis
and the margin of error should be on the y-axis. Be sure to label both axes.
Question 5.1.2 Does your answer to the previous part change if the actual per-
centage of the population is very low or very high?
You might start to investigate this question by calling the function with
plotErrors(1500, 10, 80, 1, 100).
Projects 237
that simulates the narrow escape problem in a circle with radius 1 and an
opening of openingDegrees degrees. In the circle, the opening will be between
360 − openingDegrees and 360 degrees, as illustrated below.
0 = 360 degrees
value of x a bit. Also, the Python arctangent (tan−1 ) function, math.atan, always
returns an angle between −π/2 and π/2 radians (between −90 and 90 degrees), so
the result needs to be adjusted to be between 0 and 360 degrees. The following
function handles this for you.
Below you will find a “skeleton” of the escape function with the loop and draw-
ing code already written. Drawing the partial circle is handled by the function
setupWalls below. Notice that the function uses a while loop with a Boolean flag
variable named escaped controlling the iteration. The value of escaped is initially
False, and your algorithm should set it to True when the particle escapes. Most,
but not all, of the remaining code is needed in the while loop.
if draw:
scale = 300 # scale up drawing
setupWalls(tortoise, openingDegrees, scale, radius)
if draw:
tortoise.goto(x * scale, y * scale) # move particle
if draw:
screen = tortoise.getscreen() # update screen to compensate
screen.update() # for high tracer value
Projects 239
Plot the average numbers of steps for openings ranging from 10 to 180 degrees,
in 10-degree steps, using at least 1,000 trials to get a smooth curve. As this number
of trials will take a few minutes to complete, start with fewer trials to make sure
your simulation is working properly.
In his undergraduate thesis at the University of Pittsburgh, Carey Caginalp [7]
mathematically derived a function that describes these results. In particular, he
proved that the expected time required by a particle to escape an opening width of
α degrees is
1 α
T (α) = − 2 ln (sin ) .
2 4
Plot this function in the same graph as your empirical results. You will notice
that the T (α) curve is considerably below the results from your simulation, which
has to do with the step size that we used (i.e., the value of stepLength in the
escape function). To adjust for this step size, multiply each value returned by
the escapeMonteCarlo function by the square of the step size ((π/128)2 ) before
you plot it. Once you do this, the results from your Monte Carlo simulation will
be in the same time units used by Caginalp, and should line up closely with the
mathematically derived result.
CHAPTER 6
So, here’s what I can say: the Library of Congress has more than 3 petabytes of digital
collections. What else I can say with all certainty is that by the time you read this, all the
numbers — counts and amount of storage — will have changed.
The roughly 2000 sequencing instruments in labs and hospitals around the world can
collectively sequence 15 quadrillion nucleotides per year, which equals about 15 petabytes
of compressed genetic data. A petabyte is 250 bytes, or in round numbers, 1000 terabytes.
To put this into perspective, if you were to write this data onto standard DVDs, the resulting
stack would be more than 2 miles tall. And with sequencing capacity increasing at a rate of
around three- to fivefold per year, next year the stack would be around 6 to 10 miles tall. At
this rate, within the next five years the stack of DVDs could reach higher than the orbit of
the International Space Station.
241
242 Text, documents, and DNA
def roar(n):
return ’r’ + (’o’ * n) + (’a’ * n) + ’r!’
def speak(animal):
if animal == ’cat’:
word = ’meow.’
elif animal == ’dog’:
word = ’woof.’
else:
word = roar(10)
speak(’monkey’)
Reflection 6.1 What is printed by the code above? Make sure you understand why.
Remember that, when using methods, the name of the object, in this case the string
object named pirate, precedes the name of the method.
Smaller strings contained within larger strings are called substrings. We can use
the operators in and not in to test whether a string contains a particular substring.
For example, the conditions in each of the following if statements are true.
1
See Appendix B.6 for a list.
Counting words 243
Two particularly useful methods that deal with substrings are replace and count.
The replace method returns a new string with all occurrences of one substring
replaced with another substring. For example,
>>> newName = pirate.replace(’Long’, ’Short’)
>>> newName
’Shortbeard’
>>> quote = ’Yo-ho-ho, and a bottle of rum!’
>>> quote2 = quote.replace(’ ’, ’’)
>>> quote2
’Yo-ho-ho,andabottleofrum!’
The second example uses the replace method to delete spaces from a string. The
second parameter to replace, quotes with nothing in between, is called the empty
string, and is a perfectly legal string containing zero characters.
Reflection 6.2 Write a statement to replace all occurrences of brb in a text with
be right back.
The count method returns the number of occurrences of a substring in a string. For
example,
>>> pirate.count(’ear’)
1
>>> pirate.count(’eye’)
0
We can count the number of words in a text by counting the number of “whitespace”
characters, since almost every word must be followed by a space of some kind.
Whitespace characters are spaces, tabs, and newline characters, which are the
hidden characters that mark the end of a line of text. Tab and newline characters
are examples of non-printable control characters. A tab character is denoted by
the two characters \t and a newline character is denoted by the two characters \n.
The wordCount1 function below uses the count method to return the number of
whitespace characters in a string named text.
244 Text, documents, and DNA
def wordCount1(text):
"""Approximate the number of words in a string by counting
the number of spaces, tabs, and newlines.
Parameter:
text: a string object
def main():
shortText = ’This is not long. But it will do. \n’ + \
’All we need is a few sentences.’
wc = wordCount1(shortText)
print(wc)
main()
In the string shortText above, we explicitly show the space characters ( ) and
break it into two parts because it does not fit on one line.
Reflection 6.4 What answer does the wordCount1 function give in the example
above? Is this the correct answer? If not, why not?
The wordCount1 function returns 16, but there are actually only 15 words in
shortText. We did not get the correct answer because there are two spaces after
the first period, a space and newline character after the second period, and nothing
after the last period. How can we fix this? If we knew that the text contained no
newline characters, and was perfectly formatted with one space between words, two
spaces between sentences, and no space at the end, then we could correct the above
computation by subtracting the number of instances of two spaces in the text and
adding one for the last word:
def wordCount2(text):
""" (docstring omitted) """
But most text is not perfectly formatted, so we need a more sophisticated approach.
Reflection 6.5 If you were faced with a long text, and could only examine one
character at a time, what algorithm would you use to count the number of words?
As a starting point, let’s think about how the count method must work. To count
the number of instances of some character in a string, the count method must
examine each character in the string, and increment an accumulator variable for
each one that matches.
To replicate this behavior, we can iterate over a string with a for loop, just
as we have iterated over ranges of integers. For example, the following for loop
iterates over the characters in the string shortText. Insert this loop into the main
function above to see what it does.
for character in shortText:
print(character)
In this loop, each character in the string is assigned to the index variable character
in order. To illustrate this, we have the body of the loop simply print the character
assigned to the index variable. Given the value assigned to shortText in the main
function above, this loop will print
T
h
i
⋮ (middle omitted)
e
s
.
To count whitespace characters, we can use the same loop but, in the body of the
loop, we need to increment a counter each time the value of character is equal to
a whitespace character:
def wordCount3(text):
""" (docstring omitted) """
count = 0
for character in text:
if character == ’ ’ or character == ’\t’ or character == ’\n’:
count = count + 1
return count
Reflection 6.6 What happens if we replace the or operators with and operators?
Reflection 6.7 What answer does the wordCount3 function give when it is called
with the parameter shortText that we defined above?
When we call this function with shortText, we get the same answer that we did
with wordCount1 (16) because we have simply replicated that function’s behavior.
Reflection 6.8 Now that we have a baseline word count function, how can we make
it more accurate on text with extra spaces?
To improve upon wordCount3, we need to count only the first whitespace or newline
character in any sequence of such characters.
Reflection 6.9 If the value of character is a space, tab, or newline, how can we
tell if it is the first in a sequence of such characters? (Hint: if it is the first, what
must the previous character not be?)
def wordCount4(text):
""" (docstring omitted) """
count = 0
prevCharacter = ’ ’
for character in text:
if character in ’ \t\n’ and prevCharacter not in ’ \t\n’:
count = count + 1
prevCharacter = character
return count
Let’s consider two possibilities for the first character in text: either it is a whitespace
character, or it is not. If the first character in text is not a whitespace or newline
character (as would normally be the case), then the first part of the if condition
(character in ’ \t\n’) will be false. Therefore, the initial value of prevCharacter
does not matter. On the other hand, if the first value assigned to character is
a whitespace or newline character, then the first part of the if condition will be
true. But we want to make sure that this character does not count as ending a
word. Setting prevCharacter to a space initially will ensure that the second part of
the if condition (prevCharacter not in ’ \t\n’) is initially false, and prevent
count from being incremented.
Reflection 6.12 Are there any other values that would work for the initial value
of prevCharacter?
Finally, we need to deal with the situation in which the text does not end with a
whitespace character. In this case, the final word would not have been counted, so
we need to increment count by one.
Reflection 6.13 How can we tell if the last character in text is a whitespace or
newline character?
Since prevCharacter will be assigned the last character in text after the loop
completes, we can check its value after the loop. If it is a whitespace character, then
the last word has already been counted; otherwise, we need to increment count. So
the final function looks like this:
def wordCount5(text):
"""Count the number of words in a string.
Parameter:
text: a string object
count = 0
prevCharacter = ’ ’
for character in text:
if character in ’ \t\n’ and prevCharacter not in ’ \t\n’:
count = count + 1
prevCharacter = character
if prevCharacter not in ’ \t\n’:
count = count + 1
return count
248 Text, documents, and DNA
Although our examples have used short strings, our function will work on any size
string we want. In the next section, we will see how to read an entire text file or
web page into a string, and then use our word count function, unchanged, to count
the words in the file. In later sections, we will design similar algorithms to carry out
more sophisticated analyses of text files containing entire books and long sequences
of DNA.
Exercises
6.1.1. Write a function
twice(text)
that returns the string text repeated twice, with a space in between. For example,
twice(’bah’) should return the string ’bah bah’.
6.1.2. Write a function
repeat(text, n)
that returns a string that is n copies of the string text. For example, repeat(’AB’,
3) should return the string ’ABABAB’.
6.1.3. Write a function
vowels(word)
that uses the count method to return the number of vowels in the string word.
(Note that word may contain upper and lower case letters.)
6.1.4. Write a function
nospaces(sentence)
that uses the replace string method to return a version of the string sentence in
which all the spaces have been replaced by the underscore (_) character.
6.1.5. Write a function
txtHelp(txt)
that returns an expanded version of the string txt, which may contain texting
abbreviations like “brb” and “lol.” Your function should expand at least four
different texting abbreviations. For example, txtHelp(’imo u r lol brb’) might
return the string ’in my opinion you are laugh out loud be right back’.
6.1.6. Write a function
letters(text)
that prints the characters of the string text, one per line. For example
letters(’abc’) should print
a
b
c
Counting words 249
makeEvenParity(bits)
that returns a string consisting of bits with one additional bit concate-
nated so that the returned string has even parity. Your function should
call your evenParity function. For example, makeEvenParity(’110101’)
should return ’1101010’ and makeEvenParity(’110001’) should return
’1100011’.
The second argument to open is the mode to use when working with the file; ’r’
means that we want to read from the file. By default, every file is assumed to contain
text.
2
This file can be obtained from the book’s website or from Project Gutenberg at http://www.
gutenberg.org/files/2701/2701.txt
Text documents 251
ROOT
Figure 6.1A Finder window in Mac OS X and the partial corresponding tree repre-
sentation. Ovals represent folders and rectangles represent files.
The read method of a file object reads the entire contents of a file into a string.
For example, the following statement would read the entire text from the file into a
string assigned to the variable named text.
text = inputFile.read()
When we are finished with a file, it is important that we close it. Closing a file
signals to the operating system that we are done using it, and ensures that any
memory allocated to the file by the program is released. To close a file, we simply
use the file object’s close method:
inputFile.close()
Let’s look at an example that puts all of this together. The following function reads
in a file with the given file name and returns the number of words in the file using
our wordCount5 function.
252 Text, documents, and DNA
def wcFile(fileName):
"""Return the number of words in the file with the given name.
Parameter:
fileName: the name of a text file
return wordCount5(text)
The optional encoding parameter to the open function indicates how the bits in
the file should be interpreted (we will discuss what UTF-8 is in Section 6.3).
Reflection 6.14 How many words are there in the file mobydick.txt?
Now suppose we want to print a text file, formatted with line numbers to the left of
each line. A “line” is defined to be a sequence of characters that end with a newline
(’\n’) character. Since we need to print a line number to the left of every line, this
problem would be much more easily solved if we could read in a file one line at
a time. Fortunately, we can. In the same way that we can iterate over a range of
integers or the characters of a string, we can iterate over the lines in a file. When we
use a file object as the sequence in a for loop, the index variable is assigned a string
containing each line in the file, one line per iteration. For example, the following
code prints each line in the file object named textFile:
for line in textFile:
print(line)
In each iteration of this loop, the index variable named line is assigned the next
line in the file, which is then printed in the body of the loop. We can easily extend
this idea to our line-printing problem:
Text documents 253
def lineNumbers(fileName):
"""Print the contents of the file with the given name
with each line preceded by a line number.
Parameter:
fileName: the name of a text file
The lineNumbers function combines an accumulator with a for loop that reads the
text file line by line. After the file is opened, the accumulator variable lineCount is
initialized to one. Inside the loop, each line is printed using a format string that
precedes the line with the current value of lineCount. At the end of the loop, the
accumulator is incremented and the loop repeats.
Reflection 6.15 Why does the function print line[:-1] instead of line in each
iteration? What happens if you replace line[:-1] with line?
Reflection 6.16 How would the output change if lineCount was instead incre-
mented in the loop before calling print?
Reflection 6.17 How many lines are there in the file mobydick.txt?
Opening a file in this way will create a new file named newfile.txt, if a file by
that name does not exist, or overwrite the file by that name if it does exist. (So be
careful!) To append to the end of an existing file, use the ’a’ (“append”) mode
instead. Once the file is open, we can write text to it using the write method:
newTextFile.write(’Hello.\n’)
254 Text, documents, and DNA
The write method does not write a newline character at the end of the string by
default, so we have to include one explicitly, if one is desired.
The following function illustrates how we can modify the lineNumbers function
so that it writes the file with line numbers directly to another file instead of printing
it.
Parameters:
fileName: the name of a text file
newFileName: the name of the output text file
Remember to always close the new file when you are done. Closing a file to which
we have written ensures that the changes have actually been written to the drive.
To improve efficiency, an operating system does not necessarily write text out to the
drive immediately. Instead, it usually waits until a sufficient amount builds up, and
then writes it all at once. Therefore, if you forget to close a file and your computer
crashes, your program’s last writes may not have actually been written. (This is one
reason why we sometimes have trouble with corrupted files after a computer crash.)
short for ”hypertext transfer protocol”). For example, to open the main web page
at python.org, we can do the following:
>>> import urllib.request as web
>>> webpage = web.urlopen(’http://python.org’)
Once we have the file object, we can read from it just as we did earlier.
>>> text = webpage.read()
>>> webpage.close()
Behind the scenes, the Python interpreter communicates with a web server over the
Internet to read this web page. But thanks to the magic of abstraction, we did not
have to worry about any of those details.
Because the urlopen function does not accept an encoding parameter, the read
function cannot tell how the text of the web page is encoded. Therefore, read returns
a bytes object instead of a string. A bytes object contains a sequence of raw bytes
that are not interpreted in any particular way. To convert the bytes object to a
string before we print it, we can use the decode method, as follows:
>>> print(text.decode(’utf-8’))
<!doctype html>
⋮
(Again, we will see what UTF-8 is in Section 6.3.) What you see when you print
text is HTML (short for hypertext markup language) code for the python.org
home page. HTML is the language in which most web pages are written.
We can download data files from the web in the same way if we know the correct
URL. For example, the U.S. Food and Drug Administration (FDA) lists recalls of
products that it regulates at http://www.fda.gov/Safety/Recalls/default.htm.
The raw data behind this list is also available, so we can write a program that reads
and parses it to gather statistics about recalls over time. Let’s first take a look at
what the data looks like. Because the file is quite long, we will initially print only
the first twenty lines by calling the readline method, which just reads a single line
into a string, in a for loop:
>>> url = ’http://www.fda.gov/DataSets/Recalls/RecallsDataSet.xml’
>>> webpage = web.urlopen(url)
>>> for i in range(20):
... line = webpage.readline()
... print(line.decode(’utf-8’))
...
<RECALLS_DATA>
<PRODUCT>
<DATE>Mon, 11 Aug 2014 00:00:00 -0400</DATE>
<BRAND_NAME><![CDATA[Good Food]]></BRAND_NAME>
<PRODUCT_DESCRIPTION><![CDATA[Carob powder]]></PRODUCT_DESCRIPTION>
<REASON><![CDATA[Salmonella]]></REASON>
<COMPANY><![CDATA[Goodfood Inc.]]></COMPANY>
<COMPANY_RELEASE_LINK>http://www.fda.gov/Safety/Recalls/ucm40969.htm
</COMPANY_RELEASE_LINK>
<PHOTOS_LINK></PHOTOS_LINK>
</PRODUCT>
⋮
</RECALLS_DATA>
This file is in a common format called XML (short for extensible markup language).
In XML, data elements are enclosed in matching pairs of tags. In this file, each
product recall is enclosed in a pair of <PRODUCT> ⋯ </PRODUCT> tags. Within that
element are other elements enclosed in matching tags that give detailed information
about the product.
Reflection 6.18 Look at the (fictitious) example product above enclosed in the
<PRODUCT> ⋯ </PRODUCT> tags. What company made the product? What is the
product called? Why was it recalled? When was it recalled?
Before we can do anything useful with the data in this file, we need to be able to
identify individual product recall descriptions. Notice that the file begins with a
header line that describes the version of XML and the text encoding. Following
some blank lines, we next notice that all of the recalls are enclosed in a matching
pair of <RECALLS_DATA> ⋯ </RECALLS_DATA> tags. So to get to the first product,
we need to read until we find the <RECALLS_DATA> tag. We can do this with a while
loop:
url = ’http://www.fda.gov/DataSets/Recalls/RecallsDataSet.xml’
webpage = web.urlopen(url)
line = ’’
while line[:14] != ’<RECALLS_DATA>’:
line = webpage.readline()
line = line.decode(’utf-8’)
Implicit in the file object abstraction is a file pointer that keeps track of the position
of the next character to be read. So this while loop helpfully moves the file pointer
to the beginning of the line after the <RECALLS_DATA> tag.
Reflection 6.20 Why will line != ’<RECALLS_DATA>’ not work as the condition
in the while loop? (What control character is “hiding” at the end of that line?)
The two statements before the while loop read and decode one line of XML data, so
that the condition in the while loop makes sense when it is first tested. The while
loop then iterates while we have not read the </PRODUCT> tag that marks the end of
the product element. Inside the loop, we print the line and then read the next line
of data. Since the loop body is not executed when line[:10] == ’</PRODUCT>’,
we add another call to the print function after the loop to print the closing tag.
Reflection 6.21 Why do we call line.rstrip() before printing each line? (What
happens if we omit the call to rstrip?)
Reflection 6.22 What would happen if we did not call readline inside the loop?
Finally, let’s put this loop inside another loop to print and count all of the
product elements. We saw earlier that the product elements are enclosed within
<RECALLS_DATA> ⋯ </RECALLS_DATA> tags. Therefore, we want to print product
elements while we do not see the closing </RECALLS_DATA> tag.
def printProducts():
"""Print the products on the FDA recall list.
Parameters: none
url = ’http://www.fda.gov/DataSets/Recalls/RecallsDataSet.xml’
webpage = web.urlopen(url)
line = ’’
258 Text, documents, and DNA
productNum = 1
line = webpage.readline() # read first <PRODUCT> line
line = line.decode(’utf-8’)
while line[:15] != ’</RECALLS_DATA>’: # while more products
print(productNum)
while line[:10] != ’</PRODUCT>’: # print one product element
print(line.rstrip())
line = webpage.readline()
line = line.decode(’utf-8’)
print(line.rstrip())
productNum = productNum + 1
line = webpage.readline() # read next <PRODUCT> line
line = line.decode(’utf-8’)
webpage.close()
The new while loop and the new statements that manage the accumulator to count
the products are marked in red. A loop within a loop, often called a nested loop,
can look complicated, but it helps to think of the previously written inner while
loop (the five black statements in the middle) as a functional abstraction that prints
one product. The outer loop repeats this segment while we do not see the final
</RECALLS_DATA> tag. The condition of the outer loop is initialized by reading
the first <PRODUCT> line before the loop, and the loop moves toward the condition
becoming false by reading the next <PRODUCT> (or </RECALLS_DATA>) line at the
end of the loop.
Although our function only prints the product information, it provides a frame-
work in which to do more interesting things with the data. Exercise 6.5.7 in Section 6.5
asks you to use this framework to compile the number of products recalled for a
particular reason in a particular year.
Exercises
6.2.1. Modify the lineNumbers function so that it only prints a line number on every
tenth line (for lines 1, 11, 21, . . .).
6.2.2. Write a function
wcWeb(url)
that reads a text file from the web at the given URL and returns the number of
words in the file using the final wordCount5 function from Section 6.1. You can test
your function on books from Project Gutenberg at http://www.gutenberg.org.
For any book, choose the “Plain Text UTF-8” or ASCII version. In either case,
the file should end with a .txt file extension. For example,
Encoding strings 259
wcWeb(’http://www.gutenberg.org/cache/epub/98/pg98.txt’)
should return the number of words in A Tale of Two Cities by Charles Dickens.
You can also access a mirrored copy of A Tale of Two Cities from the book web
site at
http://discovercs.denison.edu/chapter6/ataleoftwocities.txt
6.2.3. Write a function
wcLines(fileName)
that uses wordCount5 function from Section 6.1 to print the number of words in
each line of the file with the given file name.
6.2.4. Write a function
pigLatinDict(fileName)
that prints the Pig Latin equivalent of every word in the dictionary file with the
given file name. (See Exercise 6.3.3.) Assume there is exactly one word on each
line of the file. Start by testing your function on small files that you create. An
actual dictionary file can be found on most Mac OS X and Linux computers at
/usr/share/dict/words. There is also a dictionary file available on the book web
site.
6.2.5. Repeat the previous exercise, but have your function write the results to a new file
instead, one Pig Latin word per line. Add a second parameter for the name of the
new file.
6.2.6. Write a function
strip(fileName, newFileName)
that creates a new version of the file with the given fileName in which all whitespace
characters (’ ’, ’\n’, and ’\t’) have been removed. The second parameter is the
name of the new file.
pirate L o n g b e a r d
0 1 2 3 4 5 6 7 8
-9 -8 -7 -6 -5 -4 -3 -2 -1
260 Text, documents, and DNA
A reference to the entire string of nine characters is assigned to the name pirate.
Each character in the string is identified by an index that indicates its position.
Indices always start at 0. We can access a character directly by referring to its index
in square brackets following the name of the string. For example,
>>> pirate = ’Longbeard’
>>> pirate[0]
’L’
>>> pirate[2]
’n’
Notice that each character is itself represented as a single-character string in quotes.
As indicated in the figure above, we can also use negative indexing, which starts
from the end of the string. For example, in the figure, pirate[2] and pirate[-7]
refer to the same character.
Reflection 6.23 Using negative indexing, how can we always access the last char-
acter in a string, regardless of its length?
The length of a string is returned by the len function:
>>> len(pirate)
9
>>> pirate[len(pirate) - 1]
’d’
>>> pirate[-1]
’d’
>>> pirate[len(pirate)]
IndexError: string index out of range
Reflection 6.24 Why does the last statement above result in an error?
Notice that len is not a method and that it returns the number of characters in the
string, not the index of the last character. The positive index of the last character
in a string is always the length of the string minus one. As shown above, referring
to an index that does not exist will give an index error .
If we need to find the index of a particular character or substring in a string, we
can use the find method. But the find method only returns the position of the
first occurrence. For example,
>>> pirate2 = ’Willie Stargell’
>>> pirate2.find(’gel’)
11
>>> pirate2.find(’ll’)
2
>>> pirate2.find(’jelly’)
-1
In the last example, the substring ’jelly’ was not found, so find returned −1.
Reflection 6.25 Why does −1 make sense as a return value that means “not
found?”
Encoding strings 261
However, in Python, we cannot do so. Strings are immutable, meaning they cannot
be changed in place. Although this may seem like an arbitrary (and inconvenient)
decision, there are good reasons for it. The primary one is efficiency of both space
and time. The memory to store a string is allocated when the string is created.
Later, the memory immediately after the string may be used to store some other
values. If strings could be lengthened by adding more characters to the end, then,
when this happens, a larger chunk of memory would need to be allocated, and the
old string would need to be copied to the new space. The illustration below depicts
what might need to happen if we were allowed to add an ’s’ to the end of pirate.
In this example, the three variable names refer to three different adjacent chunks
of a computer’s memory: answer refers to the value 42, pirate refers to the string
’Longbeard’, and fish refers to the value 17.
answer name fish
42 L o n g b e a r d 17
42 (unused) 17 L o n g b e a r d s
>>> pirate[:4]
’Long’
>>> pirate[4:]
’beard’
>>> pirate[:]
’Longbeard’
>>> pirate[:-1]
’Longbear’
The second to last expression creates a copy of the string pirate. Since the index
−1 refers to the index of the last character in the string, the last example gives us
the slice that includes everything up to, but not including, the last character.
Reflection 6.26 How would you create a slice of pirate that contains all but the
first character? How about a slice of pirate that evaluates to ’ear’?
Let’s now return to our original question: how can we change the ’e’ in ’Longbeard’
to an ’o’? To accomplish this with slicing, we need to assign the concatenation of
three strings to pirate: the slice of pirate before the ’e’, an ’o’, and the slice of
pirate after the ’e’:
>>> pirate = pirate[:5] + ’o’ + pirate[6:]
>>> pirate
’Longboard’
def copy(text):
"""Return a copy of text.
Parameter:
text: a string object
newText = ’’
for character in text:
newText = newText + character
return newText
This technique is really just another version of an accumulator, and is very similar
to the list accumulators that we have been using for plotting. To illustrate how
this works, suppose text was ’abcd’. The table below illustrates how each value
changes in each iteration of the loop.
Encoding strings 263
In the first iteration, the first character in text is assigned to character, which is
’a’. Then newText is assigned the concatenation of the current value of newText
and character, which is ’’ + ’a’, or ’a’. In the second iteration, character is
assigned ’b’, and newText is assigned the concatenation of the current value of
newText and character, which is ’a’ + ’b’, or ’ab’. This continues for two more
iterations, resulting in a value of newText that is identical to the original text.
This string accumulator technique can form the basis of all kinds of transforma-
tions to text. For example, let’s write a function that removes all whitespace from
a string. This is very similar to what we did above, except that we only want to
concatenate the character if it is not a whitespace character.
def noWhitespace(text):
"""Return a version of text without whitespace.
Parameter:
text: a string object
newText = ’’
for character in text:
if character not in ’ \t\n’:
newText = newText + character
return newText
Some of the exercises below ask you to write similar functions to modify text in a
variety of ways.
Encoding characters
As we alluded to earlier, in a computer’s memory, each of the characters in a string
must be encoded in binary in some way. Up until recently, most English language
text was encoded in a format known as ASCII .3 ASCII assigns each character a 7-bit
code, and is therefore able to represent 27 = 128 different characters. In memory,
each ASCII character is stored in one byte, with the leftmost bit of the byte being
a 0. So a string is stored as a sequence of bytes, which can also be represented as a
3
ASCII is an acronym for American Standard Code for Information Interchange.
264 Text, documents, and DNA
Delete
Space
Digits
characters characters characters letters characters letters characters
0 31 32 33 47 48 57 58 64 65 90 91 96 97 122 123 126 127
Figure 6.2An overview of the organization (not to scale) of the ASCII character
set (and the Basic Latin segment of the Unicode character set) with decimal code
ranges. For the complete Unicode character set, refer to http://unicode.org.
sequence of numbers between 0 and 127. For example, the string ’Longbeard’ is
stored in memory in ASCII as
010011000110111101101110011001110110001001100101011000010111001001100100
76 111 110 103 98 101 97 114 100
L o n g b e a r d
The middle row contains the decimal equivalents of the binary codes. Figure 6.2
illustrates an overview of the organization of the ASCII character set.
The ASCII character set has been largely supplanted, including in Python, by
an international standard known as Unicode. Whereas ASCII only provides codes
for Latin characters, Unicode encodes over 100,000 different characters from more
than 100 languages, using up to 4 bytes per character. A Unicode string can be
encoded in one of three ways, but is most commonly encoded using a variable-length
system called UTF-8 . Conveniently, UTF-8 is backwards-compatible with ASCII,
so each character in the ASCII character set is encoded in the same 1-byte format
in UTF-8. In Python, we can view the Unicode code (in decimal) for any character
using the ord function. (ord is short for “ordinal.”) For example,
>>> ord(’L’)
76
The chr function is the inverse of ord; given a Unicode code, chr returns the
corresponding character.
>>> chr(76)
’L’
We can use the ord and chr functions to convert between letters and numbers. For
example, when we print a numeric value on the screen using the print function,
each digit in the number must be converted to its corresponding character to be
displayed. In other words, the value 0 must be converted to the character ’0’, the
value 1 must be converted to the character ’1’, etc. The Unicode codes for the digit
characters are conveniently sequential, so the code for any digit character is equal
to the code for ’0’, which is ord(’0’), plus the value of the digit. For example,
>>> ord(’2’)
50
>>> ord(’0’) + 2
50
Encoding strings 265
Therefore, for any one-digit integer, we can get the corresponding character by
passing the value plus ord(’0’) into the chr function. For example:
>>> chr(ord(’0’) + 2)
’2’
The following function generalizes this idea for any one-digit integer named digit:
def digit2String(digit):
"""Converts an integer digit to its string representation.
Parameter:
digit: an integer in 0, 1, ..., 9
Return value:
the corresponding character ’0’, ’1’, ..., ’9’
or None if a non-digit is passed as an argument
"""
If digit is not the value of a decimal digit, we need to recognize that and do
something appropriate. In this case, we simply return None.
We can use a similar idea to convert a character into a number. Suppose we
want to convert a letter of the alphabet into an integer representing its position. In
other words, we want to convert the character ’A’ or ’a’ to 1, ’B’ or ’b’ to 2, etc.
Like the characters for the digits, the codes for the upper case and lower case letters
are in consecutive order. Therefore, for an upper case letter, we can subtract the
code for ’A’ from the code for the letter to get the letter’s offset relative to ’A’.
Similarly, we can subtract the code for ’a’ from the code for a lower case letter to
get the lower case letter’s offset with respect to ’a’. Try it yourself:
>>> ord(’D’) - ord(’A’)
3
Since this gives us one less than the value we want, we can simply add one to get
the correct position for the letter.
>>> ord(’D’) - ord(’A’) + 1
4
bits instead. This is called a prefix code because no code is a prefix of another code,
which is essential for decoding the file.
An alternative compression technique, used by the Lempel-Ziv-Welch algorithm, re-
places repeated strings of characters with fixed-length codes. For example, in the
string CANTNAGATANCANCANNAGANT, the repeated sequences CAN and NAG might each be
represented with its own code.
def letter2Index(letter):
"""Returns the position of a letter in the alphabet.
Parameter:
letter: an upper case or lower case letter
Notice above that we can compare characters in the same way we compare numbers.
The values being compared are actually the Unicode codes of the characters, but
since the letters and numbers are in consecutive order, the comparisons follow
alphabetical order. We can also compare longer strings in the same way.
>>> ’cat’ < ’dog’
True
>>> ’cat’ < ’catastrophe’
True
>>> ’Cat’ < ’cat’
True
>>> ’1’ < ’one’
True
Reflection 6.27 Why are the expressions ’Cat’ < ’cat’ and ’1’ < ’one’ both
True? Refer to Figure 6.2.
Exercises
6.3.1. Suppose you have a string stored in a variable named word. Show how you would
print
(a) the string length
(b) the first character in the string
(c) the third character in the string
(d) the last character in the string
(e) the last three characters in the string
(f) the string consisting of the second, third, and fourth characters
(g) the string consisting of the fifth, fourth, and third to last characters
(h) the string consisting of all but the last character
6.3.2. Write a function
username(first, last)
that returns a person’s username, specified as the last name followed by an
underscore and the first initial. For example, username(’martin’, ’freeman’)
should return the string ’freeman_m’.
6.3.3. Write a function
piglatin(word)
that returns the pig latin equivalent of the string word. Pig latin moves the first
character of the string to the end, and follows it with ’ay’. For example, pig latin
for ’python’ is ’ythonpay’.
268 Text, documents, and DNA
6.3.4. Suppose
quote = ’Well done is better than well said.’
(The quote is from Benjamin Franklin.) Use slicing notation to answer each of the
following questions.
(a) What slice of quote is equal to ’done’?
(b) What slice of quote is equal to ’well said.’?
(c) What slice of quote is equal to ’one is bet’?
(d) What slice of quote is equal to ’Well do’?
6.3.5. Write a function
noVowels(text)
that returns a version of the string text with all the vowels removed. For example,
noVowels(‘this is an example.’) should return the string ’ths s n xmpl.’.
6.3.6. Suppose you develop a code that replaces a string with a new string that consists
of all the even indexed characters of the original followed by all the odd indexed
characters. For example, the string ’computers’ would be encoded as ’cmuesoptr’.
Write a function
encode(word)
that returns the encoded version of the string named word.
6.3.7. Write a function
decode(codeword)
that reverses the process from the encode function in the previous exercise.
6.3.8. Write a function
daffy(word)
that returns a string that has Daffy Duck’s lisp added to it (Daffy would pronounce
the ’s’ sound as though there was a ’th’ after it). For example, daffy("That’s
despicable!") should return the string "That’sth desthpicable!".
6.3.9. Suppose you work for a state in which all vehicle license plates consist of a string
of letters followed by a string of numbers, such as ’ABC 123’. Write a function
randomPlate(length)
that returns a string representing a randomly generated license plate consisting of
length upper case letters followed by a space followed by length digits.
6.3.10. Write a function
int2String(n)
that converts a positive integer value n to its string equivalent, without using the
str function. For example, int2String(1234) should return the string ’1234’.
(Use the digit2String function from this section.)
Encoding strings 269
def letterGrade(grade):
if grade >= 100:
return ’A’
if grade > 59:
return SOMETHING
return ’F’
then convert the sum back to a character. For simplicity, assume that the string
contains only lower case letters. Because the sum will likely be greater than 25,
we will need to convert the sum to a number between 0 and 25 by finding the
remainder modulo 26. For example, to find the checksum character for the string
’snow’, we compute (18 + 13 + 14 + 22) mod 26 (because s = 18, n = 13, o = 14
and w = 22), which equals 67 mod 26 = 15. Since 15 = p, we add ’p’ onto the
end of ’snow’ when we transmit this sequence of characters. The last character is
then checked on the receiving end. Write a function
checksum(word)
that returns word with the appropriate checksum character added to the end. For
example, checksum(’snow’) should return ’snowp’. (Hint: use chr and ord.)
6.3.18. Write a function
checksumCheck(word)
that determines whether the checksum character at the end of word (see Exer-
cise 6.3.17) is correct. For example, checksumCheck(’snowp’) should return True,
but checksumCheck(’snowy’) should return False.
6.3.19. Julius Caesar is said to have sent secret correspondence using a simple encryption
scheme that is now known as the Caesar cipher. In the Caesar cipher, each letter
in a text is replaced by the letter some fixed distance, called the shift, away. For
example, with a shift of 3, A is replaced by D, B is replaced by E, etc. At the
end of the alphabet, the encoding wraps around so that X is replaced by A, Y is
replaced by B, and Z is replaced by C. Write a function
encipher(text, shift)
that returns the result of encrypting text with a Caesar cypher with the given
shift. Assume that text contains only upper case letters.
6.3.20. Modify the encipher function from the previous problem so that it either encrypts
or decrypts text, based on the value of an additional Boolean parameter.
Reflection 6.28 Suppose that the variable names word1 and word2 are both
assigned string values. Should a comparison between these two strings, like
word1 < word2, also count as an elementary step?
In the first case, only a comparison of the first characters is required to determine
that the expression is true. However, in the second case, or if the two strings are the
same, we must compare every character to yield an answer. Therefore, assuming one
string is not the empty string, the minimum number of comparisons is 1 and the
maximum number of comparisons is n, where n is the length of the shorter string.
(Exercise 6.4.1 more explicitly illustrates how a string comparison works.)
Put another way, the best-case time complexity of a string comparison is constant
because it does not depend on the input size, and the worst-case time complexity
for a string comparison is directly proportional to n, the length of the shorter string.
Reflection 6.30 Do you think the best-case time complexity or the worst-case time
complexity is more representative of the true time complexity in practice?
Reflection 6.31 What if one or both of the strings are constants? For example, how
many character comparisons are required to evaluate ’buzzards’ == ’buzzword’?
Because both values are constants instead of variables, we know that this comparison
always requires five character comparisons (to find that the fifth characters in the
strings are different). In other words, the number of character comparisons required
by this string comparison is independent of the input to any algorithm containing
it. Therefore, the time complexity is constant.
Let’s now return to a comparison of the wordCount1 and wordCount5 functions.
The final wordCount5 function is reproduced below.
272 Text, documents, and DNA
1 def wordCount5(text):
2 """ (docstring omitted) """
3
4 count = 0
5 prevCharacter = ’ ’
6 for character in text:
7 if character in ’ \t\n’ and prevCharacter not in ’ \t\n’:
8 count = count + 1
9 prevCharacter = character
10 if prevCharacter not in ’ \t\n’:
11 count = count + 1
12 return count
Each of the first two statements in the function, on lines 4 and 5, is an elementary
step because its time does not depend on the value of the input, text. These
statements are followed by a for loop on lines 6–9.
Reflection 6.32 Suppose that text contains n characters (i.e., len(text) is n).
In terms of n, how many times does the for loop iterate?
The loop iterates n times, once for each character in text. In each of these iterations,
a new character from text is implicitly assigned to the index variable character,
which we should count as one elementary step per iteration. Then the comparison
on line 7 is executed. Since both character and prevCharacter consist of a single
character, and each of these characters is compared to the three characters in the
string ’ \t\n’, there are at most six character comparisons here in total. Although
count is incremented on line 8 only when this condition is true, we will assume
that it happens in every iteration because we are interested in the worst-case time
complexity. So this increment and the assignment on line 9 add two more elementary
steps to the body of the loop, for a total of nine. Since these nine elementary
steps are executed once for each character in the string text, the total number of
elementary steps in the loop is 9n. Finally, the comparison on line 10 counts as three
more elementary steps, the increment on line 11 counts as one, and the return
statement on line 12 counts as one. Adding all of these together, we have a total of
9n + 7 elementary steps in the worst case.
Now let’s count the number of elementary steps required by the wordCount1 and
wordCount3 functions. Although they did not work as well, they provide a useful
comparison. The wordCount1 function is reproduced below.
def wordCount1(text):
""" (docstring omitted) """
The wordCount1 function simply calls the count method three times to count the
number of spaces, tabs, and newlines in the parameter text. But, as we saw in
the previous section, each of these method calls hides a for loop that is comparing
every character in text to the character argument that is passed into count. We can
estimate the time complexity of wordCount1 by looking at wordCount3, reproduced
below, which we developed to mimic the behavior of wordCount1.
def wordCount3(text):
""" (docstring omitted) """
count = 0
for character in text:
if character in ’ \t\n’:
count = count + 1
return count
Reflection 6.33 Our analysis predicted that wordCount1 would be about 9/5 times
faster than wordCount5 and about the same as wordCount3. Why do you think
wordCount1 was so much faster than both of them?
This experiment confirmed that the time complexities of all three functions are
proportional to n, the length of the input. But it contradicted our expectation about
the magnitude of the constant ratio between their time complexities. This is not
surprising at all. The discrepancy is due mainly to three factors. First, we missed
some hidden elementary steps in our analyses of wordCount5 and wordCount3.
For example, we did not count the and operation in line 7 of wordCount5 or the
time it takes to “jump” back up to the top of the loop at the end of an iteration.
274 Text, documents, and DNA
Second, we did not take into account that different types of elementary steps take
different amounts of real time. Third, and most importantly here, there are various
optimizations that are hidden in the implementations of all built-in Python functions
and methods, like count, that make them faster than our Python code. At our level
of abstraction, these optimizations are not available.
(a) (b)
Figure 6.4 Two views of the time complexities n2 + 2n + 2 (blue), n2 (green), and n
(red).
complexity expressions of algorithms to include just those terms that have an impact
on the growth rate as the input size grows large. For example, we can simplify 9n + 7
to just n because, as n gets very large, 9n + 7 and n grow at the same rate. In other
words, the constants 9 and 7 in the expression 9n + 7 do not impact the rate at
which the expression grows when n gets large. This is illustrated in the following
table, which shows what happens to each expression as n increases by factors of 10.
n ↑ factor 9n + 7 ↑ factor
10 – 97 –
100 10 907 9.35052
1,000 10 9,007 9.93054
10,000 10 90,007 9.99301
100,000 10 900,007 9.99930
1,000,000 10 9,000,007 9.99993
The second column of each row shows the ratio of the value of n in that row to the
value of n in the previous row, always a factor of ten. The fourth column shows the
same ratios for 9n + 7, which get closer and closer to 10, as n gets larger. In other
words, the values of n and 9n + 7 grow at essentially the same rate as n gets larger.
We call this idea the asymptotic time complexity. So the asymptotic time
complexity of each of wordCount1, wordCount3, and wordCount5 is n. Asymptotic
refers to our interest in arbitrarily large (i.e., infinite) input sizes. An asymptote, a
line that an infinite curve gets arbitrarily close to, but never actually touches, should
be familiar if you have taken some calculus. By saying that 9n + 7 is asymptotically
n, we are saying that 9n + 7 is really the same thing as n, as our input size parameter
n approaches infinity. Algorithms with asymptotic time complexity equal to n
are said to have linear time complexity. We also say that each is a linear-time
algorithm. In general, any algorithm that performs a constant number of operations
276 Text, documents, and DNA
on each element in a list or string has linear time complexity. All of our word
count algorithms, and the string comparison that we started with, have linear time
complexity.
Reflection 6.34 The Mean algorithm on Page 10 also has linear time complexity.
Can you explain why?
Exercises
6.4.1. The following function more explicitly illustrates how a string comparison works
“behind the scenes.” The steps variable counts how many individual character
comparisons are necessary.
The variable index keeps track of the position in the two strings that is currently
being compared. The while loop increments index while the characters at position
index are the same. The while loop ends when one of three things happens:
(a) index reaches the end of word1,
(b) index reaches the end of word2, or
(c) word1[index] does not equal word2[index].
If index reaches all the way to the end of word1 and all the way to the end of
word2, then the two strings have the same length and all of their characters along
the way were equal, which means that the two strings must be equal (case 1). If
index reaches all the way to the end of word1, but index is still less than the
length of word2, then word1 must be a prefix of word2 and therefore word1 <
word2 (case 2). If index reaches all the way to the end of word2, but index is still
less than the length of word1, then we have the opposite situation and therefore
word1 > word2 (case 3). If index did not reach the end of either string, then a
278 Text, documents, and DNA
mismatch between characters must have occurred, so we need to compare the last
characters again to figure out which was less (cases 4 and 5).
The variable steps counts the total number of comparisons in the function. This
includes both character comparisons and comparisons involving the length of the
strings. The value of steps is incremented by 3 before the 3 comparisons in each
iteration of the while loop, and by 1 or 2 more in each case after the loop.
Experiment with this function and answer the following questions.
(a) How many comparisons happen in the compare function when
i. ’canny’ is compared to ’candidate’?
ii. ’canny’ is compared to ’danny’?
iii. ’canny’ is compared to ’canny’?
iv. ’can’ is compared to ’canada’?
v. ’canoeing’ is compared to ’canoe’?
(b) Suppose word1 and word2 are the same n-character string. How many com-
parisons happen when word1 is compared to word2?
(c) Suppose word1 (with m characters) is a prefix of word2 (with n characters).
How many comparisons happen when word1 is compared to word2?
(d) The value of steps actually overcounts the number of comparisons in some
cases. When does this happen?
6.4.2. For each of the following code snippets, think carefully about how it must work,
and then indicate whether it represents a linear-time algorithm, a constant-time
algorithm, or something else. Each variable name refers to a string object with
length n. Assume that any operation on a single character is one elementary step.
(a) name = name.upper()
(b) name = name.find(’x’)
(c) name = ’accident’.find(’x’)
(d) newName = name.replace(’a’, ’e’)
(e) newName = name + ’son’
(f) newName = ’jack’ + ’son’
(g) index = ord(’H’) - ord(’A’) + 1
(h) for character in name:
print(character)
(i) for character in ’hello’:
print(character)
(j) if name == newName:
print(’yes’)
(k) if name == ’hello’:
print(’yes’)
(l) if ’x’ in name:
print(’yes’)
(m) for character in name:
x = name.find(character)
Analyzing text 279
6.4.3. What is the asymptotic time complexity of an algorithm that requires each of the
following numbers of elementary steps? Assume that n is the length of the input
in each case.
(a) 7n − 4
(b) 6
(c) 3n2 + 2n + 6
(d) 4n3 + 5n + 2n
(e) n log2 n + 2n
6.4.4. Suppose that two algorithms for the same problem require 12n and n2 elementary
steps. On a computer capable of executing 1 billion steps per second, how long
will each algorithm take (in seconds) on inputs of size n = 10, 102 , 104 , 106 , and
109 ? Is the algorithm that requires n2 steps ever faster than the algorithm that
requires 12n steps?
Parameters:
text: a string object
target: a single-character string object
count = 0
for character in text:
if character == target:
count = count + 1
return count
280 Text, documents, and DNA
Reflection 6.35 Assuming target is a letter, how can we modify the function to
count both lower and upper case instances of target? (Hint: see Appendix B.6.)
Reflection 6.36 If we allow target to contain more than one character, how can
we count the number of occurrences of any character in target? (Hint: the word
“in” is the key.)
Reflection 6.37 Can we use the same for loop that we used in the count1 function
to count the number of occurrences of a string containing more than one character?
Iterating over the characters in a string only allows us to “see” one character at a time,
so we only have one character at a time that we can compare to a target string in the
body of the loop. Instead, we need to compare the target string to all multi-character
substrings with the same length in text. For example, suppose we want to search
for the target string ’good’ in a larger string named text. Then we need to check
whether text[0:4] is equal to ’good’, then whether text[1:5] is equal to ’good’,
then whether text[2:6] is equal to ’good’, etc. More concisely, for all values of
index equal to 0, 1, 2, ..., we need to test whether text[index:index + 4]
is equal to ’good’. In general, for all values of index equal to 0, 1, 2, ..., we
need to test whether text[index:index + len(target)] is equal to target. To
examine these slices, we need a for loop that iterates over every index of text,
rather than over the characters in text.
Parameters:
text: a string object
target: a string object
count = 0
for index in range(len(text)):
if text[index:index + len(target)] == target:
count = count + 1
return count
Analyzing text 281
Let’s look at how count works when we call it with the following arguments:4
result = count(’Diligence is the mother of good luck.’, ’the’)
If we “unwind” the loop, we find that the statements executed in the body of the
loop are equivalent to:
if ’Dil’ == ’the’: # compare text[0:3] to ’the’
count = count + 1 # not executed; count is still 0
if ’ili’ == ’the’: # compare text[1:4] to ’the’
count = count + 1 # not executed; count is still 0
if ’lig’ == ’the’: # compare text[2:5] to ’the’
count = count + 1 # not executed; count is still 0
⋮
if ’ th’ == ’the’: # compare text[12:15] to ’the’
count = count + 1 # not executed; count is still 0
if ’the’ == ’the’: # compare text[13:16] to ’the’
count = count + 1 # count is now 1
if ’he ’ == ’the’: # compare text[14:17] to ’the’
count = count + 1 # not executed; count is still 1
⋮
if ’oth’ == ’the’: # compare text[18:21] to ’the’
count = count + 1 # not executed; count is still 1
if ’the’ == ’the’: # compare text[19:22] to ’the’
count = count + 1 # count is now 2
⋮
if ’k.’ == ’the’: # compare text[35:38] to ’the’
count = count + 1 # not executed; count is still 2
if ’.’ == ’the’: # compare text[36:39] to ’the’
count = count + 1 # not executed; count is still 2
return count # return 2
Notice that the last two comparisons can never be true because the strings corre-
sponding to text[35:38] and text[36:39] are too short. Therefore, we never need
to look at a slice that starts after len(text) - len(target). To eliminate these
needless comparisons, we could change the range of indices to
range(len(text) - len(target) + 1)
Reflection 6.39 What is returned by a slice of a string that starts beyond the last
character in the string (e.g., ’good’[4:8])? What is returned by a slice that starts
before the last character but extends beyond the last character (e.g., ’good’[2:10])?
Iterating over the indices of a string is an alternative to iterating over the characters.
For example, the count1 function could alternatively be written as:
4
“Diligence is the mother of good luck.” is from The Way to Wealth (1758) by Benjamin Franklin.
282 Text, documents, and DNA
count = 0
for index in range(len(text)):
if text[index] == target:
count = count + 1
return count
Compare the two versions of this function, and make sure you understand how the
two approaches perform exactly the same comparisons in their loops.
There are some applications in which iterating over the string indices is necessary.
For example, consider a function that is supposed to return the index of the first
occurrence of a particular character in a string. If the function iterates over the
characters in the string, it would look like this:
Parameters:
text: a string object to search in
target: a single-character string object to search for
This is just like the first version of count1, except we want to return the index of
character when we find that it equals target. But when this happens, we are left
without a satisfactory return value because we do not know the index of character!
Instead, consider a version that iterates over the indices of the string:
Now when we find that text[index] == target, we know that the desired character
is at position index, and we can return that value.
Analyzing text 283
Reflection 6.40 What is the purpose of return -1 at the end of the function?
Under what circumstances is it executed? Why is the following alternative imple-
mentation incorrect?
Let’s look at the return statements in the correct find1 function first. If
text[index] == target is true for some value of index in the for loop, then
the find1 function is terminated by the return index statement. In this case, the
loop never reaches its “natural” conclusion and the return -1 statement is never
reached. Therefore, the return -1 statement in the find1 function is only executed
if no match for target is found in text.
In the incorrect find1BAD function, the return -1 is misplaced because it
causes the function to always return during the first iteration of the loop! When 0
is assigned to index, if text[index] == target is true, then the value 0 will be
returned. Otherwise, the value −1 will be returned. Either way, the next iteration
of the for loop will never happen. The function will appear to work correctly if
target happens to match the first character in text or if target is not found in
text, but it will be incorrect if target only matches some later character in text.
Since the function does not work for every input, it is incorrect overall.
Just as we generalized the count1 function to count by using slicing, we can
generalize find1 to a function find that finds the first occurrence of any substring.
Parameters:
text: a string object to search in
target: a string object to search for
Notice how similar the following algorithm is to that of find1 and that, if
len(target) is 1, find does the same thing as find1. Like the wordCount5 function
from Section 6.1, the count1 and find1 functions implement linear-time algorithms.
To see why, let’s look at the find1 function more closely. As we did before, let n
represent the length of the input text. The second input to find1, target, has
length one. So the total size of the input is n + 1. In the find1 function, the most
frequent elementary step is the comparison in the if statement inside the for loop.
Because the function ends when target is found in the string text, the worst-case
(i.e., maximum number of comparisons) occurs when target is not found. In this
case, there are n comparisons, one for every character in text. Since the number of
elementary steps is asymptotically the same as the input size, find1 is a linear-time
algorithm. For this reason, the algorithmic technique used by find1 is known as a
linear search (or sequential search). In Chapter 8, we will see an alternative search
algorithm that is much faster, but it can only be used in cases where the data is
maintained in sorted order.
Reflection 6.41 What are the time complexities of the more general count and
find functions? Assume text has length n and target has length m. Are these
also linear-time algorithms?
A concordance
Now that we have functions to count and search for substrings in text, we can apply
these to whatever text we want, including long text files. For example, we can write
an interactive program to find the first occurrence of any desired word in Moby
Dick :
def main():
textFile = open(’mobydick.txt’, ’r’, encoding = ’utf-8’)
text = textFile.read()
textFile.close()
line in a text file, and using our find function to decide whether the line contains
the given word. If it does, we print the line.
Reflection 6.42 If we call the find function to search for a word, how do we know
if it was found?
The following function implements the algorithm to print a single concordance entry:
Parameters:
fileName: the name of the text file as a string
word: the word to search for
When we call the concordanceEntry function on the text of Moby Dick, searching
for the word “lashed,” we get 14 matches, the first 6 of which are:
things not properly belonging to the room, there was a hammock lashed
blow her homeward; seeks all the lashed sea’s landlessness again;
sailed with. How he flashed at me!--his eyes like powder-pans! is he
I was so taken all aback with his brow, somehow. It flashed like a
with storm-lashed guns, on which the sea-salt cakes!
to the main-top and firmly lashed to the lower mast-head, the strongest
It would be easier to see where “lashed” appears in each line if we could line up the
words like this:
... belonging to the room, there was a hammock lashed
blow her homeward; seeks all the lashed sea’s ...
sailed with. How he flashed at me!...
... all aback with his brow, somehow. It flashed like a
with storm-lashed guns, ...
to the main-top and firmly lashed to the ...
Reflection 6.43 Assume that each line in the text file is at most 80 characters
long. How many spaces do we need to print before each line to make the target words
line up? (Hint: use the value of found.)
286 Text, documents, and DNA
In each line in which the word is found, we know it is found at the index assigned
to found. Therefore, there are found characters before the word in that line. We
can make the ends of the target words line up at position 80 if we preface each line
with (80 - len(word) - found) spaces, by replacing the call to print with:
space = ’ ’ * (80 - len(word) - found)
print(space + line.rstrip())
Finally, these passages are not very useful without knowing where in the text they
belong. So we should add line numbers to each line of text that we print. As in the
lineNumbers function from Section 6.2, this is accomplished by incorporating an
accumulator that is incremented every time we read a line. When we print a line,
we format the line number in a field of width 6 to maintain the alignment that we
introduced previously.
There are many more enhancements we can make to this function, some of which
we leave as exercises.
Analyzing text 287
Section 6.7 demonstrates how these algorithmic techniques can also be applied
to problems in genomics, the field of biology that studies the function and structure
of the DNA in living cells.
Exercises
6.5.1. For each of the following for loops, write an equivalent loop that iterates over the
indices of the string text instead of the characters.
(a) for character in text:
print(character)
(b) newText = ’’
for character in text:
if character != ’ ’:
newText = newText + character
(c) for character in text[2:10]:
if character >= ’a’ and character <= ’z’:
print(character)
(d) for character in text[1:-1]:
print(text.count(character))
6.5.2. Describe what is wrong with the syntax of each the following blocks of code, and
show how to fix it. Assume that a string value has been previously assigned to
text.
(a) for character in text:
caps = caps + character.upper()
(b) while answer != ’q’:
answer = input(’Word? ’)
print(len(answer))
(c) for index in range(text):
if text[index] != ’ ’:
print(index)
6.5.3. Write a function
prefixes(word)
that prints all of the prefixes of the given word. For example, prefixes(’cart’)
should print
c
ca
car
cart
6.5.4. Modify the find function so that it only finds instances of target that are whole
words.
288 Text, documents, and DNA
’<PRODUCT>
<DATE>Mon, 19 Oct 2009 00:00:00 -0400</DATE>
<BRAND_NAME><![CDATA[Good food]]></BRAND_NAME>
<PRODUCT_DESCRIPTION><![CDATA[Cake]]></PRODUCT_DESCRIPTION>
<REASON><![CDATA[Allergen]]></REASON>
<COMPANY><![CDATA[Good food Inc.]]></COMPANY>
<COMPANY_RELEASE_LINK> ⋯ </COMPANY_RELEASE_LINK>
<PHOTOS_LINK></PHOTOS_LINK>
</PRODUCT>’
then the function should return the string ’2009 Allergen’.
(b) Modify the printProducts function from Section 6.2 to create a new function
recalls(reason, year)
that returns the number of recalls issued in the given year for the given reason.
To do this, the new function should, instead of printing each product recall
element, create a string containing the product recall element (like the one
above), and then pass this string into the findReason function. (Replace each
print(line.rstrip()) with a string accumulator statement and eliminate
the statements that implement product numbering. Do not delete anything
else.) With each result returned by findReason, increment a counter if it
contains the reason and year that are passed in as parameters.
Parameters:
text1: a string object
text2: a string object
text1 = text1.lower()
text2 = text2.lower()
x = []
y = []
for index in range(len(text1)):
if text1[index] == text2[index]:
x.append(index)
y.append(index)
pyplot.scatter(x, y) # scatter plot
pyplot.xlim(0, len(text1)) # x axis covers entire text1
pyplot.ylim(0, len(text2)) # y axis covers entire text2
pyplot.xlabel(text1)
pyplot.ylabel(text2)
pyplot.show()
Reflection 6.44 What is the purpose of the calls to the lower method?
Reflection 6.45 Why must we iterate over the indices of the strings rather than
the characters in the strings?
Every time two characters are found to be equal in the loop, the index of the matching
characters is added to both a list of x-coordinates and a list of y-coordinates. These
lists are then plotted with the scatter function from matplotlib. The scatter
function simply plots points without lines attaching them. Figure 6.5 shows the
result for the two sequences above.
Reflection 6.46 Look at Figure 6.5. Which dots correspond to which characters?
Why are there only dots on the diagonal?
We can see that, because this function only recognizes matches at the same index
and most of the identical characters in the two sentences do not line up perfectly, this
function does not reveal the true degree of similarity between them. But if we were
to simply insert two gaps into the strings, the character-by-character comparison
would be quite different:
Comparing texts 291
Text 1: P e t e r P i p e r p i c k e d a p e c k o f p i c k l e d p e p p e r s .
Text 2: P e t e r P e p p e r p i c k e d a p e c k o f p i c k l e d c a p e r s .
A real dot plot compares every character in one sequence to every character in the
other sequence. This means that we want to compare text1[0] to text2[0], then
text1[0] to text2[1], then text1[0] to text2[2], etc., as illustrated below:
0 1 2 3 4 5 6 7 8 9 ...
text1: Peter Piper picked a peck of pickled peppers.
...
After we have compared text1[0] to all of the characters in text2, we need to repeat
this process with text1[1], comparing text1[1] to text2[0], then text1[1] to
text2[1], then text1[1] to text2[2], etc., as illustrated below:
0 1 2 3 4 5 6 7 8 9 ...
text1: Peter Piper picked a peck of pickled peppers.
...
In other words, for each value of index, we want to compare text1[index] to every
base in text2, not just to text2[index]. To accomplish this, we need to replace
the if statement in dotplot1 with another for loop:
292 Text, documents, and DNA
Parameters:
text1: a string object
text2: a string object
text1 = text1.lower()
text2 = text2.lower()
x = []
y = []
for index in range(len(text1)):
for index2 in range(len(text2)):
if text1[index] == text2[index2]:
x.append(index)
y.append(index2)
pyplot.scatter(x, y)
pyplot.xlim(0, len(text1))
pyplot.ylim(0, len(text2))
pyplot.xlabel(text1)
pyplot.ylabel(text2)
pyplot.show()
With this change inside the first for loop, each character text1[index] is compared
to every character in text2, indexed by the index variable index2, just as in the
illustrations above. If a match is found, we append index to the x list and index2
to the y list.
Reflection 6.47 Suppose we pass in ’plum’ for text1 and ’pea’ for text2. Write
the sequence of comparisons that would be made in the body of the for loop. How
many comparisons are there?
There are 4 ⋅ 3 = 12 total comparisons because each of the four characters in ’plum’
Comparing texts 293
Figure 6.7 Output from the dotplot function from Exercise 6.6.5 (3-grams).
Reflection 6.48 Just by looking at Figure 6.8, would you conclude that the passage
had been plagiarized? (Think about what a dotplot comparing two random passages
would look like.)
Comparing texts 295
Figure 6.8 A dot plot comparing 6-grams from an original and a plagiarized passage.
Exercises
6.6.1. What is printed by the following nested loop?
text = ’imho’
for index1 in range(len(text)):
for index2 in range(index1, len(text)):
print(text[index1:index2 + 1])
Since the bit sequences are different in two positions, the Hamming distance is 2.
Write a function
hamming(bits1, bits2)
that returns the Hamming distance between the two given bit strings. Assume
that the two strings have the same length.
6.6.4. Repeat Exercise 6.6.3, but make it work correctly even if the two strings have differ-
ent lengths. In this case, each “missing” bit at the end of the shorter string counts
as one toward the Hamming distance. For example, hamming(’000’, ’10011’)
should return 3.
6.6.5. Generalize the dotplot function so that it compares n-grams instead of individual
characters.
6.6.6. One might be able to gain insight into a text by viewing the frequency with which
a word appears in “windows” of some length over the length of the text. Consider
the very small example below, in which we are counting the frequency of the “word”
a in windows of size 4 in the string ’abracadabradab’:
2 2
a b r a c a d a b r a d a b
2 1
In this example, the window skips ahead 3 characters each time. So the four
windows’ frequencies are 2, 2, 1, 2, which can be plotted like this:
The numbers on the x-axis are indices of the beginnings of each window. Write a
function
wordFrequency(fileName, word, windowSize, skip)
Genomics 297
http://ghr.nlm.nih.gov/handbook/basics/dna
that displays a plot like this showing the frequencies of the string word in windows
of size windowSize, where the windows skip ahead by skip indices each time. The
text to analyze should be read in from the file with the given fileName.
*6.7 GENOMICS
Every living cell contains molecules of DNA (deoxyribonucleic acid) that encode
the genetic blueprint of the organism. Decoding the information contained in DNA,
and understanding how it is used in all the processes of life, is an ongoing grand
challenge at the center of modern biology and medicine. Comparing the DNA of
different organisms also provides insight into evolution and the tree of life. The
lengths of DNA molecules and the sheer quantity of DNA data that have been read
from living cells and recorded in text files require that biologists use computational
methods to answer this challenge.
A genomics primer
As illustrated in Figure 6.9, DNA is a long double-stranded molecule in the shape of
a double helix. Each strand is a chain of smaller units called nucleotides. A nucleotide
consists of a sugar (deoxyribose), a phosphate group, and one of four nitrogenous
298 Text, documents, and DNA
bases: adenine (A), thymine (T), cytosine (C), or guanine (G). Each nucleotide in
a molecule can be represented by the letter corresponding to the base it contains,
and an entire strand can be represented by a string of characters corresponding
to its sequence of nucleotides. For example, the following string represents a small
DNA molecule consisting of an adenine nucleotide followed by a guanine nucleotide
followed by a cytosine nucleotide, etc.:
>>> dna = ’agcttttcattctgactg’
(The case of the characters is irrelevant; some sequence repositories use lower case
and some use upper case.) Real DNA sequences are stored in large text files; we
will look more closely at these later.
A base on one strand is connected via a hydrogen bond to a complementary
base on the other strand. C and G are complements, as are A and T. A base and
its connected complement are called a base pair . The two strands are “antiparallel”
in the sense that they are traversed in opposite directions by the cellular machinery
that copies and reads DNA. On each strand, DNA is read in an “upstream-to-
downstream” direction, but the “upstream” and “downstream” ends are reversed
on the two strands. For reasons that are not relevant here, the “upstream” end
is called 5’ (read “five prime”) and the downstream end is called 3’ (read “three
prime”). For example, the sequence in the top strand below, read from 5’ to 3’, is
AGCTT...CTG, while the bottom strand, called its reverse complement, also read 5’
to 3’, is CAGTC...GCT.
5'- A G C T T T T C A T T C T G A C T G - 3'
3'- T C G A A A A G T A A G A C T G A C - 5'
Reflection 6.49 When DNA sequences are stored in text files, only the sequence
on one strand is stored. Why?
RNA (ribonucleic acid) is a similar molecule, but each nucleotide contains a different
sugar (ribose instead of deoxyribose), and a base named uracil (U) takes the place
of thymine (T). RNA molecules are also single-stranded. As a result, RNA tends
to “fold” when complementary bases on the same strand pair with each other. This
folded structure forms a unique three-dimensional shape that plays a significant role
in the molecule’s function.
Some regions of a DNA molecule are called genes because they encode genetic
information. A gene is transcribed by an enzyme called RNA polymerase to produce
a complementary molecule of RNA. For example, the DNA sequence 5’-GACTGAT-3’
would be transcribed into the RNA sequence 3’-CUGACUA-5’. If the RNA is a
messenger RNA (mRNA), then it contains a coding region that will ultimately be
used to build a protein. Other RNA products of transcription, called RNA genes,
are not translated into proteins, and are often instead involved in regulating whether
genes are transcribed or translated into proteins. Upstream from the transcribed
Genomics 299
U C A G
U UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C U
UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C C
UUA Leu L UCA Ser S UAA Stop * UGA Stop * A
UUG Leu L UCG Ser S UAG Stop * UGG Trp W G
C CUU Leu L CCU Pro P CAU His H CGU Arg R U
CUC Leu L CCC Pro P CAC His H CGC Arg R C
CUA Leu L CCA Pro P CAA Gln Q CGA Arg R A
CUG Leu L CCG Pro P CAG Gln Q CGG Arg R G
A AUU Ile I ACU Thr T AAU Asn N AGU Ser S U
AUC Ile I ACC Thr T AAC Asn N AGC Ser S C
AUA Ile I ACA Thr T AAA Lys K AGA Arg R A
AUG Met M ACG Thr T AAG Lys K AGG Arg R G
G GUU Val V GCU Ala A GAU Asp D GGU Gly G U
GUC Val V GCC Ala A GAC Asp D GGC Gly G C
GUA Val V GCA Ala A GAA Glu E GGA Gly G A
GUG Val V GCG Ala A GAG Glu E GGG Gly G G
Table 6.1The standard genetic code that translates between codons and amino acids.
For each codon, both the three letter code and the single letter code are shown for
the corresponding amino acid.
Reflection 6.50 What amino acid sequence is represented by the mRNA sequence
CAU UUU GAG?
Reflection 6.51 Notice that, in the genetic code (Table 6.1), most amino acids are
represented by several similar codons. Keeping in mind that nucleotides can mutate
over time, what evolutionary advantage might this hold?
The complete sequence of an organism’s DNA is called its genome. The size of a
genome can range from 105 base pairs (bp) in the simplest bacteria to 1.5 × 1011
bp in some plants. The human genome contains about 3.2 × 109 (3.2 billion) bp.
Interestingly, the size of an organism’s genome does not necessarily correspond to
its complexity; plants have some of the largest genomes, far exceeding the size of
the human genome.
The subfield of biology that studies the structure and function of genomes is
called genomics. To better understand a genome, genomicists ask questions such as:
• What is the frequency of each base in the genome? What is the frequency of
each codon? For each amino acid, is there a bias toward particular codons?
• How similar are two sequences? Sequence comparison can be used to determine
whether two genes have a shared ancestry, called homology. Sequence com-
parison between homologous sequences can also give clues to an unidentified
gene’s function.
• What genes are regulated together? Identifying groups of genes that are
regulated in the same way can lead to insights into genes’ functions, especially
those related to disease.
Because genomes are so large, questions like these can only be answered computa-
tionally. In the next few pages, we will look at how the methods and techniques from
previous sections can be used to answer some of the questions above. We leave many
additional examples as exercises. We will begin by working with small sequences;
later, we will discuss how to read some longer sequences from files and the web.
Genomics 301
Or, we could use a for loop like the count1 function in the previous section:
def gcContent(dna):
"""Compute the GC content of a DNA sequence.
Parameter:
dna: a string representing a DNA sequence
dna = dna.lower()
count = 0
for nt in dna: # nt is short for "nucleotide"
if nt in ’cg’:
count = count + 1
return count / len(dna)
Reflection 6.52 Why do we convert dna to lower case at the beginning of the
function?
Because DNA sequences can be in either upper or lower case, we need to account
for both possibilities when we write a function. But rather than check for both
possibilities in our functions, it is easier to just convert the parameter to either
upper or lower case at the beginning.
To gather statistics about the codons in genes, we need to count the number
of non-overlapping occurrences of any particular codon. This is very similar to the
count function from the previous section, except that we need to increment the
index variable by three in each step:
302 Text, documents, and DNA
Parameters:
dna: a string representing a DNA sequence
codon: a string representing a codon
count = 0
for index in range(0, len(dna) - 2, 3):
if dna[index:index + 3] == codon:
count = count + 1
return count
Transforming sequences
When DNA sequences are stored in databases, only the sequence of one strand is
stored. But genes and other features may exist on either strand, so we need to be
able to derive the reverse complement sequence from the original. To first compute
the complement of a sequence, we can use a string accumulator, but append the
complement of each base in each iteration.
def complement(dna):
"""Return the complement of a DNA sequence.
Parameter:
dna: a string representing a DNA sequence
dna = dna.lower()
compdna = ’’
for nt in dna:
if nt == ’a’:
compdna = compdna + ’t’
elif nt == ’c’:
compdna = compdna + ’g’
elif nt == ’g’:
compdna = compdna + ’c’
else:
compdna = compdna + ’a’
return compdna
Genomics 303
We can iterate over the indices in reverse order by using a range that starts at
len(dna) - 1 and goes down to, but not including, −1 using a step of −1. A function
that uses this technique follows.
def reverse(dna):
"""Return the reverse of a DNA sequence.
Parameter:
dna: a string representing a DNA sequence
revdna = ’’
for index in range(len(dna) - 1, -1, -1):
revdna = revdna + dna[index]
return revdna
A more elegant solution simply iterates over the characters in dna in the normal
order, but prepends each character to the revdna string.
def reverse(dna):
""" (docstring omitted) """
revdna = ’’
for nt in dna:
revdna = nt + revdna
return revdna
To see how this works, suppose dna was ’agct’. The table below illustrates how
each value changes in each iteration of the loop.
Iteration nt new value of revdna
1 ’a’ ’a’ + ’’ → ’a’
2 ’g’ ’g’ + ’a’ → ’ga’
3 ’c’ ’c’ + ’ga’ → ’cga’
4 ’t’ ’t’ + ’cga’ → ’tcga’
304 Text, documents, and DNA
Finally, we can put these two pieces together to create a function for reverse
complement:
def reverseComplement(dna):
"""Return the reverse complement of a DNA sequence.
Parameter:
dna: a string representing a DNA sequence
return reverse(complement(dna))
This function first computes the complement of dna, then calls reverse on the
result. We can now, for example, use the reverseComplement function to count the
frequency of a particular codon on both strands of a DNA sequence:
countForward = countCodon(dna, ’atg’)
countBackward = countCodon(reverseComplement(dna), ’atg’)
Comparing sequences
Measuring the similarity between DNA sequences, called comparative genomics, has
become an important area in modern biology. Comparing the genomes of different
species can provide fundamental insights into evolutionary relationships. Biologists
can also discover the function of an unknown gene by comparing it to genes with
known functions in evolutionarily related species.
Dot plots are used heavily in computational genomics to provide visual represen-
tations of how similar large DNA and amino sequences are. Consider the following
two sequences. The dotplot1 function in Section 6.5 would show pairwise matches
between the nucleotides in black only.
Sequence 1: a g c t t t g c a t t c t g a c a g
Sequence 2: a c c t t t t a a t t c t g t a c a g
But notice that the last four bases in the two sequences are actually the same; if
you insert a gap in the first sequence, above the last T in the second sequence, then
the number of differing bases drops from eight to four, as illustrated below.
Sequence 1: a g c t t t g c a t t c t g - a c a g
Sequence 2: a c c t t t t a a t t c t g t a c a g
Evolutionarily, if the DNA of the closest common ancestor of these two species
contained a T in the location of the gap, then we interpret the gap as a deletion
in the first sequence. Or, if the closest common ancestor did not have this T, then
Genomics 305
we interpret the gap as an insertion in the second sequence. These insertions and
deletions, collectively called indels, are common; therefore, sequence comparison
algorithms must take them into account. So, just as the first dotplot1 function
did not adequately represent the similarity between texts, neither does it for DNA
sequences. Figure 6.10 shows a complete dot plot, using the dotplot function from
Section 6.5, for the sequences above.
As we saw in the previous section, dot plots tend to be more useful when we
reduce the “noise” by instead comparing subsequences of a given length, say `,
within a sliding window. Because there are 4` different possible subsequences with
length `, fewer matches are likely. Dot plots are also more useful when comparing
sequences of amino acids. Since there are 20 different amino acids, we tend to see
less noise in the plots. For example, Figure 6.11 shows a dot plot for the following
hypothetical small proteins. Each letter corresponds to a different amino acid. (See
Table 6.1 for the meanings of the letters if you are interested.)
seq1 = ’PDAQNPDMSFFKMLFLPESARWIQRTHGKNS’
seq2 = ’PDAQNPDMPLFLMKFFSESARWIQRTHGKNS’
Notice how the plot shows an inversion in one sequence compared the other,
highlighted in red above.
306 Text, documents, and DNA
def readFASTA(filename):
"""Read a FASTA file containing a single sequence and return
the sequence as a string.
Parameter:
filename: a string representing the name of a FASTA file
After opening the FASTA file, we read the header line with the readline function.
We will not need this header, so this readline just serves to move the file pointer
past the first line, to the start of the actual sequence. To read the rest of the file,
we iterate over it, line by line. In each iteration, we append the line, without the
newline character at the end, to a growing string named dna. There is a link on the
book website to the FASTA file above so that you can try out this function.
We can also directly access FASTA files by using a URL that sends a query to
NCBI. The URL below submits a request to the NCBI web site to download the
FASTA file for the first segment of the Burmese python genome (with accession
number AEQU02000001).
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore
&id=AEQU02000001&rettype=fasta&retmode=text
308 Text, documents, and DNA
def getFASTA(id):
"""Fetch the DNA sequence with the given id from NCBI and return
it as a string.
Parameter:
id: the identifier of a DNA sequence
prefix1 = ’http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi’
prefix2 = ’?db=nuccore&id=’
suffix = ’&rettype=fasta&retmode=text’
url = prefix1 + prefix2 + id + suffix
readFile = web.urlopen(url)
header = readFile.readline()
dna = ’’
for line in readFile:
line = line[:-1]
dna = dna + line.decode(’utf-8’)
readFile.close()
return dna
The function first creates a string containing the URL for the accession number
passed in parameter id. The first common parts of the URL are assigned to the
prefix variables, and the last part of the URL is assigned to suffix. We construct
the custom URL by concatenating these pieces with the accession number. We then
open the URL and use the code from the previous readFASTA function to extract
the DNA sequence and return it as a string.
Reflection 6.55 Use the getFASTA function to read the first segment of the
Burmese python genome (accession number AEQU02000001).
We have just barely touched on the growing field of computational genomics. The
following exercises explore some additional problems and provide many opportunities
to practice your algorithm design and programming skills.
Genomics 309
Exercises
6.7.1. Write a function
countACG(dna)
that returns the fraction of nucleotides in the given DNA sequence that are not T.
Use a for loop that iterates over the characters in the string.
6.7.2. Repeat the previous exercise, but use a for loop that iterates over the indices of
the string instead.
6.7.3. Write a function
printCodons(dna)
that prints the codons, starting at the left end, in the string dna. For example,
printCodons(’ggtacactgt’) would print
ggt
aca
ctg
6.7.7. Most vertebrates have much lower density of CG pairs (called CG dinucleotides)
than would be expected by chance. However, they often have relatively high
concentrations upstream from genes (coding regions). For this reason, finding these
so-called “CpG islands” is often a good way to find putative sites of genes. (The “p”
between C and G denotes the phosphodiester bond between the C and G, versus
a hydrogen bond across two complementary strands.) Without using the built-in
count method, write a function
CpG(dna)
that returns the fraction of dinucleotides that are cg. For example, if dna were
atcgttcg, then the function should return 0.5 because half of the sequence is
composed of CG dinucleotides.
6.7.8. A microsatellite or simple sequence repeat (SSR) is a repeat of a short sequence of
DNA. The repeated sequence is typically 2–4 bases in length and can be repeated 10–
100 times or more. For example, cacacacaca is a short SSR of ca, a very common
repeat in humans. Microsatellites are very common in the DNA of most organisms,
but their lengths are highly variable within populations because of replication
errors resulting from “slippage” during copying. Comparing the distribution of
length variants within and between populations can be used to determine genetic
relationships and learn about evolutionary processes.
Write a function
ssr(dna, repeat)
that returns the length (number of repeats) of the first SSR in dna that repeats
the parameter repeat. If repeat is not found, return 0. Use the find method of
the string class to find the first instance of repeat in dna.
6.7.9. Write another version of the ssr function that finds the length of the longest SSR
in dna that repeats the parameter repeat. Your function should repeatedly call
the ssr function in the previous problem. You will probably want to use a while
loop.
6.7.10. Write a third version of the ssr function that uses the function in the previous
problem to find the longest SSR of any dinucleotide (sequence with length 2) in
dna. Your function should return the longest repeated dinucleotide.
6.7.11. Write a function
palindrome(dna)
that returns True if dna is the same as its reverse complement, and False otherwise.
(Note that this is different from the standard definition of palindrome.) For example,
gaattc is a palindrome because it and its reverse complement are the same. These
sequences turn out to be very important because certain restriction enzymes target
specific palindromic sequences and cut the DNA molecule at that location. For
example, the EcoR1 restriction enzyme cuts DNA of the bacteria Escherichia coli
at the sequence gaattc in the following way:
5’--gaattc--3’ 5’--g \ aattc--3’
|||||| ⇒ | \ |
3’--cttaag--5’ 3’--cttaa \ g--5’
Genomics 311
Write a function
fix(dna)
that returns a DNA string in which each ambiguous symbol is replaced with one
of the possible bases it represents, each with equal probability. For example, if an
R exists in dna, it should be replaced with either an A with probability 1/2 or a G
with probability 1/2.
6.7.16. Write a function
mark(dna)
that returns a new DNA string in which every start codon ATG is replaced with
>>>, and every stop codon (TAA, TAG, or TGA) is replaced with <<<. Your loop
should increment by 3 in each iteration so that you are only considering non-
overlapping codons. For example, mark(’ttgatggagcattagaag’) should return
the string ’ttg>>>gagcat<<<aag’.
6.7.17. The accession number for the hepatitis A virus is NC_001489. Write a program
that uses the getFASTA function to get the DNA sequence for hepatitis A, and
then the gcContent function from this section to find the GC content of hepatitis
A.
312 Text, documents, and DNA
6.7.18. The DNA of the hepatitis C virus encodes a single long protein that is eventually
processed by enzymes called proteases to produce ten smaller proteins. There are
seven different types of the virus. The accession numbers for the proteins produced
by type 1 and type 2 are NP_671491 and YP_001469630. Write a program that uses
the getFASTA function and your dot plot function from Exercise 6.6.5 to produce
a dot plot comparing these two proteins using a window of size 4.
6.7.19. Hox genes control key aspects of embryonic development. Early in development,
the embryo consists of several segments that will eventually become the main
axis of the head-to-tail body plan. The Hox genes dictate what body part each
segment will become. One particular Hox gene, called Hox A1, seems to control
development of the hindbrain. Write a program that uses the getFASTA function
and your dot plot function from Exercise 6.6.5 to produce a dot plot comparing
the human and mouse Hox A1 genes, using a window of size 8. The human Hox A1
gene has accession number U10421.1 and the mouse Hox A1 gene has accession
number NM_010449.4.
6.8 SUMMARY
Text is stored as a sequence of bytes, which we can read into one or more strings.
The most fundamental string algorithms have one of the following structures:
for character in text:
# process character
In the first case, consecutive characters in the string are assigned to the for
loop index variable character. In the body of the loop, each character can then
be examined individually. In the second case, consecutive integers from the list
[0, 1, 2, ..., len(text) - 1], which are precisely the indices of the characters
in the string, are assigned to the for loop index variable index. In this case, the
algorithm has more information because, not only can it access the character at
text[index], it also knows where that character resides in the string. The first
choice tends to be more elegant, but the second choice is necessary when the
algorithm needs to know the index of each character, or if it needs to process slices
of the string, which can only be accessed with indices.
We called one special case of these loops a string accumulator :
newText = ’’
for character in text:
newText = newText +
Like an integer accumulator and a list accumulator, a string accumulator builds its
result cumulatively in each iteration of the loop. Because strings are immutable, a
string accumulator must create a new string in each iteration that is composed of
the old string with a new character concatenated.
Further discovery 313
Algorithms like these that perform one pass over their string parameters and
execute a constant number of elementary steps per character are called linear-time
algorithms because their number of elementary steps is proportional to the length
of the input string.
In some cases, we need to compare every character in one string to every character
in a second string, so we need a nested loop like the following:
for index1 in range(len(text1)):
for index2 in range(len(text2)):
# process text1[index1] and text2[index2]
If both strings have length n, then a nested loop like this constitutes a quadratic-time
algorithm with time complexity n2 (as long as the body of the loop is constant-time)
because every one of n characters in the first string is compared to every one of n
characters in the second string. We will see more loops like this in later chapters.
6.10 PROJECTS
Project 6.1 Polarized politics
The legislative branch of the United States government, with its two-party system,
goes through phases in which the two parties work together productively to pass
laws, and other phases in which partisan lawmaking is closer to the norm. In this
project, we will use data available on the website of the Clerk of the U.S. House
of Representatives (http://clerk.house.gov/legislative/legvotes.aspx) to
analyze the voting behavior and polarization of the U.S. House over time.
314 Text, documents, and DNA
The placeholder <year> is replaced with the year of the vote and the placeholder
<number> is replaced with the roll call number. For example, the results of roll call
vote 194 from 2010 (the final vote on the Affordable Care Act) are available at
http://clerk.house.gov/evs/2010/roll194.xml
If the roll call number has fewer than three digits, the number is filled with zeros.
For example, the results of roll call vote 11 from 2010 are available at
http://clerk.house.gov/evs/2010/roll011.xml
def main():
url = ’http://clerk.house.gov/evs/2010/roll194.xml’
webpage = web.urlopen(url)
for line in webpage:
line = line.decode(’utf-8’)
print(line.rstrip())
webpage.close()
main()
that determines whether House roll call vote number from the given year was a
party line vote. Use only the recorded vote elements, and no other parts of the XML
files. Do not use any built-in string methods. Be aware that some votes are recorded
as Yea/Nay and some are recorded as Aye/No. Do not count votes that are recorded
in any other way (e.g., “Present” or “Not Voting”), even toward the total numbers
of votes. Test your function thoroughly before moving to the next step.
Projects 315
Question 6.1.1 What type of data should this function return? (Think about making
your life easier for the next function.)
Question 6.1.2 Was the vote on the Affordable Care Act a party line vote?
Question 6.1.3 Choose another vote that you care about. Was it a party line vote?
that uses the partyLine function to return the fraction of votes that were party
line votes during the given year. The parameter maxNumber is the number of the
last roll call vote of the year.
Finally, write a function
plotPartyLine()
that plots the fractions of party line votes for the last 20 years. To keep things
simple, you may assume that there were 450 roll call votes each year. If you would
prefer to count all of the roll call votes, here are the numbers for each of the last 20
years:
Year Number Year Number
2013 641 2003 677
2012 659 2002 484
2011 949 2001 512
2010 664 2000 603
2009 991 1999 611
2008 690 1998 547
2007 1186 1997 640
2006 543 1996 455
2005 671 1995 885
2004 544 1994 507
Question 6.1.4 What fraction of votes from each of years 1994 to 2013 went along
party lines?
Note that collecting this data may take a long time, so make sure that your functions
are correct for a single year first.
Question 6.1.5 Describe your plot. Has American politics become more polarized
over the last 20 years?
Question 6.1.6 Many news outlets report on the issue of polarization. Find a news
story about this topic online, and compare your results to the story. It might be
helpful to think about the motivations of news outlets.
316 Text, documents, and DNA
Not every ORF corresponds to a gene. Since there are 43 = 64 possible codons and 3
different stop codons, one is likely to find a stop codon approximately every 64/3 ≈ 21
codons in a random stretch of DNA. Therefore, we are really only interested in
ORFs that have lengths substantially longer than this. Since such long ORFs are
unlikely to occur by chance, there is good reason to believe that they may represent
a coding region of a gene.
Not all open reading frames on a strand of DNA will be aligned with the left
(5’) end of the sequence. For instance, in the example above, if we only started
searching for open reading frames in the codons aligned with the left end — ggc,
gga, tga, etc. — we would have missed the boxed open reading frame. To find all
open reading frames in a strand of DNA, we must look at the codon sequences with
offsets 0, 1, and 2 from the left end. (See Exercise 6.7.6 for another example.) The
codon sequences with these offsets are known as the three forward reading frames.
In the example above, the three forward reading frames begin as follows:
Projects 317
Because DNA is double stranded, with a reverse complement sequence on the other
strand, there are also three reverse reading frames. In this strand of DNA, the
reverse reading frames would be:
Because genomic sequences are so long, finding ORFs must be performed computa-
tionally. For example, consider the following sequence of 1,260 nucleotides from an
Escherichia coli (or E. coli ) genome, representing only about 0.03% of the complete
genome. (This number represents about 0.00005% of the human genome.) Can you
find an open reading frame in this sequence?
agcttttcattctgactgcaacgggcaatatgtctctgtgtggattaaaaaaagagtgtctgatagcagc
ttctgaactggttacctgccgtgagtaaattaaaattttattgacttaggtcactaaatactttaaccaa
tataggcatagcgcacagacagataaaaattacagagtacacaacatccatgaaacgcattagcaccacc
attaccaccaccatcaccattaccacaggtaacggtgcgggctgacgcgtacaggaaacacagaaaaaag
cccgcacctgacagtgcgggcttttttttcgaccaaaggtaacgaggtaacaaccatgcgagtgttgaag
ttcggcggtacatcagtggcaaatgcagaacgttttctgcgggttgccgatattctggaaagcaatgcca
ggcaggggcaggtggccaccgtcctctctgcccccgccaaaatcaccaaccatctggtagcgatgattga
aaaaaccattagcggtcaggatgctttacccaatatcagcgatgccgaacgtatttttgccgaacttctg
acgggactcgccgccgcccagccgggatttccgctggcacaattgaaaactttcgtcgaccaggaatttg
cccaaataaaacatgtcctgcatggcatcagtttgttggggcagtgcccggatagcatcaacgctgcgct
gatttgccgtggcgagaaaatgtcgatcgccattatggccggcgtgttagaagcgcgtggtcacaacgtt
accgttatcgatccggtcgaaaaactgctggcagtgggtcattacctcgaatctaccgttgatattgctg
aatccacccgccgtattgcggcaagccgcattccggctgaccacatggtgctgatggctggtttcactgc
cggtaatgaaaaaggcgagctggtggttctgggacgcaacggttccgactactccgctgcggtgctggcg
gcctgtttacgcgccgattgttgcgagatctggacggatgttgacggtgtttatacctgcgatccgcgtc
aggtgcccgatgcgaggttgttgaagtcgatgtcctatcaggaagcgatggagctttcttacttcggcgc
taaagttcttcacccccgcaccattacccccatcgcccagttccagatcccttgcctgattaaaaatacc
ggaaatccccaagcaccaggtacgctcattggtgccagccgtgatgaagacgaattaccggtcaagggca
This is clearly not a job for human beings. But, if you spent enough time, you
might spot an ORF in reading frame 2 between positions 29 and 97, an ORF in
reading frame 0 between positions 189 and 254, and another ORF in reading frame
0 between positions 915 and 1073 (to name just a few). These are highlighted in red
below.
318 Text, documents, and DNA
agcttttcattctgactgcaacgggcaatATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAattaaaattttattgacttaggtcactaaatactttaaccaa
tataggcatagcgcacagacagataaaaattacagagtacacaacatccATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGAcgcgtacaggaaacacagaaaaaag
cccgcacctgacagtgcgggcttttttttcgaccaaaggtaacgaggtaacaaccatgcgagtgttgaag
ttcggcggtacatcagtggcaaatgcagaacgttttctgcgggttgccgatattctggaaagcaatgcca
ggcaggggcaggtggccaccgtcctctctgcccccgccaaaatcaccaaccatctggtagcgatgattga
aaaaaccattagcggtcaggatgctttacccaatatcagcgatgccgaacgtatttttgccgaacttctg
acgggactcgccgccgcccagccgggatttccgctggcacaattgaaaactttcgtcgaccaggaatttg
cccaaataaaacatgtcctgcatggcatcagtttgttggggcagtgcccggatagcatcaacgctgcgct
gatttgccgtggcgagaaaatgtcgatcgccattatggccggcgtgttagaagcgcgtggtcacaacgtt
accgttatcgatccggtcgaaaaactgctggcagtgggtcattacctcgaatctaccgttgatattgctg
aatccacccgccgtattgcggcaagccgcattccggctgaccacatggtgctgatggctggtttcactgc
cggtaATGAAAAAGGCGAGCTGGTGGTTCTGGGACGCAACGGTTCCGACTACTCCGCTGCGGTGCTGGCG
GCCTGTTTACGCGCCGATTGTTGCGAGATCTGGACGGATGTTGACGGTGTTTATACCTGCGATCCGCGTC
AGGTGCCCGATGCGAGGTTGTTGAagtcgatgtcctatcaggaagcgatggagctttcttacttcggcgc
taaagttcttcacccccgcaccattacccccatcgcccagttccagatcccttgcctgattaaaaatacc
ggaaatccccaagcaccaggtacgctcattggtgccagccgtgatgaagacgaattaccggtcaagggca
To help you get started and organize your project, you can download a “skeleton” of
the program from the book web page. In the program, the viewer function sets up
the turtle graphics window, writes the DNA sequence at the bottom (one character
per x coordinate), and then calls the two functions that you will write. The main
function reads a long DNA sequence from a file and into a string variable, and then
calls viewer.
To display the open reading frames, you will write the function
orf1(dna, rf, tortoise)
Projects 319
to draw colored bars representing open reading frames in one forward reading frame
with offset rf (0, 1, or 2) of string dna using the turtle named tortoise. This
function will be called three times with different values of rf to draw the three
reading frames. As explained in the skeleton program, the drawing function is already
written; you just have to change colors at the appropriate times before calling it.
Hint: Use a Boolean variable inORF in your for loop to keep track of whether you
are currently in an ORF.
Also on the book site are files containing various size prefixes of the genome of a
particular strain of E. coli. Start with the smaller files.
Question 6.2.1 How long do you think an open reading frame should be for us to
consider it a likely gene?
Question 6.2.2 Where are the likely genes in the first 10,000 bases of the E. coli
genome?
Part 2: GC content
The GC content of a particular stretch of DNA is the ratio of the number of C and
G bases to the total number of bases. For example, in the following sequence
TCTACGACGT
the GC content is 0.5 because 5/10 of the bases are either C or G. Because the GC
content is usually higher around coding sequences, this can also give clues about
gene location. (This is actually more true in vertebrates than in bacteria like E.
coli.) In long sequences, GC content can be measured over “windows” of a fixed size,
as in the example we discussed back in Section 1.3. For example, in the tiny example
above, if we measure GC content over windows of size 4, the resulting ratios are
TCTACGACGT
TCTA 0.25
CTAC 0.50
TACG 0.50
ACGA 0.50
CGAC 0.75
GACG 0.75
ACGT 0.50
that plots the GC frequency of the string dna over windows of size window using
the turtle named tortoise. Plot this in the same window as the ORF bars. As
explained in the skeleton code, the plotting function is already written; you just need
to compute the correct GC fractions. As we discussed in Section 1.3, you should not
need to count the GC content anew for each window. Once you have counted the G
and C bases for the first window, you can incrementally modify this count for the
subsequent windows. The final display should look something like Figure 6.12.
CHAPTER 7
Designing programs
From then on, when anything went wrong with a computer, we said it had bugs in it.
Every hour of planning saves about a day of wasted time and effort.
Steve McConnell
Software Project Survival Guide (1988)
t should go without saying that we want our programs to be correct. That is,
I we want our algorithms to produce the correct output for every possible input,
and we want our implementations to be faithful to the design of our algorithms.
Unfortunately however, in practice, perfectly correct programs are virtually non-
existent. Due to their complexity, virtually every significant piece of software has
errors, or “bugs” in it. But there are techniques that we can use to increase the
likelihood that our functions and programs are correct. First, we can make sure that
we thoroughly understand the problem we are trying to solve and spend quality
time designing a solution, well before we start typing any code. Second, we can
think carefully about the requirements on our inputs and function parameters, and
enforce these requirements to head off problems down the road. And third, we can
adopt a thorough testing strategy to make sure that our programs work correctly. In
this chapter, we will introduce the basics of these three practices, and encourage you
to follow them hereafter. Following these more methodical practices can actually
save time in the long run by preventing errors and more easily discovering those
that do creep into your programs.
321
322 Designing programs
Understand the
problem.
Design an algorithm
using top-down design and
critically evaluate each part.
Implement your
algorithm as a program
testing each function as you
implement it.
1. Understand the problem. What is the unknown [output]? What are the data
[inputs]? What is the condition [linking the input to the output]?
windows
324 Designing programs
main
1 3
2
input output
The windows over the text will be represented on the x-axis, perhaps by the starting
index of each window. The y-axis will correspond to the number of times the word
occurs in each window.
Design an algorithm
Once we think we understand the problem to be solved, the next step is to design
an algorithm. An important part of this process is to identify smaller subproblems
that can be solved individually, each with its own function. Once written, these
functions can be used as functional abstractions to solve the overall problem. We
saw an example of this in the flower-drawing program illustrated in Figure 3.7. This
technique is called top-down design because it involves starting from the problem to
be solved, and then breaking this problem into smaller pieces. Each of these pieces
might also be broken down into yet smaller pieces, until we arrive at subproblems
that are relatively straightforward to solve. A generic picture of a top-down design
is shown in Figure 7.2. The generic program is broken into three main subproblems
(i.e., functions). First, a function gets the input. Second, a function computes the
output from the input. Third, another function displays the output. The dotted
lines represent the flow of information through return values and parameters. The
function that gets the input returns it to the main program, which then passes it
into the function that computes the output. This function then returns the output
to the main program, which passes it into the function that displays it. In Python,
this generic program looks like the following:
def main():
input = getInput()
output = compute(input)
displayOutput(output)
main()
Moving on from this generic structure, the next step is to break these three sub-
problems into smaller subproblems, as needed, depending on the particular problem.
Identifying these subproblems/functions can be as much an art as a science, but
here are a few guidelines to keep in mind:
How to solve it 325
main
get text get word get window count word plot window print
from file length in a window frequencies average
2. Functions should be written for subproblems that may be called upon fre-
quently, perhaps with different arguments.
4. The main function should be short, generally serving only to set up the program
and call other functions that carry out the work.
You may find that, during this process, new questions arise about the problem,
sending you briefly back to step one. This is fine; it is better to discover this
immediately and not later in the process. The problem solving process, like the
writing process, is inherently cyclic and fluid, as illustrated in Figure 7.1.
Reflection 7.2 What steps should an algorithm follow to solve our problem?
First, we can specialize the generic design by specifying what each of the three main
subproblems should do.
1. Get inputs: prompt for a file name, a word, and a window length. Read the
text from the file.
3. Display outputs: plot the frequencies and print the average frequency.
Since each of these subproblems has multiple parts, we can break each subproblem
down further:
326 Designing programs
1. Get inputs.
(a) Get the text.
i. Prompt for and get a file name.
ii. Read the text from the file with that file name.
(b) Prompt for and get a word.
(c) Prompt for and get a window length.
2. Compute a list of word frequencies and the average word frequency.
(a) In each window, count the frequency of the word.
3. Display outputs:
(a) Plot the frequencies.
(b) Print the average frequency.
This outline is represented visually in Figure 7.3. We have broken the input subprob-
lem into three parts, one for each input that we need. The text input part is further
broken down into getting the file name and then reading the text from that file.
Next, in the computation subproblem, we will need to count the words in each of
the windows that we construct, so it will make sense to create another function that
counts the words in each of those windows. Finally, we break the output subproblem
into two parts, one for the plot and one for the average frequency.
Next, before we implement our top-down design as a program, we need to design
an algorithm for each “non-trivial” subproblem. Each of the input and output
subproblems is pretty straightforward, but the main computation subproblem is not.
It is convenient at this stage to write our algorithm semi-formally in what is known as
pseudocode. Pseudocode does not have an exact definition, but is somewhere between
English and a programming language. When we write algorithms in pseudocode,
we need to be guided by our understanding of what can and cannot be done in a
programming language, but not necessarily write out every detail. For example, in
the algorithm that will find the frequency of the word in every window, we know
that we will have to somehow iterate over these windows using a loop, but we do
not necessarily have to decide at this point the precise form of that loop. Here is
one possible algorithm written in a Python-like pseudocode. We have colored it blue
to remind you that it is not an actual Python function.
getWordCounts(text, word, window_length)
total_word_count = 0
total_window_count = 0
word_count_list = [ ]
for each non-overlapping window:
increment total_window_count
count = instances of word in window
add count to total_word_count
append count to word_count_list
return (total_word_count / total_window_count) and word_count_list
How to solve it 327
In the algorithm, we are maintaining three main values. The total number of
instances of the word that we have found (total_word_count) and the total number
of windows that we have processed (total_window_count) are needed to compute
the average count per window. The list of word counts (word_count_list), which
stores one word count per window, is needed for the plot. The loop iterates over
all of the windows and, for each one, finds the count of the word in that window
and uses it to update the total word count and the word count list. At the end, we
return the average word count and the list of word counts.
Although analyzing your algorithm is the last step in Polya’s four-step process,
we really should not wait until then to consider the efficiency, clarity, and correctness
of our algorithms. At each stage, we need to critically examine our work and look
for ways to improve it. For example, if our windows overlapped, we could make the
algorithm above more efficient by basing the count in each window on the count in
the previous window, as we did in Section 1.3. Notice that this improvement could
be made before we implement any code.
def getInputs():
"""Prompt for the name of a text file, a word to analyze, and a
window length. Read and return the text, word, and window length.
Parameters: none
Return value: a text, word, and window length
"""
pass
328 Designing programs
In Section 7.2, we will discuss a more formal method for thinking about and enforcing
requirements for your parameters and return values. Requirements for parameters
are called preconditions and requirements for return values (and side effects) are
called postconditions. For example, in the getWordCounts function, we are assuming
that text and target are strings, and that windowLength is a positive integer, but
we do not check whether they actually are. If we mistakenly pass a negative number
in for windowLength, an error is likely to ensue. A precondition would state formally
that windowLength must be a positive integer and make sure that the function does
not proceed if it is not.
Once your functions exist, you can begin to implement them, one at a time.
Comment your code as you go. While you are working on a function, test it often
How to solve it 329
by calling it from main and running the program. By doing so, you can ensure that
you always have working code that is free of syntax errors. If you are running your
program often, then finding syntax errors when they do crop up will be much easier
because they must exist in the last few lines that you wrote.
Once you are pretty sure that a function works, you can initiate a more formal
testing process. This is the subject of Section 7.3. You must make sure that each
function is working properly in isolation before moving on to the next one. Once
you are sure that it works, a function becomes a functional abstraction that you
don’t need to think about any more, making your job easier! Once each of your
functions is working properly, you can assemble your complete program.
For example, we might start by implementing the getInputs function:
def getInputs():
""" (docstring omitted) """
Now we need to test this function on its own before continuing to the next one.
Call the function from main and just print the results to make sure it is working
correctly.
def main():
testText, testWord, testLength = getInputs()
print(testText, testWord, testLength)
main()
If there were syntax errors or something did not work the way you expected, it
is important to fix it now. This process continues for each of the remaining three
functions. Then the final main function ties them all together:
def main():
text, word, windowLength = getInputs()
average, wordCounts = getWordCounts(text, word, windowLength)
plotWordCounts(wordCounts)
print(word, ’occurs on average’, average, ’times per window.’)
You can find a finished version of the program on the book web site. An example of
a plot generated by the program is shown in Figure 7.4.
330 Designing programs
Figure 7.4 Word frequency scatter plot for the word “Pequod” in the text of Moby
Dick (window length = 10,000).
• Are there chunks of code that should be made into functions to improve
readability?
You may very well decide in this process that revisions are necessary, sending you
back up to an earlier step. Again, this is completely normal and encouraged. No
one ever designs a perfect solution on a first try!
Keep these steps and Figure 7.1 in mind as we move forward. We will gradually
begin to solve more complex problems in which these steps will be crucial to success.
def digit2String(digit):
"""Converts an integer digit to its corresponding string
representation.
Parameter:
digit: an integer in 0, 1, ..., 9
Return value:
the corresponding character ’0’, ’1’, ..., ’9’
or None if a non-digit is passed as an argument
"""
Recall that this function is meant to convert an integer digit between 0 and 9 to its
associated character. For example, if the input were 4, the output should be the
string ’4’. We noted when we designed this function that inputs less than 0 or
greater than 9 would not work, so we returned None in those cases. The requirement
that digit must be integer between 0 and 9 is called a precondition. A precondition
for a function is something that must be true when the function is called for it to
behave correctly. Put another way, we do not guarantee that the function will give
the correct output when the precondition is false.
332 Designing programs
The postcondition for the digit2String function should state that the function
returns a string representation of its input.
The use of preconditions and postconditions is called design by contract because
the precondition and postcondition establish a contract between the function designer
and the function caller. The function caller understands that the precondition must
be met before the function is called and the function designer guarantees that, if
the precondition is met, the postcondition will also be met.
Because they describe the input and output of a function, and therefore how
the function can be used, preconditions and postconditions should be included in
the docstring for a function. If they include information about the parameters and
return value, they can replace those parts of the docstring. For example, a new
docstring for digit2String would look like:
def digit2String(digit):
"""Converts an integer digit to its corresponding string
representation.
Checking parameters
The digit2String function partially “enforces” the precondition by returning None
if digit is not between 0 and 9. But there is another part of the precondition that
we do not handle.
Reflection 7.4 What happens if the argument passed in for digit is not an integer?
This error happens in the if statement when the value of digit, in this case
’cookies’, is compared to 0. To avoid this rather opaque error message, we should
make sure that digit is an integer at the beginning of the function. We can do so
Design by contract 333
using the isinstance function, which takes two arguments: a variable name (or
value) and the name of a class. The function returns True if the variable or value
refers to an object (i.e., instance) of the class, and False otherwise. For example,
>>> isinstance(5, int)
True
>>> isinstance(5, float)
False
>>> s = ’a string’
>>> isinstance(s, str)
True
>>> isinstance(s, float)
False
def digit2String(digit):
""" (docstring omitted) """
Now let’s consider how we might use this function in a computation. For example,
we might want to use digit2String in the solution to Exercise 6.3.10 to convert a
positive integer named value to its corresponding string representation:
def int2String(value):
"""Convert an integer into its corresponding string representation.
Parameter:
value: an integer
intString = ’’
while value > 0:
digit = value % 10
value = value // 10
intString = digit2String(digit) + intString
return intString
In each iteration of the while loop, the rightmost digit of value is extracted and
value is divided by ten, the digit is converted to a string using digit2String, and
the result is prepended to the beginning of a growing string representation.
334 Designing programs
Reflection 7.5 What happens if value is a floating point number, such as 123.4?
If value is 123.4, then the first value of digit will be 3.4. Because this is not an
integer, digit2String(digit) will return None. This, in turn, will cause an error
when we try to prepend None to the string:
TypeError: unsupported operand type(s) for +: ’NoneType’ and ’str’
Because we clearly stated in the docstring of digit2String that the function would
behave this way, by the principle of design by contract, the int2String function, as
the caller of digit2String, is responsible for catching this possibility and handling
it appropriately. In particular, it needs to check whether digit2String returns
None and, if it does, do something appropriate. For example, we could just abort
and return None:
def int2String(value):
""" (docstring omitted) """
intString = ’’
while value > 0:
digit = value % 10
value = value // 10
digitStr = digit2String(digit)
if digitStr != None:
intString = digit2Str + intString
else:
return None
return intString
Reflection 7.6 What are some alternatives to this function returning None in the
event that value is a floating point number?
Assertions
A stricter alternative is to display an error and abort the entire program if a
parameter violates a precondition. This is precisely what happened above when the
Python interpreter displayed a TypeError and aborted when the + operator tried
to concatenate None and a string. These now-familiar TypeError and ValueError
messages are called exceptions. A built-in function “raises an exception” when
something goes wrong and the function cannot continue. When a function raises an
Design by contract 335
exception, it does not return normally. Instead, execution of the function ends at
the moment the exception is raised and execution instead continues in part of the
Python interpreter called an exception handler. By default, the exception handler
prints an error message and aborts the entire program.
It is possible for our functions to also raise TypeError and ValueError ex-
ceptions, but the mechanics of how to do so are beyond the scope of this book.
We will instead consider just one particularly simple type of exception called an
AssertionError, which may be raised by an assert statement. An assert state-
ment tests a Boolean condition, and raises an AssertionError if the condition is
false. If the condition is true, the assert statement does nothing. For example,
>>> value = -1
>>> assert value < 0
>>> assert value > 0
AssertionError
>>> assert value > 0, ’value must be positive’
AssertionError: value must be positive
The first assert statement above does nothing because the condition being as-
serted is True. But the condition in the second assert statement is False, so an
AssertionError exception is raised. The third assert statement demonstrates that
we can also include an informative message to accompany an AssertionError.
We can replace the if statement in our digit2String function with assertions
to catch both types of errors that we discussed previously:
def digit2String(digit):
""" (docstring omitted) """
If digit is not an integer, then the first assert statement will display
AssertionError: digit must be an integer
and abort the program. If digit is an integer (and therefore gets past the first
assertion) but is not in the correct range, the second assert statement will display
AssertionError: digit must be in [0..9]
Reflection 7.7 Call this modified version of digit2String from the int2String
function. What happens now when you call int2String(123.4)?
336 Designing programs
Note that, since the assert statement aborts the entire program, it should only be
used in circumstances in which there is no other reasonable course of action. But
the definition of “reasonable” usually depends on the circumstances.
For another example, let’s look at the count and getWordCounts functions that
we discussed in the previous section. We did not implement the functions there, so
they are shown below. We have added to the count function a precondition and
postcondition, and an assertion that tests the precondition.
count = 0
for index in range(len(text) - len(target) + 1):
if text[index:index + len(target)] == target:
count = count + 1
return count
Parameters:
text: a string containing a text
word: a string
windowLength: the integer length of the windows
Return values: average count per window and list of window counts
"""
wordCount = 0
windowCount = 0
wordCounts = []
for index in range(0, len(text) - windowLength + 1, windowLength):
window = text[index:index + windowLength]
windowCount = windowCount + 1
wordsInWindow = count(window, word)
wordCount = wordCount + wordsInWindow
wordCounts.append(wordsInWindow)
return wordCount / windowCount, wordCounts
Design by contract 337
Reflection 7.8 What are a suitable precondition and postcondition for the
getWordCounts function? Should the precondition state anything about the values
of text or word? (What happens if text is too short?)
The precondition should state that both text and word are strings, and wordLength
is a positive integer. In addition, special attention must be paid to the contents of
text. If the length of text is smaller than windowLength, then there will be no
iterations of the loop. In this case, windowCount will remain at zero and the division
in the return statement will result in an error. Depending on the circumstances,
we might also want to prescribe some restrictions on the value of word but, in
general, this is not necessary for the function to work correctly. Incorporating these
requirements, a suitable precondition might look like this:
Precondition: text and word are string objects and
windowLength is a positive integer and
text contains at least windowLength characters
Not satisfying any one of the precondition requirements will break the
getWordCounts function, so let’s check that the requirements are met with assert
statements. First, we can check that text and word are strings with the following
assert statement:
assert isinstance(text, str) and isinstance(word, str), \
’first two arguments must be strings’
To ensure that we avoid dividing by zero, we need to assert that the text is at least
as long as windowLength:
assert len(text) >= windowLength, \
’window length must be shorter than the text’
Finally, it is not a bad idea to add an extra assertion just before the division at the
end of the function:
assert windowCount > 0, ’no windows were found’
Although our previous assertions should prevent this assertion from ever failing,
being a defensive programmer is always a good idea. Incorporating these changes
into the function looks like this:
338 Designing programs
wordCount = 0
windowCount = 0
wordCounts = []
for index in range(0, len(text) - windowLength + 1, windowLength):
window = text[index:index + windowLength]
windowCount = windowCount + 1
wordsInWindow = count(window, word)
wordCount = wordCount + wordsInWindow
wordCounts.append(wordsInWindow)
In the next section, we will discuss a process for more thoroughly testing these
functions once we think they are correct.
Exercises
7.2.1. Write a suitable precondition and postcondition for the int2String function.
7.2.2. Modify the int2String function with if statements so that it correctly satisfies
the precondition and postcondition that you wrote in the previous exercise, and
works correctly for all possible inputs.
7.2.3. Modify int2String function with assert statements so that it correctly satisfies
the precondition and postcondition that you wrote in Exercise 7.2.1, and works
correctly for all possible inputs.
Design by contract 339
For each of the following functions from earlier chapters, (a) write a suitable precondition
and postcondition, and (b) add assert statements to enforce your precondition. Be sure to
include an informative error message with each assert statement.
*7.3 TESTING
Once we have carefully defined and implemented a function, we need to test it
thoroughly. You have surely been testing your functions all along, but now that we
have written some more involved ones, it is time to start taking a more deliberate
approach.
It is very important to test each function that we write before we move on
to other functions. The process of writing a program should consist of multiple
iterations of design—implement—test, as we illustrated in Figure 7.1 in Section 7.1.
Once you come up with an overall design and identify what functions you need, you
should follow this design—implement—test process for each function individually. If
you do not follow this advice and instead test everything for the first time when
you think you are done, it will likely be very hard to discern where your errors are.
The only way to really ensure that a function is correct is to either mathematically
prove it is correct or test it with every possible input (i.e., every possible parameter
value). But, since both of these strategies are virtually impossible in all but the
most trivial situations, the best we can do is to test our functions with a variety of
carefully chosen inputs that are representative of the entire range of possibilities. In
large software companies, there are dedicated teams whose sole jobs are to design
and carry out tests of individual functions, the interplay between functions, and the
overall software project.
Unit testing
We will group our tests for each function in what is known as a unit test. The “unit”
in our case will be an individual function, but in general it may be any block of
code with a specific purpose. Each unit test will itself be a function, named test_
Testing 341
followed by the name of the function that we are testing. For example, our unit
test function for the count function will be named test_count. Each unit test
function will contain several individual tests, each of which will assert that calling
the function with a particular set of parameters returns the correct answer.
To illustrate, we will design unit tests for the count and getWordCounts func-
tions from the previous section, reproduced below in a simplified form. We have
omitted docstrings and the messages attached to the assert statements, simplified
getWordCounts slightly by removing the list of window frequencies, and simplified
the main function by removing the plot.
def getInputs():
⋮ # code omitted
count = 0
for index in range(len(text) - len(target) + 1):
if text[index:index + len(target)] == target:
count = count + 1
return count
wordCount = 0
windowCount = 0
for index in range(0, len(text) - windowLength + 1, windowLength):
window = text[index:index + windowLength]
windowCount = windowCount + 1
wordsInWindow = count(window, word)
wordCount = wordCount + wordsInWindow
def main():
text, word, windowLength = getInputs()
average = getWordCounts(text, word, windowLength)
print(word, ’occurs on average’, average, ’times per window.’)
main()
We will place all of the unit tests for a program in a separate file to reduce clutter
in our program file. If the program above is saved in a file named wordcount.py,
342 Designing programs
then the unit test file will be named test_wordcount.py, and have the following
structure:
def test_count():
"""Unit test for count"""
def test_getWordCounts():
"""Unit test for getWordCounts"""
def test():
test_count()
test_getWordCounts()
test()
The first line imports the two functions that we are testing from wordcount.py into
the global namespace of the test program. Recall from Section 3.6 that a normal
import statement creates a new namespace containing all of the functions from an
imported module. Instead, the
from <module name> import <function names>
form of the import statement imports functions into the current namespace. The
advantage is that we do not have to preface every call to count and getWordCounts
with the name of the module. In other words, we can call
count(’This is fun.’, ’is’)
instead of
wordcount.count(’This is fun.’, ’is’)
The import statement in the program is followed by the unit test functions and a
main test function that calls all of these unit tests.
Regression testing
When we test our program, we will call test() instead of individual unit test
functions. Besides being convenient, this technique has the advantage that, when we
Testing 343
test new functions, we also re-test previously tested functions. If we make changes to
any one function in a program, we want to both make sure that this change worked
and make sure that we have not inadvertently broken something that was working
earlier. This idea is called regression testing because we are making sure that our
program has not regressed to an earlier error state.
Before we design the actual unit tests, there is one more technical matter to
deal with. Although we have not thought about it this way, the import statement
both imports names from another file and executes the code in that file. Therefore,
when test_wordcount.py imports the functions from wordcount.py into the global
namespace, the code in wordcount.py is executed. This means that main will be
called, and therefore the entire program in wordcount.py will be executed, before
test() is called. But we only want to execute main() when we run wordcount.py
as the main program, not when we import it for testing. To remedy this situation,
we need to place the call to main() in wordcount.py inside the following conditional
statement:
if __name__ == ’__main__’:
main()
You may recall from Section 3.6 that __name__ is the name of the current module,
which is assigned the value ’__main__’ if the module is run as the main program.
When wordcount.py is imported into another module instead, __name__ is set to
’wordcount’. So this statement executes main() only if the module is run as the
main program.
def test_count():
quote = ’Diligence is the mother of good luck.’
assert count(quote, ’the’) == 2
assert count(quote, ’them’) == 0
Reflection 7.9 What is printed by the assert statements in the test_count func-
tion?
http://docs.python.org/3/library/development.html
This error tells you which assertion failed so that you can track down the problem.
(In this case, there is no problem; change the assertion back to the correct version.)
On their own, the results of these two tests do not provide enough evidence to
show that the count function is correct. As we noted earlier, we need to choose a
variety of tests that are representative of the entire range of possibilities. The input
that we use for a particular test is called a test case. We can generally divide test
cases into three categories:
Notice that we have chosen inputs that result in a variety of results, including
the case in which the target string is not found.
cases for the count function in which either parameter is the empty string.
We will start with cases in which target is the empty string.
Reflection 7.10 What should the correct answer be for count(quote, ’’)?
In other words, how many times does the empty string occur in a non-empty
string?
This question is similar to asking how many zeros there are in 10. In mathe-
matics, this number (10/0) is undefined, and Python gives an error. Should
counting the number of empty strings in a string also result in an error? Or
perhaps the answer should be 0; in other words, there are no empty strings in
a non-empty string.
Let’s suppose that we want the function to return zero if target is the empty
string. So we add the following three test cases to the test_count function:
assert count(quote, ’’) == 0
assert count(’a’, ’’) == 0
assert count(’’, ’’) == 0
Notice that we passed in strings with three different lengths for text.
Reflection 7.11 Does the count function pass these tests? What does
count(quote, ’’) actually return? Look at the code and try to understand
why. For comparison, what does the built-in string method quote.count(’’)
return?
The return values of both count(quote, ’’) and quote.count(’’) are 38,
which is len(quote) + 1, so the tests fail. This happens because, when target
is the empty string, every test of the if condition, which evaluates to
if text[index:index] == ’’:
is true. Since the loop iterates len(text) + 1 times, each time incrementing
count, len(text) + 1 is assigned to count at the end.
This is a situation in which testing has turned up a case that we might not
have noticed otherwise. Since our expectation differs from the current behavior,
we need to modify the count function by inserting an if statement before the
for loop:
346 Designing programs
if target == ’’:
return 0
count = 0
for index in range(len(text) - len(target) + 1):
if text[index:index + len(target)] == target:
count = count + 1
return count
In this case, both tests pass. If text is the empty string and target contains
at least one character, then the range of the for loop is empty and the function
returns zero.
In addition to tests with the empty string, we should add a few tests in which
both strings are close to the empty string:
assert count(quote, ’e’) == 4
assert count(’e’, ’e’) == 1
assert count(’e’, ’a’) == 0
Notice that we have also included two cases in which the target is close to
something at the beginning and end, but does not actually match.
Testing 347
3. Corner cases. A corner case is any other kind of rare input that might cause
the function to break. For the count function, our boundary cases took care
of most of these. But two other unusual cases to check might be if the text
and target are the same, and if text is shorter than target:
assert count(quote, quote) == 1
assert count(’the’, quote) == 0
Putting all of these test cases together, our unit test for the count function looks
like this:
def test_count():
quote = ’Diligence is the mother of good luck.’
assert count(quote, ’the’) == 2 # common cases
assert count(quote, ’the ’) == 1
assert count(quote, ’them’) == 0
assert count(quote, ’ ’) == 6
assert count(quote, ’’) == 0 # boundary cases
assert count(’a’, ’’) == 0
assert count(’’, ’’) == 0
assert count(’’, quote) == 0
assert count(’’, ’a’) == 0
assert count(quote, ’e’) == 4
assert count(’e’, ’e’) == 1
assert count(’e’, ’a’) == 0
assert count(quote, ’D’) == 1
assert count(quote, ’Di’) == 1
assert count(quote, ’Dx’) == 0
assert count(quote, ’.’) == 1
assert count(quote, ’k.’) == 1
assert count(quote, ’.x’) == 0
assert count(quote, quote) == 1 # corner cases
assert count(’the’, quote) == 0
def test_getWordCounts():
text = ’Call me Ishmael. Some years ago--never mind how long \
precisely--having little or no money in my purse, and nothing \
particular to interest me on shore, I thought I would sail about \
a little and see the watery part of the world. It is a way I have \
of driving off the spleen and regulating the circulation.’
Reflection 7.13 Why did we use 4 / 15 and 1 / 15 in the tests above instead of
something like 0.267 and 0.067?
Using these floating point approximations of 4/15 and 1/15 would cause the assertions
to fail. For example, try this:
>>> assert 4 / 15 == 0.267 # fails
In general, you should never test for equality between floating point numbers. There
are two reasons for this. First, the value you use may not accurately represent
the correct value that you are testing against. This was the case above. To get
assert 4 / 15 == 0.267 to pass, you would have to add a lot more digits to the
right of the decimal point (e.g., 0.26666 ⋯ 66). But, even then, the number of digits
in the value of 4 / 15 may depend on your specific computer system, so even using
more digits is a bad idea. Second, as we discussed in Sections 2.2 and 4.4, floating
point numbers have finite precision and are therefore approximations. For example,
consider the following example from Section 4.4.
sum = 0
for index in range(1000000):
sum = sum + 0.0001
assert sum == 100.0
This loop adds one ten-thousandths one million times, so the answer should be one
hundred, as reflected in the assert statement. However, the assert fails because
the value of sum is actually slightly greater than 100 due to rounding errors. To deal
with this inconvenience, we should always test floating point values within a range
instead. For example, the assert statement above should be replaced by
assert sum > 99.99 and sum < 100.01
or
assert sum > 99.99999 and sum < 100.00001
Testing 349
The size of the range that you test will depend on the accuracy that is necessary in
your particular application.
Let’s apply this idea to the familiar volumeSphere function:
def volumeSphere(radius):
return (4 / 3) * math.pi * (radius ** 3)
To generate some test cases for this function, we would figure out what the answers
should be for a variety of different values of radius and then write assert statements
for each of these test cases. For example, the volume of a sphere with radius 10 is
about 4188.79. So our assert statement should look something like
assert volumeSphere(10) > 4188.789 and volumeSphere(10) < 4188.791
Exercises
7.3.1. Finish the unit test for the getWordCounts function. Be sure to include both
boundary and corner cases.
7.3.2. Design a thorough unit test for the digit2String function from Section 7.2.
7.3.3. Design a thorough unit test for the int2String function from Section 7.2.
7.3.4. Design a thorough unit test for the assignGP function below.
def assignGP(score):
if score >= 90:
return 4
if score >= 80:
return 3
if score >= 70:
return 2
if score >= 60:
return 1
return 0
7.3.5. Design a thorough unit test for the volumeSphere function in Exercise 7.2.4.
7.3.6. Design a thorough unit test for the windChill function in Exercise 7.2.6.
7.3.7. Design a thorough unit test for the decayC14 function in Exercise 7.2.9.
7.3.8. Design a thorough unit test for the reverse function in Exercise 7.2.11.
7.3.9. Design a thorough unit test for the find function in Exercise 7.2.12.
350 Designing programs
7.4 SUMMARY
Designing correct algorithms and programs requires following a careful, reflective
process. We outlined an adaptation of Polya’s How to Solve It with four main steps:
1. Understand the problem.
2. Design an algorithm, starting from a top-down design.
3. Implement your algorithm as a program.
4. Analyze your program for clarity, correctness, and efficiency.
It is important to remember, however, that designing programs is not a linear
process. For example, sometimes after we start programming (step 3), we notice a
more efficient way to organize our functions and return to step 2. And, while testing
a function (step 4), we commonly find errors that need to be corrected, returning us
to step 3. So treat these four steps as guidelines, but always allow some flexibility.
In the last two sections, we formalized the process of designing and testing our
functions. In the design process, we introduced design by contract using preconditions
and postconditions, and the use of assert statements to enforce them. Finally, we
introduced unit testing and regression testing as means for discovering errors in
your functions and programs. In the remaining chapters, we encourage you to apply
design by contract and unit testing to your projects. Although this practice requires
more time up front, it virtually always ends up saving time overall because less time
is wasted chasing down hard-to-find errors.
Data analysis
“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.”
Sherlock Holmes
The Adventure of the Copper Beeches (1892)
351
352 Data analysis
The first example is a list representing hundreds of daily sales, the second example
is a list of the 2012 unemployment rates of the six largest metropolitan areas, and
the third example is a list of votes of a five-member board. The last example is a list
of (x, y) coordinates, each of which is represented by a two-element list. Although
they usually do not in practice, lists can also contain items of different types:
>>> crazy = [15, ’gtaac’, [1, 2, 3], max(4, 14)]
>>> crazy
[15, ’gtaac’, [1, 2, 3], 14]
Since lists are sequences like strings, they can also be indexed and sliced. But now
indices refer to list elements instead of characters and slices are sublists instead of
substrings:
>>> sales[1]
42
>>> votes[:3]
[’yea’, ’yea’, ’nay’]
Now suppose that we are running a small business, and we need to get some
basic descriptive statistics about last week’s daily sales, starting with the average
(or mean) daily sales for the week. Recall from Section 1.2 that, to find the mean of
a list of numbers, we need to first find their sum by iterating over the list. Iterating
over the values in a list is essentially identical to iterating over the characters in a
string, as illustrated below.
def mean(data):
"""Compute the mean of a list of numbers.
Parameter:
data: a list of numbers
sum = 0
for item in data:
sum = sum + item
return sum / len(data)
In each iteration of the for loop, item is assigned the next value in the list named
data, and then added to the running sum. After the loop, we divide the sum by the
length of the list, which is retrieved with the same len function we used on strings.
Summarizing data 353
If data is the empty list ([ ]), then the value of len(data) is zero, resulting in a
“division by zero” error in the return statement. We have several options to deal
with this. First, we could just let the error happen. Second, if you read Section 7.2,
we could use an assert statement to print an error message and abort. Third, we
could detect this error with an if statement and return something that indicates
that an error occurred. In this case, we will adopt the last option by returning None
and indicating this possibility in the docstring.
1 def mean(data):
2 """Compute the mean of a non-empty list of numbers.
3
4 Parameter:
5 data: a list of numbers
6
7 Return value: the mean of numbers in data or None if data is empty
8 """
9
10 if len(data) == 0:
11 return None
12
13 sum = 0
14 for item in data:
15 sum = sum + item
16 return sum / len(data)
This for loop is yet another example of an accumulator, and is virtually identical
to the countLinks function that we developed in Section 4.1. To illustrate what is
happening, suppose we call mean from the following main function.
def main():
sales = [32, 42, 11, 15, 58, 44, 16]
averageSales = mean(sales)
print(’Average daily sales were’, averageSales)
main()
The call to mean(sales) above will effectively execute the following sequence of
statements inside the mean function. The changing value of item assigned by the
for loop is highlighted in red. The numbers on the left indicate which line in the
mean function is being executed in each step.
14 sum = 0 # sum is initialized
15 item = 32 # for loop assigns 32 to item
16 sum = sum + item # sum is assigned 0 + 32 = 32
15 item = 42 # for loop assigns 42 to item
354 Data analysis
Reflection 8.2 Fill in the missing steps above to see how the function arrives at a
sum of 218.
The mean of a data set does not adequately describe it if there is a lot of variability
in the data, i.e., if there is no “typical” value. In these cases, we need to accompany
the mean with the variance, which is measure of how much the data varies from
the mean. Computing the variance is left as Exercise 8.1.10.
Now let’s think about how to find the minimum and maximum sales in the list.
Of course, it is easy for us to just look at a short list like the one above and pick out
the minimum and maximum. But a computer does not have this ability. Therefore,
as you think about these problems, it may be better to think about a very long list
instead, one in which the minimum and maximum are not so obvious.
Reflection 8.3 Think about how you would write an algorithm to find the minimum
value in a long list. (Similar to a running sum, keep track of the current minimum.)
As the hint suggests, we want to maintain the current minimum while we iterate
over the list with a for loop. When we examine each item, we need to test whether
it is smaller than the current minimum. If it is, we assign the current item to be the
new minimum. The following function implements this algorithm.
def min(data):
"""Compute the minimum value in a non-empty list of numbers.
Parameter:
data: a list of numbers
if len(data) == 0:
return None
minimum = data[0]
for item in data[1:]:
if item < minimum:
minimum = item
return minimum
Before the loop, we initialize minimum to be the first value in the list, using indexing.
Then we iterate over the slice of remaining values in the list. In each iteration, we
Summarizing data 355
compare the current value of item to minimum and, if item is smaller than minimum,
update minimum to the value of item. At the end of the loop, minimum is assigned
the smallest value in the list.
Reflection 8.4 If the list [32, 42, 11, 15, 58, 44, 16] is assigned to data,
then what are the values of data[0] and data[1:]?
Let’s look at a small example of how this function works when we call it with the
list containing only the first four numbers from the list above: [32, 42, 11, 15].
The function begins by assigning the value 32 to minimum. The first value of item is
42. Since 42 is not less than 32, minimum remains unchanged. In the next iteration
of the loop, the third value in the list, 11, is assigned to item. In this case, since 11
is less than 32, the value of minimum is updated to 11. Finally, in the last iteration
of the loop, item is assigned the value 15. Since 15 is greater than 11, minimum is
unchanged. At the end, the function returns the final value of minimum, which is 11.
A function to compute the maximum is very similar, so we leave it as an exercise.
Reflection 8.5 What would happen if we iterated over data instead of data[1:]?
Would the function still work?
If we iterated over the entire list instead, the first comparison would be useless
(because item and minimum would be the same) so it would be a little less efficient,
but the function would still work fine.
Now what if we also wanted to know on which day of the week the minimum
sales occurred? To answer this question, assuming we know how indices correspond
to days of the week, we need to find the index of the minimum value in the list. As
we learned in Chapter 6, we need to iterate over the indices in situations like this:
def minDay(data):
"""Compute the index of the minimum value in a non-empty list.
Parameter:
data: a list of numbers
if len(data) == 0:
return -1
minIndex = 0
for index in range(1, len(data)):
if data[index] < data[minIndex]:
minIndex = index
return minIndex
356 Data analysis
This function performs almost exactly the same algorithm as our previous min
function, but now each value in the list is identified by data[index] instead of
item and we remember the index of current minimum in the loop instead of the
actual minimum value.
Reflection 8.6 How can we modify the minDay function to return a day of the
week instead of an index, assuming the sales data starts on a Sunday.
But a more clever solution is to create a list of the days of the week that are in the
same order as the sales data. Then we can simply use the value of minIndex as an
index into this list to return the correct string.
days = [’Sunday’, ’Monday’, ’Tuesday’, ’Wednesday’, ’Thursday’,
’Friday’, ’Saturday’]
return days[minIndex]
There are many other descriptive statistics that we can use to summarize the
contents of a list. The following exercises challenge you to implement some of them.
Exercises
8.1.1. Suppose a list is assigned to the variable name data. Show how you would
(a) print the length of data
(b) print the third element in data
(c) print the last element in data
(d) print the last three elements in data
(e) print the first four elements in data
(f) print the list consisting of the second, third, and fourth elements in data
8.1.2. In the mean function, we returned None if data was empty. Show how to modify
the following main function so that it properly tests for this possibility and prints
an appropriate message.
def main():
someData = getInputFromSomewhere()
average = mean(someData)
print(’The mean value is’, average)
Summarizing data 357
8.1.21. The Luhn algorithm is the standard algorithm used to validate credit card numbers
and protect against accidental errors. Read about the algorithm online, and then
write a function
validateLuhn(number)
that returns True if the number if valid and False otherwise. The number parameter
will be a list of digits. For example, to determine if the credit card number 4563
9601 2200 1999 is valid, one would call the function with the parameter [4, 5, 6,
3, 9, 6, 0, 1, 2, 2, 0, 0, 1, 9, 9, 9]. (Hint: use a for loop that iterates
in reverse over the indices of the list.)
def pond(years):
""" (docstring omitted) """
population = 12000
populationList = [population]
for year in range(1, years + 1):
population = 1.08 * population - 1500
populationList.append(population)
pyplot.plot(range(years + 1), populationList)
pyplot.show()
return population
In the first red statement, the list named populationList is initialized to the single-
item list [12000]. In the second red statement, inside the loop, each new population
value is appended to the list. Finally, the list is plotted with the pyplot.plot
function.
We previously called this technique a list accumulator, due to its similarity to
integer accumulators. List accumulators can be applied to a variety of problems.
For example, consider the find function from Section 6.5. Suppose that, instead
Creating and modifying lists 361
of returning only the index of the first instance of the target string, we wanted to
return a list containing the indices of all the instances:
indexList = []
for index in range(len(text) - len(target) + 1):
if text[index:index + len(target)] == target:
indexList.append(index)
return indexList
In this function, just as in the pond function, we initialize the list before the loop,
and append an index to the list inside the loop (wherever we find the target string).
At the end of the function, we return the list of indices. The function call
find(’Well done is better than well said.’, ’ell’)
would return the list [1, 26]. On the other hand, the function call
find(’Well done is better than well said.’, ’Franklin’)
would return the empty list [] because the condition in the if statement is never
true.
This list accumulator pattern is so common that there is a shorthand for it called
a list comprehension. You can learn more about list comprehensions by reading the
optional material at the end of this section, starting on Page 368.
index: 0 1 2 3 4 5
unemployment[index]:
So the value 0.082 is assigned to the name unemployment[0], the value 0.092
is assigned to the name unemployment[1], etc. When we assigned a new value
to unemployment[1] with unemployment[1] = 0.062, we were simply assigning a
new value to the name unemployment[1], like any other assignment statement:
index: 0 1 2 3 4 5
unemployment[index]:
Suppose we wanted to adjust all of the unemployment rates in this list by subtracting
one percent from each of them. We can do this with a for loop that iterates over
the indices of the list.
>>> for index in range(len(unemployment)):
unemployment[index] = unemployment[index] - 0.01
>>> unemployment
[0.072, 0.052, 0.081, 0.053, 0.058, 0.042]
Reflection 8.7 Is it possible to achieve the same result by iterating over the values
in the list instead? In other words, does the following for loop accomplish the same
thing? (Try it.) Why or why not?
for rate in unemployment:
rate = rate - 0.01
This loop does not modify the list because rate, which is being modified, is not a
name in the list. So, although the value assigned to rate is being modified, the list
itself is not. For example, at the beginning of the first iteration, 0.082 is assigned
to rate, as illustrated below.
index: 0 1 2 3 4 5
unemployment[index]:
rate
Creating and modifying lists 363
Then, when the modified value rate - 0.01 is assigned to rate, this only affects
rate, not the original list, as illustrated below.
index: 0 1 2 3 4 5
unemployment[index]:
rate 0.072
def adjust(rates):
"""Subtract one percent (0.01) from each rate in a list.
Parameter:
rates: a list of numbers representing rates (percentages)
def main():
unemployment = [0.053, 0.071, 0.065, 0.074]
adjust(unemployment)
print(unemployment)
main()
The list named unemployment is assigned in the main function and then passed in
for the parameter rates to the adjust function. Inside the adjust function, every
value in rates is decremented by 0.01. What effect, if any, does this have on the list
assigned to unemployment? To find out, we need to look carefully at what happens
when the function is called.
Right after the assignment statement in the main function, the situation looks
like the following, with the variable named unemployment in the main namespace
assigned the list [0.053, 0.071, 0.065, 0.074].
364 Data analysis
main
unemployment
Now recall from Section 3.5 that, when an argument is passed to a function, it
is assigned to its associated parameter. Therefore, immediately after the adjust
function is called from main, the parameter rates is assigned the same list as
unemployment:
main adjust
unemployment rates
After adjust executes, 0.01 has been subtracted from each value in rates, as the
following picture illustrates.
main adjust
unemployment rates
But notice that, since the same list is assigned to unemployment, these changes will
also be reflected in the value of unemployment back in the main function. In other
words, after the adjust function returns, the picture looks like this:
main
unemployment
Reflection 8.8 Why does the argument’s value change in this case when it did not
in the parameter passing examples in Section 3.5? What is different?
The difference here is that lists are mutable. When you pass a mutable value as
an argument, any changes to the associated formal parameter inside the function
will be reflected in the value of the argument. Therefore, when we pass a list as an
argument to a function, the values in the list can be changed inside the function.
What if we did not want to change the argument to the adjust function (i.e.,
unemployment), and instead return a new adjusted list? One alternative, illustrated
in the function below, would be to make a copy of rates, using the list method
copy, and then modify this copy instead.
def adjust(rates):
""" (docstring omitted) """
ratesCopy = rates.copy()
for index in range(len(ratesCopy)):
ratesCopy[index] = ratesCopy[index] - 0.01
return ratesCopy
The copy method creates an independent copy of the list in memory, and returns a
reference to this new list so that it can be assigned to a variable name (in this case,
ratesCopy). There are other solutions to this problem as well, which we leave as
exercises.
Tuples
Python offers another list-like object called a tuple. A tuple works just like a list,
with two substantive differences. First, a tuple is enclosed in parentheses instead
of square brackets. Second, a tuple is immutable. For example, as a tuple, the
unemployment data would look like (0.053, 0.071, 0.065, 0.074).
Tuples can be used in place of lists in situations where the object being repre-
sented has a fixed length, and individual components are not likely to change. For
example, colors are often represented by their (red, green, blue) components (see
Box 3.2) and two-dimensional points by (x, y).
>>> point = (4, 2)
>>> point
(4, 2)
>>> green = (0, 255, 0)
366 Data analysis
Reflection 8.9 Try reassigning the first value in point to 7. What happens?
>>> point[0] = 7
TypeError: ’tuple’ object does not support item assignment
Tuples are more memory efficient than lists because extra memory is set aside in a
list for a few appends before more memory must be allocated.
The concatenation operator + creates a new list that is the result of “sticking” two
lists together. For example:
>>> unemployment = [0.082, 0.092, 0.091, 0.063, 0.068, 0.052]
>>> unemployment = unemployment + [0.087, 0.101]
>>> unemployment
[0.082, 0.092, 0.091, 0.063, 0.068, 0.052, 0.087, 0.101]
Notice that the concatenation operator combines two lists to create a new list,
whereas the append method inserts a new element into an existing list as the last
element. In other words,
unemployment = unemployment + [0.087, 0.101]
However, using concatenation actually creates a new list that is then assigned to
unemployment, whereas using append modifies an existing list. So using append is
usually more efficient than concatenation if you are just adding to the end of an
existing list.
Reflection 8.10 How do the results of the following two statements differ? If you
want to add the number 0.087 to the end of the list, which is correct?
unemployment.append(0.087)
and
unemployment.append([0.087])
Creating and modifying lists 367
The list class has several useful methods in addition to append. We will use many
of these in the upcoming sections to solve a variety of problems. For now, let’s look
at four of the most common methods: sort, insert, pop, and remove.
The sort method simply sorts the elements in a list in increasing order. For
example, suppose we have a list of SAT scores that we would like to sort:
>>> scores = [620, 710, 520, 550, 640, 730, 600]
>>> scores.sort()
>>> scores
[520, 550, 600, 620, 640, 710, 730]
Note that the sort and append methods, as well as insert, pop and remove, do
not return new lists; instead they modify the lists in place. In other words, the
following is a mistake:
>>> scores = [620, 710, 520, 550, 640, 730, 600]
>>> newScores = scores.sort()
Reflection 8.11 What is the value of newScores after we execute the statement
above?
Printing the value of newScores reveals that it refers to the value None because
sort does not return anything (meaningful). However, scores was modified as we
expected:
>>> newScores
>>> scores
[520, 550, 600, 620, 640, 710, 730]
The sort method will sort any list that contains comparable items, including
strings. For example, suppose we have a list of names that we want to be in
alphabetical order:
>>> names = [’Eric’, ’Michael’, ’Connie’, ’Graham’]
>>> names.sort()
>>> names
[’Connie’, ’Eric’, ’Graham’, ’Michael’]
Reflection 8.12 What happens if you try to sort a list containing items that cannot
be compared to each other? For example, try sorting the list [3, ’one’, 4, ’two’].
The insert method inserts an item into a list at a particular index. For exam-
ple, suppose we want to insert new names into the sorted list above to maintain
alphabetical order:
>>> names.insert(3, ’John’)
>>> names
[’Connie’, ’Eric’, ’Graham’, ’John’, ’Michael’]
>>> names.insert(0, ’Carol’)
>>> names
[’Carol’, ’Connie’, ’Eric’, ’Graham’, ’John’, ’Michael’]
368 Data analysis
The first parameter of the insert method is the index where the inserted item will
reside after the insertion.
The pop method is the inverse of insert; pop deletes the list item at a given
index and returns the deleted value. For example,
>>> inMemoriam = names.pop(3)
>>> names
[’Carol’, ’Connie’, ’Eric’, ’John’, ’Michael’]
>>> inMemoriam
’Graham’
If the argument to pop is omitted, pop deletes and returns the last item in the list.
The remove method also deletes an item from a list, but takes the value of an
item as its parameter rather than its position. If there are multiple items in the list
with the given value, the remove method only deletes the first one. For example,
>>> names.remove(’John’)
>>> names
[’Carol’, ’Connie’, ’Eric’, ’Michael’]
Reflection 8.13 What happens if you try to remove ’Graham’ from names now?
*List comprehensions
As we mentioned at the beginning of this section, the list accumulator pattern
is so common that there is a shorthand for it called a list comprehension. A list
comprehension allows us to build up a list in a single statement. For example,
suppose we wanted to create a list of the first 15 even numbers. Using a for loop,
we can construct the desired list with:
evens = [ ]
for i in range(15):
evens.append(2 * i)
The first part of the list comprehension is an expression representing the items we
want in the list. This is the same as the expression that would be passed to the
append method if we constructed the list the “long way” with a for loop. This
expression is followed by a for loop clause that specifies the values of an index
variable for which the expression should be evaluated. The for loop clause is also
identical to the for loop that we would use to construct the list the “long way.”
This correspondence is illustrated below:
Creating and modifying lists 369
evens = [ ]
for i in range(15):
evens.append(2 * i)
evens = [ ]
for i in range(15):
if 2 * i % 6 != 0:
evens.append(2 * i)
This can be reproduced with a list comprehension that looks like this:
evens = [2 * i for i in range(15) if 2 * i % 6 != 0]
The corresponding parts of this loop and list comprehension are illustrated below:
evens = [ ]
for i in range(15):
if 2 * i % 6 != 0:
evens.append(2 * i)
Reflection 8.14 Rewrite the find function on Page 361 using a list comprehension.
The find function can be rewritten with the following list comprehension.
Look carefully at the corresponding parts of the original loop version and the list
comprehension version, as we did above.
Exercises
8.2.1. Show how to add the string ’grapes’ to the end of the following list using both
concatenation and the append method.
fruit = [’apples’, ’pears’, ’kiwi’]
8.2.2. Write a function
squares(n)
that returns a list containing the squares of the integers 1 through n. Use a for
loop.
Creating and modifying lists 371
8.2.3. (This exercise assumes that you have read Section 6.7.) Write a function
getCodons(dna)
that returns a list containing the codons in the string dna. Your algorithm should
use a for loop.
8.2.4. Write a function
square(data)
that takes a list of numbers named data and squares each number in data in
place. The function should not return anything. For example, if the list [4, 2, 5]
is assigned to a variable named numbers then, after calling square(numbers),
numbers should have the value [16, 4, 25].
8.2.5. Write a function
swap(data, i, j)
that swaps the positions of the items with indices i and j in the list named data.
8.2.6. Write a function
reverse(data)
that reverses the list data in place. Your function should not return anything.
(Hint: use the swap function you wrote above.)
8.2.7. Suppose you are given a list of ’yea’ and ’nay’ votes. Write a function
winner(votes)
that returns the majority vote. For example, winner([’yea’, ’nay’, ’yea’])
should return ’yea’. If there is a tie, return ’tie’.
8.2.8. Write a function
delete(data, index)
that returns a new list that contains the same elements as the list data except
for the one at the given index. If the value of index is negative or exceeds the
length of data, return a copy of the original list. Do not use the pop method. For
example, delete([3, 1, 5, 9], 2) should return the list [3, 1, 9].
8.2.9. Write a function
remove(data, value)
that returns a new list that contains the same elements as the list data except for
those that equal value. Do not use the builit-in remove method. Note that, unlike
the built-in remove method, your function should remove all items equal to value.
For example, remove([3, 1, 5, 3, 9], 3) should return the list [1, 5, 9].
372 Data analysis
Next, show a sequence of calls to the pop method that delete each of the following
items from the final list above.
(a) ’soap’
(b) ’watermelon’
(c) ’bananas’
(d) ’ham’
8.2.17. Given n people in a room, what is the probability that at least one pair of people
shares a birthday? To answer this question, first write a function
sameBirthday(people)
that creates a list of people random birthdays and returns True if two birthdays
are the same, and False otherwise. Use the numbers 0 to 364 to represent 365
different birthdays. Next, write a function
birthdayProblem(people, trials)
that performs a Monte Carlo simulation with the given number of trials to
approximate the probability that, in a room with the given number of people,
two people share a birthday.
8.2.18. Write a function that uses your solution to the previous problem to return the
smallest number of people for which the probability of a pair sharing a birthday is
at least 0.5.
8.2.19. Rewrite the squares function from Exercise 8.2.2 using a list comprehension.
8.2.20. (This exercise assumes that you have read Section 6.7.) Rewrite the getCodons
function from Exercise 8.2.3 using a list comprehension.
Tallying values
To get a sense of how we might compute the frequencies of values in a list, let’s
consider the ocean buoy temperature readings that we first encountered in Chapter 1.
As we did then, let’s simplify matters by just considering one week’s worth of data:
temperatures = [18.9, 19.1, 18.9, 19.0, 19.3, 19.2, 19.3]
Now we want to iterate over the list, keeping track of the frequency of each value
that we see. We can imagine using a simple table for this purpose. After we see the
374 Data analysis
first value in the list, 18.9, we create an entry in the table for 18.9 and mark its
frequency with a tally mark.
Temperature: 18.9
Frequency: :
The second value in the list is 19.1, so we create another entry and tally mark.
The third value is 18.9, so we add another tally mark to the 18.9 column.
The fourth value is 19.0, so we create another entry in the table with a tally mark.
Continuing in this way with the rest of the list, we get the following final table.
Or, equivalently:
Now to find the mode, we find the maximum frequency in the table, and then return
a list of the values with this maximum frequency: [18.9, 19.3].
Dictionaries
Notice how the frequency table resembles the picture of a list on Page 361, except
that the indices are replaced by temperatures. In other words, the frequency table
looks like a generalized list in which the indices are replaced by values that we choose.
This kind of abstract data type is called a dictionary. In a dictionary, each index is
replaced with a more general key. Unlike a list, in which the indices are implicit, a
dictionary in Python (called a dict object) must define the correspondence between
a key and its value explicitly with a key:value pair. To differentiate it from a list, a
dictionary is enclosed in curly braces ({ }). For example, the frequency table above
would be represented in Python like this:
>>> frequency = {18.9: 2, 19.1: 1, 19.0: 1, 19.2: 1, 19.3: 2}
deletes the item in position index from the list m and returns it; if index
m.append(item)
is omitted, deletes and returns the last item in the list m
Temperature:
m.sort([key,
m.append(item) appends
m.append(item)
reverse]) 18.9item19.1 to the end of the list m; returns None
sorts
appends the
Frequency: list
item appends
m.insert(index,
m toin place
the item
end using
ofto
item) the
the endm;ofreturns
a stable
list the if
sort; listprovided,
m; returns
None key None is a function
that returns
m.insert(index, inserts
m.insert(index,
aitem)
key for item each intolistthe
item) item listtom sortat position
on; if reverse index; returnsis True, None the list
Chapter 8inserts
Big data 251
is
inserts
248 m.pop([index])
sorted in reverse
item into the item
order;list into at the
mreturns list
positionNone at position
m index; returnsindex;Nonereturns None 8.1 Lists
Chapter 8 Big
deletes data
thesoitem in position 251
The next value
248 is
m.reverse()
m.pop([index]) m.pop([index])
18.9 again, we add
Frequencies, a tallyindex markfrom
modes, toand thehistograms
the list m and returns
table. 375 it; if index
8.1 Lists
reverses
deletes the is
theitem omitted,
deletes the
itemsininposition deletes
item
the listindex in and
position
m in place; returns index
from returns the
the list None last
from item
the in
list the
m list
and
m and returns it; if index returns
m it; if index
Temperature:
is omitted,
m.sort([key, deletes 18.9 and 19.1 the last item in the list m
returns
m.index(item) m.append(item)
is omitted, deletes andreverse])
returns the last item in the list m
The first pair inm.sort([key, Temperature:
has key mandin18.9 value 19.1 the second item
Temperature:
frequency
returns sorts
appends
m.sort([key,
m.append(item)
the index
reverse]) the
18.9
Frequency: 18.9
of list
the 19.1
reverse])
item to
first place
the endusing
occurrence 2,the
of aofstable
list
item sort;
m; returns
in the listhas
if provided,
None m; key key
raises ais a function
19.1
that
sorts
sorts the listappends
and value 1, etc. ValueError m.insert(index, returns
the
ifFrequency:
in place
m item list
isitemnot a m
using key
toin
item)
foundthefor
place each using
endl ofsort;
in
a stable list
theaitemstable to sort
sort;on;
listprovided,
if m; returns ifkey if is
provided,
None reverse key
a function is True, the list
is a function
Frequency: is
that sorted
inserts
m.insert(index, returns
item in reverse
a key
into
item) for
the order;
each
list m returns
list
at item
position to
None sort
index; on; if
returns
reverse None is True, the list
that returns a key for each list item to sort on; if reverse is True, the list
If you type m.count(item)
the dictionary above in the Python shell, and then display it, you
isThe
returns next in value
is
m.reverse()
sorted the sorted
inserts
m.pop([index])
number
reverse isitem18.9
of
order; again,
inoccurrences
reverse
into
returnsthe so
order; of we
list
None mitematadd
returns in athe
position tally
None mark
m returns
index;
list to theNone table.
might notice something The unexpected:
next value
The m.remove(item)
m.reverse()
next value reverses
m.reverse()deletes
m.pop([index])
is 19.0, soiswe thetheitem
18.9 create again,
items inanother so we
inposition
the add
m in a
listindex
entry tally
place;
from
in mark
thereturns
the list to
table None
mand theanother
and table. it;
returns if index
tally
removes is
m.index(item)
reverses the the first omitted,
reverses
deletes
items thethe
occurrence deletes
items
item
in the listofmitem in and
in the
position
in place;returns
list m in
index
fromreturns the last
place;
from
the list None item
returns
the in
list
m and returns the
None listreturns
and
m None; mraises it; if index
>>> frequency
mark.m.index(item) Temperature:
returns
is omitted,
m.sort([key,
m.index(item)
a ValueError if item the not 18.9
index
isdeletes
reverse]) foundof
andthe 19.1
firstlist
returns
in the occurrence
them last item of initem in the
the list m list m; raises a
{19.0: 1, m.copy()
19.2:returns
1, m.sort([key,
18.9:
the sorts2, the
Temperature:
returns
ValueError
index
19.1:
of the list index
reverse])
the
1,
item 19.3:
in18.9
ifmfirst place
of isthenot
occurrence
2}
19.1
usingfound
first aof in
stable
occurrence
iteml sort;
in ofif item
the provided,
list in raises
m; thekey ais m;
list a function
raises a
Frequency:
returns
ValueError a that
sorts
m.count(item)
ValueError
shallowif itemreturns
the
copy is list
of
not amthe
if key forisin
in list
item
found place meach
not
lusing list aitem
found in l
stable to sort
sort;on; if reversekey
if provided, is True, the list
is a function
The items appear in a different
Temperature: order
is sorted
that
m.count(item)
18.9than19.1
Frequency:
returns the
returns in number
reverse
theyorder;
a key for ofwere
19.0originally. This is okay because
occurrences
each returns
list itemNone oftoitem
sort on;in the list m is True, the list
if reverse
m.count(item)
a dictionary is random.choice(m)
an unordered
The the collection
returns
m.remove(item)
m.reverse()
Frequency:
returns next is sorted
number
value isthe
of19.0, of pairs.
number
inoccurrences
reverse so we order; The
of occurrences
of
create displayed
returns
item in the
another of item
None list ordering
m in the
entry in thelisthas
mtable toand do another tally
with the way dictionariesreturns
m.remove(item)
The are implemented,
removes
reverses
m.remove(item)
m.reverse()
a random
next value the
the
element
is 19.0,from first
items sousing
occurrence
in the
thecreate
we a structure
list of
non-empty m in
item
another list called
from
place; mentry a
the
returns listin hash
m
Noneand returns
the tabletable.and If another
None; raises tally
mark. a
removes
m.index(item) ValueError
reverses the
the first if
items occurrence
item
in theis not
list offound
m in
item in
place; the
from list
the
returnslist
m m
Noneand returns None; raises
removes
random.shuffle(m)
you are interested, you the first occurrence of
can learn a bit more about hash tables in Box 8.2. item from the list m and returns None; raises
We continue mark.
in m.index(item)
this
shu✏es theway
m.copy()
a ValueError list ifmwith
areturns
ValueError the
in place;
item the
is index
notrest
if item
returns
foundof thethe
isinnot
None listfound
first
the to
list get
m in a
occurrence thefinal of table
list mitem in thatthelooks
list m;like raises a
Each entry in a dictionary objectathe
returns
ValueError can
shallow if be
indexitem referenced
copy
ofisthe of 19.1
not thefound
first listusing
in
m19.0
occurrence l the offamiliar indexing
item in the list m; raises a
m.copy() m.copy()
this: random.sample(m, k)
Temperature: 18.9
notation, but using returns
a keym.count(item)
returns in the
aa list of square
returns
ValueError
shallow a shallow
unique
k copy ofbrackets
ifelements
item
the copy
list instead
of
ismchosen
not thefoundlist
from of
in
m19.0lanlist
the index.
m; used Forfor example:
random
Temperature: 18.9 19.1
sampling m.count(item)
without Frequency:
returns
random.choice(m) the number of occurrences of item in the list m
replacement
>>> frequency[19.3]
Temperature: returns
Frequency:
m.remove(item)
random.choice(m) 18.9 athe random 19.1element
number of 19.0 from19.2
occurrences theofnon-empty
item19.3 in the listlist
m m
random.choice(m)
2 removes
random.shuffle(m)
returns
m.remove(item) a therandomfirst
returns a random element from the non-empty list m occurrence
element of
from itemthe from the
non-empty list list
m and
m returns None; raises
>>> frequency[19.1] Frequency:
We continue shu✏es
a in this
ValueError
removes the
the way
list
firstif with
m in
item the
place;
occurrence is notrest
returns
of of the
found in
None listthe
the
from tolist
list get
m aandfinal tableNone;
returns that looks like
random.shuffle(m)
random.shuffle(m)
Table 8.1: Some common methods of the list class and three functions item m from raises
1 We
this: continue
random.sample(m,
m.copy() shu✏es in this
in the
a ValueError way
listifk) with
mforin lists.
item the
place;
is rest
not returns of
found the list
in the
None to
list m get a final table that looks like
the randomshu✏es module thethatlist mare place;
useful returns None The name m represents an arbitrary
this:
Now random.sample(m,
to find the returns
returns
random.sample(m,
m.copy() k)Optionala a list
shallow ofk) k unique
copy of elements
the list m chosen from the list m; used for random
list
As we alluded constant
toreturns ormode,
above, object.
the we mustoffind the maximum
aparameters frequency
are denoted inisin the table,
square and
brackets. then
For
a list ofmodel
sampling
returns
returns k unique without
aa list
shallow of dictionary
k
elements replacement
unique
copy chosenthe in
of elements list
from memory
mchosen
the listfromm; usedsimilar
the for toused
listrandom
m; a list for random
252 a listreference,
areturn
complete of significance
the valuessee http://docs.python.org/3/library/stdtypes.html#
Temperature: with this maximum18.9 frequency. 19.0of 8.2 Descriptive19.3 statistics
(except that there issampling
no sampling
random.choice(m)
without attached
without
replacement to the19.1
replacement ordering the19.2 key values):
mutable-sequence-types. Temperature:
returns a random18.9
random.choice(m) element19.1 from the19.0 non-empty 19.2list m19.3
The frequency table Frequency:
resembles a list, but the from the non-empty list m
Tablerandom.shuffle(m)
8.1: returnsSome acommon methods of the
19.3 class and three functions from
random element list
key: Frequency:
shu✏es
18.9 the
19.1
list mare inmethods
19.0returns
place;
19.2None
Table 8.1: Table
the
Some random.shuffle(m)
8.1:
randomcommon Some
module common
methods that of useful
the list of
for the
lists.
class list class
name and
The three
and three from
m represents
functions functions
an arbitraryfrom
key: 18.9 19.1
random.sample(m, k)
19.0 19.2 19.3
Now
frequency[key]:
the
list to
random find
constant shu✏es
the
moduleor the
mode,
object. list
that we m in
must
are place;
Optionaluseful
the random module that are useful for lists. The name m represents an arbitrary find returns
the
for
parameters None
maximum
lists. The
are frequency
name
denoted m in the
represents
in square table,
an and then
arbitrary
brackets. For
frequencies[key]:
Now to find returns
random.sample(m,the a list of
mode, we uniquefind
k must
k)Optional elements
the chosen from
maximum the list m;inused
frequency the for random
table, and then
list constant list
areturn constant
complete a list
or object. or
reference, object.
ofOptional
the values see with
parameters this parameters
maximum
are 1denoted are
http://docs.python.org/3/library/stdtypes.html# denoted
frequency.
2 in in
square brackets. For square brackets. For
sampling
returns a2 list
without of 1k unique
replacement 1 elements chosen from the list m; used for random
a complete
return a list reference,
mutable-sequence-types.
a complete reference,sampling of the values see http://docs.python.org/3/library/stdtypes.html#
with this
see http://docs.python.org/3/library/stdtypes.html#
2 1 replacement 1 maximum1 frequency.
2
The frequency tablewithout
mutable-sequence-types.
mutable-sequence-types. resembles a list, but the
The frequency table resembles a list, but the
Each entry in the dictionary
Table 8.1: isSome
a reference
key: to
common a value
methods
18.9 19.1 ofinthe
thelist
19.0 same way
class
19.2 andthat
19.3 each
three functions from
entry in a list is a reference
Table toindex:
8.1:
the random aSome
value.common
module So,
that
key:
0 as with
are a
methods
useful
18.9
1 list,
of
for
19.1
2 we
the can
lists.
19.0
3 change
list
The class
name
19.2
4 any
and
m
19.3
5 value
three in
functions
represents
6
from
an arbitrary
the
listfrequencies[key]:
random
constantmodule
or that
object. are useful
Optional for lists.
parameters The
are name
denotedm represents
in square an arbitrary
brackets. For
a dictionary. For example, weindex: can increment
0 the
1 value2 associated3 with
4 the5 key 19.3:
6
list frequencies[key]:
temperatures[index]:
constant
a complete or object.
reference, seeOptional parameters are denoted in square
http://docs.python.org/3/library/stdtypes.html# brackets. For
2 1 1 1 2
temperatures[index]:
>>> frequency[19.3] = frequency[19.3]
amutable-sequence-types.
complete reference, see + 1
http://docs.python.org/3/library/stdtypes.html#
2 1 1
18.9 18.9 19.0 19.1 19.2 19.3 19.3 1 2
mutable-sequence-types.
Now let’s use a dictionary 18.9 19.1
to implement 18.9 19.0
the algorithm that we19.3 19.2 19.3
developed above
to find frequencies and the
>>> mylist mode(s)
= [[]] * 3 ofindex:
a list. To0 begin,
1 we will
2 create
3 an4empty5 6
Exercises
>>> frequency
dictionary named mylist[1].append(9)
in which index:
to record
index: 0
our 11 marks:
0 tally 22 33 44 55 66
temperatures[index]:
index: 0 1 2 3 4 5 6
Exercises temperatures[index]:
temperatures[index]:
8.2.1.References.
Why? Write temperatures[index]:
a function
See notes! 18.9 18.9 19.0 19.1 19.2 19.3 19.3
def mode(data): 18.9
18.9 19.1
18.9 18.9
19.0 19.0
19.1 19.3
19.2 19.2
19.3 19.3
19.3
8.2.1. Writethe
"""Compute a function
modesee
ofnotes!
a non-empty list.
Lists as parameters:>>> mylist
meanSquares(data) = [[]] * 3 18.9 19.1 18.9 19.0 19.3 19.2 19.3
>>> mylist = [[]] * 3
mylist[1].append(9)
Parameter:
>>> mylist[1].append(9)
aExercises
data:that returns
list ofthe mean (average) of the squares
items of the numbers in a list named
Exercises
Why?
data. References. See notes!
Return value: the mode of the notes!
Why?
8.2.1. References.
Write a See
function items in data
Lists as parameters: see notes!
8.2.1.a function
""" 8.2.2. Write Write a function
Lists as parameters: see notes!
frequency = variance(data)
{ }
that returns the variance of a list of numbers named data. The variance is
defined to be the mean of the squares of the numbers in the list minus the
square of the mean of the numbers in the list. In your implementation, call
your function from Exercise 8.2.1.
0 1 2 3 4 5
19.0: 1 18.9: 2 19.1: 1
In this illustration, the hash function associates the pair 18.9: 2 with slot index 3,
19.1: 1 with slot index 4, and 19.0: 1 with slot index 0.
The underlying hash table allows us to access a value in a dictionary (e.g.,
frequency[18.9]) or test for inclusion (e.g., key in frequency) in a constant amount
of time because each operation only involves a hash computation and then a direct
access (like indexing in a string or a list). In contrast, if the pairs were stored in a list,
then the list would need to be searched (in linear time) to perform these operations.
Unfortunately, this constant-time access could be foiled if a key is mapped to an occupied
slot, an event called a collision. Collisions can be resolved by using adjacent slots, using
a second hash function, or associating a list of items with each slot. A good hash function
tries to prevent collisions by assigning slots in a seemingly random manner, so that keys
are evenly distributed in the table and similar keys are not mapped to the same slot.
Because hash functions tend to be so good, we can still consider an average dictionary
access to be a constant-time operation, or one elementary step, even with collisions.
Each entry in this dictionary will have its key equal to a unique item in data and its
value equal to the item’s frequency count. To tally the frequencies, we need to iterate
over the items in data. As in our tallying algorithm, if there is already an entry in
frequency with a key equal to the item, we will increment the item’s associated value;
otherwise, we will create a new entry with frequency[item] = 1. To differentiate
between the two cases, we can use the in operator: item in frequency evaluates
to True if there is a key equal to item in the dictionary named frequency.
The next step is to find the maximum frequency in the dictionary. Since the
frequencies are the values in the dictionary, we need to extract a list of the values,
and then find the maximum in this list. A list of the values in a dictionary object
is returned by the values method, and a list of the keys is returned by the keys
method. For example, if we already had the complete frequency dictionary for the
example above, we could extract the values with
>>> frequency.values()
dict_values([1, 1, 2, 1, 2])
The values method returns a special kind of object called a dict_values object,
but we want the values in a list. To convert the dict_values object to a list, we
simply use the list function:
>>> list(frequency.values())
[1, 1, 2, 1, 2]
The resulting lists will be in whatever order the dictionary happened to be stored,
but the values and keys methods are guaranteed to produce lists in which every key
in a key:value pair is in the same position in the list of keys as its corresponding
value is in the list of values.
Returning to the problem at hand, we can use the values method to retrieve
the list of frequencies, and then find the maximum frequency in that list:
frequencyValues = list(frequency.values())
maxFrequency = max(frequencyValues)
Finally, to find the mode(s), we need to build a list of all the items in data
with frequency equal to maxFrequency. To do this, we iterate over all the keys
in frequency, appending to the list of modes each key that has a value equal to
maxFrequency. Iterating over the keys in frequency is done with the familiar for
loop. When the for loop is done, we return the list of modes.
modes = [ ]
for key in frequency:
if frequency[key] == maxFrequency:
modes.append(key)
return modes
With all of these pieces in place, the complete function looks like the following.
378 Data analysis
Figure 8.1A histogram displaying the frequency of each temperature reading in the
list [18.9, 19.1, 18.9, 19.0, 19.3, 19.2, 19.3].
def mode(data):
""" (docstring omitted) """
frequency = { }
frequencyValues = list(frequency.values())
maxFrequency = max(frequencyValues)
modes = [ ]
for key in frequency:
if frequency[key] == maxFrequency:
modes.append(key)
return modes
As a byproduct of computing the mode, we have also done all of the work necessary
to create a histogram for the values in data. The histogram is simply a vertical bar
Frequencies, modes, and histograms 379
chart with the keys on the x-axis and the height of the bars equal to the frequency
of each key. We leave the creation of a histogram as Exercise 8.3.7. As an example,
Figure 8.1 shows a histogram for our temperatures list.
We can use dictionaries for a variety of purposes beyond counting frequencies.
For example, as the name suggests, dictionaries are well suited for handling trans-
lations. The following dictionary associates a meaning with each of three texting
abbreviations.
>>> translations = {’lol’: ’laugh out loud’, ’u’: ’you’, ’r’: ’are’}
Exercises
8.3.1. Write a function
printFrequencies(frequency)
that prints a formatted table of frequencies stored in the dictionary named
frequency. The key values in the table must be listed in sorted order. For example,
for the dictionary in the text, the table should look something like this:
Key Frequency
18.9 2
19.0 1
19.1 1
19.2 1
19.3 2
(Hint: iterate over a sorted list of keys.)
8.3.2. Write a function
wordFrequency(text)
that returns a dictionary containing the frequency of each word in the string
text. For example, wordFrequency(’I am I.’) should return the dictionary
{’I’: 2, ’am’: 1}. (Hint: the split and strip string methods will be useful;
see Appendix B.)
8.3.3. The probability mass function (PMF) of a data set gives the probability of each
value in the set. A dictionary representing the PMF is a frequency dictionary
with each frequency value divided by the total number of values in the original
data set. For example, the probabilities for the values represented in the table in
Exercise 8.3.1 are shown below.
380 Data analysis
Key Probability
18.9 2/7
19.0 1/7
19.1 1/7
19.2 1/7
19.3 2/7
Write a function
pmf(frequency)
that returns a dictionary containing the PMF of the frequency dictionary passed
as a parameter.
8.3.4. Write a function
wordFrequencies(fileName)
that prints an alphabetized table of word frequencies in the text file with the given
fileName.
8.3.5. Write a function
firstLetterCount(words)
that takes as a parameter a list of strings named words and returns a dictio-
nary with lower case letters as keys and the number of words in words that
begin with that letter (lower or upper case) as values. For example, if the list
is [’ant’, ’bee’, ’armadillo’, ’dog’, ’cat’], then your function should
return the dictionary {’a’: 2, ’b’: 1, ’c’: 1, ’d’: 1}.
8.3.6. Similar to the Exercise 8.3.5, write a function
firstLetterWords(words)
that takes as a parameter a list of strings named words and returns a
dictionary with lower case letters as keys. But now associate with each
key the list of the words in words that begin with that letter. For ex-
ample, if the list is [’ant’, ’bee’, ’armadillo’, ’dog’, ’cat’], then
your function should return the dictionary {’a’: [’ant’, ’armadillo’],
’b’: [’bee’], ’c’: [’cat’], ’d’: [’dog’]}.
8.3.7. Write a function
histogram(data)
that displays a histogram of the values in the list data using the
bar function from matplotlib. The matplotlib function bar(x, heights,
align = ’center’) draws a vertical bar plot with bars of the given
heights, centered at the x-coordinates in x. The bar function defines
the widths of the bars with respect to the range of values on the x-
axis. Therefore, if frequency is the name of your dictionary, it is best
to pass in range(len(frequency)), instead of list(frequency.keys()),
for x. Then label the bars on the x-axis with the matplotlib function
xticks(range(len(frequency)), list(frequency.keys())).
Frequencies, modes, and histograms 381
Their profile is shown below the sequences. The first column of the profile indicates
that there is one sequence with a C in its first position and four sequences with a G
in their first position. The second column of the profile shows that there are two
sequences with A in their second position, two sequences with C in their second
position, and one sequence with G in its second position. And so on. The consensus
Frequencies, modes, and histograms 383
sequence for a set of sequences has in each position the most common base in the
profile. The consensus for this list of sequences is shown below the profile.
A profile can be implemented as a list of 4-element dictionaries, one for each
column. A consensus sequence can then be constructed by finding the base with
the maximum frequency in each position. In this exercise, you will build up a
function to find a consensus sequence in four parts.
As you can see, CSV files contain one row of text per line, with columns separated
by commas. The first row contains the names of the fifteen columns in this file, only
six of which are shown here. Each additional row consists of fifteen columns of data
from one earthquake. The first earthquake in this file was first detected at 20:01
UTC (Coordinated Universal Time) on 2013-09-24 at 40.1333 degrees latitude and
−123.863 degrees longitude. The earthquake occurred at a depth of 29 km and had
magnitude 1.8.
The CSV file containing data about all of the earthquakes in the past month is
available on the web at the URL
http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv
(If you have trouble with this file, you can try smaller files by replacing all_month
with 2.5_month or 4.5_month. The numbers indicate the minimum magnitude of
the earthquakes included in the file.)
Reflection 8.15 Enter the URL above into a web browser to see the data file for
yourself. About how many earthquakes were recorded in the last month?
We can read the contents of CSV files in Python using the same techniques that we
used in Section 6.2. We can either download and save this file manually (and then
read the file from our program), or we can download it directly from our program
using the urllib.request module. We will use the latter method.
To begin our function, we will open the URL, and then read the header row
Reading tabular data 385
containing the column names, as follows. (We do not actually need the header row;
we just need to get past it to get to the data.)
def plotQuakes():
"""Plot the locations of all earthquakes in the past month.
Parameters: None
To visualize fault boundaries with matplotlib, we will need all the longitude (x)
values in one list and all the latitude (y) values in another list. In our plot, we
will color the points according to their depths, so we will also need to extract the
depths of the earthquakes in a third list. To maintain an association between the
latitude, longitude, and depth of a particular earthquake, we will need these lists to
be parallel in the sense that, at any particular index, the longitude, latitude, and
depth at that index belong to the same earthquake. In other words, if we name
these three lists longitudes, latitudes, and depths, then for any value of index,
longitudes[index], latitudes[index], and depths[index] belong to the same
earthquake.
In our function, we next initialize these three lists to be empty and begin to
iterate over the lines in the file:
longitudes = []
latitudes = []
depths = []
for line in quakeFile:
line = line.decode(’utf-8’)
To extract the necessary information from each line of the file, we can use
the split method from the string class. The split method splits a string
at every instance of a given character, and returns a list of strings that re-
sult. For example, ’this;is;a;line’.split(’;’) returns the list of strings
[’this’, ’is’, ’a’, ’line’]. (If no argument is given, it splits at strings of
whitespace characters.) In this case, we want to split the string line at every comma:
row = line.split(’,’)
386 Data analysis
Figure 8.2Plotted earthquake locations with colors representing depths (yellow dots
are shallower, red dots are medium, and blue dots are deeper).
The resulting list named row will have the time of the earthquake at index 0, the
latitude at index 1, the longitude at index 2 and the depth at index 3, followed by
11 additional columns that we will not use. Note that each of these values is a string,
so we will need to convert each one to a number using the float function. After
converting each value, we append it to its respective list:
latitudes.append(float(row[1]))
longitudes.append(float(row[2]))
depths.append(float(row[3]))
quakeFile.close()
Once we have the data from the file in this format, we can plot the earthquake
epicenters, and depict the depth of each earthquake with a color. Shallow (less than
10 km deep) earthquakes will be yellow, medium depth (between 10 and 50 km)
Reading tabular data 387
earthquakes will be red, and deep earthquakes (greater than 50 km deep) will be
blue. In matplotlib, we can color each point differently by passing the scatter
function a list of colors, one for each point. To create this list, we iterate over the
depths list and, for each depth, append the appropriate color string to another list
named colors:
colors = []
for depth in depths:
if depth < 10:
colors.append(’yellow’)
elif depth < 50:
colors.append(’red’)
else:
colors.append(’blue’)
In the call to scatter above, the third argument is the square of the size of the
point marker, and color = colors is a keyword argument. (We also saw keyword
arguments briefly in Section 4.1.) The name color is the name of a parameter of the
scatter function, for which we are passing the argument colors. (We will only use
keyword arguments with matplotlib functions, although we could also use them in
our functions if we wished to do so.)
The complete plotQuakes function is shown below.
def plotQuakes():
"""Plot the locations of all earthquakes in the past month.
Parameters: None
longitudes = []
latitudes = []
388 Data analysis
depths = []
for line in quakeFile:
line = line.decode(’utf-8’)
row = line.split(’,’)
latitudes.append(float(row[1]))
longitudes.append(float(row[2]))
depths.append(float(row[3]))
quakeFile.close()
colors = []
for depth in depths:
if depth < 10:
colors.append(’yellow’)
elif depth < 50:
colors.append(’red’)
else:
colors.append(’blue’)
The plotted earthquakes are shown in Figure 8.2 over a map background. (Your
plot will not show the map, but if you would like to add it, look into the Basemap
class from the mpl_toolkits.basemap module.) Geologists use illustrations like
Figure 8.2 to infer the boundaries between tectonic plates.
Reflection 8.16 Look at Figure 8.2 and try to identify the different tectonic plates.
Can you infer anything about the way neighboring plates interact from the depth
colors?
For example, the ring of red and yellow dots around Africa encloses the African
plate, and the dense line of blue and red dots northwest of Australia delineates the
boundary between the Eurasian and Australian plates. The depth information gives
geologists information about the types of the boundaries and the directions in which
the plates are moving. For example, the shallow earthquakes on the west coast of
North America mark a divergent boundary in which the plates are moving away
from each other, while the deeper earthquakes in the Aleutian islands near Alaska
mark a subduction zone on a convergent boundary where the Pacific plate to the
south is diving underneath the North American plate to the north.
Exercises
8.4.1. Modify plotQuakes so that it also reads earthquake magnitudes into a list, and
then draws larger circles for higher magnitude earthquakes. The sizes of the circles
can be changed by passing a list of sizes, similar to the list of colors, as the third
argument to the scatter function.
Reading tabular data 389
8.4.2. Modify the function from Exercise 8.3.5 so that it takes a file name as a parameter
and uses the words from this file instead. Test your function using the SCRABBLE®
dictionary on the book web site.1
8.4.3. Modify the function from Exercise 8.3.13 so that it takes a file name as a parameter
and creates a username/password dictionary with the usernames and passwords in
that file before it starts prompting for a username and password. Assume that the
file contains one username and password per line, separated by a space. There is
an example file on the book web site.
8.4.4. Write a function
plotPopulation()
that plots the world population over time from the tab-separated data file on the
book web site named worldpopulation.txt. To read a tab-separated file, split
each line with line.split(’\t’). These figures are U.S. Census Bureau midyear
population estimates from 1950–2050. Your function should create two plots. The
first shows the years on the x-axis and the populations on the y-axis. The second
shows the years on the x-axis and the annual growth rate on the y-axis. The growth
rate is the difference between this year’s population and last year’s population,
divided by last year’s population. Be sure to label your axes in both plots with the
xlabel and ylabel functions.
What is the overall trend in world population growth? Do you have any hypotheses
regarding the most significant spikes and dips?
8.4.5. Write a function
plotMeteorites()
that plots the location (longitude and latitude) of every known meteorite
that has fallen to earth, using a tab-separated data file from the book web
site named meteoritessize.txt. Split each line of a tab-separated file with
line.split(’\t’). There are large areas where no meteorites have apparently
fallen. Is this accurate? Why do you think no meteorites show up in these areas?
8.4.6. Write a function
plotTemps()
that reads a CSV data file from the book web site named madison_temp.csv
to plot several years of monthly minimum temperature readings from Madison,
Wisconsin. The temperature readings in the file are integers in tenths of a degree
Celsius and each date is expressed in YYYYMMDD format. Rather than putting
every date in a list for the x-axis, just make a list of the years that are represented
in the file. Then plot the data and put a year label at each January tick with
pyplot.plot(range(len(minTemps)), minTemps)
pyplot.xticks(range(0, len(minTemps), 12), years)
The first argument to the xticks function says to only put a tick at every twelfth
x value, and the second argument supplies the list of years to use to label those
ticks. It will be helpful to know that the data file starts with a January 1 reading.
1
SCRABBLE® is a registered trademark of Hasbro Inc.
390 Data analysis
from your volunteers. Before you can submit your petition to the governor, you need
to remove duplicate signatures from the combined list. Rather than try to do this
by hand, you decide to scan all of the names into a file, and design an algorithm to
remove the duplicates for you. Because the signatures are numbered, you want the
final list of unique names to be in their original order.
As we design an algorithm for this problem, we will keep in mind the four steps
outlined in Section 7.1:
1. Understand the problem.
2. Design an algorithm.
Reflection 8.17 Before you read any further, make sure you understand this prob-
lem. If you were given this list of names on paper, what algorithm would you use to
remove the duplicates?
The input to our problem is a list of items, and the output is a new list of unique
items in the same order they appeared in the original list. We will start with an
intuitive algorithm and work through a process of refinement to get progressively
better solutions. In this process, we will see how a critical look at the algorithms we
write can lead to significant improvements.
A first algorithm
There are several different approaches we could use to solve this problem. We will
start with an algorithm that iterates over the items and, for each one, marks any
duplicates found further down the list. Once all of the duplicates are marked, we
can remove them from a copy of the original list. The following example illustrates
this approach with a list containing four unique names, abbreviated A, B, C, and D.
The algorithm starts at the beginning of the list, which contains the name A. Then
we search down the list and record the index of the duplicate, marked in red.
A B C B D A D B
↑ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
search for A
Some items, like the next item, C, do not have any duplicates.
A B C B D A D B
↑ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
search for C
392 Data analysis
The next item, D, is not marked so we search for its duplicates down the list.
A B C B D A D B
↑ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
search for D
Finally, we finish iterating over the list, but find that the remaining items are already
marked as duplicates.
Once we know where all the duplicates are, we can make a copy of the original
list and remove the duplicates from the copy. This algorithm, partially written in
Python, is shown below. We keep track of the “marked” items with a list of their
indices. The portions in red need to be replaced with calls to appropriate functions,
which we will develop soon.
1 def removeDuplicates1(data):
2 """Return a list containing only the unique items in data.
3
4 Parameter:
5 data: a list
6
7 Return value: a new list of the unique values in data,
8 in their original order
9 """
10
11 duplicateIndices = [ ] # indices of duplicate items
12 for index in range(len(data)):
13 if index is not in duplicateIndices:
14 positions = indices of later duplicates of data[index]
15 duplicateIndices.extend(positions)
16
17 unique = data.copy()
18 for index in duplicateIndices:
19 remove data[index] from unique
20
21 return unique
To implement the red portion of the if statement on line 13, we need to search
for index in the list duplicateIndices. We could use the Python in operator
(if index in duplicateIndices:), but let’s instead revisit how to write a search
function from scratch. Recall that, in Section 6.5, we developed a linear search
algorithm (named find) that returns the index of the first occurrence of a substring
in a string. We can use this algorithm as a starting point for an algorithm to find
an item in a list.
Designing efficient algorithms 393
Reflection 8.18 Look back at the find function on Page 283. Modify the function
so that it returns the index of the first occurrence of a target item in a list.
The linear search function, based on the find function but applied to a list, follows.
Parameters:
data: a list object to search in
target: an object to search for
The function iterates over the indices of the list named data, checking whether
each item equals the target value. If a match is found, the function returns the
index of the match. Otherwise, if no match is ever found, the function returns −1.
We can call this function in the if statement on line 13 of removeDuplicates1 to
determine whether the current value of index is already in duplicateIndices.
We want the condition in the if statement to be true if index is not in the list, i.e.,
if linearSearch returns −1. So the if statement should look like:
if linearSearch(duplicateIndices, index) == -1:
In the red portion on line 14, we need to find all of the indices of items equal to
data[index] that occur later in the list. A function to do this will be very similar to
linearSearch, but differ in two ways. First, it must return a list of indices instead
of a single index. Second, it will require a third parameter that specifies where the
search should begin.
Reflection 8.20 Write a new version of linearSearch that differs in the two ways
outlined above.
Parameters:
data: a list object to search in
target: an object to search for
start: the index in data to start searching from
found = [ ]
for index in range(start, len(data)):
if data[index] == target:
found.append(index)
return found
With these two new functions, we can fill in the first two missing parts of our
removeDuplicates1 function:
def removeDuplicates1(data):
""" (docstring omitted) """
unique = data.copy()
for index in duplicateIndices:
remove data[index] from unique
return unique
Once we have the list of duplicates, we need to remove them from the list. The
algorithm above suggests using the pop method to do this, since pop takes an index
as a parameter:
unique.pop(index)
However, this is likely to cause a problem. To see why, suppose we have the list
[1, 2, 3, 2, 3] and we want to remove the duplicates at indices 3 and 4:
>>> data = [1, 2, 3, 2, 3]
>>> data.pop(3)
>>> data.pop(4)
Designing efficient algorithms 395
The problem is that, after we delete the item at index 3, the list looks like
[1, 2, 3, 3], so the next item we want to delete is at index 3 instead of in-
dex 4. In fact, there is no index 4!
An alternative approach would be to use a list accumulator to build the unique
list up from an empty list. To do this, we will need to iterate over the original list
and append items to unique if their indices are not in duplicateIndices. The
following function replaces the last loop with a new list accumulator loop that uses
this approach instead.
def removeDuplicates1(data):
""" (docstring omitted) """
unique = [ ]
for index in range(len(data)):
if linearSearch(duplicateIndices, index) == -1:
unique.append(data[index])
return unique
We now finally have a working function! (Try it out.) However, this does not mean
that we should immediately leave this problem behind. It is important that we
take some time to critically reflect on our solution. Is it correct? (If you covered
Section 7.3, this would be a good time to develop some unit tests.) Can the function
be simplified or made more efficient? What is the algorithm’s time complexity?
Reflection 8.22 The function above can be simplified a bit. Look at the similarity
between the two for loops. Can they be combined?
The two for loops can, in fact, be combined. If the condition in the first if
statement is true, then this must be the first time we have seen this particular
list item. Therefore, we can append it to the unique list right then. The resulting
function is a bit more streamlined:
396 Data analysis
1 def removeDuplicates1(data):
2 """ (docstring omitted) """
3
4 duplicateIndices = [ ]
5 unique = [ ]
6 for index in range(len(data)):
7 if linearSearch(duplicateIndices, index) == -1:
8 positions = linearSearchAll(data, data[index], index + 1)
9 duplicateIndices.extend(positions)
10 unique.append(data[index])
11 return unique
Now let’s analyze the asymptotic time complexity of this algorithm in the worst
case. The input to the algorithm is the parameter data. As usual, we will call the
length of this list n (i.e., len(data) is n). The statements on lines 4, 5, and 11 each
count as one elementary step. The rest of the algorithm is contained in the for loop,
which iterates n times. So the if statement on line 7 is executed n times and, in
the worst case, the statements on lines 8–10 are each executed n times as well. But
how many elementary steps are hidden in each of these statements?
Reflection 8.23 How many elementary steps are required in the worst case by the
call to linearSearch on line 7? (You might want to refer back to Page 283, where
we talked about the time complexity of a linear search.)
We saw in Section 6.5 that linear search is a linear-time algorithm (hence the name).
So the number of elementary steps required by line 7 is proportional to the length
of the list that is passed in as a parameter. The length of the list that we pass into
linearSearch in this case, duplicateIndices, will be zero in the first iteration of
the loop, but may contain as many as n − 1 indices in the second iteration. So the
total number of elementary steps in each of these later iterations is at most n − 1,
which is just n asymptotically.
Reflection 8.24 Why could duplicateIndices have length n − 1 after the first
iteration of the loop? What value of data would make this happen?
This is actually a tricky question to answer, as explained in Box 8.3. In a nutshell, the
average time required over a sequence of append calls is constant (or close enough
to it), so it is safe to characterize an append as an elementary step. So it is safe
to say that a call to linearSearchAll requires about n − start elementary steps.
When the linearSearchAll function is called on line 8, the value of index + 1 is
passed in for start. So when index has the value 0, the linearSearchAll function
requires about n − start = n − (index + 1) = n − 1 elementary steps. When index
has the value 1, it requires about n − (index + 1) = n − 2 elementary steps. And so
on. So the total number of elementary steps in line 8 is
(n − 1) + (n − 2) + ⋯ + 1.
We have seen this sum before (see Box 4.2); it is equal to the triangular number
n(n − 1) n2 n
= −
2 2 2
which is the same as n2 asymptotically.
Finally, lines 9 and 10 involve a total of n elementary steps asymptotically. This
is easier to see with line 10 because it involves at most one append per iteration. The
extend method called on line 9 effectively appends all of the values in positions
398 Data analysis
(a) (b)
to the end of duplicateIndices. Although one call to extend may append more
than one index to duplicateIndices at a time, overall, each index of data can
be appended at most once, so the total number of elementary steps over all of the
calls to extend must be proportional to n. In summary, we have determined the
following numbers of elementary steps for each line.
Line Elementary steps
4 1
5 1
6 n
7 n2
8 n2
9 n
10 n
11 1
Since the maximum number of elementary steps for any line is n2 , this is the
asymptotic time complexity of the algorithm. Algorithms with asymptotic time
complexity n2 are called quadratic-time algorithms.
How do quadratic-time algorithms compare to the linear-time algorithms we
have encountered in previous chapters? The answer is communicated graphically by
Figure 8.3: quadratic-time algorithms are a lot slower than linear-time algorithms.
However, they still tend to finish in a reasonable amount of time, unless n is extremely
large.
Another visual way to understand the difference between a linear-time algorithm
and a quadratic-time algorithm is shown in Figure 8.4. Suppose each square rep-
resents one elementary step. On the left is a representation of the work required
in a linear-time algorithm. If the size of the input to the linear-time algorithm
Designing efficient algorithms 399
n−1 n
n−1
n
increases from n − 1 to n (n = 7 in the pictures), then the algorithm must execute one
additional elementary step (in gray). On the right is a representation of the work
involved in a quadratic-time algorithm. If the size of the input to the quadratic-time
algorithm increases from n − 1 to n, then the algorithm gains 2n − 1 additional steps.
So we can see that the amount of work that a quadratic-time algorithm must do
grows much more quickly than the work required by a linear-time algorithm!
Reflection 8.26 In our current algorithm, how else could we tell if the current
item, data[index], is a duplicate, without referring to duplicateIndices?
Since we are now constructing the list of unique items in the main for loop, we
could decide whether the current item (data[index]) is a duplicate by searching
for it in unique, instead of searching for index in duplicateIndices. This change
eliminates the need for the duplicateIndices list altogether and greatly simplifies
the algorithm, as illustrated below:
def removeDuplicates2(data):
""" (docstring omitted) """
duplicateIndices = [ ]
unique = [ ]
for index in range(len(data)):
if linearSearch(unique, data[index]) == -1:
positions = linearSearchAll(data, data[index], index + 1)
duplicateIndices.extend(positions)
unique.append(data[index])
return unique
400 Data analysis
In addition, since we are not storing indices any longer, we can iterate over the
items in the list instead of the indices, giving a much cleaner look to the algorithm:
1 def removeDuplicates2(data):
2 """ (docstring omitted) """
3
4 unique = [ ]
5 for item in data:
6 if linearSearch(unique, item) == -1:
7 unique.append(item)
8 return unique
Reflection 8.27 This revised algorithm is certainly more elegant. But is it more
efficient?
To answer this question, let’s revisit our time complexity analysis. The for loop
still iterates n times and, in the worst case, both the call to linearSearch and
the append are still executed in every iteration of the loop. We saw above that
the number of elementary steps executed by the linearSearch function depends
on the length of the list that is passed in. In this case, unique can be no longer
than the number of previous iterations, since at most one item is appended to it in
each iteration. So the length of unique can grow by at most one in each iteration,
meaning that the number of elementary steps executed by linearSearch can also
grow by at most one in each iteration and the total number of elementary steps
executed by all of the calls to linearSearch is at most
n(n − 1)
1 + 2 + 3 + ⋯ + (n − 1) =
2
or, once again, n2 asymptotically. In summary, the numbers of elementary steps for
each line are now:
Line Elementary steps
4 1
5 n
6 n2
7 n
8 1
Like our previous algorithm, the maximum value in the table is n2 , so our new
algorithm is also a quadratic-time algorithm.
To find all of the duplicates in a list, it seems obvious that we need to look at
every one of the n items in the list (using a for loop or some other kind of loop).
Therefore, the time complexity of any algorithm for this problem must be at least
n, or linear-time. But is a linear-time algorithm actually possible? Apart from the
loop, the only significant component of the algorithm remaining is the linear search
in line 6. Can the efficiency of this step be improved from linear time to something
better?
Searching for data efficiently, as we do in a linear search, is a fundamental topic
that is covered in depth in later computer science courses. There are a wide variety
of innovative ways to store data to facilitate fast access, most of which are beyond
the scope of an introductory book. However, we have already seen one alternative.
Recall from Box 8.2 that a dictionary is cleverly implemented so that access can be
considered a constant-time operation. So if we can store information about duplicates
in a dictionary, we should be able to perform the search in line 6 in constant time!
The trick is to store the items that we have already seen as keys in a dictionary.
In particular, when we append a value of item to the list of unique items, we also
make item a new key in the dictionary. Then we can test whether we want to
append each new value of item by checking whether item is already a key in seen.
The new function with these changes follows.
def removeDuplicates3(data):
""" (docstring omitted) """
seen = { }
unique = [ ]
for item in data:
if item not in seen:
unique.append(item)
seen[item] = True
return unique
In our new function, we associate the value True with each key, but this is an
arbitrary choice because we never actually use these values.
Since every statement in the body of the for loop is now one elementary step, the
removeDuplicates3 function is a linear-time algorithm. As we saw in Figure 8.3, this
makes a significant difference! Exercise 8.5.2 asks you to investigate this difference
for yourself.
Exercises
8.5.1. Show how to modify each of the three functions we wrote so that they each instead
return a list of only those items in data that have duplicates. For example, if data
were [1, 2, 3, 1, 3, 1], the function should return the list [1, 3].
8.5.2. In this exercise, you will write a program that tests whether the linear-time
removeDuplicates3 function really is faster than the first two versions that we
402 Data analysis
wrote. First, write a function that creates a list of n random integers between
0 and n − 1 using a list accumulator and the random.randrange function. Then
write another function that calls each of the three functions with such a list as
the argument. Time how long each call takes using the time.time function, which
returns the current time in elapsed seconds since a fixed “epoch” time (usually
midnight on January 1, 1970). By calling time.time before and after each call,
you can find the number of seconds that elapsed.
Repeat your experiment with n = 100, 1000, 10, 000, and 100, 000 (this will take a
long time). Describe your results.
8.5.3. In a round-robin tournament, every player faces every other player exactly once,
and the player with the most head-to-head wins is deemed the champion. The
following partially written algorithm simulates a round-robin tournament. Assume
that each of the steps expressed as a comment is one elementary step.
def roundRobin(players):
# initialize all players’ win counts to zero
for player1 in players:
for player2 in players:
if player2 != player1:
# player1 challenges player2
# increment the win count of the winner
# return the player with the most wins (or tie)
(a) (b)
A line between two players represents the player on the left challenging the player
on the right. Notice, for example, that Amelia challenges Caroline and Caroline
challenges Amelia. Obviously, these are redundant, so we would like to avoid them.
The red lines represent all of the unnecessary contests. The remaining contests
that we need to consider are illustrated on the right side above (b).
Linear regression 403
(a) Let’s think about how we can design an algorithm to create just the contests
we need. Imagine that we are iterating from top to bottom over the players
list on the left side of (b) above. Notice that each value of player1 on the
left only challenges values of player2 on the right that come after it in the
list. First, Amelia needs to challenge Beth, Caroline, and David. Then Beth
only needs to challenge Caroline and David because Amelia challenged her
in the previous round. Then Caroline only needs to challenge David because
both Amelia and Beth already challenged her in previous rounds. Finally,
when we get to David, everyone has already challenged him, so nothing more
needs to be done.
Modify the nested for loop in the previous exercise so that it implements
this algorithm instead.
(b) What is the asymptotic time complexity of your algorithm?
8.5.5. Suppose you have a list of projected daily stock prices, and you wish to find the
best days to buy and sell the stock to maximize your profit. For example, if the list
of daily stock prices was [3, 2, 1, 5, 3, 9, 2], you would want to buy on day
2 and sell on day 5, for a profit of $8 per share. Similar to the previous exercise,
you need to check all possible pairs of days, such that the sell day is after the buy
day. Write a function
profit(prices)
that returns a tuple containing the most profitable buy and sell days for the given
list of prices. For example, for the list of daily prices above, your function should
return (2, 5).
We can also think of each row in the table, which represents one student, as an
(x, y) point where x is the value of the independent variable and y is the value of
the dependent variable.
The most common type of regression analysis is called linear regression. A linear
regression finds a straight line that most closely approximates the relationship
between the two variables. The most commonly used linear regression technique,
called the least squares method, finds the line that minimizes the sum of the squares
of the vertical distances between the line and our data points. This is represented
graphically below.
The red line segments in the figure represent the vertical distances between the
points and the dashed line. This dashed line represents the line that results in the
minimum total squared vertical distance for these points.
Mathematically, we are trying to find the line y = mx + b (with slope m and
y-intercept b) for which
∑ (y − (mx + b))
2
is the minimum. The x and y in this notation represent any one of the points (x, y)
in our data set; (y − (mx + b)) represents the vertical distance between the height of
(x, y) (given by y) and the height of the line at (x, y) (given by mx + b). The upper
case Greek letter sigma (Σ) means that we are taking the sum over all such points
(x, y) in our data set.
To find the least squares line for a data set, we could test all of the possible lines,
and choose the one with the minimum total squared distance. However, since there
are an infinite number of such lines, this “brute force” approach would take a very
long time. Fortunately, the least squares line can be found exactly using calculus.
The slope m of this line is given by
n ⋅ ∑(xy) − ∑ x ⋅ ∑ y
m=
n ⋅ ∑(x2 ) − (∑ x)2
Linear regression 405
∑y − m∑x
b= .
n
Although the notation is these formulas may look imposing, the quantities are really
quite simple:
• ∑(x2 ) is the sum of the squares of the x coordinates of all of the points (x, y)
For example, suppose we had only three points: (5, 4), (3, 2), and (8, 3). Then
• ∑ x = 5 + 3 + 8 = 16
• ∑y = 4 + 2 + 3 = 9
• ∑(xy) = (5 ⋅ 4) + (3 ⋅ 2) + (8 ⋅ 3) = 20 + 6 + 24 = 50
• ∑(x2 ) = 52 + 32 + 82 = 25 + 9 + 64 = 98
Therefore,
n ⋅ ∑(xy) − ∑ x ⋅ ∑ y 3 ⋅ 50 − 16 ⋅ 9 3
m= = =
n ⋅ ∑(x2 ) − (∑ x)2 3 ⋅ 98 − 162 19
and
∑ y − m ∑ x 9 − (3/19) ⋅ 16 41
b= = = .
n 3 19
Plugging in these values, we find that the formula for the least squares line is
3 41
y=( )x + ,
19 19
which is plotted below.
19
41/19
406 Data analysis
Figure 8.5Scatter plot showing high school GPA and corresponding cumulative
college GPA with regression line.
Once all of the sums are computed, m and b can be computed and returned with
return m, b
Parameters:
x: a list of x values (independent variable)
y: a list of y values (dependent variable)
xLabel: a string to label the x axis
yLabel: a string to label the y axis
pyplot.scatter(x, y)
m, b = linearRegression(x, y)
minX = min(x)
maxX = max(x)
pyplot.plot([minX, maxX], [m * minX + b, m * maxX + b], color = ’red’)
pyplot.xlabel(xLabel)
pyplot.ylabel(yLabel)
pyplot.show()
Returning to our college admissions problem, suppose the high school GPA
values are in a list named hsGPA and the college GPA values are in a list named
collegeGPA. Then we can get our regression line by calling
plotRegression(hsGPA, collegeGPA, ’HS GPA’, ’College GPA’)
Reflection 8.29 What can you discern from this plot? Does high school GPA do a
good job of predicting college GPA?
In the exercises below, and in Project 8.4, you will have the opportunity to investigate
this problem in more detail. Projects 8.3 and 8.5 also use linear regression to
approximate the demand curve for an economics problem and predict flood levels
on the Snake River, respectively.
Exercises
8.6.1. Complete the function
linearRegression(x, y)
The function should return the slope and y-intercept of the least squares regression
line for the points whose x and y coordinates are stored in the lists x and y,
respectively. (Your function should use only one loop.)
8.6.2. The table below lists the homework and exam scores for a class (one row per
student). Write a program that uses the completed linearRegression function
408 Data analysis
from the previous exercise and the plotRegression function above to plot a linear
regression line for this data.
HW Exam
63 73
91 99
81 98
67 82
100 97
87 99
91 96
74 77
26 33
100 98
78 100
59 81
85 38
69 74
8.6.3. On the book web site, you will find a CSV data file named sat.csv that contains
GPA and SAT data for 105 students. Write a function
readData(filename)
that reads the data from this file and returns a tuple of two lists containing the
data in the first and fourth columns of the file (high school GPAs and college
GPAs). Then use the plotRegression function (which will call your completed
linearRegression function from the previous exercise) to plot this data with a
linear regression line to determine whether there is a correlation between high
school GPA and college GPA. (Your plot should look like Figure 8.5.)
8.6.4. A standard way to measure how well a regression line fits a set of data is to
compute the coefficient of determination, often called the R2 coefficient, of the
regression line. R2 is defined to be
S
R2 = 1 −
T
where S and T are defined as follows:
S = ∑ (y − (mx + b))2
For example, the example with three points in the text: (5, 4), (3, 2), and (8, 3).
We saw that these points have regression line
3 13
y=( )x + .
14 7
(So m = 3/14 and b = 13/7.) Then
• ȳ = (4 + 2 + 3)/3 = 3
• T = ∑(y − ȳ)2 = (4 − 3)2 + (2 − 3)2 + (3 − 3)2 = 2
Data clustering 409
8.6.5. An alternative linear regression method, called a Deming regression finds the line
that minimizes the squares of the perpendicular distances between the points and
the line rather than the vertical distances. While the traditional least squares
method accounts only for errors in the y values, this technique accounts for errors
in both x and y. The slope and y-intercept of the line found by this method are2
√
syy − sxx + (syy − sxx )2 + 4(sxy )2
m=
2sxy
and
b = ȳ − mx̄
where
• x̄ = (1/n) ∑ x, the mean of the x coordinates of all of the points (x, y)
• ȳ = (1/n) ∑ y, the mean of the y coordinates of all of the points (x, y)
• sxx = (1/(n − 1)) ∑(x − x̄)2
• syy = (1/(n − 1)) ∑(y − ȳ)2
• sxy = (1/(n − 1)) ∑(x − x̄)(y − ȳ)
Write a function
linearRegressionDeming(x, y)
that computes these values of m and b. Test your function by using it in the
plotRegression function above.
been diagnosed with a malignant tumor, then we can test whether clusters of similar
test results correspond uniformly to the same diagnosis. If they do, the clustering
is evidence that the tests can be used to test for malignancy, and the clusters give
insights into the range of results that correspond to that diagnosis. Because data
clustering can result in deep insights like this, it is another fundamental technique
used in data mining.
Algorithms that cluster items according to their similarity are easiest to under-
stand initially with two-dimensional data (e.g., longitude and latitude) that can
be represented visually. Therefore, before you tackle the tumor diagnosis problem
(as an exercise), we will look at a data set containing the geographic locations of
vehicular collisions in New York City.3
The clustering problem turns out to be very difficult, and there are no known
efficient algorithms that solve it exactly. Instead, people use heuristics. A heuristic
is an algorithm that does not necessarily give the best answer, but tends to work
well in practice. Colloquially, you might think of a heuristic as a “good rule of
thumb.” We will discuss a common clustering heuristic known as k-means clustering.
In k-means clustering, the data is partitioned into k clusters, and each cluster has
an associated centroid , defined to be the mean of the points in the cluster. The
k-means clustering heuristic consists of a number of iterations in which each point
is (re)assigned to the cluster of the closest centroid, and then the centroids of each
cluster are recomputed based on their potentially new membership.
Defining similarity
Similarity among items in a data set can be defined in a variety of ways; here we
will define similarity simply in terms of normal Euclidean distance. If each item in
our data set has m numerical attributes, then we can think of these attributes as a
point in m-dimensional space. Given two m-dimensional points p and q, we can find
the distance between them using the formula
√
distance(p, q) = (p[0] − q[0])2 + (p[1] − q[1])2 + ⋯ + (p[m − 1] − q[m − 1])2 .
But since we are not going to need to change any element of a point once we read it
from a data file, a better alternative is to store each point in a tuple, as follows:
3
http://nypd.openscrape.com
Data clustering 411
and to find the distance between tuples p1 and p3, we can compute
√
(85 − 86)2 + (92 − 54)2 + (45 − 33)2 + (27 − 16)2 + (31 − 54)2 + (0 − 0)2 ≈ 47.3.
Because the distance between p1 and p2 is less than the distance between p1 and p3,
we consider the results for patient p1 to be more similar to the results for patient
p2 than to the results for patient p3.
Reflection 8.30 What is the distance between tuples p2 and p3? Which patient’s
results are more similar to each of p2 and p3?
In the first iteration of the heuristic, for every point, we compute the distance
between the point and each of the two centroids, represented by the dashed line
segments. We assign each point to a cluster associated with the centroid to which it
is closest. In this example, the three points circled in red below are closest to the red
centroid and the three points circled in blue below are closest to the blue centroid.
Once every point is assigned to a cluster, we compute a new centroid for each cluster,
defined to be the mean of all the points in the cluster. In our example, the x and y
coordinates of the new centroid of the red cluster are (0.5 + 1.75 + 4)/3 = 25/12 ≈ 2.08
Data clustering 413
and (5 + 5.75 + 3.5)/3 = 4.75. The x and y coordinates of the new centroid of the
blue cluster are (3.5 + 5 + 7)/3 = 31/6 ≈ 5.17 and (2 + 2 + 1)/3 = 5/3 ≈ 1.67. These are
the new red and blue points shown below.
In the second iteration of the heuristic, we once again compute the closest centroid
for each point. As illustrated below, the point at (4, 3.5) is now grouped with the
lower right points because it is closer to the blue centroid than to the red centroid
(distance((2.08, 4.75), (4, 3.5)) ≈ 2.29 and distance((5.17, 1.67), (4, 3.5)) ≈ 2.17).
Next, we once again compute new centroids. The x and y coordinates of the new
centroid of the red cluster are (0.5 + 1.75)/2 = 1.125 and (5 + 5.75)/2 = 5.375. The x
and y coordinates of the new centroid of the blue cluster are (3.5 + 4 + 5 + 7)/4 = 4.875
and (1 + 2 + 2 + 3.5)/4 = 2.125. These new centroids are shown below.
414 Data analysis
In the third iteration, when we find the closest centroid for each point, we find that
nothing changes. Therefore, these clusters are the final ones chosen by the heuristic.
Parameters:
data: a list of points
k: the number of desired clusters
iterations: the number of iterations of the algorithm
Return value:
a tuple containing the list of clusters and the list
of centroids; each cluster is represented by a list
of indices of the points assigned to that cluster
"""
n = len(data)
centroids = random.sample(data, k) # k initial centroids
The function begins by using the sample function from the random module to choose
k random tuples from data to be the initial centroids, assigned to the variable named
centroids. The remainder of the function consists of iterations passes over the
list of points controlled by the for loop above.
Next, within the loop, we first create a list of k empty lists, one for each cluster.
For example, if k was 4 then, after the following three lines, clusters would be
assigned the list [[ ], [ ], [ ], [ ]].
clusters = []
for i in range(k):
clusters.append([])
The first empty list, clusters[0], will be used to hold the indices of the points in
the cluster associated with centroids[0]; the second empty list, clusters[1], will
Data clustering 415
be used to hold the indices of the points in the cluster associated with centroids[1];
and so on. Some care has to be taken to create this list of lists; clusters = [[]] * k
will not work. Exercise 8.7.6 asks you to investigate this further.
Next, we iterate over the indices of the points in data. In each iteration, we need
to find the centroid that is closest to the point data[dataIndex].
The inner for loop above is essentially the same algorithm that we used in Section 8.1
to find the minimum value in a list. The difference here is that we need to find the
index of the centroid with the minimum distance to data[dataIndex]. (We leave
writing this distance function as Exercise 8.7.1.) Therefore, we maintain the index
of the closest centroid in the variable named minIndex. Once we have found the
final value of minIndex, we append dataIndex to the list of indices assigned to the
cluster with index minIndex.
After we have assigned all of the points to a cluster, we compute each new centroid
with a function named centroid. We leave writing this function as Exercise 8.7.2.
Finally, once the outer for loop finishes, we return a tuple containing the list of
clusters and the list of centroids.
n = len(data)
centroids = random.sample(data, k)
For our clustering algorithm, we will need a list of accident locations as (longitude,
latitude) tuples. To limit the data in our analysis, we will extract the locations of
only those accidents that occurred in Manhattan (borocode 1) during 2012. The
function to accomplish this is very similar to what we have done previously.
4
The data file contains the rows of the file at http://nypd.openscrape.com/#/collisions.csv.
gz (accessed October 1, 2013) that involved a bicyclist.
Data clustering 417
def readFile(filename):
"""Read locations of 2012 bicycle accidents in Manhattan
into a list of tuples.
Parameter:
filename: the name of the data file
Once we have the data as a list of tuples, we can call the kmeans function to find
the clusters and centroids. If we had enough funding for six bicycle safety programs,
we would call the function with k = 6 as follows:
data = readFile(’collisions_cyclists.txt’)
(clusters, centroids) = kmeans(data, 6, 100)
To visualize the clusters, we can plot the points with matplotlib, with each cluster
in a different color and centroids represented by stars. The following plotClusters
function does just this. The result is shown in Figure 8.6.
Parameters:
clusters: a list of k lists of data indices
data: a list of points
centroids: a list of k centroid points
Six clusters of bicycle accidents in Manhattan with the centroid for each
Figure 8.6
shown as a star.
x = []
y = []
for centroid in centroids: # plot centroids
x.append(centroid[0])
y.append(centroid[1])
pyplot.scatter(x, y, 200, marker = ’*’, color = ’black’)
pyplot.show()
Reflection 8.31 If you look at the “map” in Figure 8.6, you will notice that the
orientation of Manhattan appears to be more east-west than it is in reality. Do you
know why?
In addition to completing this program, Exercise 8.7.4 asks you to apply k-means
clustering to the tumor diagnosis problem that we discussed at the beginning of the
chapter.
Exercises
8.7.1. Write a function
distance(p, q)
that returns the Euclidean distance between k-dimensional points p and q, each of
which is represented by a tuple with length k.
8.7.2. Write the function
centroid(cluster, data)
that is needed by the kmeans function. The function should compute the centroid
of the given cluster, which is a list of indices of points in data. Remember that
the points in data can be of any length, and the centroid that the function returns
must be a tuple of the same length. It will probably be most convenient to build
up the centroid in a list and then convert it to a tuple using the tuple function.
If the cluster is empty, then return a randomly chosen point from data with
random.choice(data).
8.7.3. Combine the distance and centroid functions that you wrote in the previous
two exercises with the kmeans, readFile, and plotClusters functions to create
a complete program (with a main function) that produces the plot in Figure 8.6.
The data file, collisions_cyclists.txt, is available on the book web site.
8.7.4. In this exercise, you will apply k-means clustering to the tumor diagnosis problem
from the beginning of this section, using a data set of breast cancer biopsy results.
This data set, available on the book web site, contains nine test results for each of
683 individuals.5 The first few lines of the comma-separated data set look like:
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4
⋮
The first number in each column is a unique sample number and the last number
in each column is either 2 or 4, where 2 means “benign” and 4 means “malignant.”
5
The data set on the book’s web site is the original file from http://archive.ics.uci.edu/
ml/datasets/Breast+Cancer+Wisconsin+(Original) with 16 rows containing missing attributes
removed.
420 Data analysis
Of the 683 samples in the data set, 444 were diagnosed as benign and 239 were
diagnosed as malignant. The nine numbers on each line between the sample number
and the diagnosis are test result values that have been normalized to a 1–10 scale.
(There is additional information accompanying the data on the book’s web site.)
The first step is to write a function
readFile(filename)
that reads this data and returns two lists, one of test results and the other of
diagnoses. The test results will be in a list of 9-element tuples and the diagnosis
values will be a list of integers that are in the same order as the test results.
Next, write a program, with the following main function, that clusters the data into
k = 2 groups. If one of these two groups contains predominantly benign samples
and the other contains predominantly malignant samples, then we can conclude
that these test results can effectively discriminate between malignant and benign
tumors.
def main():
data, diagnosis = readFile(’breast-cancer-wisconsin.csv’)
clusters, centroids = kmeans(data, 2, 10)
for clustIndex in range(2):
count = {2: 0, 4: 0}
for index in clusters[clustIndex]:
count[diagnosis[index]] = count[diagnosis[index]] + 1
print(’Cluster’, clustIndex)
print(’ benign:’, count[2], ’malignant:’, count[4])
To complete the program, you will need to incorporate the distance and centroid
functions from Exercises 8.7.1 and 8.7.2, the kmeans function, and your readFile
function. The data file, breast-cancer-wisconsin.csv, is available on the book
web site. Describe the results.
8.7.5. Given data on 100 customers’ torso length and chest measurements, use k-means
clustering to cluster the customers into three groups: S, M, and L. (Later, you can
use this information to design appropriately sized t-shirts.) You can find a list of
measurements on the book web site.
8.7.6. In the kmeans function, we needed to create a new list of empty clusters at the
beginning of each iteration. To create this list, we did the following:
clusters = []
for i in range(k):
clusters.append([])
Alternatively, we could have created a list of lists using the * repetition operator
like this:
clusters = [[]] * k
However, this will not work correctly, as evidenced by the following:
Summary 421
8.8 SUMMARY
Those who know how to manipulate and extract meaning from data will be well
positioned as the decision makers of the future. Algorithmically, the simplest way to
store data is in a list. We developed algorithms to summarize the contents of a list
with various descriptive statistics, modify the contents of lists, and create histograms
describing the frequency of values in a list. The beauty of these techniques is that
they can be used with a wide variety of data types and applications. But before
any of them can be used on real data, the data must be read from its source and
wrangled into a usable form. To this end, we also discussed basic methods for reading
and formatting tabular data both from local files and from the web.
In later sections, we went beyond simply describing data sets to data mining
techniques that can make predictions from them. Linear regression seeks a linear
pattern in data and then uses this pattern to predict missing data points. The
k-means clustering algorithm partitions data into clusters of like elements to elicit
hidden relationships.
Algorithms that manipulate lists can quickly become much more complicated
than what we have seen previously, and therefore paying attention to their time
complexity is important. To illustrate, we worked through a sequence of three
increasingly more elegant and more efficient algorithms for removing duplicates from
a list. In the end, we saw that the additional time taken to think through a problem
carefully and reduce its time complexity can pay dividends.
The article referenced in Box 8.4 is from The New York Times [12]. The non-
profit Electronic Frontier Foundation (EFF), founded in 1990, works at the forefront
of issues of digital privacy and free speech. To learn more about contemporary
privacy issues, visit its website at http://www.eff.org . For more about ethical
issues in computing in general, we recommend Computer Ethics by Deborah Johnson
and Keith Miller [21].
8.10 PROJECTS
Project 8.1 Climate change
The causes and consequences of global warming are of intense interest. The consensus
of the global scientific community is that the primary cause of global warming is an
increased concentration of “greenhouse gasses,” primarily carbon dioxide (CO2 ) in
the atmosphere. In addition, it is widely believed that human activity is the cause
of this increase, and that it will continue into the future if we do not limit what we
emit into the atmosphere.
To understand the causes of global warming, and to determine whether the
increase in CO2 is natural or human-induced, scientists have reconstructed ancient
climate patterns by analyzing the composition of deep ocean sediment core samples.
In a core sample, the older sediment is lower and the younger sediment is higher.
These core samples contain the remains of ancient bottom-dwelling organisms called
foraminifera that grew shells made of calcium carbonate (CaCO3 ).
Virtually all of the oxygen in these calcium carbonate shells exists as one of two
stable isotopes: oxygen-16 (16 O) and oxygen-18 (18 O). Oxygen-16 is, by far, the
most common oxygen isotope in our atmosphere and seawater at about 99.76% and
oxygen-18 is the second most common at about 0.2%. The fraction of oxygen-18
incorporated into the calcium carbonate shells of marine animals depends upon
the temperature of the seawater. Given the same seawater composition, colder
temperatures will result in a higher concentration of oxygen-18 being incorporated
into the shells. Therefore, by analyzing the ratio of oxygen-18 to oxygen-16 in an
ancient shell, scientists can deduce the temperature of the water at the time the
shell formed. This ratio is denoted δ 18 O; higher values of δ 18 O represent lower
temperatures.
Similarly, it is possible to measure the relative amounts of carbon isotopes in
calcium carbonate. Carbon can exist as one of two stable isotopes: carbon-12 (12 C)
and carbon-13 (13 C). (Recall from Section 4.4 that carbon-14 (14 C) is radioactive
and used for radiocarbon dating.) δ 13 C is a measure of the ratio of carbon-13 to
carbon-12. The value of δ 13 C can decrease, for example, if there were a sudden
injection of 12 C-rich (i.e., 13 C-depleted) carbon. Such an event would likely cause
an increase in warming due to the increase in greenhouse gasses.
In this project, we will examine a large data set [60] containing over 17,000
δ 18 O and δ 13 C measurements from deep ocean sediment core samples, representing
conditions over a period of about 65 million years. From this data, we will be able to
Projects 423
pyplot.figure(2)
pyplot.figure(3)
pyplot.subplot(2, 1, 1) # arguments are (rows, columns, subplot #)
pyplot.subplot(2, 1, 2)
Question 8.1.1 What do you notice? What do your plots imply about the relation-
ship between carbon and temperature?
Question 8.1.2 What do you notice? What is the maximum CO2 concentration
during this period?
Projects 425
Each of the dates in the file is a string in YYYY-MM-DD format. To convert each of
these date strings to a fractional year (e.g., 2013-07-01 would be 2013.5), we can
use the following formula:
year = float(date[:4]) + (float(date[5:7]) + float(date[8:10])/31) / 12
Question 8.1.3 How do these levels compare with the maximum level from the
previous 420,000 years? Is your plot consistent with the pattern of “natural” CO2
concentrations from the previous 420,000 years? Based on these results, what con-
clusions can we draw about the human impact on atmospheric CO2 concentrations?
who are available for employment at any given time, whether or not they are actually
employed. So we will define the unemployment rate to be the fraction of the (civilian)
labor force that is unemployed. To compute the unemployment rate for each category
of educational attainment, we will need data from the following ten columns of the
file:
Column index Contents
2 Name of metropolitan area
3 Total population of metropolitan area
11 No high school diploma, total in civilian labor force
15 No high school diploma, unemployed in civilian labor force
25 High school graduate, total in civilian labor force
29 High school graduate, unemployed in civilian labor force
39 Some college, total in civilian labor force
43 Some college, unemployed in civilian labor force
53 College graduate, total in civilian labor force
57 College graduate, unemployed in civilian labor force
Read this data from the file and store it in six lists that contain the names of
the metropolitan areas, the total populations, and the unemployment rates for each
of the four educational attainment categories. Each unemployment rate should be
stored as a floating point value between 0 and 100, rounded to one decimal point.
For example, 0.123456 should be stored as 12.3.
Next, use four calls to the matplotlib plot function to plot the unemployment
rate data in a single figure with the metropolitan areas on the x-axis, and the
unemployment rates for each educational attainment category on the y-axis. Be sure
to label each plot and display a legend. You can place the names of the metropolitan
areas on the x-axis with the xticks function:
pyplot.xticks(range(len(names)), names, rotation = 270, fontsize = ’small’)
The first argument is the locations of the ticks on the x-axis, the second argument is
a list of labels to place at those ticks, and the third and fourth arguments optionally
rotate the text and change its size.
Part 3: Analysis
Write a program to answer each of the following questions.
Projects 427
Question 8.2.2 Which of the 30 metropolitan areas have the highest and lowest
unemployment rates for each of the four categories of educational attainment? Use
a single loop to compute all of the answers, and do not use the built-in min and max
functions.
Question 8.2.3 Which of the 30 metropolitan areas has the largest difference
between the unemployment rates of college graduates and those with only a high
school diploma? What is this difference?
Question 8.2.4 Print a formatted table that ranks the 30 metropolitan areas by
the unemployment rates of their college graduates. (Hint: use a dictionary.)
that returns a list of n normally distributed prices with mean $4.00 and standard
deviation $1.50. Use this function to generate a list of maximum prices for your
1,000 potential customers, and display of histogram of these maximum prices.
that returns the number of customers willing to buy espresso if it were priced at the
given price. The first parameter customers is a list containing the maximum price
that each customer is willing to pay. Then write another function
plotDemand(customers, lowPrice, highPrice, step)
428 Data analysis
that uses your sales function to plot a demand curve. A demand curve has price on
the x-axis and the quantity of sales on the y-axis. The prices on the x-axis should
run from lowPrice to highPrice in increments of step. Use this function to draw
a demand curve for prices from free to $8.00, in increments of a quarter.
that plots your profit at each price. Your function should return the maximum profit,
the price at which the maximum profit is attained, and the number of customers
who buy espresso at that price. Do not use the built-in min and max functions.
Question 8.3.1 How should you price a shot of espresso to maximize your profit?
At this price, how much profit do you expect each day? How many customers should
you expect each day?
Q = b − m ⋅ P,
where Q is the quantity sold and P is the price. Modify your plotDemand function
so that it computes the regression line and plots it with the demand curve.
Question 8.3.2 What is the linear demand function that best approximates your
demand curve?
Suppose you work in a college admissions office and would like to study how
well admissions data (high school GPA and SAT scores) predict success in college.
Projects 429
that returns four lists containing the data from sat.csv. The four lists will con-
tain high school GPAs, math SAT scores, verbal (critical reading) SAT scores,
and cumulative college GPAs. (This is an expanded version of the function in
Exercise 8.6.3.)
Then a write another function
plotData(hsGPA, mathSAT, crSAT, collegeGPA)
that plots all of this data in one figure. We can do this with the matplotlib subplot
function:
pyplot.figure(1)
pyplot.subplot(4, 1, 1) # arguments are (rows, columns, subplot #)
pyplot.subplot(4, 1, 2)
pyplot.subplot(4, 1, 3)
pyplot.subplot(4, 1, 4)
Reflection 8.32 Can you glean any useful information from these plots?
Question 8.4.1 Judging from the plots, how well do you think each independent
variable predicts college GPA? Is there one variable that is a better predictor than
the others?
Question 8.4.2 Based on the R2 values, how well does each independent variable
predict college GPA? Which is the best predictor?
#
# U.S. Geological Survey
# National Water Information System
⋮
#
agency_cd▷site_no▷peak_dt▷peak_tm▷peak_va▷peak_cd▷gage_ht▷...
5s▷15s▷10d▷6s▷8s▷27s▷8s▷...
USGS▷13018750▷1976-06-04▷▷15800▷6▷7.80▷...
USGS▷13018750▷1977-06-09▷▷11000▷6▷6.42▷...
USGS▷13018750▷1978-06-10▷▷19000▷6▷8.64▷...
⋮
USGS▷13018750▷2011-07-01▷▷19900▷6▷8.75▷...
USGS▷13018750▷2012-06-06▷▷16500▷6▷7.87▷...
The file begins with several comment lines preceded by the hash (#) symbol. The
next two lines are header rows; the first contains the column names and the second
contains codes that describe the content of each column, e.g., 5s represents a string
of length 5 and 10d represents a date of length 10. Each column is separated by a
tab character, represented above by a right-facing triangle (▷). The header rows
are followed by the data, one row per year, representing the peak event of that year.
For example, in the first row we have:
So for each year, we essentially have two gauge values: the peak streamflow in cubic
feet per second and the maximum gauge height in feet.
If we had 100 years of gauge height data in this file, we could approximate the
water level of a 100-year flood with the maximum gauge height value. However, our
data set only covers 37 years (1976 to 2012) and, for 7 of those years, the gauge
height value is missing. Therefore, we will need to estimate the 100-year flood level
from the limited data we are given.
that returns lists of the peak streamflow and gauge height data (as floating point
numbers) from the Snake River data file above. Your function will need to first read
past the comment section and header lines to get to the data. Because we do not
know how many comment lines there might be, you will need to use a while loop
containing a call to the readline function to read past the comment lines.
Notice that some rows in the data file are missing gauge height information. If
this information is missing for a particular line, use a value of 0 in the list instead.
Your function should return two lists, one containing the peak streamflow rates
and one containing the peak gauge heights. A function can return two values by
simply separating them with a comma, .e.g.,
return flows, heights
Then, when calling the function, we need to assign the function call to two variable
names to capture these two lists:
flows, heights = readData(’snake_peak.txt’)
that returns a list of recurrence intervals for n floods, in order of lowest to highest.
For example if n is 3, the function should return the list [1.33, 2.0, 4.0].
After you have written this function, write another function
plotRecurrenceIntervals(heights)
that plots recurrence intervals and corresponding gauge heights (also sorted from
smallest to largest). Omit any missing gauge heights (with value zero). Your resulting
plot should look like Figure 8.7.
Figure 8.7 The peak gauge height for each recurrence interval.
gauge height for each recurrence interval. Once we have this function, we can plug
in 100 to find the gauge height for a 100-year flood.
What we need is a regression analysis, as we discussed in Section 8.6. But linear
regression only works properly if the data exhibits a linear relationship, i.e., we can
draw a straight line that closely approximates the data points.
Reflection 8.33 Do you think we can use linear regression on the data in Fig-
ure 8.7?
This data in Figure 8.7 clearly do not have a linear relationship, so a linear regression
will not produce a good approximation. The problem is that the x coordinates
(recurrence intervals) are increasing multiplicatively rather than additively; the
recurrence interval for the flood with rank r + 1 is (r + 1)/r times the recurrence
interval for the flood with rank r. However, we will share a trick that allows
us to use linear regression anyway. To illustrate the trick we can use to turn
this non-linear curve into a “more linear” one, consider the plot on the left in
Figure 8.8, representing points (20 , 0), (21 , 1), (22 , 2), . . . , (210 , 10). Like the plot in
Figure 8.7, the x coordinates are increasing multiplicatively; each x coordinate is
twice the one before it. The plot on the right in Figure 8.8 contains the points
that result from taking the logarithm base 2 of each x coordinate (log2 x), giving
434 Data analysis
Figure 8.8On the left is a plot of the points (20 , 0), (21 , 1), (22 , 2), . . . , (210 , 10), and
on the right is a plot of the points that result from taking the logarithm base 2 of
the x coordinate of each of these points.
(0, 0), (1, 1), (2, 2), . . . , (10, 10). Notice that this has turned an exponential plot into
a linear one.
We can apply this same technique to the plotRecurrenceIntervals function
you wrote above to make the curve approximately linear. Write a new function
plotLogRecurrenceIntervals(heights)
Question 8.5.1 Based on Figure 8.8, what is the estimated river level for a 100-
year flood? How can you find this value exactly in your program? What is the exact
value?
that produces a scatter plot with peak streamflow on the x-axis and the same year’s
gauge height on the y axis. Do not plot data for which the gauge height is missing.
Then also plot the least squares linear regression for this data.
Next, write a function
plotLogRecurrenceIntervals2(flows)
Once you have found the 100-year peak streamflow rate, use the linear regression
formula to find the corresponding 100-year gauge height.
Question 8.5.3 What is the gauge height that corresponds to the 100-year peak
streamflow rate?
Question 8.5.4 Compare the two results. Which one do you think is more accurate?
Why?
that returns these voting results as a list containing one list for each ballot. For
example, the file above should be stored in a list that looks like this:
[[’B’, ’A’, ’D’, ’C’], [’D’, ’B’, ’A’, ’C’], [’A’, ’B’, ’C’, ’D’]]
There are three sample voting result files on the book web site. Also feel free to
create your own.
436 Data analysis
that prints the winner (or winners if there is a tie) of the election based on a plurality
count. The parameter of the function is a list of ballots like that returned by the
readVotes function. Your function should first iterate over all of the ballots and
count the number of first-place votes won by each candidate. Store these votes in a
dictionary containing one entry for each candidate. To break the problem into more
manageable pieces, write a “helper function”
printWinners(points)
that determines the winner (or winners if there is a tie) based on this dictio-
nary (named points), and then prints the outcome. Call this function from your
plurality function.
that prints the winner (or winners if there is a tie) of the election based on a Borda
count. Like the plurality function, this function should first iterate over all of the
ballots and count the number of points won by each candidate. To make this more
manageable, write another “helper function” to call from within your loop named
processBallot(points, ballot)
that processes each individual ballot and adds the appropriate points to the dictionary
of accumulated points named points. Once all of the points have been accumulated,
call the printWinners above to determine the winner(s) and print the outcome.
Projects 437
that prints the Condorcet winner of the election or indicates that there is none. (If
there is a winner, there can only be one.)
Suppose that the list of candidates is assigned to candidates. (Think about how
you can get this list.) To simulate all head-to-head contests between one candidate
named candidate1 and all of the rest, we can use the following for loop:
for candidate2 in candidates:
if candidate2 != candidate1:
# head-to-head between candidate1 and candidate2 here
This loop iterates over all of the candidates and sets up a head-to-head contest
between each one and candidate1, as long as they are not the same candidate.
Reflection 8.34 How can we now use this loop to generate contests between all
pairs of candidates?
To generate all of the contests with all possible values of candidate1, we can nest
this for loop in the body of another for loop that also iterates over all of the
candidates, but assigns them to candidate1 instead:
for candidate1 in candidates:
for candidate2 in candidates:
if candidate2 != candidate1:
# head-to-head between candidate1 and candidate2 here
Reflection 8.35 This nested for loop actually generates too many pairs of candi-
dates. Can you see why?
To simplify the body of the nested loop (where the comment is currently), write
another “helper function”
head2head(ballots, candidate1, candidate2)
438 Data analysis
that returns the candidate that wins in a head-to-head vote between candidate1
and candidate2, or None is there is a tie. Your condorcet function should call this
function for every pair of different candidates. For each candidate, keep track of the
number of head-to-head wins in a dictionary with one entry per candidate.
Distribution Distribution
center center
House A House A
House B House B
House G House G
House D House D
But since, for n locations, there are n! (n factorial) different tours, this is practically
impossible.
Unfortunately, the TSP has many important applications, several of which seem
at first glance to have nothing at all to do with traveling or salespeople, including
circuit board drilling, controlling robots, designing networks, x-ray crystallography,
scheduling computer time, and assembling genomes. In these cases, a heuristic must
be used. A heuristic does not necessarily give the best answer, but it tends to work
well in practice
For this project, you will design your own heuristic, and then work with a genetic
algorithm, a type of heuristic that mimics the process of evolution to iteratively
improve problem solutions.
def readPoints(filename):
inputFile = open(filename, ’r’)
points = []
for line in inputFile:
values = line.split()
points.append((float(values[0]), float(values[1])))
return points
To begin, write the following three functions. The first two will be needed by your
heuristics, and the third will allow you to visualize the tours that you create. To test
your functions, and the heuristics that you will develop below, use the example file
containing the coordinates of 96 African cities (africa.tsp) on the book web site.
1. distance(p, q) returns the distance between points p and q, each of which
is stored as a two-element tuple.
2. tourLength(tour) returns the total length of a tour, stored as a list of tuples.
Remember to include the distance from the last point back to the first point.
3. drawTour(tour) draws a tour using turtle graphics. Use the setworldcoordinates
method to make the coordinates in your drawing window more closely match
the coordinates in the data files you use. For example, for the africa.tsp
data file, the following will work well:
screen.setworldcoordinates(-40, -25, 40, 60)
440 Data analysis
Mutation occurs when a base in a DNA molecule is replaced with a different base or
when bases are inserted into or deleted from a sequence. Most mutation is the result
of DNA replication errors but environmental factors can also lead to mutations in
DNA.
To apply this technique to the traveling salesperson problem, we first need to
define what we mean by an individual in a population. In genetics, an individual is
represented by its DNA, and an individual’s fitness, for the purposes of evolution, is
some measure of how well it will thrive in its environment. In the TSP, we will have
a population of tours, so an individual is one particular tour — a list of cities. The
most natural fitness function for an individual is the length of the tour; a shorter
tour is more fit than a longer tour.
Recombination and mutation on tours are a bit trickier conceptually than they
are for DNA. Unlike with DNA, swapping two subsequences of cities between two
tours is not likely to produce two new valid tours. For example, suppose we have
two tours [a, b, c, d] and [b, a, d, c], where each letter represents a point.
Swapping the two middle items between the tours will produce the offspring [a, a,
d, d] and [b, b, c, c], neither of which are permutations of the cities. One way
around this is to delete from the first tour the cities in the portion to be swapped
from the second tour, and then insert this portion from the second tour. In the
Projects 441
above example, we would delete points a and d from the first tour, leaving [b,
c], before inserting [a, d] in the middle. Doing likewise for the second tour gives
us children [b, a, d, c] and [a, b, c, d]. But we are not limited in genetic
programming to recombination that more or less mimics that found in nature. A
recombination operation can be anything that creates new offspring by somehow
combining two parents. A large part of this project involves brainstorming about
and experimenting with different techniques.
We must also rethink mutation since we cannot simply replace an arbitrary city
with another city and end up with a valid tour. One idea might be to swap the
positions of two randomly selected cities instead. But there are other possibilities as
well.
Your mission is to improve upon a baseline genetic algorithm for TSP. Be creative!
You may change anything you wish as long as the result can still be considered a
genetic algorithm. To get started, download the baseline program from the book
web site. Try running it with the 96-point instance on the book web site. Take some
time to understand how the program works. Ask questions. You may want to refer
to the Python documentation if you don’t recall how a particular function works.
Most of the work is performed by the following four functions:
that is called from the report function. (Use a Python dictionary.) This function
should print a frequency chart (based on tour length) that gives you a snapshot of
the diversity in your population. Your histogram function should print something
like this:
442 Data analysis
Population diversity
1993.2714596455853 : ****
2013.1798076309087 : **
2015.1395212505120 : ****
2017.1005248468230 : ********************************
2020.6881282400334 : *
2022.9044855489917 : *
2030.9623523675089 : *
2031.4773010231959 : *
2038.0257926528227 : *
2040.7438913120230 : *
2042.8148398732630 : *
2050.1916058477627 : *
This will be very helpful as you strive to improve the algorithm: recombination in a
homogeneous population is not likely to get you very far.
Brainstorm ways to improve the algorithm. Try lots of different things,
ranging from tweaking parameters to completely rewriting any of the four functions
described above. You are free to change anything, as long as the result still resembles
a genetic algorithm. Keep careful records of what works and what doesn’t to include
in your submission.
On the book web site is a link to a very good reference [42] that will help you
think of new things to try. Take some time to skim the introductory sections, as
they will give you a broader sense of the work that has been done on this problem.
Sections 2 and 3 contain information on genetic algorithms; Section 5 contains
information on various recombination/crossover operations; and Section 7 contains
information on possible mutation operations. As you will see, this problem has been
well studied by researchers over the past few decades! (To learn more about this
history, we recommend In Pursuit of the Traveling Salesman by William Cook [8].)
CHAPTER 9
Flatland
Suffice it that I am the completion of your incomplete self. You are a Line, but I am a Line of
Lines, called in my country a Square: and even I, infinitely superior though I am to you, am
of little account among the great nobles of Flatland, whence I have come to visit you, in the
hope of enlightening your ignorance.
Edwin A. Abbott
Flatland: A Romance of Many Dimensions (1884)
443
444 Flatland
in Chapter 8. One of the simplest was the CSV file containing monthly extreme tem-
perature readings in Madison, Wisconsin from Exercise 8.4.6 (madison_temp.csv):
STATION,STATION_NAME,DATE,EMXT,EMNT
GHCND:USW00014837,MADISON DANE CO REGIONAL AIRPORT WI US,19700101,33,-294
GHCND:USW00014837,MADISON DANE CO REGIONAL AIRPORT WI US,19700201,83,-261
GHCND:USW00014837,MADISON DANE CO REGIONAL AIRPORT WI US,19700301,122,-139
⋮
Because all of the data in this file is based on conditions at the same site, the first
two columns are identical in every row. The third column contains the dates of
the first of each month in which data was collected. The fourth and fifth columns
contain the maximum and minimum monthly temperatures, respectively, which are
in tenths of a degree Celsius (i.e., 33 represents 3.3○ C). Previously, we would have
extracted this data into three parallel lists containing dates, maximum temperatures,
and minimum temperatures, like this:
def readData():
"""Read monthly extreme temperature data into a table.
Parameters: none
Alternatively, we may wish to extract tabular data into a single table. This is
especially convenient if the data contains many columns. For example, the three
columns above could be stored in a unified tabular structure like the following:
DATE EMXT EMNT
19700101 33 -294
19700201 83 -261
19700301 122 -139
⋮ ⋮ ⋮
Two-dimensional data 445
We can represent this structure as a list of rows, where each row is a list of values
in that row. In other words, the table above can be stored like this:
[[’19700101’, 33, -294], [’19700201’, 83, -261], [’19700301’, 122, -139], ... ]
To better visualize this list as a table, we can reformat its presentation a bit:
[
[ ’19700101’, 33, -294 ],
[ ’19700201’, 83, -261 ],
[ ’19700301’, 122, -139 ],
⋮
]
Notice that, in the function above, we already have each of these row lists contained
in the list named row. Therefore, to create this structure, we can simply append
each value of row, with the temperature values converted to integers and the first
two redundant columns removed, to a growing list of rows named table:
def readData():
""" (docstring omitted) """
Since each element of table is a list containing a row of this table, the first row is
assigned to table[0], as illustrated below. Similarly, the second row is assigned to
table[1] and the third row is assigned to table[2].
table[1][2] table[2][1]
«
[ [’19700101’, 33, -294] , [’19700201’, 83, −261], [’19700301’, 122, -139] ,...]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
table[0] table[1] table[2]
Reflection 9.1 How would you access the minimum temperature in February, 1970
(date value ’19700201’) from this list?
The minimum temperature in February, 1970 is the third value in the list table[1].
Since table[1] is a list, we can use indexing to access individual items contained in
it. Therefore, the third value in table[1] is table[1][2], which equals -261 (i.e.,
−26.1○ C), as indicated above. Likewise, table[2][1] is the maximum temperature
in March, 1970: 122, or 12.2○ C.
446 Flatland
Reflection 9.2 In general, how can you access the value in row r and column c?
Notice that, for a particular value table[r][c], r is the index of the row and c is
the index of the column. So if we know the row and column of any desired value, it
is easy to retrieve that value with this convenient notation.
Now suppose we want to search this table for the minimum temperature in a
particular month, given the corresponding date string. To access this value in the
table, we will need both its row and column indices. We already know that the
column index must be 2, since the minimum temperatures are in the third column.
To find the correct row index, we need to search all of the values in the first column
until we find the row that contains the desired string. Once we have the correct row
index r, we can simply return the value of table[r][2]. The following function
does exactly this.
Parameters:
table: a table containing extreme temperature data
date: a date string
Return value:
the minimum temperature for the given date
or None if the date does not exist
"""
numRows = len(table)
for r in range(numRows):
if table[r][0] == date:
return table[r][2]
return None
The for loop iterates over the the indices of the rows in the table. For each row
with index r, we check if the first value in that row, table[r][0], is equal to the
date we are looking for. If it is, we return the value in column 2 of that row. On the
other hand, if we get all the way through the loop without returning a value, the
desired date must not exist, so we return None.
Two-dimensional data 447
Exercises
From this point on, we will generally not specify what the name and parameters of a function
should be. Instead, we would like you to design the function(s).
9.1.1. Show how the following table can be stored in a list named scores.
Student ID SAT M SAT CR
10305 700 610
11304 680 590
10254 710 730
12007 650 690
10089 780 760
9.1.2. In the list you created above, how do you refer to each of the following?
(a) the SAT M value for student 10089
(b) the SAT CR value for student 11304
(c) the SAT M value for student 10305
(d) the SAT CR value for student 12007
9.1.5. Write a function that does the same thing as the getMinTemp function above, but
returns the maximum temperature for a particular date instead.
9.1.6. If a data set contains a unique key value that is frequently searched, we can
alternatively store the data in a dictionary. Each row in the table can be associated
with its particular key value, which makes searching for a key value very efficient.
For example, the temperature table
[
[ ’19700101’, 33, -294 ],
[ ’19700201’, 83, -261 ],
[ ’19700301’, 122, -139 ]
]
448 Flatland
9.1.9. Write a function that takes as a parameter a table returned by your function from
Exercise 9.1.7, and repeatedly prompts for a minimum earthquake magnitude.
The function should then create a new table containing the rows corresponding
to earthquakes with at least that magnitude, and then print this table using your
function from Exercise 9.1.8. The output from your function should look similar to
this:
Minimum magnitude (q to quit)? 6.2
In each step, each of the cells simultaneously observes the states of its neighbors,
and may change its state according to the following rules:
1. If a cell is alive and has fewer than two live neighbors, it dies from loneliness.
2. If a cell is alive and has two or three live neighbors, it remains alive.
3. If a cell is alive and has more than three live neighbors, it dies from overcrowd-
ing.
To see how these rules affect the cells in the Game of Life, consider the initial
configuration in the top left of Figure 9.1. Dead cells are represented by white
squares and live cells are represented by black squares. To apply rule 1 to the initial
configuration, we need to check whether there are any live cells with fewer than two
live neighbors. As illustrated below, there are two such cells, each marked with D.
450 Flatland
A D
D A
According to rule 1, these two cells will die in the next generation. To apply rule
2, we need to check whether there are any live cells that have two or three live
neighbors. Since this rule applies to the other three live cells, they will remain alive
into the next generation. There are no cells that satisfy rule 3, so we move on to
rule 4. There are two dead cells with exactly three live neighbors, marked with A.
According to rule 4, these two cells will come alive in the next generation.
Reflection 9.3 Show what the second generation looks like, after applying these
rules.
The figure in the top center of Figure 9.1 shows the resulting second generation,
followed by generations three, four, and five. After five generations, as illustrated
in the bottom center of Figure 9.1, the grid has returned to its initial state, but
it has moved one cell down and to the right of its initial position. If we continued
computing generations, we would find that it would continue in this way indefinitely,
or until it collides with a border. For this reason, this initial configuration generates
what is known as a “glider.”
The game of life 451
[ [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ]
Figure 9.2 Views of the “empty” cellular automaton as a grid and as a list.
Creating a grid
To implement this cellular automaton, we first need to create an empty grid of cells.
For simplicity, we will keep it relatively small, with 10 rows and 10 columns. We
can represent each cell by a single integer value which equals 1 if the cell is alive or
0 if it is dead. For clarity, it is best to assign these values to meaningful names:
ALIVE = 1
DEAD = 0
An initially empty grid, like the one on the left side of Figure 9.2, will be represented
by a list of row lists, each of which contains a zero for every column. For example,
the list on the right side of Figure 9.2 represents the grid to its left.
Reflection 9.4 How can we easily create a list of many zeros?
If the number of columns in the grid is assigned to columns, then each of these rows
can be created with a for loop:
row = []
for c in range(columns):
row.append(DEAD)
We can then create the entire grid by simply appending a copy of row for every row
in the grid to a list named grid:
Parameters:
rows: the number of rows in the grid
columns: the number of columns in the grid
grid = []
452 Flatland
for r in range(rows):
row = []
for c in range(columns):
row.append(DEAD)
grid.append(row)
return grid
for the three statements that construct row because all of the values in each row
are the same.
Initial configurations
The cellular automaton will evolve differently, depending upon the initial configu-
ration of alive and dead cells. (There are several examples of this in the exercises
at the end of this section.) We will assume that all cells are dead initially, except
for those we explicitly specify. Each cell can be conveniently represented by a (row,
column) tuple. The coordinates of the initially live cells can be stored in a list of
tuples, and passed into the following function to initialize the grid. (Recall that a
tuple is like a list, except that it is immutable and enclosed in parentheses instead
of square brackets.)
Parameters:
grid: a grid of values for a cellular automaton
coordinates: a list of coordinates
The function iterates over the list of tuples and sets the cell at each position to be
alive. For example, to match the initial configuration in the upper left of Figure 9.1,
we would pass in the list
[(1, 3), (2, 3), (3, 3), (3, 2), (2, 1)]
Notice that by using a generic tuple as the index variable, we can conveniently
assign the two values in each tuple in the list to r and c.
The game of life 453
Reflection 9.5 Consider a cell at position (r, c). What are the coordinates of the
eight neighbors of this cell?
The coordinates of the eight neighbors are visualized in the following grid with
coordinates (r, c) in the center.
454 Flatland
(r − 1, c − 1) (r − 1, c) (r − 1, c + 1)
(r + 1, c − 1) (r + 1, c) (r + 1, c + 1)
Parameters:
grid: a two-dimensional grid of cells
row: the row index of a cell
column: the column index of a cell
Return value:
the number of live neighbors of the cell at (row, column)
"""
The list offsets contains tuples with the offsets of all eight neighbors. We iterate
over these offsets, adding each one to the given row and column to get the coordinates
of each neighbor. Then, if the neighbor is on the grid and is alive, we increment a
counter.
at how to iterate over every cell in the grid, let’s consider how we can iterate over
just the first row.
The first row of grid is named grid[0]. Since grid[0] is a list, we already know
how to iterate over it, either by value or by index.
Reflection 9.7 Should we iterate over the indices of grid[0] or over its values?
Does it matter?
Notice that, in this loop, the row number stays the same while the column number
(c) increases. We can generalize this idea to iterate over any row with index r by
simply replacing the row index with r:
for c in range(columns):
# update grid[r][c] here
Now, to iterate over the entire grid, we need to repeat the loop above with values
of r ranging from 0 to rows-1, where rows is assigned the number of rows in the
grid. We can do this by nesting the loop above in the body of another for loop that
iterates over the rows:
for r in range(rows):
for c in range(columns):
# update grid[r][c] here
Reflection 9.8 In what order will the cells of the grid be visited in this nested loop?
In other words, what sequence of r,c values does the nested loop generate?
The value of r is initially set to 0. While r is 0, the inner for loop iterates over
values of c from 0 to COLUMNS - 1. So the first cells that will be visited are
grid[0][0], grid[0][1], grid[0][2], ..., grid[0][9]
Once the inner for loop finishes, we go back up to the top of the outer for loop.
The value of r is incremented to 1, and the inner for loop executes again. So the
next cells that will be visited are
456 Flatland
This process repeats with r incremented to 2, 3, . . . , 9, until finally the cells in the
last row are visited:
grid[9][0], grid[9][1], grid[9][2], ..., grid[9][9]
Therefore, visually, the cells in the grid are being visited row by row:
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
Reflection 9.9 How would we change the nested loop so that the cells in the grid
are visited column by column instead?
To visit the cells column by column, we can simply swap the positions of the loops:
for c in range(columns):
for r in range(rows):
# update grid[r][c] here
Reflection 9.10 In what order are the cells of the grid visited in this new nested
loop?
In this new nested loop, for each value of c, the inner for loop iterates all of the
values of r, visiting all of the cells in that column. So the first cells that will be
visited are
grid[0][0], grid[1][0], grid[2][0], ..., grid[9][0]
Then the value of c is incremented to 1 in the outer for loop, and the inner for
loop executes again. So the next cells that will be visited are
grid[0][1], grid[1][1], grid[2][1], ..., grid[9][1]
This process repeats with consecutive values of c, until finally the cells in the last
column are visited:
grid[0][9], grid[1][9], grid[2][9], ..., grid[9][9]
The game of life 457
Therefore, in this case, the cells in the grid are being visited column by column
instead:
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
Reflection 9.11 Do you think the order in which the cells are updated in a cellular
automaton matters?
In most cases, the order does not matter. We will update the cells row by row.
Since rule 2 does not change the state of any cells, there is no reason to check for it.
Reflection 9.13 There is one problem with the algorithm we have developed to
update cells. What is it? (Think about the values referenced by the neighborhood
function when it is applied to neighboring cells. Are the values from the previous
generation or the current one?)
458 Flatland
To see the subtle problem, suppose that we change cell (r, c) from alive to dead.
Then, when the live neighbors of the next cell in position (r + 1, c) are being counted,
the cell at (r, c) will not be counted. But it should have been because it was alive in
the previous generation. To fix this problem, we cannot modify the grid directly
while we are updating it. Instead, we need to make a copy of the grid before each
generation. When we count live neighbors, we will look at the original grid, but
make modifications in the copy. Then, after we have looked at all of the cells, we
can update the grid by assigning the updated copy to the main grid. These changes
are shown below in red.
newGrid = copy.deepcopy(grid)
for r in range(rows):
for c in range(columns):
neighbors = neighborhood(grid, r, c)
if grid[r][c] == ALIVE and neighbors < 2: # rule 1
newGrid[r][c] = DEAD
elif grid[r][c] == ALIVE and neighbors > 3: # rule 3
newGrid[r][c] = DEAD
elif grid[r][c] == DEAD and neighbors == 3: # rule 4
newGrid[r][c] = ALIVE
grid = newGrid
The deepcopy function from the copy module creates a completely independent
copy of the grid.
Now that we can simulate one generation, we simply have to repeat this process
to simulate many generations. The complete function is shown below. The grid is
initialized with our makeGrid and initialize functions, then the nested loop that
updates the grid is further nested in a loop that iterates for a specified number of
generations.
Parameters:
rows: the number of rows in the grid
columns: the number of columns in the grid
generations: the number of generations to simulate
initialCells: a list of (row, column) tuples indicating
the positions of the initially alive cells
Return value:
the final configuration of cells in a grid
"""
newGrid = copy.deepcopy(grid)
for r in range(rows):
for c in range(columns):
neighbors = neighborhood(grid, r, c)
if grid[r][c] == ALIVE and neighbors < 2: # rule 1
newGrid[r][c] = DEAD
elif grid[r][c] == ALIVE and neighbors > 3: # rule 3
newGrid[r][c] = DEAD
elif grid[r][c] == DEAD and neighbors == 3: # rule 4
newGrid[r][c] = ALIVE
grid = newGrid
return grid
On the book web site, you can find an enhanced version of this function that
uses turtle graphics to display the evolution of the system with a variety of initial
configurations.
Exercises
9.2.1. Download the enhanced Game of Life program from the book web site and run
it with each of the following lists of coordinates set to be alive in the initial
configuration. Use at least a 50 × 50 grid. Describe what happens in each case.
(a) [(1, 3), (2, 3), (3, 3), (3, 2), (2, 1)]
(b) [(9, 10), (10, 10), (11, 10)]
(c) [(10, c + 1), (10, c + 4), (11, c), (12, c),
(12, c + 4), (13, c), (13, c + 1), (13, c + 2),
(13, c + 3)] with c = COLUMNS - 5
(d) [(r + 1, c + 2), (r + 2, c + 4), (r + 3, c + 1),
(r + 3, c + 2), (r + 3, c + 5), (r + 3, c + 6),
(r + 3, c + 7)] with r = ROWS // 2 and c = COLUMNS // 2
9.2.2. Modify the neighborhood function so that it treats the grid as if all four sides
“wrap around.” For example, in the 7 × 7 grid below on the left, the neighbors of
(4, 6) include (3, 0), (4, 0), and (5, 0). In the grid on the right, the neighbors of
the corner cell (6, 6) include (0, 0), (0, 5), (0, 6), (5, 0), and (6, 0).
9.2.3. Write a function that prints the contents of a parameter named grid, which is a
list of lists. The contents of each row should be printed on one line with spaces in
between. Each row should be printed on a separate line.
460 Flatland
5 7 3
1 6 8
9 2 4
The following algorithm generates magic squares with odd-length sides, using the
consecutive numbers 1, 2, 3, . . ..
(a) Randomly put 1 in some position in your square.
(b) Look in the square diagonally to the lower right of the previous square,
wrapping around to the if you go off the edge to the right or bottom.
i. If this square is unoccupied, put the next number there.
ii. Otherwise, put the next number directly above the previous number
(again wrapping to the bottom if you are on the top row).
Digital images 461
(c) Continue step (b) until all the positions are filled.
Write a function that takes an odd integer n as a parameter and returns an n × n
magic square.
9.2.11. A two-dimensional grid can also be stored as a dictionary in which the keys are
tuples representing grid positions. For example, the small grid
[ [0, 1],
[1, 1] ]
would be stored as the following dictionary:
{ (0, 0): 0,
(0, 1): 1,
(1, 0): 1,
(1, 1): 1 }
Rewrite the Game of Life program on the book web site so that it stores the grid
in this way instead. The following four functions will need to change: emptyGrid,
initialize, neighborhood, and life.
Colors
A digital image is a two-dimensional grid (sometimes called a bitmap) in which each
cell, called a pixel (short for “picture element”), contains a value representing its
color. In a grayscale image, the colors are limited to shades of gray. These shades
are more commonly referred to as levels of brightness or luminance, and in theory
are represented by values between 0 and 1, 0 being black and 1 being white. As we
briefly explained in Box 3.2, each pixel in a color image can be represented by a (red,
green, blue), or RGB , tuple. Each component, or channel, of the tuple represents
the brightness of the respective color. The value (0, 0, 0) is black and (1, 1, 1) is
white. Values between these can represent any color in the spectrum. For example,
(0, 0.5, 0) is a medium green and (1, 0.5, 0) is orange. In practice, each channel is
represented by eight bits (one byte) or, equivalently, a value between 0 and 255. So
black is represented by (0, 0, 0) or
00000000 00000000 00000000 ,
RGB is most commonly used for images produced by digital cameras and viewed on
a screen. Another encoding, called CMYK is used for print. See Box 9.2 for more
details.
Reflection 9.14 If we use eight bits to represent the intensity of each channel, can
we still represent any color in the spectrum? If not, how many different colors can
we represent?
Using eight bits per channel, we cannot represent the continuous range of values
between 0 and 1 that would be necessary to represent any color in the spectrum. In
effect, we are only able to represent 254 values between 0 and 1: 1/255, 2/255, . . . ,
254/255. This is another example of how some objects represented in a computer
are limited versions of those existing in nature. Looking at it another way, by using
8 bits per channel, or 24 bits total, we can represent 224 = 16, 777, 216 distinct colors.
The good news is that, while this does not include all the colors in the spectrum, it
is greater than the number of colors distinguishable by the human eye.
Reflection 9.15 Assuming eight bits are used for each channel, what RGB tuple
represents pure blue? What tuple represents purple? What color is (0, 128, 128)?
Bright blue is (0, 0, 255) and any tuple with equal parts red and blue, for example
(128, 0, 128) is a shade of purple. The tuple (0, 128, 128), equal parts medium green
and medium blue, is teal.
Digital images 463
The digital images produced by digital cameras can be quite large. For example,
a high quality camera can produce an image that is about 5120 pixels wide and
3413 pixels high, and therefore contains a total of 5120 × 3413 = 17, 474, 560 pixels.
At one byte per pixel, a grayscale image of this size would require about 17 MB of
storage. A 5120 by 3413 color image requires 5120 × 3413 × 3 = 52, 423, 680 bytes,
or about 50 MB, of storage, since it requires 24 bits, or three bytes, per pixel. In
practice, color image files are compressed to take up much less space. (See Box 9.3.)
Image filters
To illustrate some basic image processing techniques, let’s consider how we can
produce a grayscale version of a color image. An operation such as this is known as
an image filter algorithm. Photo-editing software typically includes several different
image filters for enhancing digital photographs.
To change an image to grayscale, we need to convert every color pixel (an
464 Flatland
RGB tuple) to a gray pixel with similar brightness. A white pixel (RGB color
(255, 255, 255)) is the brightest, so we would map this to a grayscale brightness of
255 while a black pixel (RGB color (0, 0, 0)) is the least bright, so we would map
this to a grayscale brightness of 0.
Reflection 9.16 How can we compute the brightness of a color pixel in general?
Consider the RGB color (250, 50, 200). The red and blue channels of this color
contribute a lot of brightness to the color while the green channel does not. So, to
estimate the overall brightness, we can simply average the three values. In this case,
(250 + 50 + 200)/3 ≈ 167. In RGB, any tuple with equal parts red, green, and blue
will be a shade of gray. Therefore, we can encode this shade of gray in RGB with
the tuple (167, 167, 167). A function to perform this conversion is straightforward:
def color2gray(color):
"""Convert a color to a shade of gray.
Parameter:
color: a tuple representing an RGB color
The parameter color is a three-element tuple of integers between 0 and 255. The
function computes the average of the three channels and returns a tuple representing
a shade of gray with that brightness.
To apply this transformation to an entire image, we need to iterate over the
positions of all of the pixels. Since an image is a two-dimensional object, we can
process its pixels row by row as we did in the previous section:
for r in range(rows):
for c in range(columns):
# process the pixel at position (r, c)
To be consistent with the language typically used in image processing, we will use
different names for the variables, however. Rather than referring to the size of an
image in terms of rows and columns, we will use height and width. And we will use
x and y (with (0, 0) in the top left corner) to denote the horizontal and vertical
positions of a pixel instead of the row and column numbers. So the following is
equivalent to the nested loop above:
for y in range(height):
for x in range(width):
# process the pixel at coordinates (x, y)
Digital images 465
The standard Python module for displaying images (and creating graphical interface
elements like windows and buttons) is called tkinter (This name is short for
“Tk interface.” Tk is a widely used graphical programming package that predates
Python; tkinter provides an “interface” to Tk.) Because simple image manipulation
in tkinter is slightly more complicated than we would like, we will interact with
tkinter indirectly through a simple class named Image. The Image class is available
in the module image.py on the book web site. Download this file and copy it into
the same folder as your programs for this section.
The following program illustrates how to use the Image class to read a digital
image file, iterate over its pixels, and produce a new image that is a grayscale
version of the original. Each of the methods and functions below is described in
Appendix B.8.
import image
def grayscale(photo):
"""Convert a color image to grayscale.
Parameter:
photo: an Image object
width = photo.width()
height = photo.height()
newPhoto = image.Image(width, height, title = ’Grayscale image’)
for y in range(height):
for x in range(width):
color = photo.get(x, y)
newPhoto.set(x, y, color2gray(color))
return newPhoto
def main():
penguin = image.Image(file = ’penguin.gif’, title = ’Penguin’)
penguinGray = grayscale(penguin)
penguin.show()
penguinGray.show()
image.mainloop()
main()
Let’s look at the grayscale function first. The lone parameter named photo is
the Image object that we want to turn to grayscale. The first two statements
in the function call the width and height methods of photo to get the image’s
dimensions. Then the third statement creates a new, empty Image object with the
same dimensions. This will be our grayscale image. Next, we iterate over all of the
466 Flatland
Figure 9.3 The original image of a penguin and the grayscale version.
(0,0) (w−1,0)
(0,0) (h−1,0)
(0,w−1) (h−1,w−1)
(0,h−1) (w−1,h−1)
pixels in photo. Inside the nested loop, we call the get method to get the color of
the pixel at each position (x,y) in photo. The color is returned as a three-element
tuple of integers between 0 and 255. Next, we set the pixel at the same position in
newPhoto to the color returned by the color2gray function that we wrote above.
Once the nested loop has finished, we return the grayscale photo.
In the main function, we create an Image object named penguin from a GIF file
named penguin.gif that can be found on the book web site. (GIF is a common
image file format; see Box 9.3 for more about image files.) We then call the grayscale
function with penguin, and assign the resulting grayscale image to penguinGray.
Finally, we display both images in their own windows by calling the show method of
each one. The mainloop function at the end causes the program to wait until all of
the windows have been closed before it quits the program. The results are shown in
Figure 9.3. This simple filter is just the beginning; we leave several other fun image
filters as exercises.
Digital images 467
Transforming images
There are, of course, many other ways we might want to transform an image. For
example, we commonly need to rotate landscape images 90 degrees clockwise. This
is illustrated in Figure 9.4. From the figure, we notice that the pixel in the corner at
(0, 0) in the original image needs to be in position (h − 1, 0) after rotation. Similarly,
the pixel in the corner at (w −1, 0) needs to be in position (h−1, w −1) after rotation.
The transformations for all four corners are shown below.
Before After
(0, 0) ⇒ (h − 1, 0)
(w − 1, 0) ⇒ (h − 1, w − 1)
(w − 1, h − 1) ⇒ (0, w − 1)
(0, h − 1) ⇒ (0, 0)
Reflection 9.17 Do you see a pattern in these transformations? Use this pattern
to infer a general rule about where each pixel at coordinates (x, y) should be in the
rotated image.
The first thing to notice is that the width and height of the image are swapped, so
the x and y coordinates in the original image need to be swapped in the rotated
image. However, just swapping the coordinates leads to the mirror image of what we
want. Notice from the table above that the y coordinate of each rotated corner is the
same as the x coordinate of the corresponding original corner. But the x coordinate
of each rotated corner is the h − 1 minus the y coordinate of the corresponding
corner in the original image. So we want to draw each pixel at (x, y) in the original
image at position (h − 1 − y, x) in the rotated image. The following function does
this. Notice that it is identical to the grayscale function, with the exceptions of
parts of two statements in red.
def rotate90(photo):
"""Rotate an image 90 degrees clockwise.
Parameter:
photo: an Image object
width = photo.width()
height = photo.height()
newPhoto = image.Image(height, width, title = ’Rotated image’)
for y in range(height):
for x in range(width):
color = photo.get(x, y)
newPhoto.set(height - y - 1, x, color)
return newPhoto
468 Flatland
Let’s look at one more example, and then we will leave several more as exercises.
Suppose we want to reduce the size of an image to one quarter of its original size. In
other words, we want to reduce both the width and height by half. In the process,
we are obviously going to lose three quarters of the pixels. Which ones do we throw
away? One option would be to group the pixels of the original image into 2 × 2 blocks
and choose the color of one of these four pixels for the corresponding pixel in the
reduced image, as illustrated in Figure 9.5. This is accomplished by the following
function. Again, it is very similar to the previous functions.
def reduce(photo):
"""Reduce an image to one quarter of its size.
Parameter:
photo: an Image object
width = photo.width()
height = photo.height()
newPhoto = image.Image(width // 2, height // 2,
title = ’Reduced image’)
for y in range(0, height, 2):
for x in range(0, width, 2):
color = photo.get(x, y)
newPhoto.set(x // 2, y // 2, color)
return newPhoto
Although this works, a better option would be to average the three channels of the
four pixels in the block, and use this average color in the reduced image. This is left
as an exercise.
Once we have filters like this, we can combine them in any way we like. For
example, we can create an image of a small, upside down, grayscale penguin:
Digital images 469
def main():
penguin = image.Image(file = ’penguin.gif’, title = ’Penguin’)
penguinSmall = reduce(penguin)
penguinGray = grayscale(penguinSmall)
penguinRotate1 = rotate90(penguinGray)
penguinRotate2 = rotate90(penguinRotate1)
penguinRotate2.show()
image.mainloop()
By implementing some of the additional filters in the exercises below, you can devise
many more fun creations.
Exercises
9.3.1. Real grayscale filters take into account how different colors are perceived by the
human eye. Human sight is most sensitive to green and least sensitive to blue.
Therefore, for a grayscale filter to look more realistic, the intensity of the green
channel should contribute the most to the grayscale luminance and the intensity of
the blue channel should contribute the least. The following formula is a common
way to weigh these intensities:
Modify the color2gray function in the text so that it uses this formula instead.
9.3.2. The colors in an image can be made “warmer” by increasing the yellow tone. In
the RGB color model, this is accomplished by increasing the intensities of both
the red and green channels. Write a function that returns an Image object that
is warmer than the original by some factor that is passed as a parameter. If the
factor is positive, the image should be made warmer; if the factor is negative, it
should be made less warm.
9.3.3. The colors in an image can be made “cooler” by increasing the intensity of the
blue channel. Write a function that returns an Image object that is cooler than the
original by some factor that is passed as a parameter. If the factor is positive, the
image should be made cooler; if the factor is negative, it should be made less cool.
9.3.4. The overall brightness in an image can be adjusted by increasing the intensity of
all three channels. Write a function that returns an Image object that is brighter
than the original by some factor that is passed as a parameter. If the factor is
positive, the image should be made brighter; if the factor is negative, it should be
made less bright.
9.3.5. A negative image is one in which the colors are the opposite of the original. In
other words, the intensity of each channel is 255 minus the original intensity. Write
a function that returns an Image object that is the negative of the original.
9.3.6. Write a function that returns an Image object that is a horizontally flipped version
of the original. Put another way, the image should be reflected along an imaginary
vertical line drawn down the center. See the example on the left of Figure 9.6.
470 Flatland
9.3.7. Write a function that returns an Image object with left half the same as the original
but with right half that is a mirror image of the original. (Imagine placing a mirror
along a vertical line down the center of an image, facing the left side.) See the
example on the right of Figure 9.6.
9.3.8. In the text, we wrote a function that reduced the size of an image to one quarter
of its original size by replacing each 2 × 2 block of pixels with the pixel in the top
left corner of the block. Now write a function that reduces an image by the same
amount by instead replacing each 2 × 2 block with a pixel that has the average
color of the pixels in the block.
9.3.9. An image can be blurred by replacing each pixel with the average color of its eight
neighbors. Write a function that returns a blurred version of the original.
9.3.10. An item can be further blurred by repeatedly applying the blur filter you wrote
above. Write a function that returns a version of the original that has been blurred
any number of times.
9.3.11. Write a function that returns an image that is a cropped version of the original.
The portion of the original image to return will be specified by a rectangle, as
illustrated below.
The function should take in four additional parameters that specify the (x, y)
coordinates of the top left and bottom right corners (shown above) of the crop
rectangle.
Summary 471
9.4 SUMMARY
As we saw in the previous chapter, a lot of data is naturally stored in two-dimensional
tables. So it makes sense that we would also want to store this data in a two-
dimensional structure in a program. We discussed two ways to do this. First, we
can store the data in a list of lists in which each inner list contains one row of data.
Second, in Exercises 9.1.6 and 9.2.11, we looked at how two-dimensional data can
be stored in a dictionary. The latter representation has the advantage that it can
be searched efficiently, if the data has an appropriate key value.
Aside from storing data, two-dimensional structures have many other applications.
Two-dimensional cellular automata are widely used to model a great variety of
phenomena. The Game of Life is probably the most famous, but cellular automata
can also be used to model actual cellular systems, pigmentation patterns on sea
shells, climate change, racial segregation (Project 9.1), ferromagnetism (Project 9.2),
and to generate pseudorandom numbers. Digital images are also stored as two-
dimensional structures, and image filters are simply algorithms that manipulate
these structures.
9.6 PROJECTS
Project 9.1 Modeling segregation
In 1971, Thomas Schelling (who in 2005 was co-recipient of the Nobel Prize in
Economics) proposed a theoretical model for how racial segregation occurs in urban
areas [47]. In the Schelling model , as it is now called, individuals belonging to one of
two groups live in houses arranged in a grid. Let’s call the two groups the Plain-Belly
Sneetches and the Star-Belly Sneetches [50]. Each cell in the grid contains a house
that is either vacant or inhabited by a Plain-Belly or a Star-Belly. Because each
cell represents an individual with its own independent attribute(s), simulations
such as these are known as agent-based simulations. Contrast this approach with
the population models in Chapter 4 in which there were no discernible individuals.
472 Flatland
Instead, we were concerned there only with aggregate sizes of populations of identical
individuals.
In an instance of the Schelling model, the grid is initialized to contain some
proportion of Plain-Bellies, Star-Bellies, and unoccupied cells (say 0.45, 0.45, 0.10,
respectively) with their locations chosen at random. At each step, a Sneetch looks at
each of its eight neighbors. (If a neighbor is off the side of the grid, wrap around to the
other side.) If the fraction of a cell’s neighbors that are different from itself exceeds
some “tolerance threshold,” the Sneetch moves to a randomly chosen unoccupied
cell. Otherwise, the Sneetch stays put. For example, if the tolerance threshold is
3/8, then a Sneetch will move if more than three of its neighbors are different. We
would like to answer the following question.
Question 9.1.1 Are there particular tolerance thresholds at which the two groups
always segregate themselves over time?
Question 9.1.2 Are there tolerance thresholds at which segregation happens only
some of the time? Or does it occur all of the time for some tolerance thresholds and
never for others?
Question 9.1.3 In the patterns you observed when answering the previous questions,
was there a “tipping point” or “phase transition” in the tolerance threshold? In other
words, is there a value of the tolerance threshold that satisfies the following property:
if the tolerance threshold is below this value, then one thing is certain to happen and
if the tolerance threshold is above this value then another thing is certain to happen?
Question 9.1.4 If the cells become segregated, are there “typical” patterns of seg-
regation or is segregation different every time?
Question 9.1.5 The Schelling model demonstrates how a “macro” (i.e., global)
property like segregation can evolve in an unpredictable way out of a sequence of
entirely “micro” (i.e., local) events. (Indeed, Schelling even wrote a book titled
Micromotives and Macrobehavior [48].) Such properties are called emergent. Can
you think of other examples of emergent phenomena?
Question 9.1.6 Based on the outcome of this model, can you conclude that segre-
gation happens because individuals are racist? Or is it possible that something else
is happening?
Projects 473
Question 9.2.1 At what temperature (roughly) does the system reach equilibrium,
i.e., settle into a consistent state that no longer changes frequently?
Question 9.2.2 As you vary the temperature, does the system change its behavior
gradually? Or is the change sudden? In other words, do you observe a “phase
transition” or “tipping point?”
474 Flatland
Question 9.2.3 In the Ising model, there is no centralized agent controlling which
atoms change their polarity and when. Rather, all of the changes occur entirely based
on an atom’s local environment. Therefore, the global property of ferromagnetism
occurs as a result of many local changes. Can you think of other examples of this
so-called emergent phenomenon?
some distance from the growing cluster. Each of these new particles will follow a
random walk on the grid (as in Section 5.1), either eventually sticking to a fixed
particle in the cluster or walking far enough away that we abandon it and move
on to the next particle. If the particle sticks, it will remain in that position for the
remainder of the simulation.
To keep track of the position of each fixed particle in the growing cluster, we
will need a 200 × 200 grid. Each element in the grid will be either zero or one,
depending upon whether there is a particle fixed in that location. To visualize the
growing cluster, you will also set up a turtle graphics window with world coordinates
matching the dimensions of your grid. When a particle sticks, place a dot in that
location. Over time, you should see the cluster emerge.
The simulation can be described in five steps:
1. Initialize your turtle graphics window and your grid.
2. Place a seed particle at the origin (both in your grid and graphically), and set
the radius R of the cluster to be 0. As your cluster grows, R will keep track of
the maximum distance of a fixed particle from the origin.
4. Let the new particle perform a random walk on the grid. If, at any step, the
particle comes into contact with an existing particle, it “sticks.” Update your
grid, draw a dot at this location, and update R if this particle is further from
the origin than the previous radius. If the particle does not stick within 200
moves, abandon it and create a new particle. (In a more realistic simulation,
you would only abandon the particle when it wanders outside of a circle with
a radius that is some function of R.)
Additional challenges
There are several ways in which this model can be enhanced and/or explored further.
Here are some ideas:
• Implement a three-dimensional DLA simulation. A video illustrating this can
be found on the book web site. Most aspects of your program will need to be
modified to work in three dimensions on a three-dimensional grid. To visualize
the cluster in three dimensions, you can use a module called vpython.
For instructions on how to install VPython on your computer, visit http:
//vpython.org and click on one of the “Download” links on the left side of
the page.
476 Flatland
As of this writing, the latest version of VPython only works with Python 2.7,
so installing this version of Python will be part of the process. To force your
code to behave in most respects like Python 3, even while using Python 2.7,
add the following as the first line of code in your program files, above any
other import statements:
from __future__ import division, print_function
This particular call draws a blue sphere with radius 1 at the origin. No prior
setup is required in vpython. At any time, you can zoom the image by dragging
while holding down the left and right buttons (or the Option key on a Mac).
You can rotate the image by dragging while holding down the right mouse
button (or the Command key on a Mac). There is very good documentation
for vpython available online at vpython.org.
• Investigate what happens if the particles are “stickier” (i.e., can stick from
further away).
• What if you start with a line of seed particles across the bottom instead of a
single particle? Or a circle of seed particles and grow inward? (Your definition
of distance or radius will need to change, of course, in these cases.)
CHAPTER 10
Clouds are not spheres, mountains are not cones, coastlines are not circles, and bark is not
smooth, nor does lightning travel in a straight line.
Benoît Mandelbrot
The Fractal Geometry of Nature (1983)
William Shakespeare
Hamlet (Act II, Scene II)
ave you ever noticed, while laying on your back under a tree, that each branch
H of the tree resembles the tree itself? If you could take any branch, from the
smallest to the largest, and place it upright in the ground, it could probably be
mistaken for a smaller tree. This phenomenon, called self-similarity, is widespread in
nature. There are also computational problems that, in a more abstract way, exhibit
self-similarity. In this chapter, we will discuss a computational technique, called
recursion, that we can use to elegantly solve problems that exhibit this property.
As the second quotation above suggests, although recursion may seem somewhat
foreign at first, it really is quite natural and just takes some practice to master.
10.1 FRACTALS
Nature is not geometric, at least not in a traditional sense. Instead, natural structures
are complex and not easily described. But many natural phenomena do share a
common characteristic: if you zoom in on any part, that part resembles the whole.
For example, consider the images in Figure 10.1. In the bottom two images, we can
see that if we zoom in on parts of the rock face and tree, these parts resemble the
whole. (In nature, the resemblance is not always exact, of course.) These kinds of
477
478 Self-similarity and recursion
Figure 10.1Fractal patterns in nature. Clockwise from top left: a nautilus shell [61],
the coastline of Norway [62], a closeup of a leaf [63], branches of a tree, a rock
outcropping, and lightning [64]. The insets in the bottom two images show how
smaller parts resemble the whole.
Fractals 479
depth = 3
3 5
4
depth = 1
2 6
depth = 2
1 7
Figure 10.2 A tree produced by tree(george, 100, 4). The center figure illustrates
what is drawn by each numbered step of the function. The figure on the right
illustrates the self-similarity in the tree.
A fractal tree
An algorithm for creating a fractal shape is recursive, meaning that it invokes
itself on smaller and smaller scales. Let’s consider the example of the simple tree
shown on the left side of Figure 10.2. Notice that each of the two main branches
of the tree is a smaller tree with the same structure as the whole. As illustrated
in the center of Figure 10.2, to create this fractal tree, we first draw a trunk and
then, for each branch, we draw two smaller trees at 30-degree angles using the
same algorithm. Each of these smaller trees is composed of a trunk and two yet
smaller trees, again drawn with the same tree-drawing algorithm. This process could
theoretically continue forever, producing a tree with infinite complexity. In reality,
however, the process eventually stops by invoking a non-recursive base case. The
base case in Figure 10.2 is a “tree” that consists of only a single line segment.
This recursive structure is shown more precisely on the right side of Figure 10.2.
The depth of the tree is a measure of its distance from the base case. The overall
tree in Figure 10.2 has depth 4 and each of its two main branches is a tree with
depth 3. Each of the two depth 3 trees is composed of two depth 2 trees. Finally,
each of the four depth 2 trees is composed of two depth 1 trees, each of which is
only a line segment.
The following tree function uses turtle graphics to draw this tree.
480 Self-similarity and recursion
import turtle
Parameters:
tortoise: a Turtle object
length: the length of the trunk
depth: the desired depth of recursion
def main():
george = turtle.Turtle()
george.left(90)
tree(george, 100, 4)
screen = george.getscreen()
screen.exitonclick()
main()
Let’s look at what happens when we run this program. The initial statements in
the main function initialize the turtle and orient it to the north. Then the tree
function is called with tree(george, 100, 4). On the lines numbered 1–2, the
turtle moves forward length units to draw the trunk, and then turns 30 degrees to
the left. This is illustrated in the center of Figure 10.2; the numbers correspond to
the line numbers in the function. Next, to draw the smaller tree, we call the tree
function recursively on line 3 with two-thirds of the length, and a value of depth
that is one less than what was passed in. The depth parameter controls how long we
continue to draw smaller trees recursively. After the call to tree returns, the turtle
turns 60 degrees to the right on line 4 to orient itself to draw the right tree. On line
5, we recursively call the tree function again with arguments that are identical to
those on line 3. When that call returns, the turtle retraces its steps in lines 6–7 to
return to the origin.
Fractals 481
tree
tortoise george
length 100
depth 3
1 4
tree tree
tortoise george tortoise george
length 66.67 length 66.67
depth 2 depth 2
2 3 5 6
The case when depth is at most 1 is called the base case because it does not
make a recursive call to the function; this is where the recursion stops.
Reflection 10.1 Try running the tree-growing function with a variety of parameters.
Also, try changing the turn angles and the amount length is shortened. Do you
understand the results you observe?
Figure 10.3 illustrates the recursive function calls that are made by the tree
function when length is 100 and depth is 3. The top box represents a call to the
tree function with parameters tortoise = george, length = 100, and depth = 3.
This function calls two instances of tree with length = 100 * (2 / 3) = 66.67
and depth = 2. Then, each of these instances of tree calls two instances of tree
with length = 66.67 * (2 / 3) = 44.44 and depth = 1. Because depth is 1,
each of these instances of tree simply draws a line segment and returns.
Reflection 10.2 The numbers on the lines in Figure 10.3 represent the order in
which the recursive calls are made. Can you see why that is?
Reflection 10.3 What would happen if we removed the base case from the algorithm
by deleting the first four statements, so that the line numbered 1 was always the first
statement executed?
The base case is extremely important to the correctness of this algorithm. Without
the base case, the function would continue to make recursive calls forever!
A fractal snowflake
One of the most famous fractal shapes is the Koch curve, named after Swedish
mathematician Helge von Koch. A Koch curve begins with a single line segment
with length `. Then that line segment is divided into three equal parts, each with
482 Self-similarity and recursion
length `/3. The middle part of this divided segment is replaced by two sides of an
equilateral triangle (with side length `/3), as depicted in Figure 10.4(b). Next, each
of the four line segments of length `/3 is divided in the same way, with the middle
segment again replaced by two sides of an equilateral triangle with side length `/9,
etc., as shown in Figure 10.4(c)–(d). As with the tree above, this process could
theoretically go on forever, producing an infinitely intricate pattern.
Notice that, like the tree, this shape exhibits self-similarity; each “side” of the
Koch curve is itself a Koch curve with smaller depth. Consider first the Koch curve
in Figure 10.4 with depth 1. It consists of four smaller Koch curves with depth 0
and length `/3. Likewise, the Koch curve with depth 2 consists of four smaller Koch
curves with depth 1 and the Koch curve with depth 3 consists of four smaller Koch
curves with depth 2.
We can use our understanding of this self-similarity to write an algorithm to
produce a Koch curve. To draw a Koch curve with depth d (d > 0) and overall
horizontal length `, we do the following:
3. Draw another Koch curve with depth d − 1 and overall length `/3.
5. Draw another Koch curve with depth d − 1 and overall length `/3.
7. Draw another Koch curve with depth d − 1 and overall length `/3.
The base case occurs when d = 0. In this case, we simply draw a line with length `.
Reflection 10.4 Follow the algorithm above to draw (on paper) a Koch curve with
depth 1. Then follow the algorithm again to draw one with depth 2.
Parameters:
tortoise: a Turtle object
length: the length of a line segment
depth: the desired depth of recursion
def main():
george = turtle.Turtle()
koch(george, 400, 3)
screen = george.getscreen()
screen.exitonclick()
main()
Reflection 10.5 Run this program and experiment by calling koch with different
values of length and depth.
Reflection 10.6 Look carefully at Figure 10.5. Can you see where the three indi-
vidual Koch curves are connected?
Parameters:
tortoise: a Turtle object
length: the length of a line segment
depth: the desired depth of recursion
Reflection 10.7 Insert this function into the previous program and call it from
main. Try making different snowflakes by increasing the number of sides (and
decreasing the right turn angle).
Imagine a Koch snowflake made from infinitely recursive Koch curves. Paradoxically,
while the area inside any Koch snowflake is clearly finite (because it is bounded),
the length of its border is infinite! In fact, the distance between any two points on
its border is infinite! To see this, notice that, at every stage in its construction, each
line segment is replaced with four line segments that are one-third the length of the
original. Therefore, the total length of that “side” increases by one-third. Since this
happens infinitely often, the perimeter of the snowflake continues to grow forever.
Fractals 485
Exercises
10.1.1. Modify the recursive tree-growing function so that it branches at random angles
between 10 and 60 degrees (instead of 30 degrees) and it shrinks the trunk/branch
length by a random fraction between 0.5 and 0.75. Do your new trees now look
more “natural”?
10.1.2. The quadratic Koch curve is similar to the Koch curve, but replaces the middle
segment of each side with three sides of a square instead, as illustrated in Figure 10.6.
Write a recursive function
quadkoch(tortoise, length, depth)
that draws the quadratic Koch curve with the given segment length and depth.
10.1.3. Each of the following activities is recursive in the sense that each step can be
considered a smaller version of the original activity. Describe how this is the case
and how the “input” gets smaller each time. What is the base case of each operation
below?
(a) Evaluating an arithmetic expression like 7 + (15 − 3)/4.
(b) The chain rule in calculus (if you have taken calculus).
(c) One hole of golf.
(d) Driving directions to some destination.
10.1.4. Generalize the Koch snowflake function with an additional parameter so that it
can be used to draw a snowflake with any number of sides.
486 Self-similarity and recursion
depth 0
(a) depth 1
(b)
depth 2
(c)
depth 3
(d)
10.1.5. The Sierpinski triangle, depicted in Figure 10.7, is another famous fractal. The
fractal at depth 0 is simply an equilateral triangle. The triangle at depth 1 is
composed of three smaller triangles, as shown in Figure 10.7(b). (The larger outer
triangle and the inner “upside down” triangle are indirect effects of the positions
of these three triangles.) At depth 2, each of these three triangles is replaced by
three smaller triangles. And so on. Write a recursive function
sierpinski(tortoise, p1, p2, p3, depth)
that draws a Sierpinski triangle with the given depth. The triangle’s three corners
should be at coordinates p1, p2, and p3 (all tuples). It will be helpful to also
write two smaller functions that you can call from sierpinski: one to draw a
simple triangle, given the coordinates of its three corners, and one to compute the
midpoint of a line segment.
10.1.6. The Hilbert space-filling curve, shown in Figure 10.8, is a fractal path that visits all
of the cells in a square grid in such a way that close cells are visited close together
in time. For example, the figure below shows how a depth 2 Hilbert curve visits
the cells in an 8 × 8 grid.
Assume the turtle is initially pointing north (up). Then the following algorithm
draws a Hilbert curve with depth d >= 0. The algorithm can be in one of two
Fractals 487
Sierpinski carpets with depths 0, 1, and 2. (The gray bounding box shows
Figure 10.9
the extent of the drawing area; it is not actually part of the fractal.)
different modes. In the first mode, steps 1 and 11 turn right, and steps 4 and 8
turn left. In the other mode, these directions are reversed (indicated in square
brackets below). Steps 2 and 10 make recursive calls that switch this mode.
1. Turn 90 degrees to the right [left].
2. Draw a depth d − 1 Hilbert curve with left/right swapped.
3. Draw a line segment.
4. Turn 90 degrees to the left [right].
5. Draw a depth d − 1 Hilbert curve.
6. Draw a line segment.
7. Draw a depth d − 1 Hilbert curve.
8. Turn 90 degrees to the left [right].
9. Draw a line segment.
10. Draw a depth d − 1 Hilbert curve with left and right swapped.
11. Turn 90 degrees to the right [left].
The base case of this algorithm (d < 0) draws nothing. Write a recursive function
hilbert(tortoise, reverse, depth)
that draws a Hilbert space-filling curve with the given depth. The Boolean parame-
ter reverse indicates which mode the algorithm should draw in. (Think about how
you can accommodate both drawing modes by changing the angle of the turns.)
10.1.7. A fractal pattern called the Sierpinski carpet is shown in Figure 10.9. At depth 0, it
is simply a filled square one-third the width of the overall square space containing
the fractal. At depth 1, this center square is surrounded by eight one-third size
Sierpinski carpets with depth 0. At depth 2, the center square is surrounded by
eight one-third size Sierpinski carpets with depth 1. Write a function
carpet(tortoise, upperLeft, width, depth)
488 Self-similarity and recursion
that draws a Sierpinski carpet with the given depth. The parameter upperLeft
refers to the coordinates of the upper left corner of the fractal and width refers to
the overall width of the fractal.
10.1.8. Modify your Sierpinski carpet function from the last exercise so that it displays
the color pattern shown in Figure 10.10.
def sumList(data):
"""Compute the sum of the values in a list.
Parameter:
data: a list of numbers
sum = 0
for value in data:
sum = sum + value
return sum
Recursion and iteration 489
Let’s think about how we could achieve the same thing recursively. To solve a
problem recursively, we need to think about how we could solve it if we had a
solution to a smaller subproblem. A subproblem is the same as the original problem,
but with only part of the original input.
In the case of summing the numbers in a list named data, a subproblem would
be summing the numbers in a slice of data. Consider the following example:
data Ð→ [1, 7, 3, 6 ]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
data[1:]
Since there is no data[0] or data[1:] when data is empty, the method above will
not work in this case. But we can easily check for this case and simply return 0;
this is the base case of the function. Putting these two parts together, we have the
following function:
def sumList(data):
""" (docstring omitted) """
But does this actually work? Yes, it does. To see why, let’s look at Figure 10.11.
At the top of the diagram, in box (a), is a representation of a main function that
has called sumList with the argument [1, 7, 3, 6]. Calling sumList creates
a new instance of the sumList function, represented in box (b), with data as-
signed the values [1, 7, 3, 6]. Since data is not empty, the function will return
1 + sumList([7, 3, 6]), the value enclosed in the gray box. To evaluate this
return value, we must call sumList again with the argument [7, 3, 6], resulting
in another instance of the sumList function, represented in box (c). The instance
of sumList in box (b) must wait to return its value until the instance in box (c)
returns. Again, since data is not empty, the instance of sumList in box (c) will
return 7 + sumList([3, 6]). Evaluating this value requires that we call sumList
again, resulting in the instance of sumList in box (d). This process continues until
the instance of sumList in box (e) calls sumList([]), creating the instance of
sumList in box (f). Since this value of the data parameter is empty, the instance of
490 Self-similarity and recursion
main()
(a)
sum = sumList([1, 7, 3, 6])
1 17
sumList([1, 7, 3, 6]) 10
(b)
return 1 + sumList([7, 3, 6])
2 16
sumList([7, 3, 6])
(c)
return 7 + sumList([3, 6]) 9
3 9
sumList([3, 6])
(d) 8
return 3 + sumList([6])
4 6
sumList([6])
(e) 7
return 6 + sumList([])
5 0
sumList([])
(f) 6
return 0
sumList in box (f) immediately returns 0 to the instance of sumList that called it,
in box (e). Now that the instance of sumList in box (e) has a value for sumList([]),
it can return 6 + 0 = 6 to the instance of sumList in box (d). Since the instance
of sumList in box (d) now has a value for sumList([6]), it can return 3 + 6 = 9
to the instance of sumList in box (c). The process continues up the chain until the
value 17 is finally returned to main.
Notice that the sequence of function calls, moving down the figure from (a) to
(f), only ended because we eventually reached the base case in which data is empty,
which resulted in the function returning without making another recursive call.
Every recursive function must have a non-recursive base case, and each recursive
call must get one step closer to the base case. This may sound familiar; it is very
similar to the way we must think about while loops. Each iteration of a while loop
must move one step closer to the loop condition becoming false.
Recursion and iteration 491
2. Suppose we could ask an all-knowing oracle for the solution to any subproblem
(but not for the solution to the problem itself ). Which subproblem solution
would be the most useful for solving the original problem?
The most useful solution would be the solution for a sublist that contains all
but one element of the original list, e.g., sumList(data[1:]).
3. How do we find the solution to the original problem using this subproblem
solution? Implement this as the recursive step of our recursive function.
The solution to sumList(data) is data[0] + sumList(data[1:]). Therefore,
the recursive step in our function should be
return data[0] + sumList(data[1:])
4. What are the simplest subproblems that we can solve non-recursively, and what
are their solutions? Implement your answer as the base case of the recursive
function.
The simplest subproblem would be to compute the sum of an empty list, which
is 0, of course. So the base case should be
if len(data) == 0:
return 0
5. For any possible parameter value, will the recursive calls eventually reach the
base case?
Yes, since an empty list will obviously reach the base case and any other list
will result in a sequence of recursive calls that each involve a list that is one
element shorter.
def sumList(data):
""" (docstring omitted) """
Palindromes
Let’s look at another example. A palindrome is any sequence of characters that
reads the same forward and backward. For example, radar, star rats, and now I
won are all palindromes. An iterative function that determines whether a string is a
palindrome is shown below.
def palindrome(s):
"""Determine whether a string is a palindrome.
Parameter: a string s
Let’s answer the five questions above to develop an equivalent recursive algorithm
for this problem.
Reflection 10.11 Second, if you could know whether any slice is a palindrome,
which would be the most helpful?
s Ð→ ’n o w I w o n’
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
s[1:-1]
If we begin by looking at the first and last characters and determine that they are
not the same, then we know that the string is not a palindrome. But if they are the
Recursion and iteration 493
same, then the question of whether the string is a palindrome is decided by whether
the slice that omits the first and last characters, i.e., s[1:-1], is a palindrome. So
it would be helpful to know the result of palindrome(s[1:-1]).
Reflection 10.12 Third, how could we use this information to determine whether
the whole string is a palindrome?
If the first and last characters are the same and s[1:-1] is a palindrome, then s is
a palindrome. Otherwise, s is not a palindrome. In other words, our desired return
value is the answer to the following Boolean expression.
return s[0] == s[-1] and palindrome(s[1:-1])
If the first part is true, then the answer depends on whether the slice is a palindrome
(palindrome(s[1:-1])). Otherwise, if the first part is false, then the entire and
expression is false. Furthermore, due to the short circuit evaluation of the and
operator, the recursive call to palindrome will be skipped.
Reflection 10.13 What are the simplest subproblems that we can solve non-
recursively, and what are their solutions? Implement your answer as the base case.
The simplest string is, of course, the empty string, which we can consider a palin-
drome. But strings containing a single character are also palindromes, since they
read the same forward and backward. So we know that any string with length at
most one is a palindrome. But we also need to think about strings that are obviously
not palindromes. Our discussion above already touched on this; when the first and
last characters are different, we know that the string cannot be a palindrome. Since
this situation is already handled by the Boolean expression above, we do not need a
separate base case for it.
Putting this all together, we have the following elegant recursive function:
def palindrome(s):
""" (docstring omitted) """
Let’s look more closely at how this recursive function works. On the left side of
Figure 10.12 is a representation of the recursive calls that are made when the
palindrome function is called with the argument ’now I won’. From the main
function in box (a), palindrome is called with ’now I won’, creating the instance of
palindrome in box (b). Since the first and last characters of the parameter are equal
(the s[0] == s[-1] part of the return statement is not shown to make the pictures
less cluttered), the function will return the value of palindrome(’ow I wo’). But,
494 Self-similarity and recursion
main() main()
(a) (a)
pal = palindrome(’now I won’) pal = palindrome(’now I win’)
1 True 1 False
2 True 2
False
3 True
palindrome(’w I w’)
8
(d)
return palindrome(‘ I ‘)
4 True
palindrome(’ I ’) 7
(e)
return palindrome(‘I’)
5 True
palindrome(’I’)
6
(f)
return True
in order to get this value, it needs to call palindrome again, creating the instance
in box (c). These recursive calls continue until we reach the base case in box (f),
where the length of the parameter is one. The instance of palindrome in box (f)
returns True to the previous instance of palindrome in box (e). Now the instance
in box (e) returns to (d) the value True that it just received from (f). The value of
True is propagated in this way all the way up the chain until it eventually reaches
main, where it is assigned to the variable named pal.
To see how the function returns False, let’s consider the example on the right side
of Figure 10.12. In this example, the recursive palindrome function is called from
main in box (a) with the non-palindromic argument ’now I win’, which creates the
instance of palindrome in box (b). As before, since the first and last characters of the
Recursion and iteration 495
parameter are equal, the function will return the value of palindrome(’ow I wi’).
Calling palindrome with this parameter creates the instance in box (c). But now,
since the first and last characters of the parameter are not equal, Boolean expression
returned by the function is False, so the instance of palindrome in box (c) returns
False, and this value is propagated up to the main function.
Guessing passwords
One technique that hackers use to compromise computer systems is to rapidly try
all possible passwords up to some given length.
Reflection 10.14 How many possible passwords are there with length n, if there
are c possible characters to choose from?
c ⋅ c ⋅ c ⋅ ⋯ ⋅ c = cn .
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
n times
different eight-character passwords that use only lower-case letters. But there are
different twelve-character passwords that draw from the lower and upper-case letters,
digits, and the five special characters $, &, #, ?, and !, which is why web sites
prompt you to use long passwords with all of these types of characters! When you
use a long enough password and enough different characters, this “guess and check”
method is useless.
Let’s think about how we could generate a list of possible passwords by
first considering the simpler problem of generating all binary strings (or “bit
strings”) of a given length. This is the same problem, but using only two charac-
ters: ’0’ and ’1’. For example, the list of all binary strings with length three is
[’000’, ’001’, ’010’, ’011’, ’100’, ’101’, ’110’, ’111’].
Thinking about this problem iteratively can be daunting. However, it becomes
easier if we think about the problem’s relationship to smaller versions of itself (i.e.,
self-similarity). As shown below, a list of binary strings with a particular length can
be created easily if we already have a list of binary strings that are one bit shorter.
We simply make two copies of the list of shorter binary strings, and then precede
all of the strings in the first copy with a 0 and all of the strings in the second copy
with a 1.
496 Self-similarity and recursion
0 00 000
1 01 001
10 010
11 011
100
101
110
111
In the illustration above, the list of binary strings with length 2 is created from two
copies of the list of binary strings with length 1. Then the list of binary strings with
length 3 is created from two copies of the list of binary strings with length 2. In
general, the list of all binary strings with a given length is the concatenation of
(a) the list of all binary strings that are one bit shorter and preceded by zero and
(b) the list of all binary strings that are one bit shorter and preceded by one.
Reflection 10.15 What is the base case of this algorithm?
The base case occurs when the length is 0, and there are no binary strings. However,
the problem says that the return value should be a list of strings, so we will return
a list containing an empty string in this case. The following function implements
our recursive algorithm:
def binary(length):
"""Return a list of all binary strings with the given length.
Parameter:
length: the length of the binary strings
Return value:
a list of all binary strings with the given length
"""
In the recursive step, we assign a list of all bit strings with length that is one shorter
to shorter, and then create two lists of bit strings with the desired length. The
first, named bitStrings0, is a list of bit strings consisting of each bit string in
shorter, preceded by ’0’. Likewise, the second list, named bitStrings1, contains
the shorter bit strings preceded by ’1’. The return value is the list consisting of
the concatenation of bitStrings0 and bitStrings1.
Reflection 10.16 Why will this algorithm not work if we return an empty list
([ ]) in the base case? What will be returned?
If we return an empty list in the base case instead of a list containing an empty
string, the function will return an empty list. To see why, consider what happens
when we call binary(1). Then shorter will be assigned the empty list, which means
that there is nothing to iterate over in the two for loops, and the function returns
the empty list. Since binary(2) calls binary(1), this means that binary(2) will
also return the empty list, and so on for any value of length!
Reflection 10.17 Reflecting the algorithm we developed, the function above con-
tains two nearly identical for loops, one for the ’0’ prefix and one for the ’1’ prefix.
How can we combine these two loops?
We can combine the two loops by repeating a more generic version of the loop for
each of the characters ’0’ and ’1’:
def binary(length):
""" (docstring omitted) """
if length == 0:
return [’’]
shorter = binary(length - 1)
bitStrings = []
for character in [’0’, ’1’]:
for shortString in shorter:
bitStrings.append(character + shortString)
return bitStrings
We can use a very similar algorithm to generate a list of possible passwords. The
only difference is that, instead of preceding each shorter string with 0 and 1, we
need to precede each shorter string with every character in the set of allowable
characters. The following function, with a string of allowable characters assigned to
an additional parameter, is a simple generalization of our binary function.
498 Self-similarity and recursion
Parameters:
length: the length of the passwords
characters: a string containing the characters to use
Return value:
a list of all possible passwords with the given length,
using the given characters
"""
if length == 0:
return [’’]
passwordList = []
for character in characters:
for shorterPassword in shorter:
passwordList.append(character + shorterPassword)
return passwordList
Reflection 10.18 How would we call the passwords function to generate a list of
all bit strings with length 5? What about all passwords with length 4 containing the
characters ’abc123’? What about all passwords with length 8 containing lower case
letters? (Do not actually try this last one!)
Exercises
Write a recursive function for each of the following problems.
10.2.12. Suppose you work for a state in which all vehicle license plates consist of a string
of letters followed by a string of numbers, such as ’ABC 123’. Write a recursive
function
licensePlates(length, letters, numbers)
that returns a list of strings representing all possible license plates of this form,
with length letters and numbers chosen from the given strings. For example,
licensePlates(2, ’XY’, ’12’) should return the following list of 16 possible
license plates consisting of two letters drawn from ’XY’ followed by two digits
drawn from ’12’:
[’XX 11’, ’XX 21’, ’XY 11’, ’XY 21’, ’XX 12’, ’XX 22’,
’XY 12’, ’XY 22’, ’YX 11’, ’YX 21’, ’YY 11’, ’YY 21’,
’YX 12’, ’YX 22’, ’YY 12’, ’YY 22’]
(Hint: this is similar to the passwords function.)
This game is interesting because it is naturally solved using the following recursive
insight. To move n disks from the first peg to the third peg, we must first be able to
1
http://www.cs.wm.edu/~pkstoc/toh.html
The mythical Tower of Hanoi 501
(a) (b)
(c) (d)
Figure 10.14 Illustration of the recursive algorithm for Tower of Hanoi with three
disks.
move the bottom (largest) disk on the first peg to the bottom position on the third
peg. The only way to do this is to somehow move the top n−1 disks from the first peg
to the second peg, to get them out of the way, as illustrated in Figure 10.14(a)–(b).
But notice that moving n − 1 disks is a subproblem of moving n disks because it is
the same problem but with only part of the input. The source and destination pegs
are different in the original problem and the subproblem, but this can be handled
by making the source, destination, and intermediate pegs additional inputs to the
problem. Because this step is a subproblem, we can perform it recursively! Once this
is accomplished, we are free to move the largest disk from the first peg to the third
peg, as in Figure 10.14(c). Finally, we can once again recursively move the n − 1
disks from the second peg to the third peg, shown in Figure 10.14(d). In summary,
we have the following recursive algorithm:
1. Recursively move the top n − 1 disks from the source peg to the intermediate
peg, as in Figure 10.14(b).
2. Move one disk from the source peg to the destination peg, as in Figure 10.14(c).
3. Recursively move the n − 1 disks from the intermediate peg to the destination
peg, as in Figure 10.14(d).
Reflection 10.19 What is the base case in this recursive algorithm? In other words,
what is the simplest subproblem that will be reached by these recursive calls?
The simplest base case would be if there were no disks at all! In this case, we simply
do nothing.
We cannot actually write a Python function to move the disks for us, but we can
write a function that gives us instructions on how to do so. The following function
accomplishes this, following exactly the algorithm described above.
502 Self-similarity and recursion
Parameters:
n: the number of disks
source: the source peg
destination: the destination peg
intermediate: the other peg
if n >= 1:
hanoi(n - 1, source, intermediate, destination)
print(’Move a disk from peg’, source, ’to peg’, destination)
hanoi(n - 1, intermediate, destination, source)
will print instructions for moving eight disks from peg A to peg C, using peg B as
the intermediate.
Reflection 10.20 Execute the function with three disks. Does it work? How many
steps are necessary? What about with four and five disks? Do you see a pattern?
disks from the intermediate peg to the destination peg, which requires another three
moves, for a total of seven moves.
In general, notice that the number of moves required for n disks is the number of
moves required for n − 1 disks, plus one move for the bottom disk, plus the number of
moves required for n − 1 disks again. In other words, if the function M (n) represents
the number of moves required for n disks, then
M (n) = M (n − 1) + 1 + M (n − 1) = 2M (n − 1) + 1.
Does this look familiar? This is a difference equation, just like those in Chapter 4.
In this context, a function that is defined in terms of itself is also called a recurrence
relation. The pattern produced by this recurrence relation is illustrated by the
following table.
n M (n)
1 1
2 3
3 7
4 15
5 31
⋮ ⋮
Reflection 10.21 Do you see the pattern in the table? What is the formula for
M (n) in terms of n?
M (n) is always one less than 2n . In other words, the algorithm requires
M (n) = 2n − 1
moves to solve the the game when there are n disks. This expression is called a
closed form for the recurrence relation because it is defined only in terms of n, not
M (n − 1). According to our formula, moving 64 disks would require
moves. I guess the end of the world is not coming any time soon!
Reflection 10.22 What does a subproblem look like in the search problem? What
would be the most useful subproblem to have an answer for?
504 Self-similarity and recursion
In the search problem, a subproblem is to search for the target item in a smaller list.
Since we can only “look at” one item at a time, the most useful subproblem will be
to search for the target item in a sublist that contains all but one item. This way,
we can break the original problem into two parts: determine if the target is this one
item and determine if the target is in the rest of the list. It will be convenient to
have this one item be the first item, so we can then solve the subproblem with the
slice starting at index 1. Of course, if the first item is the one we are searching for,
we can avoid the recursive call altogether. Otherwise, we return the index that is
returned by the search of the smaller list.
We have already discussed one base case: if the target item is the first item in the
list, we simply return its index. Another base case would be when the list is empty.
In this case, the item for which we are searching is not in the list, so we return
−1. The following function (almost) implements the recursive algorithm we have
described.
Parameters:
data: a list object to search in
target: an object to search for
Reflection 10.24 What is the problem indicated by the red question marks? Why
can we not just return 0 in that case?
When we find the target at the beginning of the list and are ready to return its
index in the original list, we do not know what it was! We know that it is at index
0 in the current sublist being searched, but we have no way of knowing where this
sublist starts in the original list. Therefore, we need to add a third parameter to
the function that keeps track of the original index of the first item in the sublist
data. In each recursive call, we add one to this argument since the index of the new
front item in the list will be one more than that of the current front item.
Recursive linear search 505
With this third parameter, we need to now initially call the function like
position = linearSearch(data, target, 0)
We can now also make this function more efficient by eliminating the need to take
slices of the list in each recursive call. Since we now have the index of the first item
in the sublist under consideration, we can pass the entire list as a parameter each
time and use the value of first in the second base case.
As shown above, this change also necessitates a change in our first base case because
the length of the list is no longer decreasing to zero. Since the intent of the function
is to search in the list between indices first and len(data) - 1, we will consider
the list under consideration to be empty if the value of first is greater than the
last index in the list. Just to be safe, we also make sure that first is at least zero.
in the base case in which the list is empty (n = 0) is T (0) = c. In recursive cases,
there are additional comparisons to be made in the recursive call.
Reflection 10.25 How many more comparisons are made in the recursive call to
linearSearch?
The size of the sublist yet to be considered in each recursive call is n − 1, one less
than in the current instance of the function. Therefore, the number of comparisons
in each recursive call must be the number of comparisons required by a linear search
on a list with length n − 1, which is T (n − 1). So the total number of comparisons is
T (n) = T (n − 1) + c.
But this is not very helpful in determining what the time complexity of the linear
search is. To get this recurrence relation into a more useful form, we can think
about the recurrence relation as saying that the value of T (n) is equal to, or can be
replaced by, the value of T (n − 1) + c, as illustrated below:
T(n)
T(n − 1) + c
T (n − 2) = T (n − 2 − 1) + c = T (n − 3) + c.
and
T (n − 3) = T (n − 3 − 1) + c = T (n − 4) + c.
This sequence of substitutions is illustrated in Figure 10.15. The right side of the
figure illustrates the accumulation of c’s as we proceed downward. Since c is the
number of comparisons made before each recursive call, these values on the right
represent the number of comparisons made so far. Notice that the number subtracted
from n in the argument of T at each step is equal to the multiplier in front of the
accumulated c’s at that step. In other words, to the right of each T (n − i), the
accumulated value of c’s is i ⋅ c. When we finally reach T (0), which is the same as
T (n − n), the total on the right must be nc. Finally, we showed above that T (0) = c,
so the total number of comparisons is (n + 1)c. This expression is called the closed
Recursive linear search 507
T(n)
T(n − 1) + c c
T(n − 2) + c 2c
T(n − 3) + c 3c
T(n − i) + c ic
T(1) + c (n − 1) c
T(0) + c nc
c (n + 1) c
Figure 10.15An illustration of how to derive a closed form for the recurrence relation
T (n) = T (n − 1) + c.
form of the recurrence relation because it does not involve any values of T (n). Since
(n + 1)c is proportional to n asymptotically, recursive linear search is a linear-time
algorithm, just like the iterative linear search. Intuitively, this should make sense
because the two algorithms essentially do the same thing: they both look at every
item in the list until the target is found.
Exercises
10.4.1. Our first version of the linearSearch function, without the first parameter, can
work if we only need to know whether the target is contained in the list or not.
Write a working version of this function that returns True if the target item is
contained in the list and False otherwise.
10.4.2. Unlike our final version of the linearSearch function, the function you wrote in
the previous exercise uses slicing. Is this still a linear-time algorithm?
508 Self-similarity and recursion
10.4.3. Write a new version of recursive linear search that looks at the last item in the
list, and recursively calls the function with the sublist not containing the last item
instead.
10.4.4. Write a new version of recursive linear search that only looks at every other item in
the list for the target value. For example, linearSearch([1, 2, 3, 4, 2], 2, 0)
should return the index 4 because it would not find the target, 2, at index 1.
10.4.5. Write a recursive function
sumSearch(data, total, first)
that returns the first index in data, greater than or equal to first, for which the
sum of the values in data[:index + 1] is greater than or equal to total. If the
sum of all of the values in the list is less than total, the function should return −1.
For example, sumSearch([2, 1, 4, 3], 4) returns index 2 because 2 + 1 + 4 ≥ 4
but 2 + 1 < 4.
3. Combine the solutions to the subproblems into a solution for the original
problem.
In the Tower of Hanoi algorithm, the “combine” step was essentially free. Once the
subproblems had been solved, we were done. But other problems do require this
step at the end.
In this section, we will look at three more relatively simple examples of problems
that are solvable by divide and conquer algorithms.
Day 1 2 3 4 5 6 7 8 9 10
Price 3.90 3.60 3.65 3.71 3.78 4.95 3.21 4.50 3.18 3.53
It is tempting to look for the minimum price ($3.18) and then look for the maximum
price after that day. But this clearly does not work with this example. Even choosing
the second smallest price ($3.21) does not give the optimal answer. The most
profitable choice is to buy on day 2 at $3.60 and sell on day 6 at $4.95, for a profit
of $1.35 per share.
One way to find this answer is to look at all possible pairs of buy and sell dates,
and pick the pair with the maximum profit. (See Exercise 8.5.5.) Since there are
n(n − 1)/2 such pairs, this yields an algorithm with time complexity n2 . However,
there is a more efficient way. Consider dividing the list of prices into two equal-size
lists (or as close as possible). Then the optimal pair of dates must either reside in
the left half of the list, the right half of the list, or straddle the two halves, with
the buy date in the left half and sell date in the right half. This observation can be
used to design the following divide and conquer algorithm:
1. Divide the problem into two subproblems: (a) finding the optimal buy and
sell dates in the left half of the list and (b) finding the optimal buy and sell
dates in the right half of the list.
3. Combine the solutions by choosing the best buy and sell dates from the left
half, the right half, and from those that straddle the two halves.
Reflection 10.27 Is there an easy way to find the best buy and sell dates that
straddle the two halves, with the buy date in the left half and sell date in the right
half ?
At first glance, it might look like the “combine” step would require another recursive
call to the algorithm. But finding the optimal buy and sell dates with this particular
restriction is actually quite easy. The best buy date in the left half must be the one
with the minimum price, and the best sell date in the right half must be the one
with the maximum price. So finding these buy and sell dates simply amounts to
finding the minimum price in the left half of the list and the maximum price in the
right half, which we already know how to do.
Before we write a function to implement this algorithm, let’s apply it to the list
of prices above:
[3.90, 3.60, 3.65, 3.71, 3.78, 4.95, 3.21, 4.50, 3.18, 3.53]
The algorithm divides the list into two halves, and recursively finds the maximum
profit in the left list [3.90, 3.60, 3.65, 3.71, 3.78] and the maximum profit
in the right list [4.95, 3.21, 4.50, 3.18, 3.53]. In each recursive call, the
510 Self-similarity and recursion
algorithm is executed again but, for now, let’s assume that we magically get these
two maximum profits: 3.78 − 3.60 = 0.18 in the left half and 4.50 − 3.21 = 1.29 in the
right half. Next, we find the maximum profit possible by holding the stock from the
first half to the second half. Since the minimum price in the first half is 3.60 and
the maximum price in the second half is 4.95, this profit is 1.35. Finally, we return
the maximum of these three profits, which is 1.35.
Reflection 10.28 Since this is a recursive algorithm, we also need a base case.
What is the simplest list in which to find the optimal buy and sell dates?
A simple case would be a list with only two prices. Then obviously the optimal buy
and sell dates are days 1 and 2, as long as the price on day 2 is higher than the price
on day 1. Otherwise, we are better off not buying at all (or equivalently, buying and
selling on the same day). An even easier case would be a list with less than two
prices; then once again we either never buy at all, or buy and sell on the same day.
The following function implements this divide and conquer algorithm, but just
finds the optimal profit, not the actual buy and sell days. Finding these days requires
just a little more work, which we leave for you to think about as an exercise.
1 def profit(prices):
2 """Find the maximum achievable profit from a list of daily
3 stock prices.
4
5 Parameter:
6 prices: a list of daily stock prices
7
8 Return value: the maximum profit
9 """
10
The base case covers situations in which the list contains either one price or no
prices. In these cases, the profit is 0. In the divide step, we find the index of the
middle (or close to middle) price, which is the boundary between the left and right
halves of the list. We then call the function recursively for each half. From each of
these recursive calls, we get the maximum possible profits in the two halves. Finally,
Divide and conquer 511
0.11
1 0 7
0.06 0.11
0 0 -0.30 0 0.06
2 4 8 0 10
Figure 10.16 A representation of the function calls in the recursive profit function.
The red numbers indicate the order in which the events happen.
for the combine step, we find the minimum price in the left half of the list and
the maximum price in the right half of the list. The final return value is then the
maximum of the maximum profit from the left half of the list, the maximum profit
from the right half of the list, and the maximum profit from holding the stock from
a day in the first half to a day in the second half.
Reflection 10.29 Call this function with the list of prices that we used in the
example above.
That this algorithm actually works may seem like magic at this point. But, rest
assured, like all recursive algorithms, there is a perfectly good reason why it works.
The process is sketched out in Figure 10.16 for a small list containing just the
first four prices in our example. As in Figures 10.11 and 10.12, each bold rectangle
represents an instance of a function call. Each box contains the function’s name
and argument, and a representation of how the return value is computed. In all
but the base cases, this is the maximum of leftProfit (represented by left),
rightProfit (represented by right), and midProfit (represented by mid). At the
top of Figure 10.16, the profit function is called with a list of four prices. This
results in calling profit recursively with the first two prices and the last two prices.
The first recursive call (on line 15) is represented by box (b). This call to profit
results in two more recursive calls, labeled (c) and (d), each of which is a base case.
These two calls both return the value 0, which is separately assigned to leftProfit
and rightProfit (left and right) in box (b). The profit from holding the stock in
the combine step in box (b), -0.30, is assigned to midProfit (mid). And then the
maximum value, which in this case is 0, is returned back to box (a) and assigned to
512 Self-similarity and recursion
leftProfit. Once this recursive call returns, the second recursive call (on line 16) is
made from box (a), resulting in a similar sequence of function calls, as illustrated in
boxes (e)–(g). This second recursive call from (a) results in 0.06 being assigned to
rightProfit. Finally, the maximum of leftProfit, rightProfit, and midProfit,
which is 3.71 - 3.60 = 0.11, is returned by the original function call. The red
numbers on the arrows in the figure indicate the complete ordering of events.
Navigating a maze
Suppose we want to design an algorithm to navigate a robotic rover through an
unknown, obstacle-filled terrain. For simplicity, we will assume that the landscape
is laid out on a grid and the rover is only able to “see” and move to the four grid
cells to its east, south, west, and north in each step, as long as they do not contain
obstacles that the rover cannot move through.
To navigate the rover to its destination on the grid (or determine that the
destination cannot be reached), we can use a technique called depth-first search.
The depth-first search algorithm explores a grid by first exploring in one direction
as far as it can from the source. Then it backtracks to follow paths that branch off
in each of the other three directions. Put another way, a depth-first search divides
the problem of searching for a path to the destination into four subproblems: search
for a path starting from the cell to the east, search for a path starting from the cell
to the south, search for a path starting from the cell to the west, and search for a
path starting from the cell to the north. To solve each of these subproblems, the
algorithm follows this identical procedure again, just from a different starting point.
In terms of the three divide and conquer steps, the depth-first search algorithm
looks like this:
1. Divide the problem into four subproblems. Each subproblem searches for a
path that starts from one of the four neighboring cells.
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
cells to the east, south, and north are blocked; and the cell to the west has already
been visited. Therefore, the depth-first search algorithm returns failure to the cell
at (2, 1). We color cell (2, 2) light blue to indicate that it has been visited, but is
no longer on the path to the destination. From cell (2, 1), the algorithm has already
looked east, so it now moves south to (3, 1), as shown in Figure 10.17(d). In the next
step, shown in Figure 10.17(e), the algorithm moves south again to (4, 1) because
the cell to the east is blocked.
Reflection 10.30 When it is at cell (3, 1), why does the algorithm not “see” the
destination in cell (3, 0)?
It does not yet “see” the destination because it looks east and south before it
looks west. Since there is an open cell to the south, the algorithm will follow that
possibility first. In the next steps, shown in Figure 10.17(f)–(g), the algorithm is able
to move east, and in Figure 10.17(h), it is only able to move north. At this point,
the algorithm backtracks to (4, 1) over three steps, as shown in Figure 10.17(i)–(k),
because all possible directions have already been attempted from cells (3, 3), (4, 3),
and (4, 2). From cell (4, 1), the algorithm next moves west to cell (4, 0), as shown in
Figure 10.17(l), because it has already moved east and there is no cell to the south.
514 Self-similarity and recursion
Finally, from cell (4, 0), it cannot move east, south, or west; so it moves north where
it finally finds the destination.
The final path shown in blue illustrates a path from the source to destination.
Of course, this is not the path that the algorithm followed, but this path can now
be remembered for future trips.
Reflection 10.31 Did the depth-first search algorithm find the shortest path?
A depth-first search is not guaranteed to find the shortest, or even a short, path.
But it will find a path if one exists. Another algorithm, called a breadth-first search,
can be used to find the shortest path.
Reflection 10.32 What is the base case of the depth-first search algorithm? For
what types of source cells can the algorithm finish without making any recursive
calls?
There are two kinds of base cases in depth-first search, corresponding to the two
possible outcomes. One base case occurs when the source cell is not a “legal” cell
from which to start a new search. These source cells are outside the grid, blocked, or
already visited. In these cases, we simply return failure. The other base case occurs
when the source cell is the same as the destination cell. In this case, we return
success.
The depth-first search algorithm is implemented by the following function. The
function returns True (success) if the destination was reached by a path and False
(failure) if the destination could not be found.
Parameters:
grid: a 2D grid (list of lists)
source: a (row, column) tuple to start from
dest: a (row, column) tuple to reach
The variable names BLOCKED, VISITED, and OPEN represent the possible status of
each cell. For example, the grid in Figure 10.17 is represented with
grid = [[BLOCKED, OPEN, BLOCKED, OPEN, OPEN],
[OPEN, OPEN, BLOCKED, OPEN, OPEN],
[BLOCKED, OPEN, OPEN, BLOCKED, OPEN],
[OPEN, OPEN, BLOCKED, OPEN, BLOCKED],
[OPEN, OPEN, OPEN, OPEN, BLOCKED]]
When a cell is visited, its value is changed from OPEN to VISITED by the dfs
function. There is a program available on the book web site that includes additional
turtle graphics code to visualize how the cells are visited in this depth-first search.
Download this program and run it on several random grids.
Reflection 10.33 Our dfs function returns a Boolean value indicating whether
the destination was reached, but it does not actually give us the path (as marked in
blue in Figure 10.17). How can we modify the function to do this?
This modification is actually quite simple, although it may take some time to
understand how it works. The idea is to add another parameter, a list named path,
to which we append each cell after we mark it as visited. The values in this list
contain the sequence of cells visited in the recursive calls. However, we remove the
cell from path if we get to the end of the function where we return False because
getting this far means that this cell is not part of a successful path after all. In our
example in Figure 10.17, initially coloring a cell blue is analogous to appending that
cell to the path, while recoloring a cell light blue when backtracking is analogous
516 Self-similarity and recursion
to removing the cell from the path. We leave an implementation of this change as
an exercise. Two projects at the end of this chapter demonstrate how depth-first
search can be used to solve other problems as well.
Exercises
Write a recursive divide and conquer function for each of the following problems. Each of
your functions should contain at least two recursive calls.
10.5.9. In Section 10.2, we developed a recursive function named binary that returned a
list of all binary strings with a given length. We can design an alternative divide
and conquer algorithm for the same problem by using the following insight. The
list of n-bit binary strings with the common prefix p (with length less than n) is
the concatenation of
(a) the list of n-bit binary strings with the common prefix p + ’0’ and
(b) the list of n-bit binary strings with the common prefix p + ’1’.
518 Self-similarity and recursion
For example, the list of all 4-bit binary strings with the common prefix 01 is the
list of 4-bit binary strings with the common prefix 010 (namely, 0100 and 0101)
plus the list of 4-bit binary strings with the common prefix 011 (namely, 0110 and
0111).
Write a recursive divide and conquer function
binary(prefix, n)
that uses this insight to return a list of all n-bit binary strings with the given
prefix. To compute the list of 4-bit binary strings, you would call the function
initially with binary(’’, 4).
Formal grammars
A formal grammar defines a set of productions (or rules) for constructing strings of
characters. For example, the following very simple grammar defines three productions
that allow for the construction of a handful of English sentences.
S→N V
N → our dog ∣ the school bus ∣ my foot
V → ate my homework ∣ swallowed a fly ∣ barked
The first production, S → N V says that the symbol S (a special start symbol ) may
be replaced by the string N V . The second production states that the symbol N
(short for “noun phrase”) may be replaced by one of three strings: our dog, the
school bus, or my foot (the vertical bar (|) means “or”). The third production
states that the symbol V (short for “verb phrase”) may be replaced by one of three
other strings. The following sequence represents one way to use these productions
to derive a sentence.
The derivation starts with the start symbol S. Using the first production, S is
replaced with the string N V . Then, using the second production, N is replaced
with the string “my foot”. Finally, using the third production, V is replaced with
“swallowed a fly.”
Lindenmayer systems 519
We can see from the following derivation that parallel application of the single
production very quickly leads to very long strings:
F ⇒ F-F++F-F
⇒ F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F
⇒ F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F-F-F++F-F-F-F++F-F++F-F++F-F-F-F
++F-F++F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F-F-F++F-F-F-F++F-F++F-F++
F-F-F-F++F-F
⇒ F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F-F-F++F-F-F-F++F-F++F-F++F-F-F-F
++F-F++F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F-F-F++F-F-F-F++F-F++F-F++
F-F-F-F++F-F-F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F-F-F++F-F-F-F++F-F+
+F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F-F-F++F-F-F-
F++F-F++F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F-F-F+
+F-F-F-F++F-F++F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F++F-F++F-F-F-F++F
-F-F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F-F-F++F-F-F-F++F-F++F-F++F-F-
F-F++F-F-F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F++F-
F++F-F-F-F++F-F-F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F
⇒⋯
520 Self-similarity and recursion
In the first step of the derivation, the production is applied to replace F with
F-F++F-F. In the second step, all four instances of F are replaced with F-F++F-F.
The same process occurs in the third step, and the resulting string grows very
quickly. The number of strings generated from the axiom in a derivation is called
the depth of the derivation. If we stopped the above derivation after the last string
shown, then its depth would be four because four strings were generated from the
axiom.
Like a recursive solution, we are applying the same algorithm (“apply all applicable
productions”) at each stage of a derivation. At each stage, the inputs are growing
larger, and we are getting closer to our desired depth. In other words, we can think
of each step in the derivation as applying the algorithm to a longer string, but
with the depth decreased by one. For example, in the derivation above, applying
a derivation with depth four to the axiom F is the same as applying a derivation
with depth three to F-F++F-F, which is the same as applying a derivation of length
two to the next string, etc. In general, applying a derivation with depth d to a
string is the same as applying a derivation with depth d − 1 to that string after the
productions have been applied one time. As a recursive algorithm in pseudocode,
this process looks like this:
We will later look more closely at how to represent the productions and implement
the part of the function represented by the comment.
As mentioned above, each symbol in an L-system represents a turtle graphics
command:
depth 1:
depth 2:
depth 3:
depth 4:
symbol must be specified by the L-system. For the L-system above, we will specify
an angle of 60 degrees:
Axiom: F
Production: F → F-F++F-F
Angle: 60 degrees
Reflection 10.35 Carefully follow the turtle graphics instructions (on graph pa-
per) in each of the first two strings derived from this L-system (F-F++F-F and
F-F++F-F-F-F++F-F++F-F++F-F-F-F++F-F). Do the pictures look familiar?
An annotated sketch of the shorter string is shown below.
++
F F
F − F
−
Starting on the left, we first move forward. Then we turn left 60 degrees and move
forward again. Next, we turn right twice, a total of 120 degrees. Finally, we move
forward, turn left again 60 degrees, and move forward one last time. Does this look
familiar? As shown in Figure 10.18, the strings derived from this L-system produce
Koch curves. Indeed, Lindenmayer systems produce fractals!
Here is another example:
Axiom: FX
Productions: X → X-YF
Y → FX+Y
Angle: 90 degrees
Implementing L-systems
To implement Lindenmayer systems in Python, we need to answer three questions:
Clearly, the axiom and subsequent strings generated by an L-system can be stored
as Python string objects. The productions can conveniently be stored in a dictionary.
For each production, we create an entry in the dictionary with key equal to the
symbol on the lefthand side and value equal to the string on the righthand side.
For example, the productions for the dragon curve L-system would be stored as the
following dictionary:
{’X’: ’X-YF’, ’Y’:’FX+Y’}
newString = ’’
for symbol in string:
if symbol in productions:
newString = newString + productions[symbol]
else:
newString = newString + symbol
To apply the productions again, we want to repeat this process on the new string.
This can be accomplished recursively by calling the same function on newString. To
control how many times we apply the productions, we will need a parameter named
depth that we decrement with each recursive call, precisely the way we did with
fractals in Section 10.1. When depth is 0, we do not want to apply the productions
at all, so we return the string untouched.
Parameters:
string: a string of L-system symbols
productions: a dictionary containing L-system productions
depth: the number of times the productions are applied
Return value:
the new string reflecting the application of productions
"""
def main():
kochProductions = {’F’: ’F-F++F-F’}
result = derive(’F’, kochProductions, 3)
print(result)
main()
The main function above derives a string for the Koch curve with depth 3.
Reflection 10.36 Run the program above. Then modify the main function so that
it derives the depth 4 string for the dragon curve.
524 Self-similarity and recursion
Of course, Lindenmayer systems are much more satisfying when you can draw them.
We will leave that to you as an exercise. In Project 10.1, we explore how to augment
L-systems so they can produce branching shapes that closely resemble real plants.
Exercises
10.6.1. Write a function
drawLSystem(tortoise, string, angle, distance)
that draws the picture described by the given L-system string. Your function
should correctly handle the special symbols we discussed in this section (F, +, -).
Any other symbols should be ignored. The parameters angle and distance give
the angle the turtle turns in response to a + or - command, and the distance the
turtle draws in response to an F command, respectively. For example, the following
program should draw the smallest Koch curve.
def main():
george = turtle.Turtle()
screen = george.getscreen()
george.hideturtle()
screen.update()
screen.exitonclick()
main()
10.6.2. Apply your drawLSystem function from Exercise 10.6.1 to each of the following
strings:
(a) F-F++F-F++F-F++F-F++F-F++F-F (angle = 60 degrees, distance = 20)
(b) FX-YF-FX+YF-FX-YF+FX+YF-FX-YF-FX+YF+FX-YF+FX+YF (angle = 90 de-
grees, distance = 20)
10.6.3. Write a function
lsystem(axiom, productions, depth, angle, distance, position, heading)
that calls the derive function with the first three parameters, and then calls your
drawLSystem function from Exercise 10.6.1 with the new string and the values of
angle and distance. The last two parameters specify the initial position and
heading of the turtle, before drawLSystem is called. This function combines all of
your previous work into a single L-system generator.
10.6.4. Call your lsystem function from Exercise 10.6.3 on each the following L-systems:
(a)
Axiom: F
Production: F → F-F++F-F
Angle: 60 degrees
distance = 10, position = (−400, 0), heading = 0, depth = 4
Summary 525
(b)
Axiom: FX
Productions: X → X-YF
Y → FX+Y
Angle: 90 degrees
distance = 5, position = (0, 0), heading = 0, depth = 12
(c)
Axiom: F-F-F-F
Production: F → F-F+F+FF-F-F+F
Angle: 90 degrees
distance = 3, position = (−100, −100), heading = 0, depth = 3
(d)
Axiom: F-F-F-F
Production: F → FF-F-F-F-F-F+F
Angle: 90 degrees
distance = 5, position = (0, −200), heading = 0, depth = 3
10.6.5. By simply changing the axiom, we can turn the L-system for the Koch curve
discussed in the text an L-system for a Koch snowflake composed of three Koch
curves. Show what the axiom needs to be. Use your lsystem function from
Exercise 10.6.3 to work out and test your answer.
10.7 SUMMARY
Some problems, like many natural objects, “naturally” exhibit self-similarity. In
other words, a problem solution is simply stated in terms of solutions to smaller
versions of itself. It is often easier to see how to solve such problems recursively
than it is iteratively. An algorithm that utilizes this technique is called recursive.
We suggested answering five questions to solve a problem recursively:
1. What does a subproblem look like?
2. Which subproblem solution would be the most useful for solving the original
problem?
3. How do we find the solution to the original problem using this subproblem
solution? Implement this as the recursive step of our recursive function.
4. What are the simplest subproblems that we can solve non-recursively, and what
are their solutions? Implement your answer as the base case of the recursive
function.
5. For any possible parameter value, will the recursive calls eventually reach the
base case?
We designed recursive algorithms for about a dozen different problems in this
chapter to illustrate how widely recursion can be applied. But learning how to solve
problems recursively definitely takes time and practice. The more problems you
solve, the more comfortable you will become!
526 Self-similarity and recursion
2 3
1 1 1 1 1
1. 2. 3. 4. 5. 6.
10.9 PROJECTS
*Project 10.1 Lindenmayer’s beautiful plants
For this project, we assume you have read Section 10.6.
Aristid Lindenmayer was specifically interested in modeling the branching behavior
of plants. To accomplish this, we need to introduce two more symbols: [ and ]. For
example, consider the following L-system:
Axiom: X
Productions: X → F[-X]+X
F → FF
Angle: 30 degrees
These two new symbols involve the use of a simple data structure called a stack .
A stack is simply a list in which we only append to and delete from one end. The
append operation is called a push and the delete operation is called a pop (hence
the name of the list pop method). For example, consider the following sequence of
operations on a hypothetical stack object named stack, visualized in Figure 10.20.
1. stack.push(1)
2. stack.push(2)
3. x = stack.pop()
4. stack.push(3)
5. y = stack.pop()
6. z = stack.pop()
Projects 527
• [ means “push the turtle’s current position and heading on a stack,” and
• ] means “pop a position and heading from the stack and set the turtle’s
current position and heading to these values.”
Let’s now return to the Lindenmayer system above. Applying the productions of
this Lindenmayer system twice results in the following string.
X ⇒ F[-X]+X ⇒ FF[-F[-X]+X]+F[-X]+X
The X symbols are used only in the derivation process and do not have any meaning
for turtle graphics, so we simply skip them when we are drawing. So the string
FF[-F[-X]+X]+F[-X]+X represents the simple “tree” below. On the left is a drawing
of the tree; on the right is a schematic we will use to explain how it was drawn.
c d
b
a
O
528 Self-similarity and recursion
The turtle starts at the origin, marked O, with a heading of 90 degrees (north). The
first two F symbols move the turtle forward from the origin to point a and then
point b. The next symbol, [, means that we push the current position and heading
(b, 90 degrees) on the stack.
(b, 90 degrees)
The next two symbols, -F, turn the turtle left 30 degrees (to a heading of 120 degrees)
and move it forward, to point c. The next symbol is another [, which pushes the
current position and heading, (c, 120 degrees), on the stack. So now the stack
contains two items—(b, 90 degrees) and (c, 120 degrees)—with the last item on top.
(b, 90 degrees)
The next three symbols, -X], turn the turtle left another 30 degrees (to a heading of
150 degrees), but then restore its heading to 120 degrees by popping (c, 120 degrees)
from the stack.
(b, 90 degrees)
The next three symbols, +X], turn the turtle 30 degrees to the right (to a heading
of 90 degrees), but then pop (b, 90 degrees) from the stack, moving the turtle back
to point b, heading north.
(So, in effect, the previous six symbols, [-X]+X did nothing.) The next two symbols,
+F, turn the turtle 30 degrees to the right (to a heading of 60 degrees) and move it
forward to point d. Similar to before, the last six symbols, [-X]+X, while pushing
states onto the stack, have no visible effect.
Continued applications of the productions in the L-system above will produce
strings that draw the same sequence of trees that we created in Section 10.1. More
involved L-systems will produce much more interesting trees. For example, the
following two L-systems produce the trees in Figure 10.21.
Axiom: X Axiom: F
Productions: X → F-[[X]+X]+F[+FX]-X Production: F → FF-[-F+F+F]+[+F-F-F]
F → FF Angle: 22.5 degrees
Angle: 25 degrees
Projects 529
Figure 10.21 Two trees from The Algorithmic Beauty of Plants ([43], p. 25).
to a string situated inside matching square brackets. We will pass the index of the
first character after the left square bracket ([) as an additional parameter:
drawLSystem(tortoise, string, startIndex, angle, distance)
The function will return the index of the matching right square bracket (]). (We
can pretend that there are imaginary square brackets around the entire string
for the initial call of the function, so we initially pass in 0 for startIndex.) The
recursive function will iterate over the indices of the characters in string, starting at
startIndex. (Use a while loop, for reasons we will see shortly.) When it encounters
a non-bracket character, it should do the same thing it did earlier. When the function
encounters a left bracket, it will save the turtle’s current position and heading, and
then recursively call the function with startIndex assigned to the index of the
character after the left bracket. When this recursive call returns, the current index
should be set to the index returned by the recursive call, and the function should
reset the turtle’s position and heading to the saved values. When it encounters a
right bracket, the function will return the index of the right bracket.
For example, the string below would be processed left to right but when the first
left bracket is encountered, the function would be called recursively with index 5
passed in for startIndex.
0 5 26
[ F F F F [ - F F [ - F [ - X ] + X ] + F [ - X ] + X ]+ F F [ - F [ - X ] + X ] + F [ - X ] + X ]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ °
drawLSystem(..., 5, ...)
This recursive call will return 26, the index of the corresponding right bracket, and
the + symbol at index 27 would be the next character processed in the loop. The
function will later make two more recursive calls, marked with the two additional
braces above.
Using this description, rewrite drawLSystem as a recursive function that does not
use an explicit stack. Test your recursive function with the same tree-like L-systems,
as above.
Question 10.1.3 Why can the stack used in Part 1 be replaced by recursion in
Part 2? Referring back to Figures 10.11 and 10.12, how are recursive function calls
similar to pushing and popping from a stack?
Question 10.1.4 Use your program to draw the following additional Lindenmayer
systems. For each one, set distance = 5, position = (0 , −3 0 0), heading = 90, and
depth = 6.
Axiom: X
Productions: X → F[+X]F[-X]+X
F → FF
Angle: 30 degrees
Projects 531
Axiom: H
Productions: H → HFX[+H][-H]
X → X[-FFF][+FFF]FX
Angle: 25.7 degrees
These two shapes actually have the same perimeter, but the shape on the right
contains only about seven percent of the area of the circle on the left.
In some states, the majority political party has control over periodic redistricting.
Often, the majority exploits this power by drawing district lines that favor their
chances for re-election, a practice that has come to be known as gerrymandering.
These districts often take on bizarre, non-compact shapes.
Several researchers have developed algorithms that redistrict states objectively
to optimize various measures of compactness. For example, the image below on the
left shows a recent district map for the state of Ohio. The image on the right shows
a more compact district map.2
2
These figures were produced by an algorithm developed by Brian Olson and retrieved from
http://bdistricting.com
532 Self-similarity and recursion
1. First, we can measure the mean of the distance between each voter and the
centroid of the district. The centroid is the “average point,” computed by
averaging all of the x and y values inside the district. We might expect a
gerrymandered district to have a higher mean distance than a more compact
district. Since we will not actually have information fine enough to compute
this value for individual voters, we will compute the average distance between
the centroid and each pixel in the image of the district.
2. Second, we can measure the standard deviation of the distance between each
pixel and the centroid of the district. The standard deviation measures the
degree of variability from the average. Similar to above, we might expect a
gerrymandered district to have higher variability in this distance. The standard
deviation of a list of values (in this case, distances) is the square root of the
variance. (See Exercise 8.1.10.)
3. Third, we can compare the area of the district to the area of a (perfectly
compact) circle with the same perimeter. In other words, we can define
3
0 0 1 0 1 2 0 1 2
4 3 5 4 3 5 4 3 5 4 3
0 1 2 0 1 2 6 0 1 2 6 0 1 2
7
(e) (f) (g) (h)
9 9 10 9 10
5 4 3 5 4 3 5 4 3 5 4 3
6 0 1 2 6 0 1 2 6 0 1 2 6 0 1 2
7 8 7 8 7 8 7 8
(i) (j) (k) (l)
express the area of a circle, which is normally expressed in terms of the radius
r (i.e., πr2 ), in terms of p instead. To do this, recall that the formula for the
perimeter of a circle is p = 2πr. Therefore, r = p/(2π). Substituting this into
the standard formula, we find that the area of a circle with perimeter p is
p 2 πp2 p2
πr2 = π ( ) = 2 = .
2π 4π 4π
Finally, incorporating this into the formula above, we have
A 4πA
compactness = p2
= .
p2
4π
To compute values for the first and second compactness measures, we need a list of
the coordinates of all of the pixels in each district. With this list, we can find the
centroid of the district, and then compute the mean and standard deviation of the
distances from this centroid to the list of coordinates. To compute the third metric,
we need to be able to determine the perimeter and area of each district.
534 Self-similarity and recursion
Start with a distance threshold for closeness of 100, and adjust as needed.
• points will be a list of coordinates of the pixels that the algorithm visited.
When you call the function, initially pass in an empty list. When the function
returns, this list should be populated with the coordinates in the district.
Your function should return a tuple containing the total perimeter and total area
obtained from a DFS starting at (x, y). The perimeter can be obtained by counting
the number of times the algorithm reaches a pixel that is outside of the region (think
base case), and the area is the total number of pixels that are visited inside the
region. For example, as shown below, the region from Figure 10.22 has perimeter 18
and area 11. The red numbers indicate the order in which the flood fill algorithm
will count each border.
18 16
17 14
12 13 15
11 5
3
10 1
4 2
9 6
8 7
The centroid of these points, derived by the averaging the x and y coordinates, is
(28/11, 15/11) ≈ (2.54, 1.36). Then the mean distance to the centroid is approxi-
mately 1.35 and the standard deviation is approximately 0.54.
that computes the three compactness measurements that we discussed above for
the district map with file name imageName. You can find maps for several states on
the book web site. The second parameter districts will contain a list of starting
coordinates (two-element tuples) for the districts on the map. These are also available
on the book web site. Your function should iterate over this list of tuples, and call
your measureDistrict function with x and y set to the coordinates in each one.
536 Self-similarity and recursion
The function should return a three-element tuple containing the average value, over
all of the districts, for each of the three metrics. To make sure your flood fill is
working properly, it will also be helpful to display the map (using the show method
of the Image class) and update it (using the update method of the Image class) in
each iteration. You should see the districts colored white, one by one.
For at least three states, compare the existing district map and the more compact
district map, using the three compactness measures. What do you notice?
To drive your program, write a main function that calls the compactness function
with a particular map, and then reports the results. As always, think carefully about
the design of your program and what additional functions might be helpful.
Technical notes
1. The images supplied on the book web site have low resolution to keep the
depth of the recursive calls in check. As a result, your compactness results will
only be a rough approximation. Also, shrinking the image sizes caused some
of the boundaries between districts to become “fuzzy.” As a result, you will
see some unvisited pixels along these boundaries when the flood fill algorithm
is complete.
However, set this value carefully. Use the smallest value that works. Setting
the maximum recursion depth too high may crash Python on your computer! If
you cannot find a value that works on your computer, try shrinking the image
file instead (and scaling the starting coordinates appropriately).
Now imagine that we pour a liquid uniformly over the top of the grid. The liquid
will fill the open sites at the top and percolate to connected open sites until the
liquid fills all of the open sites that are reachable from an open site at the top. We
say that the grid percolates if at least one open site in the bottom row is full at the
end. For example, the grid below on the left percolates, while the grid on the right
does not.
This system can be used to model a variety of naturally occurring phenomena. Most
obviously, it can model a porous rock being saturated with water. Similarly, it can
model an oil (or natural gas) field; in this case, the top of the grid represents the oil
underground and percolation represents the oil reaching the surface. Percolation
systems can also be used to model the flow of current through a network of transistors,
whether a material conducts electricity, the spread of disease or forest fires, and
even evolution.
We can represent how “porous” a grid is by a vacancy probability, the probability
that any particular site is open. For a variety of applications, scientists are interested
in knowing the probability that a grid with a particular vacancy probability will
538 Self-similarity and recursion
percolate. In other words, we would like to know the percolation probability for any
vacancy probability p. Despite decades of research, there is no known mathematical
solution to this problem. Therefore, we will use a Monte Carlo simulation to estimate
it.
Recall from Chapter 5 that a Monte Carlo simulation flips coins (metaphorically
speaking) at each step in a computation. For example, to estimate the distance
traveled by a random walk, we performed many random walks and took the average
final distance from the origin. In this problem, for any given vacancy probability
p, we will create many random grids and then test whether they percolate. By
computing the number that do percolate divided by the total number of trials, we
will estimate the percolation probability for vacancy probability p.
that decides whether a given grid percolates. The second parameter is a Boolean
that indicates whether the grid should also be drawn using turtle graphics. There
is a skeleton program on the book web site in which the drawing code has already
been written. Notice that some of the functions include a Boolean parameter that
indicates whether the percolation should be visualized.
that plots vacancy probability on the x axis and percolation probability on the
y axis for vacancy probabilities minP, minP + stepP, . . . , maxP. Each percolation
probability should be derived from a Monte Carlo simulation with the given number
of trials.
You should discover a phase transition: if the vacancy probability is less than
a particular threshold value, the system almost certainly does not percolate; but
if the vacancy probability is greater than this threshold value, the system almost
certainly does percolate. What is this threshold value?
This page intentionally left blank
CHAPTER 11
Organizing data
Search is an unsolved problem. We have a good 90 to 95% of the solution, but there is a lot
to go in the remaining 10%.
n this age of “big data,” we take search algorithms for granted. Without web
I search sites that are able to sift through billions of pages in a fraction of a second,
the web would be practically useless. Similarly, large data repositories, such as those
maintained by the U.S. Geological Survey (USGS) and the National Institutes of
Health (NIH), would be useless without the ability to search for specific information.
Even the operating systems on our personal computers now supply integrated search
capabilities to help us navigate our increasingly large collections of files.
To enable fast access to these sets of data, it must be organized in some way. A
method for organizing data is known as a data structure. Hidden data structures
in the implementations of the list and dictionary abstract data types enable their
methods to access and modify their contents quickly. (The data structure behind a
dictionary was briefly explained in Box 8.2.) In this chapter, we will explore one
of the simplest ways to organize data—maintaining it in a sorted list—and the
benefits this can provide. We will begin by developing a significantly faster search
algorithm that can take advantage of knowing that the data is sorted. Then we will
design three algorithms to sort data in a list, effectively creating a sorted list data
structure. If you continue to study computer science, you can look forward to seeing
many more sophisticated data structures in the future that enable a wide variety of
efficient algorithms.
541
542 Organizing data
Reflection 11.1 Can we apply this idea to searching in a sorted list of values?
We can search a list in a similar way, except that we usually do not know much
about the distribution of the list’s contents, so it is hard to make that first guess
about where to start. In this case, the best strategy is to start in the middle. After
comparing the item we seek to the middle item, we continue on the half of the list
that must contain the item. Because we are effectively dividing the list into two
halves each time, this algorithm is called binary search.
For example, suppose we wanted to search for the number 70 in the following
sorted list of numbers. (We will use numbers instead of words in our example to
save space.)
left mid right
index: 0 1 2 3 4 5 6 7 8 9 10 11
data[index]:
As we hone in on our target, we will update two variables named left and right
to keep track of the first and last indices of the sublist that we are still considering.
Binary search 543
The table on the left contains information about individual earthquakes, each of which
is identified with a QuakeID that acts as its key. The last column contains a two-letter
network code that identifies the preferred source of information about that earthquake.
The table on the right contains the names associated with each two-letter code. The
two-letter codes also act as the keys for this table.
Relational databases are queried using a programming language called SQL. A simple
SQL query looks like this:
select Mag from Earthquakes where QuakeID = ’nc72076101’
This query is asking for the magnitude (Mag), from the Earthquakes table, of the
earthquake with QuakeID nc72076101. The response to this query would be the value
1.8. Searching a table quickly for a particular key value is facilitated by an index. An
index is data structure that maps key values to rows in a table (similar to a Python
dictionary). The key values in the index can be maintained in sorted order so that any
key value, and hence any row, can be found quickly using a binary search. (Database
indices are more commonly maintained in practice in a hash table or a specialized data
structure called a B-tree.)
In addition, we will maintain a variable named mid that is assigned to the index
of the middle value of this sublist. (When there are two middle values, we choose
the leftmost one.) In each step, we will compare the target item to the item at
index mid. If the target is equal to this middle item, we return mid. Otherwise, we
either set right to be mid - 1 (to hone in on the left sublist) or we set left to be
mid + 1 (to hone in on the right sublist).
In the list above, we start by comparing the item at index mid (60) to our target
item (70). Then, because 70 > 60, we decide to narrow our search to the second half
of the list. To do this, we assign left to mid + 1, which is the index of the item
immediately after the middle item. In this case, we assign left to 5 + 1 = 6, as
shown below.
544 Organizing data
0 1 2 3 4 5 6 7 8 9 10 11
Then we update mid to be the index of the middle item in this sublist between left
and right, in this case, 8. Next, since 70 is less than the new middle value, 90, we
discard the second half of the sublist by assigning right to mid - 1, in this case,
8 - 1 = 7, as shown below.
mid
left right
0 1 2 3 4 5 6 7 8 9 10 11
Then we update mid to be 6, the index of the “middle” item in this short sublist.
Finally, since the item at index mid is the one we seek, we return the value of mid.
Reflection 11.2 What would have happened if we were looking for a non-existent
number like 72 instead?
If we were looking for 72 instead of 70, all of the steps up to this point would have
been the same, except that when we looked at the middle item in the last step, it
would not have been equal to our target. Therefore, picking up from where we left
off, we would notice that 72 is greater than our middle item 70, so we update left
to be the index after mid, as shown below.
mid
left right
0 1 2 3 4 5 6 7 8 9 10 11
Now, since left and right are both equal to 7, mid must be assigned to 7 as well.
Then, since 72 is less than the middle item, 80, we continue to blindly follow the
algorithm by assigning right to be one less than mid.
Binary search 545
right left
0 1 2 3 4 5 6 7 8 9 10 11
At this point, since right is to the left of left (i.e., left > right), the sublist
framed by left and right is empty! Therefore, 72 must not be in the list, and we
return −1.
This description of the binary search algorithm can be translated into a Python
function in a very straightforward way:
Parameters:
keys: a list of key values
target: a value for which to search
Return value:
the index of an occurrence of target in keys
"""
n = len(keys)
left = 0
right = n - 1
while left <= right:
mid = (left + right) // 2
if target < keys[mid]:
right = mid - 1
elif target > keys[mid]:
left = mid + 1
else:
return mid
return -1
Notice that we have named our list parameter keys (instead of the usual data)
because, in real database applications (see Box 11.1), we typically try to match a
target value to a particular feature of a data item, rather than to the entire data
item. This particular feature is known as a key value. For example, in searching a
phone directory, if we enter “Cumberbatch” in the search field, we are not looking
for a directory entry (the data item) in which the entire contents contain only the
word “Cumberbatch.” Instead, we are looking for a directory entry in which just
the last name (the key value) matches Cumberbatch. When the search term is
546 Organizing data
found, we return the entire directory entry that corresponds to this key value. In
our function, we return the index at which the key value was found which, if we had
data associated with the key, might provide us with enough information to find it
in an associated data structure. We will look at an example of this in Section 11.2.
The binarySearch function begins by initializing left and right to the first
and last indices in the list. Then, while left <= right, we compute the new value
of mid and compare keys[mid] to the target we seek. If they are equal, we simply
return the value of mid. Otherwise, we adjust left or right to hone in on target.
If we get to the point where left > right, the loop ends and we return −1, having
not found target.
Reflection 11.3 Write a main function that calls binarySearch with the list that
we used in our example. Search for 70 and 72.
Reflection 11.5 Under what circumstances will the binary search algorithm per-
form the most comparisons between the target and a list item?
In the worst case, the while loop will iterate all the way until left > right.
Therefore, the worst case number of item comparisons will be necessary when the
item we seek is not found in the list. Let’s start by thinking about the worst case
number of item comparisons for some small lists. First, suppose we have a list with
length n = 4. In the worst case, we first look at the item in the middle of this list,
and then are faced with searching a sublist with length 2. Next, we look at the
middle item of this sublist and, upon not finding the item, search a sublist of length
1. After one final comparison to this single item, the algorithm will return −1. So
we needed a total of 3 comparisons for a list of length 4.
Reflection 11.6 Now what happens if we double the size of the list to n = 8?
After we compare the middle item in a list with length n = 8 to our target, we
are left with a sublist with length 4. We already know that a list with length 4
requires 3 comparisons in the worst case, so a list with length 8 must require 3 + 1 = 4
comparisons in the worst case. Similarly, a list with length 16 must require only one
Binary search 547
Worst case
List length comparisons
n c
1 1
2 2
4 3
8 4
16 5
⋮ ⋮
210 = 1, 024 11
⋮ ⋮
220 ≈ 1 million 21
⋮ ⋮
230 ≈ 1 billion 31
Table 11.1The worst case number of comparisons for a binary search on lists with
increasing lengths.
more comparison than a list with length 8, for a total of 5. And so on. This pattern
is summarized in Table 11.1. Notice that a list with over a billion items requires at
most 31 comparisons!
Reflection 11.7 Do you see the pattern in Table 11.1? For list of length n, how
many comparisons are necessary in the worst case?
In each row of the table, the length of the list (n) is 2 raised to the power of 1 less
than the number of comparisons (c), or
n = 2c−1 .
c = log2 n + 1
A spelling checker
Now let’s apply our binary search to the spelling checker problem with which we
started this section. We can write a program that reads an alphabetized word
list, and then allows someone to repeatedly enter a word to see if it is spelled
correctly (i.e., is present in the list of words). A list of English words can be found
on computers running Mac OS X or Linux in the file /usr/share/dict/words, or
one can be downloaded from the book web site. This list is already sorted if you
consider an upper case letter to be equivalent to its lower case counterpart. (For
example, “academy” directly precedes “Acadia” in this file.) However, as we saw in
Chapter 6, Python considers upper case letters to come before lower case letters, so
we actually still need to sort the list to have it match Python’s definition of “sorted.”
For now, we can use the sort method; in the coming sections, we will develop our
own sorting algorithms. The following function implements our spelling checker.
def spellcheck():
"""Repeatedly ask for a word to spell-check and print the result.
Parameters: none
The function begins by opening the word list file and reading each word (one word
per line) into a list. Since each line ends with a newline character, we slice it off
before adding the word to the list. After all of the words have been read, we sort
the list. The following while loop repeatedly prompts for a word until the letter
q is entered. Notice that we ask for a word before the while loop to initialize the
value of word, and then again at the bottom of the loop to set up for the next
iteration. In each iteration, we call the binary search function to check if the word
is contained in the list. If the word was found (index != -1), we state that it is
spelled correctly. Otherwise, we state that it is not spelled correctly.
Reflection 11.8 Combine this function with the binarySearch function together
in a program. Run the program to try it out.
Parameters:
keys: a list of key values
target: a value for which to search
Return value:
the index of an occurrence of target in keys
"""
Reflection 11.9 Repeat Reflections 11.3 and 11.4 with the recursive binary search
function. Does the recursive version “look at” the same values of mid?
Reflection 11.10 How many more comparisons are there in a recursive call to
binarySearch?
Since each recursive call divides the size of the list under consideration by (about)
half, the size of the list we are passing into each recursive call is (about) n/2.
Therefore, the number of comparisons in each recursive call must be T (n/2). The
total number of comparisons is then
T (n) = T (n/2) + c.
Now we can use the same substitution method that we used with recursive linear
search to arrive at a closed form expression in terms of n. First, since T (n) =
T (n/2) + c, we can substitute T (n) with T (n/2) + c:
Binary search 551
T(n)
T(n/2) + c c
T(n/4) + c 2c
T(n/8) + c 3c
i
T(n/2 ) + c ic
c (log2 n + 2)c
Figure 11.2An illustration of how to derive a closed form for the recurrence relation
T (n) = T (n/2) + c.
T(n)
T(n/2) + c
Now we need to replace T (n/2) with something. Notice that T (n/2) is just T (n)
with n/2 substituted for n. Therefore, using the definition of T (n) above,
T (n/2) = T (n/2/2) + c = T (n/4) + c.
Similarly,
T (n/4) = T (n/4/2) + c = T (n/8) + c
and
T (n/8) = T (n/8/2) + c = T (n/16) + c.
552 Organizing data
This sequence of substitutions is illustrated in Figure 11.2. Notice that the denomi-
nator under the n at each step is a power of 2 whose exponent is the multiplier in
front of the accumulated c’s at that step. In other words, for each denominator 2i ,
the accumulated value on the right is i ⋅ c. When we finally reach T (1) = T (n/n),
the denominator has become n = 2log2 n , so i = log2 n and the total on the right must
be (log2 n)c. Finally, we know that T (0) = c, so the total number of comparisons is
T (n) = (log2 n + 2) c.
Exercises
11.1.1. How would you modify each of the binary search functions in this section so that,
when a target is not found, the function also prints the values in keys that would
be on either side of the target if it were in the list?
11.1.2. Write a function that takes a file name as a parameter and returns the number of
misspelled words in the file. To check the spelling of each word, use binary search
to locate it in a list of words, as we did above. (Hint: use the strip string method
to remove extraneous punctuation.)
11.1.3. Write a function that takes three parameters—minLength, maxLength, and step—
and produces a plot like Figure 11.1 comparing the worst case running times
of binary search and linear search on lists with length minLength, minLength +
step, minLength + 2 * step, . . . , maxLength. Use a slice of the list derived from
list(range(maxLength)) as the sorted list for each length. To produce the worst
case behavior of each algorithm, search for an item that is not in the list (e.g., −1).
11.1.4. The function below plays a guessing game against the pseudorandom number
generator. What is the worst case number of guesses necessary for the function to
win the game for any value of n, where n is a power of 2? Explain your answer.
import random
def guessingGame(n):
secret = random.randrange(1, n + 1)
left = 1
right = n
guessCount = 1
guess = (left + right) // 2
while guess != secret:
if guess > secret:
right = guess - 1
else:
left = guess + 1
guessCount = guessCount + 1
guess = (left + right) // 2
return guessCount
Selection sort 553
Reflection 11.11 Before you read further, think about how you would sort a list of
data items (names, numbers, books, socks, etc.) in some desired order. Write down
your algorithm informally.
The selection sort algorithm is so called because, in each step, it selects the next
smallest value in the list and places it in its proper sorted position, by swapping
it with whatever is currently there. For example, consider the list of numbers
[50, 30, 40, 20, 10, 70, 60]. To sort this list in ascending order, the selection
sort algorithm first finds the smallest number, 10. We want to place 10 in the first
position in the list, so we swap it with the number that is currently in that position,
50, resulting in the modified list
[10, 30, 40, 20, 50, 70, 60]
Next, we find the second smallest number, 20, and swap it with the number in the
second position, 30:
[10, 20, 40, 30, 50, 70, 60]
Then we find the third smallest number, 30, and swap it with the number in the
third position, 40:
[10, 20, 30, 40, 50, 70, 60]
Next, we find the fourth smallest number, 40. But since 40 is already in the fourth
position, no swap is necessary. This process continues until we reach the end of the
list.
Reflection 11.12 Work through the remaining steps in the selection sort algorithm.
What numbers are swapped in each step?
index: 0 1 2 3 4 5 6
data[index]:
50 30 40 20 10 70 60
554 Organizing data
To begin, we want to search for the smallest value in the list, and swap it with the
value at index 0. We have already seen how to accomplish key parts of this step.
First, writing a function
swap(data, i, j)
to swap two values in a list was the objective of Exercise 8.2.5. In this swap function,
i and j are the indices of the two values in the list data that we want to swap. And
on Page 354, we developed a function to find the minimum value in a list. The min
function used the following algorithm:
minimum = data[0]
for item in data[1:]:
if item < minimum:
minimum = item
The variable named minimum is initialized to the first value in the list, and updated
with smaller values as they are encountered in the list.
Reflection 11.13 Once we have the final value of minimum, can we implement
the first step of the selection sort algorithm by swapping minimum with the value
currently at index 0?
Unfortunately, it is not quite that easy. To swap the positions of two values in data,
we need to know their indices rather than their values. Therefore, in our selection
sort algorithm, we will need to substitute minimum with a reference to the index of
the minimum value. Let’s call this new variable minIndex. This substitution results
in the following alternative algorithm.
n = len(data)
minIndex = 0
for index in range(1, n):
if data[index] < data[minIndex]:
minIndex = index
Notice that this algorithm is really the same as the previous algorithm; we are just
now referring to the minimum value indirectly through its index (data[minIndex])
instead of directly through the variable minimum. Once we have the index of the
minimum value in minIndex, we can swap the minimum value with the value at
index 0 by calling
if minIndex != 0:
swap(data, 0, minIndex)
index: 0 1 2 3 4 5 6
data[index]:
10 30 40 20 50 70 60
In the next step, we need to do almost exactly the same thing, but for the second
smallest value.
Reflection 11.15 How we do we find the second smallest value in the list?
Notice that, now that the smallest value is “out of the way” at the front of the
list, the second smallest value in data must be the smallest value in data[1:].
Therefore, we can use exactly the same process as above, but on data[1:] instead.
This requires only four small changes in the code, marked in red below.
minIndex = 1
for index in range(2, n):
if data[index] < data[minIndex]:
minIndex = index
if minIndex != 1:
swap(data, 1, minIndex)
index: 0 1 2 3 4 5 6
data[index]:
10 20 40 30 50 70 60
Similarly, the next step is to find the index of the smallest value starting at index 2,
and then swap it with the value in index 2:
minIndex = 2
for index in range(3, n):
if data[index] < data[minIndex]:
minIndex = index
if minIndex != 2:
swap(data, 2, minIndex)
In our example list, this will find the smallest value in data[2:], 30, at index 3.
Then it will call swap(data, 2, 3), resulting in:
556 Organizing data
index: 0 1 2 3 4 5 6
data[index]:
10 20 30 40 50 70 60
1 def selectionSort(data):
2 """Sort a list of values in ascending order using the
3 selection sort algorithm.
4
5 Parameter:
6 data: a list of values
7
Reflection 11.16 In the outer for loop of the selectionSort function, why is
the last value of start n - 2 instead of n - 1? Think about what steps would be
executed if start were assigned the value n - 1.
Reflection 11.17 What would happen if we called selectionSort with the
list [’dog’, ’cat’, ’monkey’, ’zebra’, ’platypus’, ’armadillo’]? Would
it work? If so, in what order would the words be sorted?
Selection sort 557
A comparison of the execution times of selection sort and the list sort
Figure 11.3
method on small randomly shuffled lists.
Because the comparison operators are defined for both numbers and strings, we
can use our selectionSort function to sort either kind of data. For example, call
selectionSort on each of the following lists, and then print the results. (Remember
to incorporate the swap function from Exercise 8.2.5.)
numbers = [50, 30, 40, 20, 10, 70, 60]
animals = [’dog’, ’cat’, ’monkey’, ’zebra’, ’platypus’, ’armadillo’]
heights = [7.80, 6.42, 8.64, 7.83, 7.75, 8.99, 9.25, 8.95]
the inner for loop. Therefore, the total number of times that line 15 is executed is
(n − 1) + (n − 2) + (n − 3) + ⋯
Where does this sum stop? To find out, we look at the last iteration of the outer
for loop, when start is n - 2. In this case, the inner for loop runs from n - 1 to
n - 1, for only one iteration. So the total number of steps is
(n − 1) + (n − 2) + (n − 3) + ⋯ + 3 + 2 + 1.
What is this sum equal to? You may recall that we have encountered this sum a few
times before (see Box 4.2):
n(n − 1) 1 2 1
(n − 1) + (n − 2) + (n − 3) + ⋯ + 3 + 2 + 1 = = n − n.
2 2 2
Ignoring the constant 1/2 in front of n2 and the low order term (1/2)n, we find
that this expression is asymptotically proportional to n2 ; hence selection sort has
quadratic time complexity.
Figure 11.3 shows the results of an experiment comparing the running time of
selection sort to the sort method of the list class. (Exercise 11.3.4 asks you to
replicate this experiment.) The parabolic blue curve in Figure 11.3 represents the
quadratic time complexity of the selection sort algorithm. The red curve at the
bottom of Figure 11.3 represents the running time of the sort method. Although
this plot compares the algorithms on very small lists on which both algorithms are
very fast, we see a marked difference in the growth rates of the execution times. We
will see why the sort method is so much faster in Section 11.4.
Querying data
Suppose we want to write a program that allows someone to query the USGS
earthquake data that we worked with in Section 8.4. Although we did not use IDs
then, each earthquake was identified by a unique ID such as ak10811825. The first
two characters identify the monitoring network (ak represents the Alaska Regional
Network) and the last eight digits represent a unique code assigned by the network.
So our program needs to search for a given ID, and then return the attributes
associated with that ID, say the earthquake’s latitude, longitude, magnitude, and
depth. The earthquake IDs will act as keys, on which we will perform our searches.
This associated data is sometimes called satellite data because it revolves around
the key.
To use the efficient binary search algorithm in our program, we need to first sort
the data by its key values. When we read this data into memory, we can either read
it into parallel lists, as we did in Section 8.4, or we can read it into a table (i.e., a list
of lists), as we did in Section 9.1. In this section, we will modify our selection sort
algorithm to handle the first option. We will leave the second option as an exercise.
First, we will read the data into two lists: a list of keys (i.e., earthquake IDs) and a
Selection sort 559
list of tuples containing the satellite data. For example, the satellite data for one earth-
quake that occurred at 19.5223 latitude and -155.5753 longitude with magnitude 1.1
and depth 13.6 km will be stored in the tuple (19.5223, -155.5753, 1.1, 13.6).
Let’s call these two lists ids and data, respectively. (We will leave writing the
function to read the IDs and data into their respective lists as an exercise.) By
design, these two lists are parallel in the sense that the satellite data in data[index]
belongs to the earthquake with ID in ids[index]. When we sort the earthquakes
by ID, we will need to make sure that the associations with the satellite data are
maintained. In other words, if, during the sort of the list ids, we swap the values in
ids[9] and ids[4], we also need to swap data[9] and data[4].
To do this with our existing selection sort algorithm is actually quite simple. We
pass both lists into the selection sort function, but make all of our sorting decisions
based entirely on a list of keys. Then, when we swap two values in keys, we also
swap the values in the same positions in data. The modified function looks like this
(with changes in red):
Parameters:
keys: a list of keys
data: a list of data values corresponding to the keys
n = len(keys)
for start in range(n - 1):
minIndex = start
for index in range(start + 1, n):
if keys[index] < keys[minIndex]:
minIndex = index
swap(keys, start, minIndex)
swap(data, start, minIndex)
Once we have the sorted parallel lists ids and data, we can use binary search to
retrieve the index of a particular ID in the list ids, and then use that index to
retrieve the corresponding satellite data from the list data. The following function
does this repeatedly with inputted earthquake IDs.
560 Organizing data
A main function that ties all three pieces together looks like this:
def main():
ids, data = readQuakes() # left as an exercise
selectionSort(ids, data)
queryQuakes(ids, data)
Exercises
11.2.1. Can you find a list of length 5 that requires more comparisons (on line 15) than
another list of length 5? In general, with lists of length n, is there a worst case list
and a best case list with respect to comparisons? How many comparisons do the
best case and worst case lists require?
11.2.2. Now consider the number of swaps. Can you find a list of length 5 that requires
more swaps (on line 18) than another list of length 5? In general, with lists of
length n, is there a worst case list and a best case list with respect to swaps? How
many swaps do the best case and worst case lists require?
11.2.3. The inner for loop of the selection sort function can be eliminated by using two
built-in Python functions instead:
def selectionSort2(data):
n = len(data)
for start in range(n - 1):
minimum = min(data[start:])
minIndex = start + data[start:].index(minimum)
if minIndex != start:
swap(data, start, minIndex)
Is this function more or less efficient than the selectionSort function we developed
above? Explain.
11.2.4. Suppose we already have a list that is sorted in ascending order, and want to
insert new values into it. Write a function that inserts an item into a sorted list,
maintaining the sorted order, without re-sorting the list.
Selection sort 561
11.2.5. Write a function that reads earthquake IDs and earthquake satellite data, consisting
of latitude, longitude, depth and magnitude, from the data file on the web at
http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_month.csv
and returns two parallel lists, as described on Page 559. The satellite data for
each earthquake should be stored as a tuple of floating point values. Then use
this function to complete a working version of the program whose main function is
shown on Page 560. (Remember to incorporate the recursive binary search and
the swap function from Exercise 8.2.5.) Look at the above URL in a web browser
to find some earthquake IDs to search for or do the next exercise to have your
program print a list of all of them.
11.2.6. Add to the queryQuakes function on Page 560 the option to print an alphabetical
list of all earthquakes, in response to typing list for the earthquake ID. For
example, the output should look something like this:
Earthquake ID (q to quit): ci37281696
Location: (33.4436667, -116.6743333)
Magnitude: 0.54
Depth: 13.69
Earthquake ID (q to quit):
11.2.7. An alternative to storing the earthquake data in two parallel lists is to store it in
one table (a list of lists or a list of tuples). For example, the beginning of a table
containing the earthquakes shown in the previous exercise would look like this:
[[’ak11406701’, 63.2397, -151.4564, 5.5, 1.3],
[’ak11406705’, 58.9801, -152.9252, 69.2, 2.3],
...
]
Rewrite the readQuakes, selectionSort, binarySearch, and queryQuakes func-
tions so that they work with the earthquake data stored in this way instead. Your
functions should assume that the key value for each earthquake is in column 0.
Combine your functions into a working program that is driven by the following
main function:
def main():
quakes = readQuakes()
selectionSort(quakes)
queryQuakes(quakes)
562 Organizing data
11.2.8. The Sieve of Eratosthenes is a simple algorithm for generating prime numbers that
has a structure that is similar to the nested loop structure of selection sort. In
this algorithm, we begin by initializing a list of n Boolean values named prime as
follows. (In this case, n = 12.)
prime = F F T T T T T T T T T T
0 1 2 3 4 5 6 7 8 9 10 11
F F T T F T F T F T F T
0 1 2 3 4 5 6 7 8 9 10 11
↑
Next, the loop index variable is incremented to 3 and, since prime[3] is True, the
list value of every multiple of 3 is set to be False.
F F T T F T F T F F F T
0 1 2 3 4 5 6 7 8 9 10 11
↑
F F T T F T F T F F F T
0 1 2 3 4 5 6 7 8 9 10 11
↑
F F T T F T F T F F F T
0 1 2 3 4 5 6 7 8 9 10 11
↑
And so on. How far must we continue to increment index before we know we are
done? Once we are done filling in the list, we can iterate over it one more time to
build the list of prime numbers, in this case, [2, 3, 5, 7, 11]. Write a function
that implements this algorithm to return a list of all prime numbers less than or
equal to a parameter n.
Insertion sort 563
40 20 10
30 70
50 60
We start with the second card to the left, 30, and decide whether it should stay
where it is or be inserted to the left of the first card. In this case, it should be
inserted to the left of 50, resulting in the following slightly modified ordering:
40 20 10
50 70
30 60
Then we consider the third card from the left, 40. We see that 40 should be inserted
between 30 and 50, resulting in the following order.
50 20 10
40 70
30 60
Next, we consider 20, and see that it should be inserted all the way to the left,
before 30.
564 Organizing data
40 50 10
30 70
20 60
This process continues with 10, 70, and 60, at which time the hand is sorted.
index: 0 1 2 3 4 5 6
data[index]:
20 30 40 50 10 70 60
itemToInsert
We need to compare itemToInsert to each of the items to the left, first at index
insertIndex - 1, then at insertIndex - 2, insertIndex - 3, etc. When we
come to an item that is less than or equal to itemToInsert or we reach the
beginning of the list, we know that we have found the proper location for the item.
This process can be expressed with a while loop:
itemToInsert = data[insertIndex]
index = insertIndex - 1
while index >= 0 and data[index] > itemToInsert:
index = index - 1
The variable named index tracks which item we are currently comparing to
itemToInsert. The value of index is decremented while it is still at least zero
and the item at position index is still greater than itemToInsert. At the end of the
Insertion sort 565
loop, because the while condition has become false, either index has reached -1 or
data[index] <= itemToInsert. In either case, we want to insert itemToInsert
into position index + 1. In the example above, we would reach the beginning of
the list, so we want to insert itemToInsert into position index + 1 = 0.
To actually insert itemToInsert in its correct position, we need to delete
itemToInsert from its current position, and insert it into position index + 1.
data.pop(insertIndex)
data.insert(index + 1, itemToInsert)
In the insertion sort algorithm, we want to repeat this process for each value of
insertIndex, starting at 1, so we enclose these steps in a for loop:
def insertionSort(data):
"""Sort a list of values in ascending order using the
insertion sort algorithm.
Parameter:
data: a list of values
n = len(data)
for insertIndex in range(1, n):
itemToInsert = data[insertIndex]
index = insertIndex - 1
while index >= 0 and data[index] > itemToInsert:
index = index - 1
data.pop(insertIndex)
data.insert(index + 1, itemToInsert)
Although this function is correct, it performs more work than necessary. To see
why, think about how the pop and insert methods must work, based on the
picture of the list on Page 564. First, to delete (pop) itemToInsert, which is at
position insertIndex, all of the items to the right, from position insertIndex
+ 1 to position n - 1, must be shifted one position to the left. Then, to insert
itemToInsert into position index + 1, all of the items to the right, from position
index + 2 to n - 1, must be shifted one position to the right. So the items from
position insertIndex + 1 to position n - 1 are shifted twice, only to end up back
where they started.
A more efficient technique would be to only shift those items that need to be
shifted, and do so while we are already iterating over them. The following modified
algorithm does just that.
566 Organizing data
1 def insertionSort(data):
2 """ (docstring omitted) """
3
4 n = len(data)
5 for insertIndex in range(1, n):
6 itemToInsert = data[insertIndex]
7 index = insertIndex - 1
8 while index >= 0 and data[index] > itemToInsert:
9 data[index + 1] = data[index]
10 index = index - 1
11 data[index + 1] = itemToInsert
The red assignment statement in the for loop copies each item at position index
one position to the right. Therefore, when we get to the end of the loop, position
index + 1 is available to store itemToInsert.
Reflection 11.18 To get a better sense of how this works, carefully work through
the steps with the three remaining items to be inserted in the illustration on Page 564.
Reflection 11.19 Write a main function that calls the insertionSort function
to sort the list from the beginning of this section: [50, 30, 40, 20, 10, 70, 60].
Reflection 11.20 What are the minimum and maximum number of iterations
executed by the while loop for a particular value of insertIndex?
In the best case, it is possible that the item immediately to the left of itemToInsert
is less than itemToInsert, and therefore the condition is tested only once. Therefore,
since there are n − 1 iterations of the outer for loop, there are only n − 1 steps
Insertion sort 567
total for the entire algorithm. So, in the best case, insertion sort is a linear-time
algorithm.
On the other hand, in the worst case, itemToInsert may be the smallest item
in the list and the while loop executes until index is less than 0. Since index is
initialized to insertIndex - 1, the condition on line 8 is tested with index equal
to insertIndex - 1, then insertIndex - 2, then insertIndex - 3, etc., until
index is -1, at which time the while loop ends. In all, this is insertIndex + 1
iterations of the while loop. Therefore, when insertIndex is 1 in the outer for
loop, there are two iterations of the while loop; when insertIndex is 2, there are
three iterations of the while loop; when insertIndex is 3, there are four iterations
of the while loop; etc. In total, there are 2 + 3 + 4 + ⋯ iterations of the while loop.
As we did with selection sort, we need to figure out when this pattern ends. Since
the last value of insertIndex is n - 1, the while loop condition is tested at most
n times in the last iteration. So the total number of iterations of the while loop is
2 + 3 + 4 + ⋯ + (n − 1) + n.
Using the same trick we used with selection sort, we find that the total number of
steps is
n(n + 1) 1 1
2 + 3 + 4 + ⋯ + (n − 1) + n = − 1 = n2 + n − 1.
2 2 2
Ignoring the constants and lower order terms, this means that insertion sort is also
a quadratic-time algorithm in the worst case.
So which case is more representative of the efficiency of insertion sort, the best
case or the worst case? Computer scientists are virtually always interested in the
worst case over the best case because the best case scenario for an algorithm is
usually fairly specific and very unlikely to happen in practice. On the other hand,
the worst case gives a robust upper limit on the performance of the algorithm.
Although more challenging, it is also possible to find the average case complexity of
some algorithms, assuming that all possible cases are equally likely to occur.
We can see from Figure 11.4 that the running time of insertion sort is almost
identical to that of selection sort in practice. Since the best case and worst case of
selection sort are the same, it appears that the worst case analysis of insertion sort
is an accurate measure of performance on random lists. Both algorithms are still
significantly slower than the built-in sort method, and this difference is even more
apparent with longer lists. We will see why in the next section.
Exercises
11.3.1. Give a particular 10-element list that requires the worst case number of comparisons
in an insertion sort. How many comparisons are necessary for this list?
11.3.2. Give a particular 10-element list that requires the best case number of comparisons
in an insertion sort. How many comparisons are necessary for this list?
11.3.3. Write a function that compares the time required to sort a long list of English
words using insertion sort to the time required by the sort method of the list class.
568 Organizing data
Figure 11.4A comparison of the execution times of selection sort, insertion sort, and
the list sort method on small randomly shuffled lists.
We can use the function time in the module of the same name to record the time
required to execute each function. The time.time() function returns the number
of seconds that have elapsed since January 1, 1970; so we can time a function by
getting the current time before and after we call the function, and finding the
difference.
As we saw in Section 11.1, a list of English words can be found on computers
running Mac OS X or Linux in the file /usr/share/dict/words, or one can be
downloaded from the book web site. This list is already sorted if you consider an
upper case letter to be equivalent to its lower case counterpart. However, since
Python considers upper case letters to come before lower case letters, the list is
not sorted for our purposes.
Notes:
• When you read in the list of words, remember that each line will end with a
newline character. Be sure to remove that newline character before adding
the word to your list.
• Make a separate copy of the original list for each sorting algorithm. If you
pass a list that has already been sorted by the sort method to your insertion
sort, you will not get a realistic time measurement because a sorted list is a
best case scenario for insertion sort.
How many seconds did each sort require? (Be patient; insertion sort will take
several minutes!) If you can be really patient, try timing selection sort as well.
Insertion sort 569
At the end of the first pass, we know that the largest item (in blue) is in its correct
location at the end of the list. We now repeat the process above, but stop before
the last item.
2 3 1 4 2 3 1 4 2 1 3 4
After the second pass, we know that the two largest items (in blue) are in their
correct locations. On this short list, we make just one more pass.
2 1 3 4 1 2 3 4
After n − 1 passes, we know that the last n − 1 items are in their correct locations.
Therefore, the first item must be also, and we are done.
1 2 3 4
Write a function that implements this algorithm.
11.3.7. In the bubble sort algorithm, if no items are swapped during a pass over the
list, the list must be in sorted order. So the bubble sort algorithm can be made
somewhat more efficient by detecting when this happens, and returning early if it
does. Write a function that implements this modified bubble sort algorithm. (Hint:
replace the outer for loop with a while loop and introduce a Boolean variable
that controls the while loop.)
11.3.8. Write a modified version of the insertion sort function that sorts two parallel lists
named keys and data, based on the values in keys, like the parallel list version of
selection sort on Page 559.
570 Organizing data
data
data[:mid] data[mid:]
sort sort
data[:mid] data[mid:]
merge
data
3. Combine the solutions to the subproblems into a solution for the original
problem.
Reflection 11.21 Based on Figure 11.5, what are the divide, conquer, and combine
steps in the merge sort algorithm?
The divide step of merge sort is very simple: just divide the list in half. The conquer
step recursively calls the merge sort algorithm on the two halves. The combine step
merges the two sorted halves into the final sorted list. This elegant algorithm can
be implemented by the following function:
1
The Python sorting algorithm, called Timsort, has elements of both merge sort and insertion
sort. If you would like to learn more, go to http://bugs.python.org/file4451/timsort.txt
Efficient sorting 571
sort sort
merge
def mergeSort(data):
"""Recursively sort a list in place in ascending order,
using the merge sort algorithm.
Parameter:
data: a list of values to sort
n = len(data)
if n > 1:
mid = n // 2 # divide list in half
left = data[:mid]
right = data[mid:]
mergeSort(left) # recursively sort first half
mergeSort(right) # recursively sort second half
merge(left, right, data) # merge sorted halves into data
The base case in this function is implicit; when n <= 1, the function just returns
because a list containing zero or one values is, of course, already sorted.
All that is left to flesh out mergeSort is to implement the merge function.
Suppose we want to sort the list [60, 30, 50, 20, 10, 70, 40]. As illustrated
in Figure 11.6, the merge sort algorithm first divides this list into the two sublists
left = [60, 30, 50] and right = [20, 10, 70, 40]. After recursively sorting
each of these lists, we have left = [30, 50, 60] and right = [10, 20, 40, 70].
Now we want to efficiently merge these two sorted lists into one final sorted list. We
572 Organizing data
(c) [30, 50, 60] [10, 20, 40, 70] [10, 20]
(d) [30, 50, 60] [10, 20, 40, 70] [10, 20, 30]
(e) [30, 50, 60] [10, 20, 40, 70] [10, 20, 30, 40]
(f) [30, 50, 60] [10, 20, 40, 70] [10, 20, 30, 40, 50]
(g) [30, 50, 60] [10, 20, 40, 70] [10, 20, 30, 40, 50, 60]
(h) [30, 50, 60] [10, 20, 40, 70] [10, 20, 30, 40, 50, 60, 70]
could, of course, concatenate the two lists and then call merge sort with them. But
that would be far too much work; because the individual lists are sorted, we can do
much better!
Since left and right are sorted, the first item in the merged list must be the
minimum of the first item in left and the first item in right. So we place this
minimum item into the first position in the merged list, and remove it from left or
right. Then the next item in the merged list must again be one of the items at the
front of left or right. This process continues until we run out of items in one of
the lists.
This algorithm is illustrated in Figure 11.7. Rather than delete items from left
and right as we append them to the merged list, we will simply maintain an index
for each list to remember the next item to consider. The red arrows in Figure 11.7
represent these indices which, as shown in part (a), start at the left side of each
list. In parts (a)–(b), we compare the two front items in left and right, append
the minimum (10 from right) to the merged list, and advance the right index. In
parts (b)–(c), we compare the first item in left to the second item in right, again
append the minimum (20 from right) to the merged list, and advance the right
Efficient sorting 573
index. This process continues until, after step (g), when the index in left exceeds
the length of the list. At this point, we simply extend the merged list with whatever
is left over in right, as shown in part (h).
Reflection 11.23 Work through steps (a) through (h) on your own to make sure
you understand how the merge algorithm works.
The merge function begins by clearing out the contents of the merged list and
initializing the indices for the left and right lists to zero. The while loop starting
at line 17 constitutes the main part of the algorithm. While both indices still
refer to items in their respective lists, we compare the items at the two indices
and append the smallest to merged. When this loop finishes, we know that either
leftIndex >= len(left) or rightIndex >= len(right). In the first case (lines
25–26), there are still items remaining in right to append to merged. In the second
case (lines 27–28), there are still items remaining in left to append to merged.
574 Organizing data
Reflection 11.24 Write a program that uses the merge sort algorithm to sort the
list in Figure 11.6.
T (n) = 2 T (n/2) + ?
where the question mark represents the number of steps in addition to the recursive
calls, in particular to split data into left and right and to call the merge function.
Reflection 11.25 How many steps must be required to split the list into left and
right?
Since every item in data is copied through slicing to one of the two lists, this must
require about n steps.
Reflection 11.26 How many steps does the merge function require?
Efficient sorting 575
T(n)
T(n/2)+ T(n/2) + cn cn
T(n/8) +T(n/8) +c(n / 4) T(n/8) +T(n/8) + c(n / 4) T(n/8) +T(n/8) +c(n / 4) T(n/8) +T(n/8) + c(n / 4) cn
Figure 11.8An illustration of how to derive a closed form for the recurrence relation
T (n) = 2T (n/2) + cn.
Exercises
11.4.1. To keep things simple, assume that the selection sort algorithm requires exactly
n2 steps and the merge sort algorithm requires exactly n log2 n steps. About how
many times slower is selection sort than merge sort when n = 100? n = 1000? n = 1
million?
11.4.2. Repeat Exercise 11.3.3 with the merge sort algorithm. How does the time required
by the merge sort algorithm compare to that of the insertion sort algorithm and
the built-in sort method?
11.4.3. Add merge sort to the running time plot in Exercise 11.3.4. How does its time
compare to the other sorts?
11.4.4. Our mergeSort function is a stable sort, meaning that two items with the same
value always appear in the sorted list in the same order as they appeared in the
original list. However, if we changed the <= operator in line 18 of the merge function
to a < operator, it would no longer be stable. Explain why.
11.4.5. We have seen that binary search is exponentially faster than linear search in the
worst case. But is it always worthwhile to use binary search over linear search?
Tractable and intractable algorithms 577
The answer, as is commonly the case in the “real world”, is “it depends.” In this
exercise, you will investigate this question. Suppose we have an unordered list of n
items that we wish to search.
(a) If we use the linear search algorithm, what is the time complexity of this
search?
(b) If we use the binary search algorithm, what is the time complexity of this
search?
(c) If we perform n (where n is also the length of the list) individual searches of
the list, what is the time complexity of the n searches together if we use the
linear search algorithm?
(d) If we perform n individual searches with the binary search algorithm, what
is the time complexity of the n searches together?
(e) What can you conclude about when it is best to use binary search vs. linear
search?
11.4.6. Suppose we have a list of n keys that we anticipate needing to search k times. We
have two options: either we sort the keys once and then perform all of the searches
using a binary search algorithm or we forgo the sort and simply perform all of the
searches using a linear search algorithm. Suppose the sorting algorithm requires
exactly n2 /2 steps, the binary search algorithm requires log2 n steps, and the linear
search requires n steps. Assume each step takes the same amount of time.
(a) If the length of the list is n = 1024 and we perform k = 100 searches, which
alternative is better?
(b) If the length of the list is n = 1024 and we perform k = 500 searches, which
alternative is better?
(c) If the length of the list is n = 1024 and we perform k = 1000 searches, which
alternative is better?
11.4.7. Write a function that merges two sorted files into one sorted file. Your function
should take the names of the three files as parameters. Assume that all three files
contain one string value per line. Your function should not use any lists, instead
reading only one item at a time from each input file and writing one item at a
time to the output file. In other words, at any particular time, there should be at
most one item from each file assigned to any variable in your function. You will
know when you have reached the end of one of the input files when a read function
returns the empty string. There are two files on the book web site that you can
use to test your function.
useless with all but the smallest inputs; even if it is correct and will eventually finish,
the answer will come long after anyone who needs it is dead and gone.
To motivate this distinction, Table 11.2 shows the execution times of five al-
gorithms with different time complexities, on a hypothetical computer capable of
executing one billion operations per second. We can see that, for n up to 30, all
five algorithms complete in about a second or less. However, when n = 50, we start
to notice a dramatic difference: while the first four algorithms still execute very
quickly, the exponential-time algorithm requires 13 days to complete. When n = 100,
it requires 41 trillion years, about 3,000 times the age of the universe, to complete.
And the differences only get more pronounced for larger values of n; the first four,
tractable algorithms finish in a “reasonable” amount of time, even with input sizes
of 1 billion,2 while the exponential-time algorithm requires an absurd amount of
time. Notice that the difference between tractable and intractable algorithms holds
no matter how fast our computers get; a computer that is one billion times faster
than the one used in the table can only bring the exponential-time algorithm down
to 10275 years from 10284 years when n = 1, 000!
The dramatic time differences in Table 11.2 also illustrate just how important
efficient algorithms are. In fact, advances in algorithm efficiency are often consid-
erably more impactful than faster hardware. Consider the impact of improving an
algorithm from quadratic to linear time complexity. On an input with n equal to
1 million, we would see execution time improve from 16.7 minutes to 1/1000 of a
second, a factor of one million! According to Moore’s Law (see Box 11.2), such an
increase in hardware performance will not manifest itself for about 30 more years!
2
Thirty-one years is admittedly a long time to wait, and no one would actually wait that long,
but it is far shorter than 41 trillion years.
Tractable and intractable algorithms 579
Hard problems
Unfortunately, there are many common, simply stated problems for which the only
known algorithms have exponential time complexity. For example, suppose we have
n tasks to complete, each with a time estimate, that we wish to delegate to two
assistants as evenly as possible. In general, there is no known way to solve this
problem that is any better than trying all possible ways to delegate the tasks and
then choosing the best solution. This type of algorithm is known as a brute force or
exhaustive search algorithm. For this problem, there are two possible ways to assign
each task (to one of the two assistants), so there are
2 ⋅ 2 ⋅ 2 ⋅ ⋯ ⋅ 2 = 2n
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
n tasks
possible ways to assign the n tasks. Therefore, the brute force algorithm has
exponential time complexity. Referring back to Table 11.2, we see that even if we
only have n = 50 tasks, the brute force algorithm would be of no use to us.
It turns out that there are thousands of such problems, known as the NP-hard
problems, that, on the surface, do not seem as if they should be intractable, but,
as far as we know, they are. Even more interesting, no one has actually proven
that they are intractable. See Box 11.3 if you would like to know more about this
fascinating problem.
When we cannot solve a problem exactly, one common approach is to instead
use a heuristic. A heuristic is a type of algorithm that does not necessarily give a
correct answer, but tends to work well in practice. For example, a heuristic for the
task delegation problem might assign the tasks in order, always assigning the next
580 Organizing data
task to the assistant with the least to do so far. Although this will not necessarily
give the best solution, it may be “good enough” in practice.
11.6 SUMMARY
Sorting and searching are perhaps the most fundamental problems in computer
science for good reason. We have seen how simply sorting a list can exponentially
decrease the time it takes to search it, using the binary search algorithm. Since
binary search is one of those algorithms that “naturally” exhibits self-similarity, we
designed both iterative and recursive algorithms that implement the same idea. We
also designed two basic sorting algorithms named selection sort and insertion sort.
Each of these algorithms can sort a short list relatively quickly, but they are both
very inefficient when it comes to larger lists. By comparison, the recursive merge
sort algorithm is very fast. Merge sort has the added advantage of being an external
sorting algorithm, meaning we can adapt it to sort very large data sets that cannot
be brought into a computer’s memory all at once.
Although the selection and insertion sort algorithms are quite inefficient compared
to merge sort, they are still tractable, meaning that they will finish in a “reasonable”
amount of time. In fact, all algorithms with time complexities that are polynomial
functions of their input sizes are considered to be tractable. On the other hand,
Further discovery 581
exponential-time algorithms are called intractable because even when their input
sizes are relatively small, they require eons to finish.
11.8 PROJECTS
Project 11.1 Creating a searchable database
Write a program that allows for the interactive search of a data set, besides the
earthquake data from Section 11.2, downloaded from the web or the book web site.
It may be data that we have worked with elsewhere in this book or it may be a new
data set. Your program, at a minimum, should behave like the program that we
developed in Section 11.2. In particular, it should:
1. Read the data into a table or parallel lists.
3. Interactively allow someone to query the data (by a key value). When a key
value is found, the program should print the satellite data associated with
that key. Use binary search to search for the key and return the results.
50
30 70
20 40 60
10
50
30 70
20 40 60
10 35
to the left. Next, since 35 > 30, we next move to the right. Then, since 35 < 40, we
move to the left. Since there is no node in this position, we create a new node with
35 and insert it there, as the left child of the node containing 40.
Question 11.2.1 How would the values 5, 25, 65, and 75 be inserted into the binary
search tree in Figure 11.10?
Question 11.2.2 Does the order in which items are inserted affect what the tree
looks like? After the four values in the previous question are inserted into the binary
search tree, insert the value 67. Would the binary search tree be different if 67 were
inserted before 65?
Searching a binary search tree follows the same process, except that we check
whether the target value is equal to the key in each node that we visit. If it is,
we return success. Otherwise, we move to the left or right, as we did above. If we
eventually end up in a position without a node, we know that the target value was
not found. For example, if we want to search for 20 in the binary search tree in
Figure 11.10, we would start at the root and first move left because 20 < 50. Then
we move left again because 20 < 30. Finally, we return success because we found our
target. If we were searching for 25 instead, would have moved right when we arrived
at node 20, but finding no node there, we would have returned failure.
Question 11.2.3 What nodes would be visited in searches for 10, 25, 55, and 60
in the binary search tree in Figure 11.10?
Projects 583
In Python, we can represent a node in a binary search tree with a three-item list.
As illustrated below, the first item in the list is the key, the second item is a list
representing the left child node, and the third item is a list representing the right
child node.
[50, [ ], [ ]]
right child
key
left child
The list above represents a single node with no left or right child. Or, equivalently,
we can think of the two empty lists as representing “empty” left and right children.
To insert a child, we simply insert into one of the empty lists the items representing
the desired node. For example, to make 70 the right child of the node above, we
would insert a new node containing 70 into the second list:
[50, [ ], [70, [ ], [ ]]]
To insert 60 as the left child of 70, we would insert a new node containing 60 into
the first list in 70:
[50, [ ], [70, [60, [ ], [ ]], [ ]]]
The list above now represents the root and the two nodes to the right of the root in
Figure 11.10. Notice that an entire binary search tree can be represented by its root
node. The complete binary search tree in Figure 11.10 looks like this:
bst = [50, [30, [20, [10, [], []], []], [40, [], []]],
[70, [60, [], []], []]]
Question 11.2.4 Parse the list above to understand how it represents the binary
search tree in Figure 11.10.
This representation quickly becomes difficult to read. But, luckily, we will rely on
our functions to read them instead of us.
Let’s now implement the insert and search algorithms we discussed earlier, using
this list implementation. To make our code easier to read, we will define three
constant values representing the indices of the key, left child, and right child in a
node:
KEY = 0
LEFT = 1
RIGHT = 2
So if node is the name of a binary search tree node, then node[KEY] is the node’s
key value, node[LEFT] is the node’s left child, and node[RIGHT] is the node’s right
child.
The following function inserts a new node into a binary search tree:
584 Organizing data
Parameters:
root: the list representing the BST
key: the key value to insert
current = root
while current != [ ]:
if key <= current[KEY]:
current = current[LEFT]
else:
current = current[RIGHT]
current.extend([key, [ ], [ ]])
The variable named current keeps track of where we are in the tree during the
insertion process. The while loop proceeds to “move” current left or right until
current reaches an empty node. At that point, the loop ends, and the algorithm
inserts a new node containing key by inserting key and two empty lists into the
empty list assigned to current. (Recall that the extend method effectively appends
each item in its list argument to the end of the list.) To use this function to insert
the value 35 into our binary search tree named bst above, as in Figure 11.11, we
would call insert(bst, 35).
The function to search a binary search tree is very similar:
Parameters:
root: the list representing the BST
key: the key value to search for
current = root
while current != [ ] and current[KEY] != key:
if key < current[KEY]:
current = current[LEFT]
else:
current = current[RIGHT]
return current != [ ]
Projects 585
The only differences in the search function are (a) the loop now also ends if we
find the desired key value in the node assigned to current, and (b) at the end of
the loop, we return False (failure) if current ends at an empty node and True
otherwise.
In this project, you will work with a data set of your choice, downloaded from
the web or the book web site. It may be data that we have worked with elsewhere
in this book or it may be a new data set. Your data must contain two or more
attributes per entry, one of which will be an appropriate key value. The remaining
attributes will constitute the satellite data associated with the entry.
Question 11.2.5 Is searching a binary search tree as efficient as using the binary
search algorithm to search in a sorted list? In what situations might a binary search
tree not be as efficient? Explain your answers.
Write a main function that puts all of the pieces together to create a program that
reads your data set and allows repeated queries of the data.
586 Organizing data
Part 4: Recursion
Every node in a binary search tree is the root of a subtree. In this way, binary
search trees exhibit self-similarity. The subtrees rooted by a node’s left and right
children are called the node’s left subtree and right subtree, respectively. Exploiting
this self-similarity, we can think about inserting into (or searching) a binary search
tree with root r as recursively solving one of two subproblems: inserting into the
left subtree of r or inserting into the right subtree of r. Write recursive versions of
the insert and search functions that use this self-similarity.
Part 5: Sorting
Once data is in a binary search tree, we have a lot of information about how it is
ordered. We can use this structure to create a sorted list of the data. Notice that a
sorted list of the keys in a binary search tree consists of a sorted list of the keys in
the left subtree, followed by the root of the tree, followed by a sorted list of the keys
in the right subtree. Using this insight, write a recursive function bstSort(root)
that returns a sorted list of the keys in a binary search tree. Then use this function
to add an option to your query function from Part 3 that prints the list of keys
when requested.
Question 11.2.6 How efficient do you think this sorting algorithm is? How do you
think it compares to the sorting algorithms we discussed in this chapter? (Remember
to take into account the time it takes to insert the keys into the binary search tree.)
CHAPTER 12
Networks
Fred Jones of Peoria, sitting in a sidewalk cafe in Tunis and needing a light for his cigarette,
asks the man at the next table for a match. They fall into conversation; the stranger is an
Englishman who, it turns out, spent several months in Detroit studying the operation of
an interchangeable-bottlecap factory. “I know it’s a foolish question,” says Jones, “but did
you ever by any chance run into a fellow named Ben Arkadian? He’s an old friend of mine,
manages a chain of supermarkets in Detroit. . . ”
“Arkadian, Arkadian,” the Englishman mutters. “Why, upon my soul, I believe I do!
Small chap, very energetic, raised merry hell with the factory over a shipment of defective
bottlecaps.”
“No kidding!” Jones exclaims in amazement.
“Good lord, it’s a small world, isn’t it?”
Stanley Milgram
The Small-World Problem (1967)
hat do Facebook, food webs, the banking system, and our brains all have
W in common? They are all networks: systems of interconnected units that
exchange information over the links between them. There are networks all around us:
social networks, road networks, protein interaction networks, electrical transmission
networks, the Internet, networks of seismic faults, terrorist networks, networks of
political influence, transportation networks, and semantic networks, to name a few.
The continuous and dynamic local interactions in large networks such as these
make them extraordinarily complex and hard to predict. Learning more about
networks can help us combat disease, terrorism, and power outages. Realizations that
some networks are emergent systems that develop global behaviors based on local
interactions have improved our understanding of insect colonies, urban planning,
and even our brains. Too little understanding of networks has had unfortunate
consequences, such as when invasive species have been introduced into poorly
understood ecological networks.
As with the other types of “big data,” we need computers and algorithms to
understand large and complex networks. In this chapter, we will begin by discussing
587
588 Networks
A B A B A C
C D E
D E C E B D
how we can represent networks in algorithms so that we can analyze them. Then we
will develop an algorithm to find the distance between any two nodes in a network.
Recent discoveries have shown that many real networks exhibit a “small-world
property,” meaning that the average distance between nodes is relatively small. In
later sections, we will computationally investigate the characteristics of small-world
networks and their ramifications for solving real problems.
Amelia Caroline
Cathy
Lillian
AB PD
LP
PY
is no link. An adjacency matrix for the network in Figure 12.2 can be represented
by the following table.
Amelia Beth Caroline Cathy Dave Lillian Nick
Amelia 0 1 1 0 0 1 1
Beth 1 0 0 1 1 0 1
Caroline 1 0 0 0 0 1 1
Cathy 0 1 0 0 1 0 0
Dave 0 1 0 1 0 0 0
Lillian 1 0 1 0 0 0 1
Nick 1 1 1 0 0 1 0
For example, the first row in the adjacency matrix indicates that Amelia is only
connected to Beth, Caroline, Lillian, and Nick. In Python, we would represent this
matrix with the following nested list.
graph = [[0, 1, 1, 0, 0, 1, 1],
[1, 0, 0, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 1, 1],
[0, 1, 0, 0, 1, 0, 0],
[0, 1, 0, 1, 0, 0, 0],
[1, 0, 1, 0, 0, 0, 1],
[1, 1, 1, 0, 0, 1, 0]]
590 Networks
Kevin
Tyler Amelia Caroline
Christina
Ted Cathy
Lillian
An expanded social network. The nodes and links in red are additions to
Figure 12.5
the graph in Figure 12.2.
Although the nodes’ labels are not stored in the adjacency matrix itself, they could
be stored separately as strings in a list. The index of each string in the list should
equal the row and column of the corresponding node in the adjacency matrix.
Reflection 12.1 Create an adjacency matrix for the graph in Figure 12.1. (Re-
member that all three pictures depict the same graph.)
Each key in this dictionary, a string, represents a node, and each corresponding
value is a list of strings representing the nodes to which the key node is connected.
Notice that, if two nodes are connected, that information is stored in both nodes’
lists. For example, there is a link connecting Amelia and Beth, so Beth is in Amelia’s
list and Amelia is in Beth’s list.
Reflection 12.2 Create an adjacency list for the graph in Figure 12.1.
Making friends
Social networking sites often have an eerie ability to make good suggestions about
who you should add to your list of “connections” or “friends.” One way they do this
is by examining the connections of your connections (or “friends-of-friends”). For
Modeling with graphs 591
example, consider the expanded social network graph in Figure 12.5. Dave currently
has only three friends. But his friends have an additional seven friends that an
algorithm could suggest to Dave.
Reflection 12.3 Who are the seven friends-of-friends of Dave in Figure 12.5?
In graph terminology, the connections of a node are called the node’s neighborhood ,
and the size of a node’s neighborhood is called its degree. In the graph in Figure 12.5,
Dave’s neighborhood contains Beth, Cathy and Christina, and therefore his degree
is three.
Reflection 12.4 How can you compute the degree of a node from the graph’s
adjacency matrix? What about from the graph’s adjacency list?
On line 12, network[node] is the list of nodes to which node is connected in the
adjacency list named network. We assign this list to neighbors, and then iterate
over it on line 13. On line 14, in the inner for loop, we then iterate over the list
of each neighbors’ neighbors. In the if statement, we choose suggestions to be a
list of unique neighbors-of-neighbors that are not the node itself or neighbors of the
node.
592 Networks
Reflection 12.5 Look carefully at the three-part if statement in the function above.
How does each part contribute to the desired characteristics of suggestions listed
above?
Reflection 12.6 Insert the additional nodes and links from Figure 12.5 (in red)
into the dictionary on Page 590. Then call the friendsOfFriends function with
this graph to find new friend suggestions for Dave.
In the next section, we will design an algorithm to find paths to nodes that are
farther away. The ability to compute the distance between nodes will also allow us
to better characterize and understand large networks.
Exercises
12.1.1. Besides those presented in this section, describe three other examples of networks.
12.1.2. Draw the network represented by the following adjacency matrix.
A B C D E
A 0 1 1 0 0
B 1 0 0 1 1
C 1 0 0 0 1
D 0 1 0 0 1
E 0 1 1 1 0
1 2 3
4 5
6 7 8
12.1.5. What is the neighborhood of each of the nodes in the network from Exercise 12.1.4?
12.1.6. What is the degree of each node in the network from Exercise 12.1.4? Which
node(s) have the maximum degree?
Modeling with graphs 593
12.1.7. Are the networks in each of the following pairs the same or different? Why?
(a)
(b)
12.1.8. A graph can be represented in a file by listing one link per line, with each link
represented by a pair of nodes. For example, the graph below is represented by the
file on the right. Write a function that reads such a file and returns an adjacency
list (as a dictionary) for the graph. Notice that, for each line A B in the file, your
function will need to insert node B into the list of neighbors of A and insert node A
into the list of neighbors of B.
graph.txt
A B A B
A C
C A D
B E
D E C D
C E
12.1.9. In this chapter, we are generally assuming that all graphs are undirected, meaning
that each link represents a mutual relationship between two nodes. For example, if
there is a link between nodes A and B, then this means that A is friends with B
and B is friends with A, or that one can travel from city A to city B and from
city B to city A. However, the relationships between nodes in some networks are
not mutual or do not exist in both directions. Such a network is more accurately
represented by a directed graph (or digraph), in which links are directed from one
node to another. In a picture, the directions are indicated arrows. For example, in
the directed graph below, one can go directly from node A to node B, but not vice
versa. However, one can go in both directions between nodes B and E.
digraph.txt
A B
A B A D
B E
C C A
D C
D E E B
E C
594 Networks
(a) Give three examples of networks that are better represented by a directed
graph.
(b) How would an adjacency list representation of a directed graph differ from
that of an undirected graph?
(c) Write a function that reads a file representing a directed graph (see the
example above), and returns an adjacency list (as a dictionary) representing
that directed graph.
12.1.10. Write a function that returns the maximum degree in a network represented by an
adjacency list (dictionary).
12.1.11. Write a function that returns the average degree in a network represented by an
adjacency list (dictionary).
Reflection 12.7 Are shortest paths always unique? Is there another shortest path
between Dave and Lillian?
There may be many shortest paths between two nodes in a network, but in most
applications we are concerned with just finding one.
Shortest paths can be computed using an algorithm called breadth-first search
(BFS). A breadth-first search explores outward from a source node, first visiting
all nodes with distance one from the source, then all nodes with distance two, etc.,
until it has visited every reachable node in the network. In other words, the BFS
algorithm incrementally pushes its “frontier” of visited nodes outward from the
Shortest paths 595
source. When the algorithm finishes, it has computed the distances between the
source node and every other node.
For example, suppose we wanted to discover the distance from Beth to every
other person in the social network in Figure 12.5, reproduced below.
Kevin
Tyler Amelia Caroline
Christina
Ted Cathy
Lillian
1
Kevin
Tyler Amelia Caroline
1
Christina
Ted Cathy
Lillian
Since these nodes are one hop away from the source, we label them with distance
1. These nodes now comprise the “frontier” being explored by the algorithm. In
the next round, we explore all unvisited neighbors of the nodes on this frontier, as
shown below.
1 2
Kevin
Tyler 2 Amelia Caroline
2 1
Christina
Ted Cathy
Lillian 2
As indicated by the red links, Christina is visited from Dave, Ted is visited from
Cathy, and both Caroline and Lillian are visited from Amelia. Notice that Caroline
596 Networks
and Lillian could have been visited from Nick as well. The decision is arbitrary,
depending, as we will see, on the order in which nodes are considered by the
algorithm. Since all four of these nodes are neighbors of a node with distance 1,
we label them with distance 2. Finally, in the third round, we visit all unvisited
neighbors of the new frontier of nodes, as shown below.
3
3 1 2
Kevin
Tyler 2 Amelia Caroline
2 1
Christina
Ted Cathy
Lillian 2
Since these newly visited nodes are all neighbors of a node labeled with distance
2, we label all of them with distance 3. At this point, all of the nodes have been
visited, and the final label of each node gives its distance from the source.
Reflection 12.8 If you also studied the depth-first search algorithm in Section 10.5,
compare and contrast that approach with breadth-first search.
In an algorithm, keeping track of the nodes on the current frontier could get
complicated. The trick is to use a queue. A queue is a list in which items are always
inserted at the end and deleted from the front. The insertion operation is called
enqueue and the deletion operation is called dequeue.
0 1 2 3 4 5 6
dequeue enqueue
And then a dequeue can be implemented by “popping” the front item from the list:
item = queue.pop(0) # dequeue an item
Reflection 12.10 Why can we not explore outward from these newly visited neigh-
bors right away? Why do they need to be stored in the queue for later?
We need to wait because there may be nodes further ahead in the queue that have
smaller distances from the source. For the algorithm to work correctly, we have to
explore outward from these nodes first.
The following Python function implements the breadth-first search algorithm.
11 visited = { }
12 distance = { }
13 for node in network:
14 visited[node] = False
15 distance[node] = float(’inf’)
16 visited[source] = True
17 distance[source] = 0
18 queue = [source]
19 while queue != [ ]:
20 front = queue.pop(0) # dequeue front node
21 for neighbor in network[front]:
22 if not visited[neighbor]:
23 visited[neighbor] = True
24 distance[neighbor] = distance[front] + 1
25 queue.append(neighbor) # enqueue visited node
26 return distance
The function maintains two dictionaries: visited keeps track of whether each node
has been visited and distance keeps track of the distance from the source to each
node. Lines 11–17 initialize the dictionaries. Every node, except the source, is marked
as unvisited and assigned an initial distance of infinity (∞) because we do not yet
know which nodes can be reached from the source. (The expression float(’inf’)
creates a special value representing ∞ that is greater than every other floating point
value.) The source is marked as visited and assigned distance zero. On line 18, the
queue is initialized to contain just the source node. Then, while the queue is not
empty, the algorithm repeatedly dequeues the front node (line 20), and explores all
neighbors of this node (lines 21–25). If a neighbor has not yet been visited (line 22),
it is marked as visited (line 23), assigned a distance that is one greater than the
node from which it is being visited (line 24), and then enqueued (line 25). Once the
598 Networks
queue is empty, we know that all reachable nodes have been visited, so we return
the distance dictionary, which now contains the distance to each node.
Reflection 12.11 Call the bfs function with the graph that you created earlier, to
find the distances from Beth to all other nodes.
Reflection 12.12 What does it mean if the bfs function returns a distance of ∞
for a node?
If the final distance is ∞, then the node must not have been visited by the algorithm,
which means that there is no path to it from the source.
Reflection 12.13 If you just want the distance between two particular nodes, named
source and dest, how can you use the bfs function to find it?
The bfs function finds the distance from a source node to every node, so we just
need to call bfs and then pick out the particular distance we are interested in:
allDistances = bfs(graph, source)
distance = allDistances[dest]
Since we incremented the distance to Tyler by one each time we crossed one of
these links, this path must be a shortest path! Therefore, all we have to do is
remember this order of nodes, as we visit them. This is accomplished by adding
another dictionary, named predecessor, to the bfs function:
Shortest paths 599
The predecessor dictionary remembers the predecessor (the node that comes
before) of each node on the shortest path to it from the source. This value is
assigned on line 20: the predecessor of each newly visited node is assigned to be the
node from which it is visited.
Reflection 12.14 How can we use the final values in the predecessor dictionary
to construct a shortest path between the source node and another node?
To construct the path to any particular node, we need to follow the predecessors
backward from the destination. As we follow them, we will insert each one into the
front of a list so they are in the correct order when we are done. For example, in the
previous example, to find the shortest path from Beth to Tyler, we start at Tyler.
path = [’Tyler’]
Christina’s predecessor was Dave, so we next insert Dave into the front of the list:
path = [’Dave’, ’Christina’, ’Tyler’]
Since Beth was the source, we stop. The following function implements this idea.
Parameters:
network: a graph represented by a dictionary
source: the source node in network
dest: the destination node in network
path = [ ]
current = dest
while current != source:
path.insert(0, current)
current = allPredecessors[current]
path.insert(0, source)
return path
With the destination node initially assigned to current, the while loop inserts each
value assigned to current into the front of the list named path and then moves
current one step closer to the source by reassigning it the predecessor of current.
When current reaches the source, the loop ends and we insert the source as the
first node in the path.
In the next section, we will use information about shortest paths to investigate
a special kind of network called a small-world network.
Exercises
12.2.1. List the order in which nodes are visited by the bfs function when it is called to
find the distance between Ted and every other node in the graph in Figure 12.5.
(There is more than one correct answer.)
12.2.2. List the order in which nodes are visited by the bfs function when it is called to
find the distance between Caroline and every other node in the graph in Figure 12.5.
(There is more than one correct answer.)
12.2.3. By modifying one line of code, the visited dictionary can be completely removed
from the bfs function. Show how.
12.2.4. Write a function that uses the bfs function to return the distance in a graph
between two particular nodes. The function should take three parameters: the
graph, the source node, and the destination node.
It’s a small world. . . 601
12.2.5. We say that a graph is connected if there is a path between any pair of nodes.
The breadth-first search algorithm can be used to determine whether a graph is
connected. Show how to modify the bfs algorithm so that it returns a Boolean
value indicating whether the graph is connected.
12.2.6. A depth-first search algorithm (see Section 10.5) can also be used to determine
whether a graph is connected. Recall that a depth-first search recursively searches
as far from the source as it can, and then backtracks when it reaches a dead end.
Writing a depth-first search algorithm for a graph is actually much easier than
writing the one in Section 10.5 because there are fewer base cases to deal with.
2 19
9 10
1 11
3
18 20
4 5 12
13 17 21
6 8 14 22 23 24
7 15 16
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
small relative to the number possible. The keys to a small-world network are a
high degree of clustering and a few long-range shortcuts that facilitate short paths
between clusters. A cluster is a set of nodes that are highly connected among
themselves. In your social network, you probably participate in several clusters:
family, friends at school, friends at home, co-workers, teammates, etc. Many of the
members of each of these clusters are probably also connected to one another, but
members of different clusters might be far apart if you did not act as a shortcut
link between them.
Although it is too small to really be called a small-world network, the network
in Figure 12.6 illustrates these ideas. The graph contains three clusters of nodes,
centered around nodes 5, 12 and 18, that are connected by a few shortcut links
(e.g., the links between nodes 5 and 18 and between nodes 14 and 23). These two
characteristics together give an average distance between nodes of about 2.42. On
the other hand, the highly structured grid or mesh network in Figure 12.7 has an
average node distance of about 3.33. Both of these graphs have 24 nodes and 38
It’s a small world. . . 603
links, so they are both sparse relative to the (24 ⋅ 23)/2 = 276 possible links that
they could have.
Clustering coefficients
The extent to which the neighborhood of a node is clustered is measured by its local
clustering coefficient. The local clustering coefficient of a node is the number of
links between its neighbors, divided by the total number of possible links between
neighbors. For example, consider the cluster on the left below surrounding the blue
node in the center.
The blue node has five neighbors, with six links between them (in red). Notice that
each of these links, together with two black links, forms a closed cycle, called a
triangle. So we can also think about the local clustering coefficient as counting these
triangles. As shown on the right, there are four dashed links between neighbors of
the blue node (i.e., four additional triangles) that are not present on the left, for a
total of ten possible links altogether. So the local clustering coefficient of the blue
node is 6/10 = 0.6. (The clustering coefficient will always be between 0 and 1.)
Reflection 12.15 In general, if a node has k neighbors, how many possible links
are there between pairs of these neighbors?
Reflection 12.16 If you had a small local clustering coefficient in your social
network (i.e., if your friends are not friends with each other), what implications
might this have?
It has been suggested that situations like this breed instability. Imagine that, instead
of a social network, we are talking about a network of nations and links represent
604 Networks
the existence of diplomatic relations. A nation with diplomatic relations with many
other nations that are enemies of each other is likely in a stressful situation. It might
be helpful to detect such situations in advance to curtail potential conflicts.
To compute the local clustering coefficient for a node, we need to iterate over
all of the node’s neighbors and count for each one the number of links between it
and the other neighbors of the node. Then we divide this number by the maximum
possible number of links between the node’s neighbors. This is accomplished by the
following function.
Parameters:
network: a graph represented by a dictionary
node: a node in the network
neighbors = network[node]
numNeighbors = len(neighbors)
if numNeighbors <= 1:
return 0
numLinks = 0
for neighbor1 in neighbors:
for neighbor2 in neighbors:
if neighbor1 != neighbor2 and neighbor1 in network[neighbor2]:
numLinks = numLinks + 1
return numLinks / (numNeighbors * (numNeighbors - 1))
This function is relatively straightforward. The two for loops iterate over every
possible pair of neighbors, and the if statement checks for a link between unique
neighbors. However, this process effectively counts every link twice, so at the end
we divide by numNeighbors * (numNeighbors - 1) (i.e., k(k − 1)), which is twice
what we discussed previously.
Reflection 12.17 Do you see why the function counts every link twice? How can
we fix this?
The function effectively counts every link twice because it checks whether each
neighbor is in every other neighbor’s list of adjacent nodes. Therefore, for any two
connected neighbors, call them A and B, we are counting the link once when we see
A in the list of adjacent nodes of B and again when we see B in the list of adjacent
nodes of A.
To count each link just once, we can use the following trick. In the list of
neighbors, we first check whether the node at index 0 is connected to nodes at
It’s a small world. . . 605
neighbors = network[node]
numNeighbors = len(neighbors)
if numNeighbors <= 1:
return 0
numLinks = 0
for index1 in range(len(neighbors) - 1):
for index2 in range(index1 + 1, len(neighbors)):
neighbor1 = neighbors[index1]
neighbor2 = neighbors[index2]
if neighbor1 != neighbor2 and neighbor1 in network[neighbor2]:
numLinks = numLinks + 1
return numLinks / (numNeighbors * (numNeighbors - 1) / 2)
Once we have this function, to compute the clustering coefficient for the network,
we just have to call it for every node, and compute the average. We leave this, and
writing a function to compute the average distance, as exercises.
Scale-free networks
In addition to having short paths and high clustering, researchers soon discovered
that most small-world networks also contain a few highly connected (i.e., high degree)
nodes called hubs that facilitate even shorter paths. In the network in Figure 12.6,
nodes 5, 12, and 18 are hubs because their degrees are large relative to the other
nodes in the network.
The existence of hubs in a large network can be seen by plotting for each node
degree in the network the fraction of the nodes that have that degree. This is called
the degree distribution of the network. The degree distribution for a network with a
few hubs will show the vast majority of nodes having relatively small degree and
just a few nodes having very large degrees. For example, Figure 12.8 shows such a
plot for a small portion of the web. Each node represents a web page and a directed
link from one node to another represents a hyperlink from the first page to the
606 Networks
Figure 12.8 The degree distribution of 875,713 nodes in the web network.
second page.1 In this network, 99% of the nodes have degree at most 25, while just
a few have degrees that are much higher. (In fact, 98% of the nodes have degrees at
most 20 and 90% have degrees at most 15.) These few hubs with high degree enable
a small average distance and a clustering coefficient of about 0.37.
Reflection 12.19 In the web network from Figure 12.8, the degree of a node is the
number of hyperlinks from that page. How do you think the degree distribution might
change if we instead counted the number of hyperlinks to each page?
Networks with this characteristic shape to their degree distributions are called
scale-free networks. The name comes from the observation that the fraction of nodes
with degree d is roughly (1/d)a , for some small value of a. Such functions are called
“scale-free” because their plots have the same shape regardless of the scale at which
you view them. A scale-free degree distribution is very different from the normal
distribution that seems to describe most natural phenomena, which is why this
discovery was so interesting.
Reflection 12.20 How could recognizing that a network is scale-free and then
identifying the hubs have practical importance?
Prince George
Edmonton
Vancouver Calgary/Banff
Kelowna Saskatoon Deer Lake
Victoria Abbotsford
Bellingham St. John’s, NL
Seattle/Tacoma CANADA
Wenatchee
Spokane
Pasco/Richland Regina
Portland Kalispell
ALASKA
CANADA Yakima /Kennewick Winnipeg Sydney
Pullman Great Falls
Walla Missoula Charlottetown
Fairbanks Walla Lewiston
Williston Moncton
Eugene Redmond/Bend Helena Minot International Falls
Butte Grand Thunder Bay Québec
Anchorage Forks
Juneau Bozeman Chisholm/
Dickinson Bemidji Hibbing Halifax
Sitka Billings Bismarck
Medford Boise Duluth Marquette Sault Ste. Montreal
Ketchikan West Yellowstone Fargo Brainerd Marie Bangor
Sun Valley Cody Ottawa
Idaho Falls Aberdeen Rhinelander Iron Mountain
Twin Rapid City Escanaba Pellston/Mackinac Island Portland
Falls Minneapolis/ Alpena Burlington
Jackson Hole
Gillette St. Paul Wausau Green Bay Traverse City
Pocatello Toronto Manchester
Appleton/ Albany Boston
Elko Fox Cities Midland/ Syracuse
Santa Rosa Sioux Falls Rochester La Crosse Grand
Casper Rapids Saginaw Rochester Hartford/
San Reno/Tahoe Milwaukee Ithaca Springfield Martha’s Vineyard
Sacramento Rock Springs Flint Buffalo/ Elmira/
Francisco Niagara FallsCorning Binghamton Nantucket
Oakland Madison Kalamazoo/ Lansing Newburgh Providence
Salt Lake City Cedar Rapids/ Battle Creek Wilkes-Barre/ White Plains
San Jose Mammoth Vernal
Iowa City Detroit Erie Scranton
Lakes Hayden/Steamboat Springs Chicago South Bend Cleveland New York (JFK, LGA)
Omaha State
Eagle/Vail/Beaver Creek Des Moines
(ORD, MDW) Akron/Canton College Allentown Newark
Fresno/Yosemite Grand Junction Peoria Philadelphia
Denver Moline/ Bloomington Ft. Wayne Pittsburgh
Cedar City Moab Lincoln Quad Harrisburg
Aspen/ Columbus
Snowmass Cities Baltimore
St. George Colorado Springs Dayton
Montrose/
Telluride Kansas City Indianapolis Washington, D.C. (DCA, IAD)
Pacific Santa Barbara
Las Vegas Cincinnati Charlottesville
Charleston
St. Louis
Ocean Burbank
Los Angeles
Ontario
Palm Springs
Louisville Lexington Richmond
Newport News/Williamsburg
Wichita Evansville
Long Beach Springfield/ Norfolk/Virginia Beach
Orange County Branson Roanoke
Tri-Cities Greensboro/High Point/Winston-Salem
San Diego Albuquerque Tulsa Fayetteville/ Nashville Raleigh/Durham
Northwest Arkansas Knoxville New Bern
Phoenix/Scottsdale Asheville
Oklahoma City Charlotte Jacksonville/Camp Lejeune
Ft. Smith Greenville/
Memphis Fayetteville/Ft. Bragg
Chattanooga Spartanburg
Wilmington
Tucson Little Rock Huntsville/
Decatur Columbia Myrtle Beach Atlantic Ocean
Columbus/ Atlanta
Dallas/ Starkville/ Birmingham
West Point Augusta Charleston
El Paso/ Ft. Worth
Ciudad Juárez (DFW) Columbus/Ft. Benning
Dallas Love Monroe
Field (DAL) Jackson Montgomery
Shreveport Savannah
Killeen/Ft. Hood Dothan Albany Brunswick
Valdosta Delta Air Lines/Delta Connection/
Mobile Pensacola Delta Joint Venture Route
Austin Alexandria Baton Tallahassee Jacksonville
Kauai Rouge Future Route Service
Lihue Houston Lafayette Gulfport/Biloxi Panama City
Oahu Daytona Beach Route served by Alaska Airlines/
San Antonio (IAH, HOU) New Orleans
Honolulu Kahului Ft. Walton Gainesville Horizon Air
Maui Beach Orlando Destination served by Delta / Delta
Tampa/St. Petersburg Melbourne Connection
H AWA I I Hilo
Destination served by one of Delta’s
Kona Sarasota/Bradenton West Palm Beach Worldwide Codeshare Partners
Hawaii
Ft. Myers/Naples Ft. Lauderdale/
MEXICO
Effective November 2014. Select routes are seasonal. Some
Harlingen/ Hollywood future services subject to government approval. Service may
Pacific Ocean South Padre Island
Miami
be operated by one of Delta’s codeshare partner airlines or
of Mexico
Key West
is structured in this way, as are airline networks (see Figure 12.9). Also, because
so many of the nodes in a scale-free network are relatively unimportant, scale-free
networks tend to be very robust when subjected to random attacks or damage.
Some have speculated that, because some natural networks are scale-free, they may
represent an evolutionary advantage. On the other hand, because the few hubs are
so important, a directed attack on a hub can cause the network to fail. (Have you
ever noticed the havoc that ensues when an airline hub is closed due to weather?) A
directed attack on a hub can also be advantageous if we want the network to fail.
For example, if we suspect that the network through which an epidemic is traveling
is scale-free, we may have a better chance of stopping it if we vaccinate the hubs.
Exercises
12.3.1. Write a function that returns the average local clustering coefficient for a network.
Test your function by calling it on some of the networks on the book web site. You
will need the function assigned in Exercise 12.1.8 to read these files.
12.3.2. Write a function that returns the average distance between every pair of nodes in
a network. If two nodes are not connected by a path, assign their distance to be
equal to the number of nodes in the network (since this is longer than any possible
path). Test your function by calling it on some of the networks on the book web
site. You will need the function assigned in Exercise 12.1.8 to read these files.
12.3.3. The closeness centrality of a node is the total distance between it and all other
608 Networks
nodes in the network. By this measure, the node with the smallest value is the
most central (and perhaps most influential) node in the network. Write a function
that computes the closeness centrality of a node. Your function should take two
parameters: the network and a node. Test your function by calling it on some
of the networks on the book web site. You will need the function assigned in
Exercise 12.1.8 to read these files.
12.3.4. Using the function you wrote in the previous exercise, write another function that
returns the most central node in a network (with the smallest closeness centrality).
12.3.5. Write a function that plots the degree distribution of a network, producing a plot
like that in Figure 12.8. Test your function on a small network first. Then call
your function on the large Facebook network (with 4,039 nodes and 88,234 links)
that is available on the book web site. (You will need the function assigned in
Exercise 12.1.8 to read these files.) Is the network scale-free?
import random
Parameters:
n: the number of nodes
p: the probability that two nodes are connected
graph = { }
for node in range(n): # label nodes 0, 1, ..., n-1
graph[node] = [ ] # graph has n nodes, 0 links
Because we will get a different random graph every time we call this function, any
characteristics that we want to measure will have to be averages over several different
random graphs with the same values of n and p. To illustrate, let’s compute the
average distance, clustering coefficient, and degree distribution for uniform random
graphs with the same number of nodes and links as the graphs in Figures 12.6 and
12.7. Recall that those graphs had 24 nodes and 38 links.
We cannot specify the number of links specifically in a uniform random graph, but
we can set the probability so that we are likely to get a particular number, on
average, over many trials. In particular, we want 38 out of a possible (24 ⋅ 23)/2
links, so we set
38 38
p= = ≈ 0.14.
(24 ⋅ 23)/2 276
Averaging over 20,000 uniform random graphs, each generated by calling
randomGraph(24, 0.14), we find that the average distance between nodes is about
4.32 and the average clustering coefficient is about 0.12. The table below compares
these results to what we computed previously for the other two graphs.
Graph Average distance Clustering coefficient
Figure 12.6 (clusters) 2.42 0.59
Figure 12.7 (grid) 3.33 0
Uniform random 4.32 0.12
The random graph with the same number of nodes and edges has a slightly longer
average distance and a markedly smaller clustering coefficient than the graph in
Figure 12.6 with the three clusters. Because these graphs are so small, these numbers
610 Networks
alone, while suggestive, are not very strong evidence that random graphs do not
have the small-world or scale-free properties. So let’s also look at the average degree
distribution of the random graphs, shown in Figure 12.10. The shape of the degree
distribution is quite different from that of a scale-free network, and is much closer
to a normal distribution. Because the probability of adding an edge was relatively
low, the average degree was only about 3 and there were a number of nodes with 0
degree, causing the plot to “run into” the y-axis. If we perform the same experiment
with p = 0.5, as shown in Figure 12.11, we get a much clearer bell curve. These
distributions show that random graphs do not have hubs; instead, the nodes all
tend to have about the same degree. So there is definitely something non-random
happening to generate scale-free networks.
Reflection 12.22 What kind of process do you think might create a scale-free
network with a few high-degree nodes?
The presumed process at play has been dubbed preferential attachment or, collo-
quially the “rich get richer” phenomenon. The idea is relatively intuitive: popular
people, destinations, and web pages tend to get more popular over time as word of
them moves through the network.
Exercises
12.4.1. Show how to create a uniform random graph with 30 nodes and 50 links, on
average.
12.4.2. Exercise 12.3.1 asked you to write a function that returns the clustering coefficient
for a graph. Use this function to write another function
avgCCRandom(n, p, trials)
that returns the average clustering coefficient, over the given number of trials, of
random graphs with the given values of n and p.
12.4.3. Exercise 12.3.2 asked you to write a function that returns the average distance
between any two nodes in a graph. Use this function to write another function
avgDistanceRandom(n, p, trials)
that returns the average of this value, over the given number of trials, for random
graphs with the given values of n and p.
12.4.4. Exercise 12.3.5 asked you to write a function to plot the degree distribution of
a graph and then call the function on the large Facebook network on the book
web site. This network has 4,039 nodes and 88,234 links. To compare the degree
distribution of this network to a random graph of the same size, write a function
degreeDistributionRandom(n, p, trials)
that plots the average degree distribution, over the given number of trials, of
random graphs with the given values of n and p. Then use this function to plot the
degree distribution of random graphs with 4,039 nodes and an average of 88,234
links. What do you notice?
Summary 611
12.4.5. We say that a graph is connected if there is a path between any pair of nodes.
Random graphs that are generated with a low probability p are unlikely to be
connected, while random graphs generated with a high probability p are very likely
to be connected. But for what value of p does this transition between disconnected
and connected graphs occur?
To determine whether a graph is connected, we can use either a breadth-first search
(as in Exercise 12.2.5) or a depth-first search (as in Exercise 12.2.6). In either case,
we start from any node in the network, and try to visit all of the other nodes. If
the search is successful, then the graph must be connected. Otherwise, it must not
be connected.
12.5 SUMMARY
In this chapter, we took a peek at one of the more exciting interdisciplinary areas
in which computer scientists have become engaged. Networks are all around us,
some obvious and some not so obvious. But they can all be described using the
language of graphs. The shortest path and distance between any two nodes in a
graph can be found with the breadth-first search algorithm. Graphs in which the
distance between any two nodes is relatively short and the clustering coefficient is
relatively high are called small-world networks. Networks that are also characterized
by a few high-degree hubs are called scale-free networks. Scientists have discovered
over the last two decades that virtually all large-scale natural and human-made
networks are scale-free. Knowing this about a network can give a lot of information
about how the network works and about its vulnerabilities.
12.7 PROJECTS
Project 12.1 Diffusion of ideas and influence
In this project, you will investigate how memes, ideas, beliefs, and information
propagate through social networks, and who in a network has the greatest influence.
We will simulate diffusion through a network with the independent cascade model .
In this simplified model, an idea originates at one node (called the seed) and each
of the neighbors of this node adopts it with some fixed probability (called the
propagation probability). This probability measures how influential each node is. In
reality, of course, people have different degrees of influence, but we will assume here
that everyone is equally influential. Once the idea has propagated to one or more
new nodes, each of these nodes gets an opportunity to propagate it to each of their
neighbors in the next round. This process continues through successive rounds until
all of the nodes that have adopted the idea have had a chance to propagate it to
their neighbors.
For example, suppose node 1 in the network below is the seed of an idea.
1 2 1 2
4 5 3 4 5 3
6 7 8 6 7 8
1 2 1 2
4 5 3 4 5 3
6 7 8 6 7 8
Round 2 Round 3
In round 1, node 1 is successful in spreading the idea to node 2, but not to node 4.
In round 2, the idea spreads from node 2 to both node 3 and node 5. In round 3, the
Projects 613
idea spreads from node 3 to node 7, but node 3 does not successfully influence node
8. Node 5 also attempts to influence node 4, but is unsuccessful. In round 4 (not
shown), node 7 attempts to spread the idea to nodes 6 and 8, but is unsuccessful,
and the process completes.
that reads in a network file with this format, and returns an adjacency list represen-
tation of the network.
that simulates the independent cascade model starting from the given seed, using
propagation probability p. The function should return the total number of nodes in
the network who adopted the idea (i.e., were successfully influenced).
that calls diffusion trials times with the given values of network, seed and p,
and returns the average number of nodes influenced over that many trials.
maxInfluence(network, p, trials)
that calls the rateSeed function for every node in the network, and returns the
most influential node.
Combine these functions into a complete program that finds the most influential
node(s) in the small Facebook network, using a propagation probability of 0.05.
Then answer the following questions. You may write additional functions if you
wish.
Question 12.1.1 Which node(s) turned out to be the most influential in your
experiments?
Question 12.1.2 Were the ten most influential nodes the same as the ten nodes
with the most friends?
Question 12.1.3 Can you find any relationship between a person’s number of
friends and how influential the person is?
that reads in a network file with this format, and returns an adjacency list represen-
tation of the network.
that simulates this infection spreading in network, starting from a randomly chosen
source node, and using infection probability p. The fourth parameter, vaccinated,
is a list of nodes that are immune to the infection. An immune node should never
become infected, and therefore cannot infect others either; more on this in Part 3.
The function should return when all infected nodes have had a chance to infect their
neighbors. The function should return the total number of nodes who were infected.
If the last parameter randomSim is True, the function should perform the simulation
with random vaccinations. Otherwise, it should perform the simulation with your
improved strategy. The fourth parameter, numVaccs, is the number of individuals
to vaccinate. For every number of vaccinations, the simulation should call the
infection function trials times with the given network and infection probability
616 Networks
p. The function should return a list of numVaccs + 1 values representing the average
number of infections that resulted with 0, 1, 2, . . . , numVaccs immune nodes.
Create a program that calls this function once for each strategy with the following
arguments:
• 500 trials
• 50 vaccinations
Then your program should produce a plot showing for both strategies how the
number of infections changes as the number of vaccinations increases.
Question 12.2.2 Why did you think that your strategy would be effective?
Question 12.2.3 How successful was your strategy in reducing infections, compared
to the random strategy?
Question 12.2.4 Given an unlimited number of vaccinations, how many are re-
quired for the infection to not spread at all?
Donald
Sutherland
Ani
mal
Hou
Madonna se
Animal House
John
ue Belushi
ag
Le ous
e
A Le
A lH
ima
An
ague
Penny
Kevin
Marshall A Le
ague 13 Bacon
Apollo
Tom Hanks
Apollo
Apollo 13
Ap
13
oll
13
o 13
lo
ol
Ap
The files are tab-separated (the ▷ symbols represent tabs). The first entry on each
line is the name of a movie. The movie is followed by a list of actors that appeared
in that movie.
Write a function
createNetwork(filename)
that takes the name of one of these data files as a parameter, and returns an
actor network (as an adjacency list) in which two actors are connected if they have
appeared in the same movie. For example, for the very short file above, the network
would look like that in Figure 12.12. Each link in this network is labeled with
the name of a movie in which the two actors appeared. This is not necessary to
determine a Bacon number, but it is necessary to display all the relationships, as
required by the traditional “Six Degrees of Kevin Bacon” game (more on this later).
618 Networks
The following files were created from the IMDb data, and are available on the
book web site.
File Movies Actors Description
movies2005.txt 560 24,276 MPAA-rated movies from 2005
movies2012.txt 518 25,225 MPAA-rated movies from 2012
movies2013.txt 517 24,190 MPAA-rated movies from 2013
movies2000p.txt 8,396 362,798 MPAA-rated movies from 2000–2014
movies2005p.txt 5,590 251,881 MPAA-rated movies from 2005–2014
movies2010p.txt 2,547 117,908 MPAA-rated movies from 2010–2014
movies_mpaa.txt 12,404 514,780 All movies rated by MPAA
moves_all.txt 565,509 6,147,773 All movies
Start by testing your function with the smaller files, building up to working with
the entire database. Keep in mind that, when working with large networks, this and
your other functions below may take a few minutes to execute. For each file, we give
the number of movies, the total number of actors in all casts (not unique actors),
and a description.
that plots the degree distribution of a network using matplotlib. Again, start with
small files and work your way up.
Question 12.3.1 What does your plot show? Is the network scale-free?
Question 12.3.2 Where does Kevin Bacon fall on your plot? Is he a hub? If so, is
he unique in being a hub?
Question 12.3.3 Which actors have the ten largest degrees? (They might not be
who you expect.)
that repeatedly prompts for the names of two actors, and then prints the distance
between them in the network. Your function should call bfs to find the shortest
distance.
For an extra challenge, modify the algorithm (and your adjacency list) so that it
also prints how the two actors are related. For example, querying the oracle with
Kevin Bacon and Zooey Deschanel might print
Zooey Deschanel and Kevin Bacon have distance 2.
1. Kevin Bacon was in "The Air I Breathe (2007)" with Clark Gregg.
2. Clark Gregg was in "(500) Days of Summer (2009)" with Zooey Deschanel.
that displays this chart for the given network. For example, given the file
movies2013.txt, your function should print the following:
Bacon Number Frequency
------------ ---------
0 1
1 152
2 2816
3 12243
4 4230
5 253
6 47
7 0
8 0
infinity 819
An actor has infinite Bacon number if he or she cannot be reached from Kevin
Bacon.
Once your function is working, call it with some of the larger files.
Question 12.3.4 What does the chart tell you about Kevin Bacon’s place in the
movie universe?
This page intentionally left blank
CHAPTER 13
Niklaus Wirth
Textbook title (1976)
hroughout this book, we have stressed the importance of both functional and
T data abstractions in problem solving. By decomposing algorithms into groups
of functional abstractions, each implemented by a Python function, we have been
able to more easily solve complex problems. Each functional abstraction provides
a solution to a subproblem that we can use to solve our overall problem, without
having to worry about how it works.
We have also made extensive use of a variety of data abstractions, called abstract
data types (ADTs), such as turtles, strings, lists, and dictionaries. Each ADT is
defined by the information it can store and a set of functions that operate on that
information. An ADT is called abstract because its description is independent of
its implementation. Barbara Liskov, the pioneering researcher and Turing Award
winner who first developed the idea of an abstract data type, explained,
What we desire from an abstraction is a mechanism which permits the expression
of relevant details and the suppression of irrelevant details. In the case of
programming, the use which may be made of an abstraction is relevant; the
way in which the abstraction is implemented is irrelevant.
621
622 Abstract data types
In this chapter, we will discover how to implement new classes based on ADT
descriptions. Programming languages like Python that allow us to create our own
classes are called object-oriented programming languages. As we will see, object-
oriented languages support the ability to create classes that behave the same way
the built-in classes do. For instance, we will define new classes on which we can use
both operators (e.g., +, in) and built-in functions (e.g., str, len). Implementing
an ADT as a class also makes using the ADT more convenient and simultaneously
protects the attributes of each object from being corrupted by mistake. Later in
this chapter, we will design classes on which to build a bird-flocking simulation in
which each bird, implemented as an object, has its own independent attributes and
can interact with other birds in the flock.
def centroid(points):
"""Compute the centroid of a list of points.
Parameter:
points: a list of two-element tuples
n = len(points)
if n == 0:
return None
sumX = 0
sumY = 0
for point in points:
sumX = sumX + point[0]
sumY = sumY + point[1]
return (sumX / n, sumY / n)
def main():
homes = [(0.5, 5), (3.5, 2), (4, 3.5), (5, 2), (7, 1)]
central = centroid(homes)
print(central)
main()
Designing classes 623
To keep things simple for now, instead of longitude and latitude, we will assume
that x and y represent the east-west and north-south distances (in km), respectively,
from some fixed point (0, 0). (In Project 13.1, we will look at how to compute
distance between geographical points.)
For example, the points in the list named homes above are shown in black below
and the computed centroid is shown in blue.
(0, 0)
We can simplify working with points by designing a new abstract data type for a
point, and then implementing this ADT with a class. For example, we can define
what it means to add two points and divide a point by a number. But, instead
of designing an ADT that is specific to a geometric point, we will define a more
general ADT for an ordered pair of numbers. An ordered pair is any pair of numbers
in which we attribute different meanings to the values in the different positions.
In other words, the ordered pair (3, 14) is different from the ordered pair (14, 3).
We can, for example, use this ADT to represent a geographic location, the (x, y)
position of a particle, a (row, column) position in a grid, or a vector.
Remember that an ADT serves as a blueprint for a category of data. The
ADT defines both data, called attributes and the operations (or functions) that we
are allowed to perform with those attributes. For a pair of numbers, the obvious
attributes are the two numbers, which we will simply name a and b.
The operations of an abstract data type can be divided into four types:
Reflection 13.2 In which category does each of the nine operations above fall?
Of the nine operations we identified above for the Pair ADT, the first is the constructor
because it creates a new instance of a Pair. The next five operations are accessors
because they give information that is derived from the attribute values of an ADT
instance, without modifying it. The set and scale operations are mutators because
they change the values of the attributes of an instance of the Pair ADT. Finally, the
last operation is the destructor. In other words:
1. constructor: create
4. destructor: destroy
Designing classes 625
Implementing a class
In Python, we implement an abstract data type as a class. A class serves as a
blueprint for a category of data. Objects are particular instances of a class. In
Section 3.1, we described this difference by analogy to a species and the organisms
that belong to that species. The species description is like a class; it describes
a category of organisms but is not an organism itself. The individual organisms
belonging to a species are like objects from the same class. For example, in
george = turtle.Turtle()
diego = turtle.Turtle()
distinct objects are assigned to the names george and diego, both belonging to the
class Turtle (which is contained in the module named turtle.py).
When we implement an ADT as a class, the attributes defined by the ADT
are assigned to a set of variables called instance variables. The operations defined
by the ADT are implemented as methods. Instance variables remain hidden to a
programmer using the class, and are only accessed or modified indirectly through
methods. The Turtle class contains several hidden instance variables that store
each Turtle object’s position, color, heading, and whether its tail is up or down.
The Turtle class also defines several familiar methods, such as forward/backward,
left/right, speed, and up/down that we can call to indirectly interact with these
instance variables.
Let’s now implement the Pair abstract data type as a class. The definition of a
new class begins with the keyword class followed by the name of the class and, of
course, a colon. The class’ methods are indented inside the class.
class Pair:
"""An ordered pair class."""
def __init__(self):
"""Constructor initializes a Pair object to (0, 0).
Parameter:
self: a Pair object
The constructor is implicitly called when we call the function bearing the name
of the class to create a new object. For example, when we created the two Turtle
objects above, we invoked the Turtle constructor twice. To invoke the constructor
of the Pair class to create a new Pair object, we call
pair = Pair()
The parameter of the __init__ method, named self, is an implicit reference to the
object on which the method is being called. In the assignment statement above, when
Pair() implicitly calls the constructor, self is assigned to the new object being
created. This same object is implicitly returned by the constructor and assigned to
pair. We never explicitly pass anything in for self. We will see more examples of
this soon.
Inside the constructor method, two instance variables, self._a and self._b,
corresponding to the two attributes in the ADT, are assigned the value 0. These
instance variables are also commented by the descriptions of their corresponding
attributes in the ADT. An instance is another word for an object. Attributes are
called instance variables because every instance of the class (i.e., an object) has
its own copy of the variable. Every instance variable name is preceded by self.
to signify that it belongs to the particular object assigned to self. Since self is
assigned to the new object being created, the assignment statement above is, in
effect, creating a new Pair object named pair and assigning
pair._a = 0
pair._b = 0
The underscore (_) character before each instance variable name is a Python
convention that indicates that the instance variables should be private, meaning
that they should never be accessed from outside the class. However, Python does
not actually prevent someone from accessing private names (as other languages do).
More specifically, any name in a class that is preceded by at least one underscore
and not followed by two underscores like __init__ is private by convention.
Reflection 13.3 Why do you think it is important to keep instance variables pri-
vate?
The scope of an instance variable is the entire object. This means that we can access
and change the values of self._a and self._b from within every method that we
define for the class. In contrast, any variable name defined inside a method that is
not preceded by self is considered to be a local variable within that method.
A constructor can also take additional parameters that are used to initialize the
object’s instance variables. In the case of the Pair class, it would be convenient to
be able to pass the pair’s a and b values into the constructor. This is accomplished
by the following modified constructor:
Designing classes 627
Parameter:
self: a Pair object
Reflection 13.4 How would you create a new Pair object named myPair with value
(0, 18)?
Earlier, we said that abstract data types also have destructors that destroy ADT
instances when we are done with them. However, Python destroys objects automati-
cally, so we do not need explicit destructors.
Accessor methods
Next, let’s add three accessor methods to the Pair class, based on the ADT
specification. (Notice that the descriptions of the methods in the docstrings are
taken from the ADT specification.)
def getFirst(self):
"""Return the first value of self.
Parameter:
self: a Pair object
return self._a
628 Abstract data types
def getSecond(self):
"""Return the second value of self.
Parameter:
self: a Pair object
return self._b
def get(self):
"""Return the (a, b) tuple representing self.
Parameter:
self: a Pair object
The methods getFirst and getSecond return the values of the self._a and
self._b instance variables, respectively. The get method returns a tuple containing
both self._a and self._b. For example,
pair3 = Pair(3, 14)
a = pair3.getFirst() # a is assigned 3
b = pair3.getSecond() # b is assigned 14
pair = pair3.get() # pair is assigned the tuple (3, 14)
Notice that, like the __init__ method, the first (and only) parameter to these
methods is self. Again, a reference to the object on which the method is called is
assigned to self. For example, when we call pair3.getFirst(), the object pair3
is implicitly passed in for the parameter self, even though it is not passed in the
parentheses following the name of the method. This means that, inside the method,
self._a really refers to pair3._a, the value of the instance variable _a that belongs
to the object named pair3.
Reflection 13.5 Take a moment, if you haven’t already, to type in what we have
so far for the Pair class. (You might omit the docstrings to save time.) Then write
a main function that creates the new Pair object from the previous Reflection and
prints its value using the get method as follows.
def main():
myPair = Pair(0, 18)
print(myPair.get())
Reflection 13.6 Create another Pair object in the main function and print its
values as well.
Designing classes 629
Arithmetic methods
As in the centroid function, we sometimes want to add (or subtract) the corre-
sponding elements in a pair. So we can define addition of two pairs (a, b) and (c, d)
as the pair (a + c, b + d). For example, (3, 8) + (4, 5) = (7, 13). If we represented pairs
as two-element tuples, then an addition function would look like this:
Parameters:
pair1: a tuple representing a pair
pair2: a tuple representing a pair
To use this function to find the sum of (3, 8) and (4, 5), we would do the
following:
duo1 = (3, 8)
duo2 = (4, 5)
sum = add(duo1, duo2) # sum is assigned (7, 13)
print(sum) # prints "(7, 13)"
To implement this functionality for Pair objects, we need to write a similar add
method for the Pair class. Since we are writing a method, one of the points involved
in the sum calculation will be assigned to self and the other will be assigned to a
second parameter. The implementation of the new method is shown below.
Parameters:
self: a Pair object
pair2: another Pair object
Now to find the sum of two Pair objects named duo1 and duo2, we would call this
method as follows:
duo1 = Pair(3, 8)
duo2 = Pair(4, 5)
sum = duo1.add(duo2) # sum is assigned Pair(7, 13)
print(sum.get()) # prints "(7, 13)"
In this case, duo1 is assigned to self and duo2 is assigned to the second parameter
pair2. The method creates and returns a new Pair object, as emphasized in red.
Reflection 13.7 Add the add method to your Pair class. Then, define two new
Pair objects in your main function and compute their sum.
Reflection 13.8 Using the add method as a template, write a subtract method
that subtracts another Pair object from self.
Mutator methods
Our Pair ADT also includes two mutator operations that change the values of the
attributes: set and scale. Implementing methods for these operations is no harder
than implementing the previous methods.
Parameters:
self: a Pair object
a: a number representing a new first value for self
b: a number representing a new second value for self
self._a = a
self._b = b
Parameters:
self: a Pair object
scalar: a number by which to scale the values in self
The set method, which is almost identical to the constructor, sets the first and
second values of a pair object to the values passed in for the parameters a and b,
respectively. For example,
pair1 = Pair() # pair1 is assigned Pair(0, 0)
pair1.set(3, 14) # pair1 is now Pair(3, 14)
print(pair1.get()) # prints "(3, 14)"
changes the self._a instance variable of pair1 from 0 to 3, and the self._b
instance variable of pair1 from 0 to 14. Then, when we call pair1.get(), we get
the tuple (3, 14).
The scale method takes as a parameter a numerical value (which is called a
scalar in mathematics) and uses the set method to multiply the values of self._a
and self._b by scalar. Notice that, when we call a method of the class from
within another method, we need to preface the name of the called method with
self. just as we do with instance variables.
Reflection 13.9 How can we write the scale method without calling self.set?
Reflection 13.10 Add the set and scale methods to your class. Then, in your
main function, use the set method to change the value of myPair to (43, 23).
Let’s now revisit the centroid function, but pass in a list of Pair objects instead
of a list of tuples.
def centroid(points):
"""Compute the centroid of a list of Pair objects.
Parameter:
points: a list of Pair objects
n = len(points)
if n == 0:
return None
sum = Pair() # sum is the Pair (0, 0)
for point in points:
sum = sum.add(point) # sum = sum + point
sum.scale(1 / n) # divide sum by n
return sum
def main():
homes = [Pair(0.5, 5), Pair(3.5, 2), Pair(4, 3.5),
Pair(5, 2), Pair(7, 1)]
central = centroid(homes)
print(central.get())
main()
632 Abstract data types
Notice that we have replaced the two sumX and sumY variables with a single sum
variable, initialized to the pair (0, 0). Inside the for loop, each value of point,
which is now a Pair object, is added to sum using the add method. After the loop,
we use the scale method to multiply the point by 1 / n which, of course, is the
same as dividing by n. In the main function, we assign homes to be a list of Pair
objects instead of tuples. Printing the value of the centroid at the end is slightly
more cumbersome because we have to convert it to a tuple first, but we will fix that
in the next section.
Reflection 13.11 Add the code above after your Pair class. What is the value of
the centroid?
Documenting a class
Just as we have consistently documented programs and functions, we have also
documented the Pair class and its methods. Notice that, in the Pair class, there are
docstrings for the class itself immediately after the class line and for each method,
just as we have done for functions. These docstrings provide documentation on your
class to anyone using it. Once the class is defined, calling
help(Pair)
will produce a summary of your class constructed from your docstrings.
class Pair(builtins.object)
| An ordered pair class.
|
| Methods defined here:
|
| __init__(self, a=0, b=0)
| Constructor initializes a Pair object to (a, b).
|
| Parameter:
| self: a Pair object
|
| Return value: None
|
| add(self, pair2)
| Return a new Pair representing the component-wise sum
| of self and pair2.
|
| Parameters:
| self: a Pair object
| pair2: another Pair object
|
| Return value: a Pair object representing self + pair2
⋮
Reflection 13.12 In your program, call help(Pair) to view the complete docu-
mentation for the Pair class.
Designing classes 633
Exercises
13.1.1. Name two accessor methods and two mutator methods in the Turtle class.
13.1.2. Name two accessor methods and two mutator methods in the list class.
13.1.3. Add a method to the Pair class named round that rounds the two values to the
nearest integers.
13.1.4. Suppose you are tallying the votes in an election between two candidates. Write a
program that repeatedly prompts for additional votes for both candidates, stores
these votes in a Pair object, and then adds this Pair object to a running sum of
votes, also stored in Pair object. For example, your program output may look like
this:
Enter votes (q to quit): 1 2
Enter votes (q to quit): 2 4
Enter votes (q to quit): q
Candidate 1: 3 votes
Candidate 2: 6 votes
13.1.5. Suppose you are writing code for a runner’s watch that keeps track of a list of
split times and total elapsed times. While the timer is running, and the split
button is pressed, the time elapsed since the last split is recorded in a Pair object
along with the total elapsed time so far. For example, if the split button were
pressed at 65, 67, and 62 second intervals, the list of (split, elapsed) pairs would be
[(65, 65), (67, 132), (62, 194)] (where a tuple represents a Pair object).
Write a function that is to be called when the split button is pressed that updates
this list of Pair objects. Your function should take two parameters: the list of
Pair objects and the current split time.
13.1.6. A data logging program for a jetliner periodically records the time along with the
current altitude in a Pair object. Write a function that takes such a list of Pair
objects as a parameter and plots the data using matplotlib.
13.1.7. Write a function that returns the distance between two two-dimensional points,
each represented as a Pair object.
13.1.8. Write a function that returns the average distance between a list of points, each
represented by a Pair object and a given site, also represented as a Pair object.
13.1.9. The file africa.txt, available on the book web site, contains (longitude, latitude)
locations for cities on the African continent. The following program should read
this file into a list of Pair objects, find the closest and farthest pairs of points
in the list, and then plot all of the points using turtle graphics, coloring the
closest pair blue and farthest pair red. To finish this program, add a method
draw(self, tortoise, color) to the Pair class that plots a Pair object as an
(x, y) point, and write the functions named closestPairs and farthestPairs.
634 Abstract data types
import turtle
class Pair:
# fill in here from the text
def closestPairs(points):
pass
def farthestPairs(points):
pass
def main():
points = []
inputFile = open(’africa.txt’, ’r’, encoding = ’utf-8’)
for line in inputFile:
values = line.split()
longitude = float(values[0])
latitude = float(values[1])
p = Pair(longitude, latitude)
points.append(p)
george = turtle.Turtle()
screen = george.getscreen()
screen.setworldcoordinates(-37, -23, 37, 58)
george.hideturtle()
george.speed(0)
screen.tracer(10)
for point in points:
point.draw(george, ’black’)
cpoint1.draw(george, ’blue’)
cpoint2.draw(george, ’blue’)
fpoint1.draw(george, ’red’)
fpoint2.draw(george, ’red’)
screen.update()
screen.exitonclick()
main()
13.1.10. In this chapter, we implemented a Pair ADT with a class in which the two values
were stored in two variables. These two variables comprised the extremely simple
data structure that we used to implement the ADT. Now rewrite the Pair class so
that it stores its two values in a two-element list instead. The way in which the
Designing classes 635
class’ methods are called should remain exactly the same. In other words, the way
someone uses the class (the ADT specification) must remain the same even though
the implementation (the data structure) changes.
13.1.11. Write a BankAccount class that has a single instance variable (the available
balance), a constructor that takes the initial balance as a parameter, and methods
getBalance (which should return the amount left in the account), deposit (which
should deposit a given amount into the account), and withdraw (which should
remove a given amount from the account).
13.1.12. Using your BankAccount class from the previous exercise, write a program that
prompts for an initial balance, creates a BankAccount object with this balance,
and then repeatedly prompts for deposits or withdrawals. After each transaction, it
should update the BankAccount object and print the current balance. For example:
Initial balance? 100
(D)eposit, (W)ithdraw, or (Q)uit? d
Amount = 50
Your balance is now $150.00
(D)eposit, (W)ithdraw, or (Q)uit? w
Amount = 25
Your balance is now $125.00
(D)eposit, (W)ithdraw, or (Q)uit? q
13.1.13. Write a class that represents a U.S. president. The class should include instance
variables for the president’s name, party, home state, religion, and age when he
or she took office. The constructor should initialize the president’s name to a
parameter value, but initialize all other instance variables to default values (empty
strings or zero). Write accessor and mutator methods for all five instance variables.
13.1.14. On the book web site is a tab-separated file containing a list of all U.S. presidents
with the five instance variables mentioned in the previous exercise. Write a function
that reads this information and returns a list of president objects (using the class
you wrote in the previous exercise) representing all of the presidents in the file.
Also, write a function that, given a list of president objects and an age, prints a
table with all presidents who where at least that old when they took office, along
with their ages when they took office.
13.1.15. Write a Movie class that has as instance variables the movie title, the movie year,
and a list of actors (all of which are initialized in the constructor). Write accessor
and modifier functions for all the instance variables and an addActor method that
adds an actor to the list of actors in the movie. Finally, write a method that takes
as a parameter another movie object and checks whether the two movies have any
common actors.
There is a program on the book web site with which to test your class. The program
reads actors from a movie file (like those used in Project 12.3), and then prompts
for movie titles. For each movie, you can print the actors, add an actor, and check
whether the movie has actors in common with another movie.
13.1.16. Write a class representing a U.S. senator. The Senator class should contain instance
variables for the senator’s name, political party, home state, and a list of committees
636 Abstract data types
on which they serve. The constructor should initialize all of the instance variables
to parameter values, except for the list of committees, which should be initialized
to an empty list. Add accessor methods for all four instance variables, plus a
mutator method that adds a committee to a senator’s list of committees.
13.1.17. On the book web site is a function that reads a list of senators from a file and
returns a list of senator objects, using the Senator class that you wrote in the
previous exercise. Write a program that uses this function to create a list of Senator
objects, and then iterates over the list of Senator objects, printing each senator’s
name, party, and committees. Then your program should prompt repeatedly for
the name of a committee, and print the names and parties of all senators who are
on that committee. For example,
Alexander, Lamar (R)
Committee on Appropriations
Committee on Energy and Natural Resources
Committee on Health, Education, Labor, and Pensions
Committee on Rules and Administration
Ayotte, Kelly (R)
Commission on Security and Cooperation in Europe
Special Committee on Aging
⋮
13.1.20. This exercise assumes you read Section 6.7. Write a Sequence class to represent a
DNA, RNA, or amino acid sequence. The class should store the type of sequence,
a sequence identifier (or accession number), and the sequence itself. Identify and
implement at least three useful methods, in addition to the constructor.
String representations
When we print a Pair object like
counts = Pair(3, 14)
print(counts)
we would ideally get a convenient representation that shows the values in the ordered
pair. Instead, we get
<__main__.Pair object at 0x10215ab50>
is identical to calling
counts.__str__()
Since the print function implicitly calls str on an object to get the string rep-
resentation that it prints, defining the __str__ method also dictates how print
behaves.
The following __str__ method for the Pair class returns a string representing
the pair in parentheses (like a tuple).
638 Abstract data types
def __str__(self):
"""Return an ’(a, b)’ string representation of self.
Parameter:
self: a Pair object
With this method added to the class, if we want to include a string representation
of a Pair object in a larger string, we can now use the str function. For example,
print(’The current votes are ’ + str(counts) + ’.’)
will print
(3, 14)
Reflection 13.13 Add the __str__ method to your Pair class. Then print some
of the Pair objects in your main function.
Arithmetic
Just as we can define the special __str__ method in the Pair class to define how
str and print behave with Pair objects, we can define other special methods to
define how operators work with Pair objects. For example, the __add__ method is
implicitly called when the + operator is used. An assignment statement like
name = first + last
is identical to
name = first.__add__(last)
The ability to define this special method for each class is precisely what allows us
to use the + operator in different ways on different objects (e.g., numbers, strings,
lists). We can implement the + operator on Pair objects by simply changing the
name of our add method to __add__:
Operators and special methods 639
With this special method defined, we can carry out our previous example as follows
instead:
duo1 = Pair(3, 8)
duo2 = Pair(4, 5)
sum = duo1 + duo2 # sum is assigned Pair(7, 13)
print(sum) # prints "(7, 13)"
Reflection 13.14 Incorporate the __add__ method into your Pair class and ex-
periment with adding Pair objects.
Reflection 13.15 The behavior of the − operator is similarly defined by the __sub__
method. Modify our previous subtract method so that it is called when the − operator
is used with Pair objects.
Parameters:
self: a Pair object
scalar: a number
Like the __add__ method, this method returns a new Pair object, but this time it
is simply the values in self multiplied by a number.
Reflection 13.16 How is the __mul__ method different from scale? (What does
each return?)
640 Abstract data types
With this new method, we can easily scale pairs of numbers in one statement:
bets = Pair(150, 100)
double = bets * 2 # double is assigned Pair(300, 200)
Applying the new addition and (true) division operators and the __str__ method
to our centroid function from the previous section makes the code much more
elegant and natural.
def centroid(points):
""" (docstring omitted) """
n = len(points)
if n == 0:
return None
sum = Pair()
for point in points:
sum = sum + point
return sum / n
def main():
homes = [Pair(0.5, 5), Pair(3.5, 2), Pair(4, 3.5),
Pair(5, 2), Pair(7, 1)]
central = centroid(homes)
print(central)
main()
Comparison
We can also overload the comparison operators ==, <, <=, etc. using the following
special methods.
Operator == != < <= > >=
Method __eq__ __ne__ __lt__ __le__ __gt__ __ge__
The special method named __eq__ defines how the == operator behaves with a class.
It is natural to say that two pairs are equal if their corresponding values are equal,
as the following method implements.
Operators and special methods 641
Parameters:
self: a Pair object
pair2: another Pair object
Return value:
True if the corresponding values of self and pair2 are equal;
False otherwise
"""
The special method named __lt__ defines how the < operator behaves with a
class. If duo1 and duo2 are two Pair objects, then duo1 < duo2 should return
True if duo1._a < duo2._a, or if duo1._a == duo2._a and duo1._b < duo2._b.
Otherwise, it should return False.
Parameters:
self: a Pair object
pair2: another Pair object
Suppose we store the number of wins and ties in a Pair object for each of three
teams. If a team is ranked higher when it has more wins and the number of ties is
used to rank teams with the same number of wins, then the comparison operators
we defined can be used to decide rankings.
wins1 = Pair(6, 2) # 6 wins, 2 ties
wins2 = Pair(6, 4) # 6 wins, 4 ties
wins3 = Pair(6, 2) # 6 wins, 2 ties
Reflection 13.19 Add these two new methods to your Pair class. Experiment with
some comparisons, including those we did not implement, in your main function.
Indexing
When an element in a string, list, tuple, or dictionary is accessed with indexing, a
special method named __getitem__ is implicitly called. For example, if maxPrices
and minPrices are lists, then
range = maxPrices[0] - minPrices[0]
is equivalent to
range = maxPrices.__getitem__(0) - minPrices.__getitem__(0)
is equivalent to
temperatures.__setitem__(1, 18.9)
Parameter:
index: an integer (0 or 1)
if index == 0:
return self._a
if index == 1:
return self._b
return None
Reflection 13.20 Is this behavior consistent with what happens when you use an
erroneous index with a list?
Operators and special methods 643
When we use an erroneous index with an object from one of the built-in classes, we
get a IndexError. We will look at how to implement this alternative behavior in
Sections 13.5 and 13.6.
As an example, suppose we have a Pair object defined as follows:
counts = Pair(12, 15)
With the new __getitem__ method, we can retrieve the individual values in counts
with
first = counts[0]
second = counts[1]
Parameters:
index: an integer (0 or 1)
value: a number to which to set a value in self
if index == 0:
self._a = value
elif index == 1:
self._b = value
The __setitem__ method assigns self._a or self._b to the given value if index
is 0 or 1, respectively.
With the new __setitem__ method, we can assign a new value to counts with
counts[0] = 14
counts[1] = 16
print(counts) # prints "(14, 16)"
With these indexing methods defined, we can now use indexing within other methods,
where it is convenient. For example, we can use indexing in the __add__ method to
get the individual values and in the set method to assign new values.
644 Abstract data types
self[0] = a
self[1] = b
Reflection 13.22 Add the two indexing methods to your Pair class. Then modify
the __lt__ method so that it uses indexing to access values of self._a and self._b
instead.
You can find a summary of these and other special methods in Appendix B.9.
Exercises
13.2.1. Implement alternative __mul__ and __truediv__ methods for the Pair class that
multiply two Pair objects. The product of two Pair objects pair1 and pair2 is
a Pair object in which the first value is the product of the first values of pair1
and pair2, and the second value is the product of the second values of pair1 and
pair2. Division is defined similarly.
13.2.2. Implement the remaining four comparison operators (!=, <=, >, >=) for the Pair
class.
13.2.3. Rewrite your linearRegression function from Exercise 8.6.1 so that it takes a
list of Pair objects as a parameter.
13.2.4. Add a __str__ method to the president class that you wrote in Exercise 13.1.13.
The method should return a string containing the president’s name and political
party, for example, ’Kennedy (D)’. Also, write a function that, given a list of pres-
ident objects and a state abbreviation, prints the presidents in this list (indirectly
using the new __str__ method) that are from that state.
13.2.5. Add a __lt__ method to the president class that you wrote in Exercise 13.1.13.
The method should base its results on a comparison of the presidents’ ages.
13.2.6. Add a __str__ method to the Senator class from Exercise 13.1.16 that prints the
name of the senator followed by their party, for example, ’Franken, Al (D)’.
13.2.7. Rewrite the distance function from Exercise 13.1.7 so that it uses indexing to get
the first and second values from each pair.
13.2.8. Modify your program from Exercise 13.1.17 so that it uses the new __str__ method
that you wrote in the previous exercise.
Modules 645
13.2.9. Write a class that represents a rational number (i.e., a number that can be
represented as a fraction). The constructor for your class should take as parameters
a numerator and denominator. In addition, implement the following methods for
your class:
• arithmetic: __add__, __sub__, __mul__, __truediv__
• comparison: __lt__, __eq__, __le__
• __str__
When you are done, you should be able to perform calculations like the following:
a = Rational(3, 2) # 3/2
b = Rational(1, 3) # 1/3
sum = a + b
print(sum) # should print 11/6
print(a < b) # should print False
13.3 MODULES
When we define a new class, we usually intend it to be broadly useful in a variety of
programs. Therefore, we should save each class as a module in its own file. Then we
can use the import statement to import the class definition into a program. This is
precisely the way we have been using the Turtle class all along.
For example, if we save our Pair class in its own module named pair.py, then
we can write a very simple program using the Pair class that looks like the following:
import pair
def main():
pair1 = pair.Pair(3, 8)
pair2 = pair.Pair(4, 5)
sum = pair1 + pair2
difference = pair1 - pair2
print(sum, difference)
main()
As you may have seen in Section 7.3, the import statement both imports names
from another file and executes the code in that file. Therefore, if a module like
pair.py calls its own main function, the main function in pair.py will be executed
when pair is imported into a program, before the main function in the program is
executed. To prevent this from happening, we can place the module’s call to main()
inside the following conditional statement:
if __name__ == ’__main__’:
main()
"""pair.py"""
class Pair:
def __init__(self, x = 0, y = 0):
⋮
def main():
myPair = Pair(0, 18)
print(myPair)
if __name__ == ’__main__’:
main()
Without the red if statement, when pair is imported into the program above,
the main function in pair.py will be executed before the main function in the
program. But the main function in pair.py will not be called if we include the
red if statement. The name __name__ is the name of the current module, which
is assigned the value ’__main__’ if the module is run as the main program. But
when pair.py is imported into another module instead, __name__ is set to ’pair’.
So this statement executes main() only if the module is run as the main program.
Namespaces, redux
Recall from Section 3.6 that each function and module defines its own namespace.
The same is true for classes and objects, as illustrated in Figure 13.1. When we
import the pair module with import pair, a new namespace is created with the
name pair. Inside the pair namespace is the name Pair, which refers to the Pair
class. The Pair class also has a namespace containing the names of all of the methods
of that class, as shown on the right side of Figure 13.1. And each of these method
names refers to a function with its own namespace (not shown in the figure).
When a new object is created, its namespace contains the names of all of its
instance variables. For example, in Figure 13.1, the namespace for each of the Pair
objects named duo1 and duo2 contains the instance variable names _a and _b. In
addition, every object namespace contains the special name __class__, which refers
to the class to which the object belongs. In our example, the name __class__ in
the namespaces for duo1 and duo2 refers to the Pair class. The __class__ name
connects an object to its methods. When we call a method for an object such as
duo1.get(), this is equivalent to calling duo1.__class__.get(duo1). (Yuck!)
To avoid having to preface the name of the class with the name of the module,
we could substitute
import pair
with
Modules 647
builtins
math
__main__ pair
Pair
duo1 duo2
_a _b __class__ _a _b __class__
3 8 4 5
or
from pair import *
The * character is a wildcard character that represents all of the names in the
module. Using this alternative import style imports the names in the module into
the current local namespace instead of creating a separate namespace. Since we
always import modules at the top of our programs, the local namespace into which
these names are imported is the global namespace.
Exercises
13.3.1. Create a new module for your BankAccount class from Exercise 13.1.11. Then
rewrite your program from Exercise 13.1.12 so that it imports the BankAccount
class from your new module.
13.3.2. Consider the following short program that uses the BankAccount class that you
wrote in Exercise 13.1.11. The program assumes that the BankAccount class is in
a module named bankaccount.py.
648 Abstract data types
import bankaccount
def main():
account = bankaccount.BankAccount(100)
account.deposit(50)
print(’Your balance is $0:<4.2f’.format(account.getBalance()))
main()
Draw a namespace diagram like that in Figure 13.1 that includes namespaces for
__main__, the main function, the bankaccount module, the BankAccount class,
and the object named account.
13.3.3. Create a new module for the president class that you wrote in Exercise 13.1.13.
Then rewrite your program from Exercise 13.1.14 so that it imports your president
class from the new module.
13.3.4. Consider the following (silly) program that uses the president class that you wrote
in Exercise 13.1.13. The program assumes that your president class is named
President and is in a module named president.py.
import president
def main():
taft = president.President(’Taft’)
taft.setAge(51)
washington = president.President(’Washington’)
washington.setAge(57)
main()
Draw a namespace diagram like that in Figure 13.1 that includes namespaces for
__main__, the main function, the president module, the President class, and
both President objects.
The operations for a World ADT involve getting its dimensions, getting, setting
and deleting the agent in a particular position, and querying the neighborhood of
650 Abstract data types
any position. We will represent a position with a (x, y) tuple, where x is a column
number and y is a row number.
class World:
"""A two-dimensional world class."""
self._width = width
self._height = height
self._agents = { }
def getWidth(self):
"""Return the width of self."""
return self._width
def getHeight(self):
"""Return the height of self."""
return self._height
agents are added to the world. This way, we only use as much space as we need
to store the agents and assume that the rest of the world is empty. In contrast,
the implementation that we used in Chapter 9 explicitly stores every cell, whether
or not it is occupied. If most cells are always occupied, then this implementation
is perfectly reasonable. However, in situations where most cells are not occupied
at any particular time, as will be the case in our boids simulation, the dictionary
implementation is more efficient. In these situations, we say that the space is sparse
or a sparse matrix .
Two-dimensional indexing
To get or change the agent in a particular position in the world, we can define
indexing for the World class using the __getitem__ and __setitem__ methods. In
our two-dimensional world, an index needs to be a (x, y) pair. So we can define the
__getitem__ and __setitem__ methods for our World class to interpret the index
parameter as a two-element tuple that we directly use as a key in our dictionary.
if position in self._agents:
return self._agents[position]
return None
With these methods, we can get and set particular positions in the world by simply
indexing with tuples. As a simple example, we can place an integer value (in lieu of
an agent for now) in the world at position (2, 1) and then print the value at that
position.
myWorld = World(10, 10)
myWorld[2, 1] = 5
print(myWorld[2, 1])
Notice that we did not need to put parentheses around the tuple values that we
used in the square brackets; Python will automatically wrap the pair in parentheses
and interpret it as a tuple. If we try to get an agent from a position that is empty,
the __getitem__ method returns None. The __setitem__ method does nothing if
we try to insert an agent into a position that is occupied or out of bounds.
652 Abstract data types
Reflection 13.23 How can a simulation determine whether a position in the world
is occupied?
When an agent moves, we will need to delete it from its current position in the world.
To delete an item from a Python list or dictionary, we can use the del operator.
For example,
del frequency[18.9]
deletes the (key, value) pair with key equal to 18.9 from the dictionary named
frequency. We can mimic this behavior in a World object by defining the
__delitem__ method.
if position in self._agents:
del self._agents[position]
With this method defined, we can delete an agent from a particular position like
this:
del myWorld[2, 1]
neighbors = []
for otherPosition in self._agents:
if (position != otherPosition) and \
(_distance(position, otherPosition) <= distance):
neighbors.append(self._agents[otherPosition])
return neighbors
A flocking simulation 653
The _distance function (not shown) is a private function defined outside the class
that returns the distance between two positions.
Reflection 13.24 Why is the _distance function not a method of World instead?
We decided to not make _distance a method because it does not need to access
any attributes of the class. The leading underscore in its name prevents the function
from being imported into other modules.
def stepAll(self):
"""All agents advance one step in the simulation."""
agents = list(self._agents.values())
for agent in agents:
agent.step()
import math
class World:
"""A two-dimensional world class."""
self._width = width
self._height = height
self._agents = { }
654 Abstract data types
def getWidth(self):
"""Return the width of self."""
return self._width
def getHeight(self):
"""Return the height of self."""
return self._height
if position in self._agents:
return self._agents[position]
return None
if position in self._agents:
del self._agents[position]
neighbors = []
for otherPosition in self._agents:
if (position != otherPosition) and \
(_distance(position, otherPosition) <= distance):
neighbors.append(self._agents[otherPosition])
return neighbors
def stepAll(self):
"""All agents advance one step in the simulation."""
agents = list(self._agents.values())
for agent in agents:
agent.step()
A flocking simulation 655
In each step of an agent-based simulation, each agent carries out some application-
specific tasks, such as querying its neighbors and moving to a new location. We
encapsulate this activity in a function named step. In our boids simulation, in each
step, a boid will look at its neighbors, adjust its velocity according to the three rules
of the boids model, and then move to its new position. These intermediate steps
will be handled by the neighbors and move operations below.
Vectors
Velocity is represented by a vector . A vector is simply an ordered pair, like a
geometric point, but it represents a quantity with both magnitude and direction
(e.g., velocity, force, or displacement). A vector ⟨x, y⟩ is often represented by a
directed line segment that extends from the origin (0, 0) to the point (x, y), as
shown below on the left.
656 Abstract data types
2
(x, y)
y
2
x
+ r
y r sin
x r cos
The angle α that the vector makes with the horizontal axis is the direction of the
vector and the length of the line segment is the vector’s magnitude (or length).
If a vector represents velocity, then α is the direction of movement and the mag-
nitude is √speed. We know from Pythagorean theorem that the magnitude of the
vector is x2 + y 2 . If you know some trigonometry, you also know that the angle
α = tan−1 (y/x). Also, if you only know the magnitude r and angle α, you can find
the vector ⟨x, y⟩ with x = r cos α and y = r sin α, as illustrated above on the right.
To facilitate some of the computations that will be necessary in our boids
simulation, we have written a new Vector class, based on the Pair class, but with
a few additional methods that apply specifically to vectors. You can download the
vector.py module from the book web site. We will not dwell on its details here,
but instead include an abbreviated description given by the help function.
class Vector(builtins.object)
| A two-dimensional vector class.
|
| Methods defined here:
|
| __add__(self, vector2)
| Return the Vector that is self + vector2.
|
| __getitem__(self, index)
| Return the value of the index-th coordinate of self.
|
| __init__(self, vector=(0, 0))
| Constructor initializes a Vector object to <x, y>.
|
| __mul__(self, scalar)
| Return the Vector <x * scalar, y * scalar>.
|
| __setitem__(self, index, value)
| Set the index-th coordinate of self to value.
|
| __str__(self)
| Return an ’<x, y>’ string representation of self.
|
| __sub__(self, vector2)
| Return the Vector that is self - vector2.
A flocking simulation 657
|
| __truediv__(self, scalar)
| Return the Vector <x / scalar, y / scalar>.
|
| angle(self)
| Return the angle made by self (in degrees).
|
| diffAngle(self, vector2)
| Return the angle (in degrees) between self and vector2.
|
| dotProduct(self, vector2)
| Return the dot product of self and vector2,
| which is the cosine of the angle between them.
|
| get(self)
| Return the (x, y) tuple representing self.
|
| magnitude(self)
| Return the magnitude (length) of self.
|
| scale(self, scalar)
| Multiply the coordinates in self by a scalar value.
|
| set(self, x, y)
| Set the two coordinates in self.
|
| turn(self, angle)
| Rotate self by the given angle (in degrees).
|
| unit(self)
| Return a unit vector in the same direction as self.
class Boid:
"""A boid in a agent-based flocking simulation."""
self._world = myWorld
(x, y) = (random.randrange(self._world.getWidth()),
random.randrange(self._world.getHeight()))
while self._world[x, y] != None:
(x, y) = (random.randrange(self._world.getWidth()),
random.randrange(self._world.getHeight()))
self._position = Pair(x, y)
self._world[x, y] = self
self._velocity = Vector((random.uniform(-1, 1),
random.uniform(-1, 1))).unit()
self._turtle = turtle.Turtle()
self._turtle.speed(0)
self._turtle.up()
self._turtle.setheading(self._velocity.angle())
The constructor assigns the new Boid object a random, unoccupied position in
self._world. Notice that this involves three steps: finding an unoccupied position
using a while loop, assigning this position (a Pair object) to the instance variable
self._position, and placing the new Boid object (self) in self._world at that
position. Next, the instance variable self._velocity is assigned a Vector object
with random value between ⟨−1, −1⟩ and ⟨1, 1⟩ (which covers every angle between
0 and 360 degrees). The unit method scales the velocity vector so that it has
magnitude (speed) 1. Finally, we add an instance variable for a Turtle object to
visualize the boid, and set the turtle’s initial heading to the angle of the velocity
vector.
Reflection 13.25 What angle does the vector ⟨1, 1⟩ represent? What about ⟨−1, −1⟩
and ⟨0, −1⟩?
Reflection 13.26 How many total instance variables does each Boid object have?
Moving a boid
In each step of the simulation, a boid will move to a new location based on
its current velocity (which will change periodically based on its interaction with
neighboring boids). The move method below moves the Boid object by adding the
x and y coordinates of its velocity to the x and y coordinates of its current position,
respectively.
A flocking simulation 659
def move(self):
"""Move self to a new position in its world."""
self._turtle.setheading(self._velocity.angle())
width = self._world.getWidth()
height = self._world.getHeight()
After the boid’s new position is assigned to newX and newY, if this new position is
not occupied in the boid’s world, the Boid object moves to it. Finally, if the boid is
approaching a boundary, we rotate its velocity by some angle, so that it turns in the
next step. The constant value TURN_ANGLE along with some other named constants
will be defined at the top of the boid module. To avoid unnaturally abrupt turns,
we use a small angle like
TURN_ANGLE = 30
def step(self):
"""Advance self one step in the flocking simulation."""
self._velocity = newVelocity.unit()
self.move()
The private _match, _center, and _avoid methods will compute each of the three
individual velocities. The boids model suggests that the weights assigned to the
rules decrease in order of rule number. So avoidance should have the highest weight,
velocity matching the next highest weight, and centering the lowest weight. For
example, the following values follow these guidelines.
PREV_WEIGHT = 0.5
AVOID_WEIGHT = 0.25
MATCH_WEIGHT = 0.15
CENTER_WEIGHT = 0.1
Note that because we always scale the resulting vector to a unit vector, these weights
need not always sum to 1. Once we have the complete simulation, we can modify
these weights to induce different behaviors.
In each of the _match, _center and _avoid methods, we will need to get a list
of boids within some distance and viewing angle. This is accomplished by the Boid
method named neighbors, shown below.
The method begins by calling the neighbors method of the World class to get a list
of boids within the given distance. Then we iterate over these neighbors, and check
whether each one is visible within the given viewing angle. Doing this requires a little
vector algebra. In a nutshell, neighborDir is the vector pointing in the direction
of the neighbor named boid, from the point of view of this boid (i.e., self). The
A flocking simulation 661
diffAngle method of the Vector class computes the angle between neighborDir
and the velocity of this boid. If this angle is within the boid’s viewing angle, we add
the neighboring boid to the list of visible neighbors to return.
With this infrastructure in place, methods that follow the three rules are relatively
straightforward. Let’s review them before continuing:
1. Avoid collisions with obstacles and nearby flockmates.
2. Attempt to match the velocity (heading plus speed) of nearby flockmates.
3. Attempt to move toward the center of the flock to avoid predators.
Since rule 2 is slightly easier than the other two, let’s implement that one first.
def _match(self):
"""Return the average velocity of neighboring boids."""
The method first gets a list of visible neighbors, according to a distance and viewing
angle that we will define shortly. If there are no such neighbors, the method returns
the zero vector ⟨0, 0⟩. Otherwise, we return the average velocity of the neighbors,
normalized to a unit vector. Since we have overloaded the addition and division
operators for Vector objects, as we did for the Pair class, finding the average of a
list of vectors is no harder than finding an average of a list of numbers!
The method to implement the first rule is similar, but now we want to find the
velocity vector that points away from the average position of close boids.
def _avoid(self):
"""Return a velocity away from close neighbors."""
The first part of the method finds the average position (rather than velocity) of
neighboring boids. Then we find the vector that points away from this average
position by subtracting the average position from the position of the boid. This is
illustrated below.
self._position
(x, y) (x - a, y - b)
y-b
def _center(self):
"""Return a velocity toward center of neighboring flock."""
In the _center method, we perform the subtraction the other way around to
produce a vector in the opposite direction. The distance and viewing angle will
also be different from those in the _avoid method. We want AVOID_DISTANCE to be
much smaller than CENTER_DISTANCE so that boids avoid only other boids that are
very close to them. We will use the following values to start.
A flocking simulation 663
import turtle
from world import *
from boid import *
WIDTH = 100
HEIGHT = 100
NUM_BIRDS = 30
ITERATIONS = 2000
def main():
worldTurtle = turtle.Turtle()
screen = worldTurtle.getscreen()
screen.setworldcoordinates(0, 0, WIDTH - 1, HEIGHT - 1)
screen.tracer(NUM_BIRDS)
worldTurtle.hideturtle()
screen.update()
screen.exitonclick()
main()
The complete program is available on the book web site. Figure 13.2 shows a
sequence of four screenshots of the simulation.
Reflection 13.29 Run the program a few times to see what happens. Then try
changing the following constant values. What is the effect in each case?
(a) TURN_ANGLE = 90
(b) PREV_WEIGHT = 0
(c) AVOID_DISTANCE = 8
(d) CENTER_WEIGHT = 0.25
Exercises
13.4.1. Remove each of the three rules from the simulation, one at a time, by setting its
corresponding weight to zero. What is the effect of removing each one? What can
you conclude about the importance of each rule to successful flocking?
13.4.2. Implement the __eq__ method for the Vector class. Then modify the step method
so that it slightly alters the boid’s heading with some probability if the new velocity
is the same as the old velocity (which will happen if it has no neighbors).
13.4.3. Modify the move method so that a boid randomly turns either left or right when it
approaches a boundary. What is the effect?
A stack ADT 665
e
d d d
o o o o o
c c c c c c c
push push push push pop pop pop pop
code ode de e e ed edo edoc
1. 2. 3. 4. 5. 6. 7. 8.
In addition to the push and pop functions, it is useful to have a function that allows
666 Abstract data types
us to peek at the item on the top of the stack without deleting it, and a function
that tells us when the stack is empty.
class Stack:
"""A stack class."""
def __init__(self):
self._stack = [] # the items in the stack
def top(self):
"""Return the item on the top of the stack."""
if len(self._stack) > 0:
return self._stack[-1]
raise IndexError(’stack is empty’)
def pop(self):
"""Return and delete the item on the top of the stack."""
if len(self._stack) > 0:
return self._stack.pop()
raise IndexError(’stack is empty’)
self._stack.append(item)
def isEmpty(self):
"""Return true if the stack is empty, false otherwise."""
return len(self._stack) == 0
We implement the stack with a list named self._stack, and define the end of the
list to be the top of the stack. We then use the append and pop list methods to push
A stack ADT 667
and pop items to and from the top of the stack, respectively. (This is the origin of
the name for the pop method; without an argument, the pop method deletes the
item from the end of the list.)
Reflection 13.30 The stack ADT contained a length attribute. Why do we not
need to include this in the class?
We do not include the length attribute because the list that we use to store the stack
items maintains its length for us; to determine the length of the stack internally, we
can just refer to len(self._stack). For example, in the top method, we need to
make sure the stack is not empty before attempting to access the top element. If we
tried to return self._stack[-1] when self._stack is empty, the following error
will result:
IndexError: list index out of range
Likewise, in the pop method of Stack, if we tried to return self._stack.pop()
when self._stack is empty, we would get this error:
IndexError: pop from empty list
These are not very helpful error messages, both because they refer to a list instead
of the stack and because the first refers to an index, which is not a concept that
is relevant to a stack. In the top and pop methods, we remedy this by using the
raise statement. Familiar errors like IndexError, ValueError, and SyntaxError
are called exceptions. The raise statement “raises” an exception which, by default,
causes the program to end with an error message. The raise statement in red in
the top and pop methods raises an IndexError and prints a more helpful message
after it:
IndexError: stack is empty
Although we will not discuss exceptions in any more depth here, you may be
interested to know that you can also create your own exceptions and respond to (or
“catch”) exceptions in ways that do not result in your program ending. (Look up
try and except clauses.)
Reflection 13.31 If you have not already done so, type the Stack class into a
module named stack.py. Then run the following program.
import stack
def main():
myStack = stack.Stack()
myStack.push(’one’)
print(myStack.pop()) # prints "one"
print(myStack.pop()) # exception
main()
668 Abstract data types
Now let’s explore a few ways we can use our new Stack class. First, the following
function uses a stack to reverse a string, using the algorithm we described above.
import stack
def reverse(text):
""" (docstring omitted) """
characterStack = stack.Stack()
for character in text:
characterStack.push(character)
reverseText = ’’
while not characterStack.isEmpty():
character = characterStack.pop()
reverseText = reverseText + character
return reverseText
Reflection 13.32 If the string ’class’ is passed in for text, what is assigned to
characterStack after the for loop? What string is assigned to reverseText at the
end of the function?
For another example, let’s write a function that converts an integer value to its
corresponding representation in any other base. In Section 1.4 and Box 3.1, we saw
that we can represent integers in other bases by simply using a number other than
10 as the base of each positional value in the number. For example, the numbers
234 in base 10, 11101010 in base 2, EA in base 16, and 1414 in base 5 all represent
the same value, as shown below. To avoid ambiguity, we use a subscript with the
base after any number not in base 10.
111010102 = 1 × 27 + 1 × 26 + 1 × 25 + 0 × 24 + 1 × 23 + 0 × 22 + 1 × 21 + 0 × 20
= 128 + 64 + 32 + 0 + 8 + 0 + 2 + 0
= 234.
14145 = 1 × 53 + 4 × 52 + 1 × 51 + 4 × 50
= 125 + 100 + 5 + 4
= 234.
To represent an integer value in another base, we can repeatedly divide by the base
and add the digit representing the remainder to the front of a growing string. For
example, to convert 234 to its base 5 representation:
A stack ADT 669
1. 234 ÷ 5 = 46 remainder 4
2. 46 ÷ 5 = 9 remainder 1
3. 9 ÷ 5 = 1 remainder 4
4. 1 ÷ 5 = 0 remainder 1
In each step, we divide the quotient obtained in the previous step (in red) by the
base. At the end, the remainders (underlined), in reverse order, comprise the final
representation. Since we obtain the digits in the opposite order that they appear in
the number, we can push each one onto a stack as we get it. Then, after all of the
digits are found, we can pop them off the stack and append them to the end of a
string representing the number in the desired base.
import stack
Parameters:
number: a non-negative integer
base: the base in which to represent the value
digitStack = stack.Stack()
while number > 0:
digit = number % base
digitStack.push(digit)
number = number // base
numberString = ’’
while not digitStack.isEmpty():
digit = digitStack.pop()
if digit < 10:
numberString = numberString + str(digit)
else:
numberString = numberString + chr(ord(’A’) + digit - 10)
return numberString
Exercises
13.5.1. Add a __str__ method to the Stack class. Your representation should indicate
which end of the stack is the top, and whether the stack is empty.
13.5.2. Enhance the convert function so that it also works correctly for negative integers.
670 Abstract data types
13.5.3. Write a function that uses a stack to determine whether a string is a palindrome.
13.5.4. Write a function that uses a stack to determine whether the parentheses in an
arithmetic expression (represented as a string) are balanced. For example, the
parentheses in the string ’(1+(2+3)*(4 - 5))’ are balanced but the parentheses
in the strings ’(1+2+3)*(4-5))’ and ’(1+(2+3)*(4-5)’ are not. (Hint: push left
parentheses onto a stack and pop when a right parenthesis is encountered.)
13.5.5. When a function is called recursively, a representation (which includes the values
of its local variables) is pushed onto the top of a stack. The use of a stack can
be seen in Figures 10.11, 10.12, and 10.16. When a function returns, it is popped
from the top of the stack and execution continues with the function instance that
is on the stack below it.
Recursive algorithms can be rewritten iteratively by replacing the recursive calls
with an explicit stack. Rewrite the recursive depth-first function from Section 10.5
iteratively using a stack. Start by creating a stack and pushing the source position
onto it. Then, inside a while loop that iterates while the stack is not empty, pop
a position and check whether it can be visited. If it can, mark it as visited and
push the four neighbors of the position onto the stack. The function should return
True if the destination can be reached, or False otherwise.
13.5.6. Write a class that implements a queue abstract data type, as defined below.
13.5.7. Write a class named Pointset that contains a collection of points. Inside the class,
the points should be stored in a list of Pair objects (from Section 13.2). Your class
should implement the following methods:
• insert(self, x, y) adds a point (x, y) to the point set
• length(self) returns the number of points
• centroid(self) returns the centroid as a Pair object (see Section 13.2)
• closestPairs(self) returns the two closest points as Pair objects (see
Exercise 13.1.9)
• farthestPairs(self) returns the two farthest points as Pair objects
• diameter(self) returns the distance between the two farthest points
A dictionary ADT 671
• draw(self, tortoise, color) draws the points with the given turtle and
color
Use your class to implement a program that is equivalent to that in Exercise 13.1.9.
Your program should initialize a new PairSet object (instead of a list), insert each
new point into the PairSet object, call your closestPairs, farthestPairs, and
centroid methods, draw all of the points with the draw method of your PairSet
class, and draw the closest pair, farthest pair and centroid in different colors using
the draw method of the Pair class (as assigned in that exercise).
def main():
worldSeries = { }
worldSeries[1903] = ’Boston Americans’
⋮
worldSeries[1979] = ’Pittsburgh Pirates’
⋮
worldSeries[2014] = ’San Francisco Giants’
main()
In this section, we will implement our own Dictionary abstract data type to illustrate
how more complex collection types can be implemented. Formally, a dictionary ADT
simply contains a collection of (key, value) pairs and a length.
0 1 2 3 4 5
19.0: 1 18.9: 2 19.1: 1
Hash tables
Although there are several other ways that a dictionary ADT could be implemented
(such as with a binary search tree, as in Project 11.2), let’s look at how to do so
with our own hash table implementation. When implementing a hash table, there
are two main issues that we need to consider. First, we need to decide what our
hash function should be. Second, we need to decide how to handle collisions, which
occur when more than one key is mapped to the same slot by the hash function.
We will leave an answer to the second question as Project 13.2.
A hash function should ideally spread keys uniformly throughout the hash table
A dictionary ADT 673
to efficiently use the allocated space and to prevent collisions. Assuming that all
key values are integers, the simplest hash functions have the form
hash(key) = key mod n,
where n is the number of slots in the hash table. Hash functions of this form are
said to use the division method .
Reflection 13.33 What range of possible slot indices is given by this hash function?
To prevent patterns from emerging in the slot assignments, we typically want n to
be a prime number.
Reflection 13.34 If n were even, what undesirable pattern might emerge in the
slot indices assigned to keys? (For example, what if all keys were odd?)
Reflection 13.35 Suppose n = 11. What is the value of hash(key) for key values
7, 11, 14, 25, and 100?
There are many better hash functions that have been developed, but the topic is
outside the scope of our discussion here. If you are interested, we give some pointers
to additional reading at the end of the chapter. In the exercises, we discuss how you
might define a hash function for keys that are strings.
class Dictionary:
"""A dictionary class."""
def __init__(self):
"""Construct a new Dictionary object."""
index = self._hash(key)
self._table[index] = (key, value)
self._length = self._length + 1
674 Abstract data types
The constructor creates an empty hash table with 11 slots. We represent an empty
slot with None. When a (key, value) pair is inserted into a slot, it will be represented
with a (key, value) tuple. The _hash method implements our simple hash function.
The leading underscore (_) character signifies that this method is private and should
never be called from outside the class. (The leading underscore also tells the help
function not to list this method.) We will discuss this further below. The insert
method begins by using the hash function to map the given key value to the index
of a slot in the hash table. Then the inserted (key, value) tuple is assigned to
this slot (self._table[index]) and the number of entries in the dictionary is
incremented.
Reflection 13.36 With this implementation, how would you create a new dictionary
and insert (1903, Boston Americans), (1979, Pittsburgh Pirates), and (2014, San
Francisco Giants) into it (as in the main function at the beginning of this section)?
worldSeries = Dictionary()
worldSeries.insert(1903, ’Boston Americans’)
worldSeries.insert(1979, ’Pittsburgh Pirates’)
worldSeries.insert(2014, ’San Francisco Giants’)
Reflection 13.37 Using the insert method as a template, write the delete
method.
The delete method is very similar to the insert method if the key to delete exists
in the hash table. In this case, the only differences are that we do not need to pass in
a value, we assign None to the slot instead of a tuple, and we decrement the number
of items.
Reflection 13.38 How do we determine whether the key that we want to delete is
contained in the hash table?
If the slot to which the key is mapped by the hash function contains the value
None, then the key must not exist. But even if the slot does not contain None, it
may be that a different key resides there, so we still need to compare the value of
the key in the slot with the value of the key we wish to delete. Since the key is
the first value in the tuple assigned to self._table[index], the key value is in
self._table[index][0]. Therefore, the required test looks like this:
if self._table[index] != None and self._table[index][0] == key:
# delete pair
else:
# raise a KeyError exception
A dictionary ADT 675
The KeyError exception is the exception raised when a key is not found in Python’s
built-in dictionary. The fleshed-out delete method is shown below.
index = self._hash(key)
if self._table[index] != None and \
self._table[index][self._KEY] == key:
self._table[index] = None
self._length = self._length - 1
else:
raise KeyError(’key was not found’)
To prevent the use of “magic numbers,” we define two private constant values, _KEY
and _VALUE, that correspond to the indices of the key and value in a tuple. Because
they are defined inside the class, but not in any method, these are class variables.
Unlike an instance variable, which can have a unique value in every object belonging
to the class, a class variable is shared by every object in the class. In other words, if
there were ten Dictionary objects in a program, there would be ten independent
instance variables named _size, one in each of the ten objects. But there would be
only one class variable named _KEY. A class variable can be referred to by prefacing
it by either the name of the class or by self.
Reflection 13.39 How do we delete (2014, ’San Francisco Giants’) from the
worldSeries Dictionary object that we created previously?
To delete this pair, we simply call
worldSeries.delete(2014)
Reflection 13.40 Now write the lookup method for the Dictionary class.
To retrieve a value corresponding to a key, we once again find the index corresponding
to the key and check whether the key is present. If it is, we return the corresponding
value. Otherwise, we raise an exception.
index = self._hash(key)
if self._table[index] != None and \
self._table[index][self._KEY] == key:
return self._table[index][self._VALUE]
else:
raise KeyError(’key was not found’)
676 Abstract data types
Reflection 13.41 How do we look up the 1979 World Series champion in the
worldSeries Dictionary object that we created previously?
To look up the winner of the 1979 World Series, we simply call the lookup method
with the key 1979:
champion = worldSeries.lookup(1979)
print(champion) # prints "Pittsburgh Pirates"
We round out the class with a method that returns the number of (key, value) pairs
in the dictionary. In the built-in collection types, the length is returned by the
built-in len function. We can define the same behavior for our Dictionary class by
defining the special __len__ method.
def __len__(self):
"""Return the number of (key, value) pairs."""
return self._length
def isEmpty(self):
"""Return true if the dictionary is empty, false otherwise."""
return len(self) == 0
With the __len__ method defined, we can find out how many entries are in
the worldSeries dictionary with len(worldSeries). Notice that in the isEmpty
method above, we also call len. Since self refers to a Dictionary object, len(self)
implicitly invokes the __len__ method of Dictionary as well. The result is used to
indicate whether the Dictionary object is empty.
The following main function combines the previous examples to illustrate the
use of our class.
def main():
worldSeries = Dictionary()
worldSeries.insert(1903, ’Boston Americans’)
worldSeries.insert(1979, ’Pittsburgh Pirates’)
worldSeries.insert(2014, ’San Francisco Giants’)
print(worldSeries.lookup(1979)) # prints "Pittsburgh Pirates"
worldSeries.delete(2014)
print(len(worldSeries)) # prints 2
print(worldSeries.lookup(2014)) # KeyError
Implementing indexing
As we already discussed, Python’s built-in dictionary class implements insertion
and retrieval of (key, value) pairs with indexing rather than explicit methods as
A dictionary ADT 677
we used above. We can also mimic this behavior by defining the __getitem__
and __setitem__ methods, as we did for the Pair class earlier. The __getitem__
method takes a single index parameter and the __setitem__ method takes an index
and a value as parameters, just as our lookup and insert methods do. Therefore,
to use indexing with our Dictionary class, we only have to rename our existing
methods, as follows.
index = self._hash(key)
self._table[index] = (key, value)
self._length = self._length + 1
index = self._hash(key)
if self._table[index] != None and \
self._table[index][self._KEY] == key:
return self._table[index][self._VALUE]
else:
raise KeyError(’key was not found’)
The most general method for deleting an item from a Python collection is to use
the del operator. For example,
del frequency[18.9]
deletes the (key, value) pair with key equal to 18.9 from the dictionary named
frequency. We can mimic this behavior by renaming our delete method to
__delitem__, as follows.
index = self._hash(key)
if self._table[index] != None and \
self._table[index][self._KEY] == key:
self._table[index] = None
self._length = self._length - 1
else:
raise KeyError(’key was not found’)
With these three changes, the following main function is equivalent to the one above.
678 Abstract data types
def main():
worldSeries = Dictionary()
worldSeries[1903] = ’Boston Americans’
worldSeries[1979] = ’Pittsburgh Pirates’
worldSeries[2014] = ’San Francisco Giants’
print(worldSeries[1979]) # prints "Pittsburgh Pirates"
del worldSeries[2014]
print(len(worldSeries)) # prints 2
print(worldSeries[2014]) # KeyError
class Dictionary(builtins.object)
| A dictionary class.
|
| Methods defined here:
|
| __delitem__(self, key)
| Delete the pair with a given key.
|
| __getitem__(self, key)
| Return the value associated with a key.
|
| __init__(self)
| Construct a new Dictionary object.
|
| __len__(self)
| Return the number of (key, value) pairs.
|
| __setitem__(self, key, value)
| Insert a (key, value) pair into self.
|
| isEmpty(self)
| Return true if the dictionary is empty, false otherwise.
Exercises
13.6.1. Add a private method named _printTable to the Dictionary class that prints
the contents of the underlying hash table. For example, for the dictionary created
in the main function above, the method should print something like this:
0: (1903, ’Boston Americans’)
1: (2014, ’San Francisco Giants’)
2: None
3: None
4: None
5: None
6: None
7: None
8: None
9: None
10: (1979, ’Pittsburgh Pirates’)
13.6.2. Add a __str__ method to the Dictionary class. The method should return a
string similar to that printed for a built-in Python dictionary. It should not divulge
any information about the underlying hash table implementation. For example,
for the dictionary displayed in the previous exercise, the method should return a
string like this:
13.6.3. Implement use of the in operator for the Dictionary class by adding a method
named __contains__. The method should return True if a key is contained in the
Dictionary object, or False otherwise.
13.6.4. Implement Dictionary methods named items, keys, and values that return the
list of (key, value) tuples, keys, and values, respectively. In any sequence of calls
to these methods, with no modifications to the Dictionary object between the
calls, they must return lists in which the orders of the items correspond. In other
words, the first value returned by values will correspond to the first key returned
by keys, the second value returned by values will correspond to the second key
returned by keys, etc. The order of the tuples in the list returned by items must
be the same as the order of the lists returned by keys and values.
13.6.5. Show how to use the keys method that you wrote in the previous exercise to print
a list of keys and values in a Dictionary object in alphabetical order by key.
13.6.6. Use the Dictionary class to solve Exercise 8.3.4.
13.6.7. Use the Dictionary class to implement the removeDuplicates3 function on
Page 401.
13.6.8. If keys are string values, then a new hash function is needed. One simple idea is
to sum the Unicode values corresponding to the characters in the string and then
return the sum modulo n. Implement this new hash function for the Dictionary
class.
13.6.9. Design and implement a hash function for keys that are tuples.
13.6.10. Write a class named Presidents that maintains a list of all of the U.S. presidents.
The constructor should take the number of presidents as a parameter and initialize
a list of empty slots, each containing the value None. Then add __setitem__ and
__getitem__ methods that insert and return, respectively, a President object
(from Exercise 13.1.13) with the given chronological number (starting at 1). Be sure
to check in each method whether the parameters are valid. Also add a __str__
method that returns a complete list of the presidents in chronological order. If
a president is missing, replace the name with question marks. For example, the
following code
presidents = Presidents(44)
washington = President(’George Washington’)
kennedy = President(’John F. Kennedy’)
presidents[1] = washington
presidents[35] = kennedy
print(presidents[35]) # prints Kennedy
print(presidents)
should print
A dictionary ADT 681
John F. Kennedy
1. George Washington
2. ???
3. ???
⋮
34. ???
35. John F. Kennedy
36. ???
⋮
44. ???
Also include a method that does the same thing as the function from Exer-
cise 13.1.14. In other words, your method should, given an age, print a table with
all presidents who where at least that old when they took office, along with their
ages when they took office.
13.6.11. Write a class that stores all of the movies that have won the Academy Award for
Best Picture. Inside the class, use a list of Movie objects from Exercise 13.1.15.
Your class should include __getitem__ and __setitem__ methods that return and
assign the movie winning the award in the given edition of the ceremony. (The 87th
Academy Awards were held in 2015.) In addition, include a __str__ method that
returns a string containing the complete list of the winning titles in chronological
order. If a movie is missing, replace the title with question marks. Finally, include
a method that checks whether the winners in two given editions have actors in
common (by calling the method from Exercise 13.1.15). You can find a complete
list of winners at http://www.oscars.org/oscars/awards-databases
13.6.12. Write a class named Roster that stores information about all of the students
enrolled in a course. In the Roster class, store the information about the students
using a list of Student objects from Exercise 13.1.18. Each student will also have
an associated ID number, which can be used to access the student through the
class’ __getitem__ and __setitem__ methods. Also include a __len__ method
that returns the number of students enrolled, a method that returns the average
of all of the exam grades for all of the students, a method that returns the average
overall grade (using the weights in Exercise 13.1.18), and a __str__ method that
returns a string representing the complete roster with current grades. Use of this
class is illustrated with the following short segment of code:
roster = Roster()
alice = Student(’Alice Miller’)
bob = Student(’Bob Smith’)
roster[101] = alice
roster[102] = bob
roster[101].addExam(100)
roster[102].addExam(85)
print(roster.examAverage()) # prints 92.5
print(roster.averageGrade()) # prints 46.25
print(roster) # prints:
# 101 Alice Miller 50.00
# 102 Bob Smith 42.50
682 Abstract data types
13.7 SUMMARY
As the quotation at the beginning of the chapter so concisely puts it, a program
consists of two main parts: the algorithm and the data on which the algorithm
works. When we design an algorithm using an object-oriented approach, we begin
by identifying the main objects in our problem, and then define abstract data types
for them. When we design a new ADT, we need to identify the data that the ADT
will contain and the operations that will be allowed on that data. These operations
are generally organized into four categories: constructors, a destructor, accessors,
and mutators.
In an object-oriented programming language like Python, abstract data types are
implemented as classes. A Python class contains a set of functions called methods
and a set of instance variables whose names are preceded by self within the class.
The name self always refers to the object on which a method is called. A class
can also define the meaning of several special methods that dictate how operators
and built-in functions behave on the class. The class implements a data structure
that implements the specification given by the abstract data type. There may be
many different data structures that one can use to implement a particular abstract
data type. For example, the Pair ADT from the beginning of this chapter may be
implemented with two individual variables, a list of length two, a two-element tuple,
or a dictionary with two entries.
To illustrate how classes are used in larger programs, we designed an agent-based
simulation that simulates flocking birds or schooling fish. This simulation consists
of two main classes that interact with each other: an agent class and a class for the
world that the agents inhabit. Agent-based simulations can be used in a variety of
disciplines including sociology, biology, economics, and the physical sciences.
13.9 PROJECTS
Project 13.1 Tracking GPS coordinates
A GPS (short for Global Positioning System) receiver (like those in most mobile
phones) is a small computing device that uses signals from four or more GPS
satellites to compute its three-dimensional position (latitude, longitude, altitude)
on Earth. By recording this position data over time, a GPS device is able to track
moving objects. The use of such tracking data is now ubiquitous. When we go
jogging, our mobile phones can track our movements to record our route, distance,
and speed. Companies and government agencies that maintain vehicle fleets use
GPS to track their locations to streamline operations. Biologists attach small GPS
devices to animals to track their migration behavior.
In this project, you will write a class that stores a sequence of two-dimensional
geographical points (latitude, longitude) with their timestamps. We will call such
a sequence a track. We will use tracking data that the San Francisco Municipal
Transportation Agency (SF MTA) maintains for each of the vehicles (cable cars,
streetcars, coaches, buses, and light rail) in its Municipal Railway (“Muni”) fleet.
For example, the following table shows part of a track for the Powell/Hyde cable
car in metropolitan San Francisco.
Time stamp Longitude Latitude
2014-12-01 11:03:03 −122.41144 37.79438
2014-12-01 11:04:33 −122.41035 37.79466
2014-12-01 11:06:03 −122.41011 37.7956
2014-12-01 11:07:33 −122.4115 37.79538
2014-12-01 11:09:03 −122.4115 37.79538
Negative longitude values represent longitudes west of the prime meridian and
negative latitude values represent latitudes south of the equator. After you implement
your track class, write a program that uses it with this data to determine the Muni
route that is closest to any particular location in San Francisco.
An abstract data type for a track will need the following pair of attributes.
In addition, the ADT needs operations that allow us to add new data to the track,
draw a picture of the track, and compute various characteristics of the track.
Question 13.1.1 What does the _distance method do? Why is its name preceded
with an underscore?
Question 13.1.2 What is the purpose of the degToPix function that is passed as
a parameter to the draw method?
After you write each method, be sure to thoroughly test it before moving on to the
next one. For this purpose, design a short track consisting of four to six points and
times, and write a program that tests each method on this track.
http://www.sfmta.com/maps/muni-system-map
registers the function named clickMap as the function to be called when a mouse
click happens. Then the last function call
screen.mainloop()
initiates an event loop in the background that calls the registered clickMap function
on each mouse click. The x and y coordinates of the mouse click are passed as
parameters into clickMap when it is called.
The tracking data is contained in a comma-separated (CSV) file on the book
web site named muni_tracking.csv. The file contains over a million individual
data points, tracking the movements of 965 vehicles over a 24-hour period. The only
function in muni.py that is left to be written is readTracks, which should read
this data and return a dictionary of Track objects, one per vehicle. Each vehicle is
labeled in the file with a unique vehicle tag, which you should use for the Track
objects’ names and as the keys in the dictionary of Track objects.
Question 13.1.3 Why is tracks a global variable? (There had better be a very
good reason!)
Question 13.1.4 What are the purposes of the seven constant named values in all
caps at the top of the program?
Projects 687
Question 13.1.6 Study the clickMap function carefully. What does it do?
Once you have written the readTracks function, test the program. The program
also uses the closestDistance method that you wrote for the Track class, so you
may have to debug that method to get the program working correctly.
Question 13.2.1 How do the implementations of the insert, delete, and lookup
functions need to change to implement chaining?
Question 13.2.2 With your answer to the previous question in mind, what is the
worst case time complexity of each of these operations if there are n items in the
hash table?
for work, and is named for the largest city in the zone. This city name will be the key
for your database, so you will need a hash function that maps strings to hash table
indices. Exercise 13.6.8 suggested one simple way to do this. Do some independent
research to discover at least one additional hash function that is effective for general
strings. Implement this new hash function.
Question 13.2.3 According to your research, why is the hash function you discov-
ered better than the one from Exercise 13.6.8?
1. Print a table like the following of all commuting zone data, alphabetized by
state then by commuting zone name. (Hints: (a) create another Dictionary
object as you read the data file; (b) the sort method sorts a list of tuples by
the first element in the tuple.)
AK
Anchorage: 13.4%
Barrow: 10.0%
Bethel: 5.2%
Dillingham: 11.8%
Fairbanks: 16.0%
Juneau: 12.6%
Ketchikan: 12.0%
Kodiak: 14.7%
Kotzebue: 6.5%
Nome: 4.7%
Sitka: 7.1%
Unalaska: 13.0%
Valdez: 15.4%
AL
Atmore: 4.8%
Auburn: 3.5%
⋮
2. Print a table, like the following, alphabetized by state, of the average probability
for each state. (Hint: use another Dictionary object.)
State Percent
----- -------
AK 11.0%
AL 5.4%
AR 7.2%
⋮
3. Print a table, formatted like that above, of the states with the five lowest and
five highest average probabilities. To do this, it may be helpful to know about
the following trick with the built-in sort method. When the sort method
sorts a list of tuples or lists, it compares the first elements in the tuples or lists.
For example, if values = [(0, 2), (2, 1), (1, 0)], then values.sort()
will reorder the list to be [((0, 2), (1, 0), (2, 1)]. To have the sort
method use another element as the key on which to sort, you can define a
simple function like this:
690 Abstract data types
def getSecond(item):
return item[1]
values.sort(key = getSecond)
When the list named values is sorted above, the function named
getSecond is called for each item in the list and the return value is
used as the key to use when sorting the item. For example, suppose
values = [(0, 2), (2, 1), (1, 0)]. Then the keys used to sort the three
items will be 2, 1, and 0, respectively, and the final sorted list will be
[(1, 0), (2, 1), (0, 2)].
Slime world
In our simulation, the slime mold’s world will consist of a grid of square patches,
each of which contains some non-negative level of the chemical cAMP. The cAMP
will be deposited by the slime mold (explained next). In each time step, the chemical
in each patch should:
Projects 691
1. Diffuse to the eight neighboring patches. In other words, after the chemical in
a patch diffuses, 1/8 of it will be added to the chemical in each neighboring
patch. (Note: this needs to be done carefully; the resulting levels should be as
if all patches diffused simultaneously.)
Slime world will be modeled as an instance of a class (that you will create) called
World. Each patch in slime world will be modeled as an instance of a class called
Patch (that you will also create). The World class should contain a grid of Patch
objects. You will need to design the variables and methods needed in these new
classes.
There is code on the book web site to visualize the level of chemical in each
patch. Higher levels are represented with darker shades of green on the turtle’s
canvas. Although it is possible to recolor each patch with a Turtle during each time
step, it is far too slow. The supplied code modifies the underlying canvas used in
the implementation of the turtle module.
Amoeboid behavior
At the outset of the simulation, the world will be populated with some number of
slime mold amoeboids at random locations on the grid. At each time step in the
simulation, a slime mold amoeboid will:
1. “Sniff” the level of the chemical cAMP at its current position. If that level
is above some threshold, it will next sniff for chemical SNIFF_DISTANCE units
ahead and SNIFF_DISTANCE units out at SNIFF_ANGLE degrees to the left and
right of its current position. SNIFF_ANGLE and SNIFF_DISTANCE are parameters
that can be set in the simulation. In the graphic below, the slime mold is
represented by a red triangle pointing at its current heading and SNIFF_ANGLE
is 45 degrees. The X’s represent the positions to sniff.
692 Abstract data types
y+2
x
x
y+1
y
x x+1 x+2
Notice that neither the current coordinates of the slime mold cell nor the
coordinates to sniff may be integers. You will want to write a function that will
round coordinates to find the patch in which they reside. Once it ascertains
the levels in each of these three patches, it will turn toward the highest level.
The simulation
The main loop of the simulation will involve iterating over some number of time
steps. In each time step, every slime mold amoeboid and every patch must execute
the steps outlined above.
Download a bare-bones skeleton of the classes needed in the project from the
book web site. These files contain only the minimum amount of code necessary to
accomplish the drawing of cAMP levels in the background (as discussed earlier).
Before you write any Python code, think carefully about how you want to design
your project. Draw pictures and map out what each class should contain. Also map
out the main event loop of your simulation. (This will be an additional file.)
As of this writing, the latest version of VPython only works with Python 2.7,
so installing this version of Python will be part of the process. To force your code
to behave in most respects like Python 3, even while using Python 2.7, add the
following as the first line of code in your program files, above any other import
statements:
from __future__ import division, print_function
The three-dimensional coordinate system in VPython looks like this, with the
positive z-axis coming out of the screen toward you.
• The constructor should require a list or tuple parameter to initialize the vector.
The length of the parameter will dictate the length of the Vector object. For
example,
velocity = Vector((1, 0, 0))
will assign a three-dimensional Vector object representing the vector ⟨1, 0, 0⟩.
• Add a __len__ method that returns the length of the Vector object.
• You can delete the angle and turn methods, as you will no longer need them.
• The dot product of two vectors is the sum of the products of corresponding
elements. For example, the dot product of ⟨1, 2, 3⟩ and ⟨4, 5, 6⟩ is 1⋅4+2⋅5+3⋅6 =
4 + 10 + 18 = 32. The dotproduct method needs to be generalized to compute
this quantity for vectors of any length.
In the display constructor, the width and height give the dimensions of the
window and the range argument gives the visible range of points on either
side of the origin. The background color is a deep blue; feel free to change it.
Then, to place a yellow sphere in the center to represent the light, create a
new sphere object like this:
self._light = visual.sphere(scene = self._scene,
pos = (0, 0, 0),
color = visual.color.yellow)
• In the stepAll method, make the light follow the position of the mouse with
self._light.pos = self._scene.mouse.pos
You can change the position of any visual object by changing the value of
its pos instance variable. In the display object (self._scene), mouse refers
to the mouse pointer within the VPython window.
• Finally, generalize the _distance function, and remove all code from the class
that limits the position of an agent.
Projects 695
• The move method will need to be generalized to three dimensions, but you
can remove all of the code that keeps the boids within a boundary.
• The _avoid, _center, and _match methods can remain mostly the same,
except that you will need to replace instances of Pair with Vector. Also, have
the _avoid method avoid the light in addition to avoiding other boids.
• Write a new method named _light that returns a unit vector pointing toward
the current position of the light. Incorporate this vector into your step method
with another weight
LIGHT_WEIGHT = 0.3
WIDTH = 60
HEIGHT = 60
DEPTH = 60
NUM_MOTHS = 20
def main():
sky = World(WIDTH, HEIGHT, DEPTH)
for index in range(NUM_MOTHS):
moth = Boid(sky)
while True:
visual.rate(25) # 1 / 25 sec elapse between computations
sky.stepAll()
main()
This page intentionally left blank
APPENDIX A
Installing Python
he Python programs in this book require Python version 3.4 or later, and two
T additional modules: numpy and matplotlib. Generally speaking, you have two
options for installing this software on your computer.
https://store.continuum.io/cshop/anaconda/
Be sure to install the version that includes Python 3.4 or later (not Python 2.7).
Macintosh
1. Download Python 3.4 (or later) from
http://www.python.org/downloads/
and follow the directions to install it. The Python installation and IDLE can
then be found in the Python 3.4 folder in your Applications folder.
2. To install numpy and matplotlib, open the Terminal application and type
pip3 install matplotlib
697
698 Installing Python
in the window. This command will download and install matplotlib in addi-
tion to other modules on which matplotlib depends (including numpy). When
you are done, you can quit Terminal.
http://www.barebones.com/products/textwrangler/
To run a Python program in TextWrangler, save the file with a .py extension,
and then select Run from the #! (pronounced “shebang”) menu. To ensure
that TextWrangler uses the correct version of Python, you may need to include
the following as the first line of every program:
#!/usr/bin/env python3
Windows
1. Download Python 3.4 (or later) from
http://www.python.org/downloads/
and follow the directions to install it. After installation, you should find IDLE
in the Start menu, or you can use Windows’ search capability to find it. The
Python distribution is installed in the C:\Python34 folder.
2. To install numpy, go to
http://sourceforge.net/projects/numpy/files/NumPy/
and then click on the highest version number next to a folder icon
(1.9.1 at the time of this writing). Next download the file that looks like
numpy-...-python3.4.exe (e.g., numpy-1.9.1-win32-superpack-python3.4.exe).
Save the file and double-click to install it.
3. To install matplotlib, go to
http://matplotlib.org/downloads.html .
In the section at the top of the page, labeled “Latest stable version”
(version 1.4.3 at the time of this writing), find the file that looks like
matplotlib-...-py3.4.exe (e.g., matplotlib-1.4.3.win32-py3.4.exe)
and click to download it. Save the file and double-click to install it.
4. Open the Command Prompt application (in the Accessories folder) and type
the following commands, one at a time, in the window.
Manual installation 699
cd \python34\scripts
pip3 install six
pip3 install python-dateutil
pip3 install pyparsing
The first command changes to the folder containing the pip3 program, which
is used to install additional modules in Python. The next three commands
use pip3 to install three modules that are needed by matplotlib. When you
are done, you can quit Command Prompt.
This page intentionally left blank
APPENDIX B
he following tables provide a convenient reference for the most common Python
T functions and methods used in this book. You can find a complete reference
for the Python standard library at https://docs.python.org/3/library/ .
701
702 Python library reference
Image(width, height, returns a new empty Image object with the given
[title = ’Title’]) width and height; optionally sets the title of
the image window displayed by show
Image(file = ’file.gif’, returns a new Image object containing the image
[title = ’Title’]) in the given GIF file; optionally sets the title
of the image window displayed by show
mainloop() waits until all image windows have been closed,
then quits the program
Special methods 707
709
710 Bibliography
[17] David Harel. Algorithmics: The Spirit of Computing, third edition. Addison-Wesley,
2004.
[18] W. Daniel Hillis. The Pattern On The Stone: The Simple Ideas That Make Computers
Work. Basic Books, 1998.
[19] Eric Hobsbawm. The Age of Extremes: A History of the World, 1914-1991. Vintage
Books, 1994.
[20] Andrew Hodges. Alan Turing: The Enigma. Princeton University Press, 1983.
[21] Deborah G. Johnson with Keith W. Miller. Computer Ethics: Analyzing Information
Technology, fourth edition. Prentice Hall, 2009.
[22] Steven Johnson. Emergence: The Connected Lives of Ants, Brains, Cities, and Software.
Scribner, 2001.
[23] William O. Kermack and Anderson G. McKendrick. A Contribution to the Mathematical
Theory of Epidemics. Proceedings of the Royal Society A 115: 700–721, 1927.
[24] Donald E. Knuth. The Art of Computer Programming, volumes 1–4A. Addison-Wesley,
1968–2011.
[25] Donald E. Knuth. Computer Programming as an Art. Communications of the ACM
17(12):667–673, 1974.
[26] Donald E. Knuth. Computer Science and Mathematics. American Scientist 61(6), 1973.
[27] Ehud Lamm and Ron Unger. Biological Computation. Chapman & Hall/CRC Press,
2011.
[28] Barbara Liskov and Stephen Zilles. Programming with Abstract Data Types. In
Proceedings of the ACM Conference on Very High Level Languages, SIGPLAN Notices
9(4):50–59, 1974.
[29] Mark Lutz. Programming Python. O’Reilly Media, 1996.
[30] John MacCormick. Nine Algorithms that Changed the Future. Princeton University
Press, 2013.
[31] Benoit B. Mandelbrot. The Fractal Geometry of Nature. Macmillan, 1983.
[32] Steve McConnell. Software Project Survival Guide: How to Be Sure Your First Important
Project Isn’t Your Last. Microsoft Press, 1998.
[33] Stanley Milgram. The Small-World Problem. Psychology Today 1(1):61–67, May 1967.
[34] Leonard Mlodinow. The Drunkard’s Walk: How Randomness Rules Our Lives. Vintage
Books, 2009.
[35] Philip Morrison and Phylis Morrison. 100 or so Books that shaped a Century of Science.
American Scientist 87(6), November-December 1999.
[36] Alexander J. Nicholson and Victor A. Bailey. The Balance of Animal Populations—Part
I. Proceedings of the Zoological Society of London 105:551-598, 1935.
[37] Stephen K. Park and Keith W. Miller. Random Number Generators: Good Ones Are
Hard to Find. Communications of the ACM 31(10):1192–1201, 1988.
[38] William R. Pearson and David J. Lipman. Improved Tools for Biological Sequence
Comparison. Proceedings of the National Academy of Science 85(8):2444–2448, 1988.
Bibliography 711
[39] Jean R. Petit, et al. Climate and Atmospheric History of the Past 420,000 years from
the Vostok Ice Core, Antarctica. Nature 399:429–436.
[40] Charles Petzold. CODE: The Hidden Language of Computer Hardware and Software.
Microsoft Press, 2000.
[41] George Polya. How to Solve It: A New Aspect of Mathematical Method. Princeton
University Press, 1945.
[42] Jean-Yves1 Potvin. Genetic algorithms for the traveling salesman problem. Annals of
Operations Research 63(3):337–370, 1996.
[43] Przemyslaw Prusinkiewicz and Aristid Lindenmayer. The Algorithmic Beauty of Plants.
http://algorithmicbotany.org/papers/abop/abop.pdf, 1990.
[44] Mitchell Resnick. Turtles, Termites, and Traffic Jams: Explorations in Massively Parallel
Microworlds. MIT Press, 1994.
[45] Craig W. Reynolds. Flocks, Herds, and Schools: A Distributed Behavioral Model.
Computer Graphics, SIGGRAPH ’87 21(4):25–34, 1987.
[46] Michael C. Schatz and Ben Langmead. The DNA Data Deluge. IEEE Spectrum
50(7):28–33, 2013.
[47] Thomas C. Schelling. Dynamic Models of Segregation. Journal of Mathematical
Sociology 1:143–186, 1971.
[48] Thomas C. Schelling. Micromotives and Macrobehavior. W. W. Norton & Company,
1978.
[49] Rachel Schutt and Cathy O’Neil. Doing Data Science. O’Reilly Media, 2014.
[50] Dr. Seuss. The Sneetches and Other Stories. Random House, 1961.
[51] Dennis Shasha and Cathy Lazere. Natural Computing: DNA, Quantum Bits, and the
Future of Smart Machines. W. W. Norton & Company, 2010.
[52] Angela B. Shiflet and George W. Shiflet. Introduction to Computational Science:
Modeling and Simulation for the Sciences, second edition. Princeton University Press,
2014.
[53] Ian Stewart. The Mathematics of Life. Basic Books, 2011.
[54] Steven Strogatz. Sync: How Order Emerges from Chaos in the Universe, Nature, and
Daily Life. Hyperion, 2003.
[55] Alexander L. Taylor III, Peter Stoler, and Michael Moritz. The Wizard inside the
Machine Time 123(6):64–73, April 1984.
[56] Christophe Van den Bulte and Yogesh V. Joshi. New Product Diffusion with Influentials
and Imitators. Marketing Science 26(3):400–421, 2007.
[57] Duncan J. Watts. Six Degrees: The Science of a Connected Age. W. W. Norton &
Company, 2004.
[58] Wikipedia. Electricity sector in New Zealand. http://en.wikipedia.org/w/index.
php?title=Electricity_sector_in_New_Zealand, 2014.
[59] Niklaus Wirth. Algorithms + Data Structures = Programs. Prentice-Hall, 1976.
712 Bibliography
[60] James Zachos, et al. Trends, Rhythms, and Aberrations in Global Climate 65 Ma to
Present. Science 292: 686–692, 2001.
[61] http://www.pd4pic.com/nautilus-cephalopods-sea-holiday-memory-housing-2.
html, 2015.
[62] http://photojournal.jpl.nasa.gov/jpeg/PIA03424.jpg, 2015.
[63] http://www.pd4pic.com/leaf-green-veins-radiating-patterned.html, 2015.
[64] http://www.pd4pic.com/strait-of-malacca-sky-clouds-lightning-storm.
html, 2015.
Computer Science & Engineering
Chapman & Hall/CRC
TEXTBOOKS IN COMPUTING
Discovering Computer Science: Interdisciplinary Problems, Principles, and WITH VITALSOURCE ®
Python Programming introduces computational problem solving as a vehicle of EBOOK
DISCOVERING
discovery in a wide variety of disciplines. With a principles-oriented introduction
to computational thinking, the text provides a broader and deeper introduction to
COMPUTER
Organized around interdisciplinary problem domains, rather than programming
language features, each chapter guides students through increasingly sophisti-
cated algorithmic and programming techniques. The author uses a spiral ap-
proach to introduce Python language features in increasingly complex contexts
as the book progresses.
The text places programming in the context of fundamental computer science
SCIENCE
principles, such as abstraction, efficiency, and algorithmic techniques, and offers
overviews of fundamental topics that are traditionally put off until later courses.
Interdisciplinary Problems,
The book includes 30 well-developed independent projects that encourage stu-
dents to explore questions across disciplinary boundaries. Each is motivated by a
Principles, and Python
problem that students can investigate by developing algorithms and implement-
ing them as Python programs. Programming
The book’s accompanying website — http://discoverCS.denison.edu — includes
sample code and data files, pointers for further exploration, errata, and links to
Python language references.
Containing over 600 homework exercises and over 300 integrated reflection ques-
tions, this textbook is appropriate for a first computer science course for com-
puter science majors, an introductory scientific computing course, or at a slower
pace, any introductory computer science course.
Jessen Havill
Havill
• Access online or download to your smartphone, tablet or PC/Mac
• Search the full text of this and other titles you own
• Make and share notes and highlights
• Copy and paste text and figures for use in your own documents
• Customize your view by changing font size and layout
K23951
w w w. c rc p r e s s . c o m