Practical Scientific Computing in Python A Workbook
Practical Scientific Computing in Python A Workbook
in Python
A Workbook
John D. Hunter
Fernando Pérez
Andrew Straw
Contents
Chapter 1. Introduction 5
Chapter 2. Simple non-numerical problems 7
1. Sorting quickly with QuickSort 7
2. Dictionaries for counting words 8
Chapter 7. Statistics 37
1. Descriptive statistics 37
2. Statistical distributions 40
3
CHAPTER 1
Introduction
This document contains a set of small problems, drawn from many different fields, meant to
illustrate commonly useful techniques for using Python in scientific computing.
All problems are presented in a similar fashion: the task is explained including any necessary
mathematical background and a ‘code skeleton’ is provided that is meant to serve as a starting
point for the solution of the exercise. In some cases, some example output of the expected solution,
figures or additional hints may be provided as well.
The accompanying source download for this workbook contains the complete solutions, which
are not part of this document for the sake of brevity.
For several examples, the provided skeleton contains pre-written tests which validate the cor-
rectness of the expected answers. When you have completed the exercise successfully, you should
be able to run it from within IPython and see something like this (illustrated using a trapezoidal
rule problem, whose solution is in the file trapezoid.py):
In [7]: run trapezoid.py
....
----------------------------------------------------------------------
Ran 4 tests in 0.003s
OK
This message tells you that 4 automatic tests were successfully executed. The idea of in-
cluding automatic tests in your code is a common one in modern software development, and
Python includes in its standard library two modules for automatic testing, with slightly dif-
ferent functionality: unittest and doctest. These tests were written using the unittest
system, whose complete documentation can be found here: http://docs.python.org/lib/
module-unittest.html.
Other exercises will illustrate the use of the doctest system, since it provides complementary
functionality.
5
CHAPTER 2
def qsort(lst):
"""Return a sorted copy of the input list."""
raise NotImplementedError
if __name__ == ’__main__’:
from unittest import main, TestCase
import random
class qsortTestCase(TestCase):
def test_sorted(self):
seq = range(10)
sseq = qsort(seq)
self.assertEqual(seq,sseq)
def test_random(self):
tseq = range(10)
rseq = range(10)
random.shuffle(rseq)
sseq = qsort(rseq)
self.assertEqual(tseq,sseq)
main()
Hints.
• Python has no particular syntactic requirements for implementing recursion,
but it does have a maximum recursion depth. This value can be queried
7
8 2. SIMPLE NON-NUMERICAL PROBLEMS
def word_freq(text):
"""Return a dictionary of word frequencies for the given text."""
# XXX you need to write this
def print_vk(lst):
"""Print a list of value/key pairs nicely formatted in key/value order."""
# Find the longest key: remember, the list has value/key paris, so the key
# is element [1], not [0]
longest_key = max(map(lambda x: len(x[1]),lst))
# Make a format string out of it
fmt = ’%’+str(longest_key)+’s -> %s’
# Do actual printing
for v,k in lst:
print fmt % (k,v)
def freq_summ(freqs,n=10):
"""Print a simple summary of a word frequencies dictionary.
Inputs:
- freqs: a dictionary of word frequencies.
Optional inputs:
- n: the number of items to print"""
print_vk(items[:n])
print
print ’%d most frequent words:’ % n
print_vk(items[-n:])
if __name__ == ’__main__’:
text = # XXX
# You need to read the contents of the file HISTORY.gz. Do NOT unzip it
# manually, look at the gzip module from the standard library and the
# read() method of file objects.
freqs = word_freq(text)
freq_summ(freqs,20)
Hints.
• The print_vk function is already provided for you as a simple way to summarize your
results.
• You will need to read the compressed file HISTORY.gz. Python has facilities to do this
without having to manually uncompress it.
• Consider ‘words’ simply the result of splitting the input text into a list, using any form
of whitespace as a separator. This is obviously a very naïve definition of ‘word’, but it
shall suffice for the purposes of this exercise.
• Python strings have a .split() method that allows for very flexible splitting. You can
easily get more details on it in IPython:
In [2]: a = ’somestring’
In [3]: a.split?
Type: builtin_function_or_method
Base Class: <type ’builtin_function_or_method’>
Namespace: Interactive
Docstring:
S.split([sep [,maxsplit]]) -> list of strings
This section is a general overview to show how easy it is to load and manipulate data on the
file system and over the web using python’s built in data structures and numpy arrays. The goal
is to exercise basic programming skills like building filename or web addresses to automate certain
tasks like loading a series of data files or downloading a bunch of related files off the web, as well
as to illustrate basic numpy and pylab skills.
a = 2 # 2 volt amplitude
f = 10 # 10 Hz frequency
sigma = 0.5 # 0.5 volt standard deviation noise
# create the t and v arrays; see the scipy commands arange, sin, and randn
t = XXX # an evenly sampled time array
v = XXX # a noisy sine wave
# create a 2D array X and put t in the 1st column and v in the 2nd;
# see the numpy command zeros
X = XXX
11
12 3. WORKING WITH FILES, THE INTERNET, AND NUMPY ARRAYS
# save the output file as ASCII; see the pylab command save
XXX
# plot the arrays t vs v and label the x-axis, y-axis and title save
# the output figure as noisy_sine.png. See the pylab commands plot,
# xlabel, ylabel, grid, show
XXX
and the graph will look something like Figure 1
The second part of this exercise is to write a script which loads data from the data file into
an array X, extracts the columns into arrays t and v, and computes the RMS (root-mean-square)
intensity of the signal using the load command.
an entry called “Historical Prices” which will take you to a page where you can download the price
history of your stock. Near the bottom of this page you should see a “Download To Spreadsheet”
link – instead of clicking on it, right click it and choose “Copy Link Location” and paste this into
a python script or ipython session as a string named url. Eg, for SPY page better
url = ’http://ichart.finance.yahoo.com/table.csv?’ +\
’s=SPY&d=9&e=20&f=2007&g=d&a=0&b=29&c=1993&ignore=.csv’
I’ve broken the url into two strings so they will fit on the page. If you spend a little time looking at
this pattern, you can probably figure out what is going on. The URL is encoding the information
about the stock, the variable s for the stock ticker, d for the latest month, e for the latest day,
f for the latest year, c for the start year, and so on (similarly a, b, and c for the start month,
day and year). This is handy to know, because below we will write some code to automate some
downloads for a stock universe.
One of the great things about python is it’s “batteries included” standard library, which includes
support for dates, csv files and internet downloads. The example interactive session below shows
how in just a few lines of code using python’s urllib for retrieving information from the internet,
and matplotlib’s csv2rec function for loading numpy record arrays, we are ready to get to
work analyzing some web based data. Comments have been added to a copy-and-paste from the
interactive session
# import a couple of libraries we’ll be needing
In [23]: import urllib
In [24]: import matplotlib.mlab as mlab
# this will grab that web file and save it as ’SPY.csv’ on our local
# filesystem
In [27]: urllib.urlretrieve(url, ’SPY.csv’)
Out[27]: (’SPY.csv’, <httplib.HTTPMessage instance at 0x2118210>)
# here we use the UNIX command head to peak into the file, which is
# a comma separated and contains various types, dates, ints, floats
In [28]: !head SPY.csv
Date,Open,High,Low,Close,Volume,Adj Close
2007-10-19,153.09,156.48,149.66,149.67,295362200,149.67
2007-10-18,153.45,154.19,153.08,153.69,148367500,153.69
2007-10-17,154.98,155.09,152.47,154.25,216687300,154.25
2007-10-16,154.41,154.52,153.47,153.78,166525700,153.78
2007-10-15,156.27,156.36,153.94,155.01,161151900,155.01
2007-10-12,155.46,156.35,155.27,156.33,124546700,156.33
2007-10-11,156.93,157.52,154.54,155.47,233529100,155.47
2007-10-10,156.04,156.44,155.41,156.22,101711100,156.22
2007-10-09,155.60,156.50,155.03,156.48,94054300,156.48
# csv2rec will import the file into a numpy record array, inspecting
# the columns to determine the correct data type
In [29]: r = mlab.csv2rec(’SPY.csv’)
# the dtype attribute shows you the field names and data types.
# O4 is a 4 byte python object (datetime.date), f8 is an 8 byte
# float, i4 is a 4 byte integer and so on. The > and < symbols
# indicate the byte order of multi-byte data types, eg big endian or
14 3. WORKING WITH FILES, THE INTERNET, AND NUMPY ARRAYS
# Each of the columns is stored as a numpy array, but the types are
# preserved. Eg, the adjusted closing price column adj_close is a
# floating point type, and the date column is a python datetime.date
In [31]: print r.adj_close
[ 149.67 153.69 154.25 ..., 34.68 34.61 34.36]
In [32]: print r.date
[2007-10-19 00:00:00 2007-10-18 00:00:00 2007-10-17 00:00:00 ...,
1993-02-02 00:00:00 1993-02-01 00:00:00 1993-01-29 00:00:00]
For your exercise, you’ll elaborate on the code here to do a batch download of a number of
stock tickers in a defined stock universe. Define a function fetch_stock(ticker) which takes
a stock ticker symbol as an argument and returns a numpy record array. Select the rows of the
record array where the date is greater than 2003-01-01 and plot the returns (p − p0 )/p0 where p
are the prices and p0 is the initial price. by date for each stock on the same plot. Create a legend
for the plot using the matplotlib legend command, and print out a sorted list of final returns (eg
assuming you bought in 2003 and held to the present) for each stock. Here is the exercise skeleton.:
Listing 3.2
"""
Download historical pricing record arrays for a universe of stocks
from Yahoo Finance using urllib. Load them into numpy record arrays
using matplotlib.mlab.csv2rec, and do some batch processing -- make
date vs price charts for each one, and compute the return since 2003
for each stock. Sort the returns and print out the tickers of the 4
biggest winners
"""
import os, datetime, urllib
import matplotlib.mlab as mlab # contains csv2rec
import numpy as npy
import pylab as p
def fetch_stock(ticker):
"""
download the CSV file for stock with ticker and return a numpy
record array. Save the CSV file as TICKER.csv where TICKER is the
stock’s ticker symbol.
Extra credit for supporting a start date and end date, and
checking to see if the file already exists on the local file
system before re-downloading it
"""
fname = ’%s.csv’%ticker
url = XXX # create the url for this ticker
# note that the CSV file is sorted most recent date first, so you
# will probably want to sort the record array so most recent date
# is last
XXX
return r
# we’ll store a list of each return and ticker for analysis later
data = [] # a list of (return, ticker) for each stock
fig = p.figure()
for ticker in tickers:
print ’fetching’, ticker
r = fetch_stock(ticker)
# plot the returns by date for each stock using pylab.plot, adding
# a label for the legend
XXX
# now sort the data by returns and print the results for each stock
XXX
In [19]: print x
[[ 0.56331918 0.519582 ]
[ 0.22685429 0.18371135]
[ 0.19384767 0.27367054]
[ 0.35935445 0.95795884]
[ 0.37646642 0.14431089]]
In [23]: y.shape
Out[23]: (10,)
In [25]: print y
[[ 0.56331918 0.519582 ]
[ 0.22685429 0.18371135]
[ 0.19384767 0.27367054]
[ 0.35935445 0.95795884]
[ 0.37646642 0.14431089]]
3. LOADING AND SAVING BINARY DATA 17
The advantage of numpy tofile and fromfile over ASCII data is that the data storage is
compact and the read and write are very fast. It is a bit of a pain that that meta ata like array
datatype and shape are not stored. In this format, just the raw binary numeric data is stored, so
you will have to keep track of the data type and shape by other means. This is a good solution if
you need to port binary data files between different packages, but if you know you will always be
working in python, you can use the python pickle function to preserve all metadata (pickle also
works with all standard python data types, but has the disadvantage that other programs and
applications cannot easily read it)
# create a 6,3 array of random integers
In [36]: x = (256*numpy.random.rand(6,3)).astype(numpy.int)
In [37]: print x
[[173 38 2]
[243 207 155]
[127 62 140]
[ 46 29 98]
[ 0 46 156]
[ 20 177 36]]
Elementary Numerics
The listing 4.1 contains a skeleton with no implementation but with some plotting commands
already inserted, so that you can visualize the convergence rate of this formula as more terms are
kept.
Listing 4.1
#!/usr/bin/env python
"""Simple demonstration of Python’s arbitrary-precision integers."""
def pi(n):
"""Compute pi using n terms of Wallis’ product.
pi(n) = 2 \prod_{i=1}^{n}\frac{4i^2}{4i^2-1}."""
XXX
# This part only executes when the code is run as a script, not when it is
# imported as a library
if __name__ == ’__main__’:
# Simple convergence demo.
19
20 4. ELEMENTARY NUMERICS
# 16-digit value)
diff = XXX
# Make a new figure and build a semilog plot of the difference so we can
# see the quality of the convergence
P.figure()
# Line plot with red circles at the data points
P.semilogy(nrange,diff,’-o’,mfc=’red’)
2. Trapezoidal rule
In this exercise, you are tasked with implementing the simple trapezoid rule formula for nu-
merical integration. If we want to compute the definite integral
Z b
(2) f (x)dx
a
we can partition the integration interval [a, b] into smaller subintervals, and approximate the area
under the curve for each subinterval by the area of the trapezoid created by linearly interpolating
between the two function values at each end of the subinterval. This is graphically illustrated in
Figure 2, where the blue line represents the function f (x) and the red line represents the successive
linear segments.
2. TRAPEZOIDAL RULE 21
The area under f (x) (the value of the definite integral) can thus be approximated as the sum
of the areas of all these trapezoids. If we denote by xi (i = 0, . . . , n, with x0 = a and xn = b) the
abscissas where the function is sampled, then
Z b n
1X
(3) f (x)dx ≈ (xi − xi−1 ) (f (xi ) + f (xi+1 )) .
a 2 i=1
The common case of using equally spaced abscissas with spacing h = (b − a)/n reads simply
Z b n
hX
(4) f (x)dx ≈ (f (xi ) + f (xi+1 )) .
a 2 i=1
One frequently receives the function values already precomputed, yi = f (xi ), so equation (3)
becomes
Z b n
1X
(5) f (x)dx ≈ (xi − xi−1 ) (yi + yi−1 ) .
a 2 i=1
Listing 4.2 contains a skeleton for this problem, written in the form of two incomplete functions
and a set of automatic tests (in the form of unit tests, as described in the introduction).
Listing 4.2
#!/usr/bin/env python
"""Simple trapezoid-rule integrator."""
import numpy as N
Inputs:
- x,y: arrays of the same length.
Output:
- The result of applying the trapezoid rule to the input, assuming that
y[i] = f(x[i]) for some function f to be integrated.
raise NotImplementedError
22 4. ELEMENTARY NUMERICS
def trapzf(f,a,b,npts=100):
"""Simple trapezoid-based integrator.
Inputs:
- f: function to be integrated.
Optional inputs:
- npts(100): the number of equally spaced points to sample f at, between
a and b.
Output:
- The value of the trapezoid-rule approximation to the integral."""
if __name__ == ’__main__’:
# Simple tests for trapezoid integrator, when this module is called as a
# script from the command line.
import unittest
import numpy.testing as ntest
class trapzTestCase(unittest.TestCase):
def test_err(self):
self.assertRaises(ValueError,trapz,range(2),range(3))
def test_call(self):
x = N.linspace(0,1,100)
y = N.array(map(square,x))
ntest.assert_almost_equal(trapz(x,y),1./3,4)
class trapzfTestCase(unittest.TestCase):
def test_square(self):
ntest.assert_almost_equal(trapzf(square,0,1),1./3,4)
def test_square2(self):
ntest.assert_almost_equal(trapzf(square,0,3,350),9.0,4)
unittest.main()
In this exercise, you’ll need to write two functions, trapz and trapzf. trapz applies
the trapezoid formula to pre-computed values, implementing equation (5), while trapzf takes a
function f as input, as well as the total number of samples to evaluate, and computes eq. (4).
3. NEWTON’S METHOD 23
3. Newton’s method
Consider the problem of solving for t in
Z t
(6) f (s)ds = u
o
where f (s) is a monotonically increasing function of s and u > 0.
This problem can be simply solved if seen as a root finding question. Let
Z t
(7) g(t) = f (s)ds − u,
o
then we just need to find the root for g(t), which is guaranteed to be unique given the conditions
above.
The SciPy library includes an optimization package that contains a Newton-Raphson solver
called scipy.optimize.newton. This solver can optionally take a known derivative for the
function whose roots are being sought, and in this case the derivative is simply
dg(t)
(8) = f (t).
dt
For this exercise, implement the solution for the test function
f (t) = t sin2 (t),
using
1
u= .
4
The listing 4.3 contains a skeleton that includes for comparison the correct numerical value.
Listing 4.3
#!/usr/bin/env python
"""Root finding using SciPy’s Newton’s method routines.
"""
quad = scipy.integrate.quad
newton = scipy.optimize.newton
# Use u=0.25
def g(t): XXX
# main
tguess = 10.0
print
print "To six digits, the answer in this case is t==1.06601."
CHAPTER 5
Linear algebra
Like matlab, numpy and scipy have support for fast linear algebra built upon the highly
optimized LAPACK, BLAS and ATLAS fortran linear algebra libraries. Unlike Matlab, in which
everything is a matrix or vector, and the ’*’ operator always means matrix multiple, the default
object in numpy is an array, and the ’*’ operator on arrays means element-wise multiplication.
Instead, numpy provides a matrix class if you want to do standard matrix-matrix multipli-
cation with the ’*’ operator, or the dot function if you want to do matrix multiplies with plain
arrays. The basic linear algebra functionality is found in numpy.linalg
# the matrix class will create matrix objects that support matrix
# multiplication with *
In [7]: Xm = npy.matrix(X)
In [8]: Ym = npy.matrix(Y)
In [9]: print Xm*Ym
[[ 0.10670678 0.68340331 0.39236388]
[ 0.27840642 1.14561885 0.62192324]
[ 0.48192134 1.32314856 0.51188578]]
25
26 5. LINEAR ALGEBRA
http://en.wikipedia.org/wiki/Moir%C3%A9_pattern
See L. Glass. ’Moire effect from random dots’ Nature 223, 578580 (1969).
"""
from numpy import cos, sin, pi, matrix
import numpy as npy
import numpy.linalg as linalg
from pylab import figure, show
def csqrt(x):
’sqrt func that handles returns sqrt(x)j for x<0’
XXX
def myeig(M):
"""
compute eigen values and eigenvectors analytically
Solve quadratic:
lamba^2 - tau*lambda + Delta = 0
where tau = trace(M) and Delta = Determinant(M)
1L. Glass. ’Moiré effect from random dots’ Nature 223, 578580 (1969).
1. GLASS MOIRÉ PATTERNS 27
name = ’saddle’
#sx, sy, angle = XXX
#name = ’center’
#sx, sy, angle = XXX
Signal processing
numpy and scipy provide many of the essential tools for digital signal processing.
scipy.signal provides basic tools for digital filter design and filtering (eg Butterworth filters),
a linear systems toolkit, standard waveforms such as square waves, and saw tooth functions, and
some basic wavelet functionality. scipy.fftpack provides a suite of tools for Fourier domain
analysis, including 1D, 2D, and ND discrete fourier transform and inverse functions, in addition to
other tools such as analytic signal representations via the Hilbert trasformation (numpy.fft also
provides basic FFT functions). pylab provides Matlab compatible functions for computing and
plotting standard time series analyses, such as historgrams (hist), auto and cross correlations
(acorr and xcorr), power spectra and coherence spectra (psd, csd, cohere and specgram).
1. Convolution
The output of a linear system is given by the convolution of its impulse response function with
the input. Mathematically
Z t
(9) y(t) = x(τ )r(t − τ )dτ
0
This fundamental relationship lies at the heart of linear systems analysis. It is used to model the
dynamics of calcium buffers in neuronal synapses, where incoming action potentials are represented
as Dirac δ-functions and the calcium stores are represented with a response function with multiple
exponential time constants. It is used in microscopy, in which the image distortions introduced by
the lenses are deconvolved out using a measured point spread function to provide a better picture of
the true image input. It is essential in structural engineering to determine how materials respond
to shocks.
The impulse response function r is the system response to a pulsatile input. For example, in
Figure 1 below, the response function is the sum of two exponentials with different time constants
and signs. This is a typical function used to model synaptic current following a neuronal action
potential. The figure shows three δ inputs at different times and with different amplitudes. The
corresponsing impulse response for each input is shown following it, and is color coded with the
impulse input color. If the system response is linear, by definition, the response to a sum of
inputs is the sum of the responses to the individual inputs, and the lower panel shows the sum
of the responses, or equivalently, the convolution of the impulse response function with the input
function.
In Figure 1, the summing of the impulse response function over the three inputs is conceptually
and visually easy to understand. Some find the concept of a convolution of an impulse response
function with a continuos time function, such as a sinusoid or a noise process, conceptually more
difficult. It shouldn’t be. By the sampling theorem, we can represent any finite bandwidth contin-
uous time signal as the sum of Dirac-δ functions where the height of the δ function at each time
point is simply the amplitude of the signal at that time point. The only requirement is that the
sampling frequency be at least as high as the Nyquist frequency, defined as the highest spectral
frequency in the signal divided by 2. See Figure 2 for a representation of a delta function sampling
of a damped, oscillatory, exponential function.
29
30 6. SIGNAL PROCESSING
In the exercise below, we will convolve a sample from the normal distribution (white noise) with
a double exponential impulse response function. Such a function acts as a low pass filter, so the
resultant output will look considerably smoother than the input. You can use numpy.convolve
to perform the convolution numerically.
We also explore the important relationship that a convolution in the tempoeral (or spatial)
domain becomes a multiplication in the spectral domain, which is mathematically much easier to
work with.
Y =R∗X
where Y , X, and R are the Fourier transforms of the respective variable in the temporal
convolution equation above. The Fourier transform of the impulse response function serves as an
1. CONVOLUTION 31
amplitude weighting and phase shifting operator for each frequency component. Thus, we can get
deeper insight into the effects of impulse response function r by studying the amplitude and phase
spectrum of its transform R. In the example below, however, we simply use the multiplication
property to perform the same convolution in Fourier space to confirm the numerical result from
numpy.convolve.
Listing 6.1
"""
In signal processing, the output of a linear system to an arbitrary
input is given by the convolution of the impule response function (the
system response to a Dirac-delta impulse) and the input signal.
Mathematically:
where x(t) is the input signal at time t, y(t) is the output, and r(t)
is the impulse response function.
* using numpy.convolve
def impulse_response(t):
’double exponential response function’
return XXX
# now inverse fft and extract the real part, just the part up to
# len(x)
yi = XXX
the transformation values beyond the midpoint of the frequency spectrum (the Nyquist frequency)
correspond to the values for negative frequencies and are simply the mirror image of the positive
frequencies below the Nyquist (this is true for the 1D, 2D and ND FFTs in numpy).
In this exercise we will compute the 2D spatial frequency spectra of the luminance image, zero
out the high frequency components, and inverse transform back into the time domain. We can
plot the input and output images with the pylab.imshow function, but the images must first
be scaled to be withing the 0..1 luminance range. For best results, it helps to amplify the image
by some scale factor, and then clip it to set all values greater than one to one. This serves to
enhance contrast among the darker elements of the image, so it is not completely dominated by
the brighter segments
Listing 6.2
#!/usr/bin/env python
"""Image denoising example using 2-dimensional FFT."""
import numpy as N
import pylab as P
import scipy as S
def mag_phase(F):
"""Return magnitude and phase components of spectrum F."""
# XXX Next, clip all values larger than one to one. You can set all
# elements of an array which satisfy a given condition with array indexing
# syntax: ARR[ARR<VALUE] = NEWVALUE, for example.
# Display: this one already works, if you did everything right with M
P.imshow(M, P.cm.Blues)
# ’main’ script
im = # XXX make an image array from the file ’moonlanding.png’, using the
# pylab imread() function. You will need to just extract the red
# channel from the MxNx4 RGBA matrix to represent the grayscale
# intensities
F = # Compute the 2d FFT of the input image. Look for a 2-d FFT in N.dft
# XXX Call ff a copy of the original transform. Numpy arrays have a copy
...method
# for this purpose.
# XXX Set r and c to be the number of rows and columns of the array. Look for
# the shape attribute...
# The code below already works, if you did everything above right.
P.figure()
P.subplot(221)
P.title(’Original image’)
P.imshow(im, P.cm.gray)
P.subplot(222)
P.title(’Fourier transform’)
plot_spectrum(F)
P.subplot(224)
P.title(’Filtered Spectrum’)
plot_spectrum(ff)
P.subplot(223)
P.title(’Reconstructed Image’)
P.imshow(im_new, P.cm.gray)
P.show()
2. FFT IMAGE DENOISING 35
Statistics
R, a statistical package based on S, is viewd by some as the best statistical software on the
planet, and in the open source world it is the clear choice for sophisticated statistical analysis. Like
python, R is an interpreted language written in C with an interactive shell. Unlike python, which
is a general purpose programming language, R is a specialized statistical language. Since python
is a excellent glue language, with facilities for providing a transparent interface to FORTRAN,
C, C++ and other languages, it should come as no surprise that you can harness R’s immense
statistical power from python, through the rpy third part extension library.
However, R is not without its warts. As a language, it lacks python’s elegance and advanced
programming constructs and idioms. It is also GPL, which means you cannot distribute code based
upon it unhindered: the code you distribute must be GPL as well (python, and the core scientific
extension libraries, carry a more permissive license which support distribution in closed source,
proprietary application).
Fortunately, the core tools scientific libraries for python (primarily numpy and scipy.stats)
provide a wide array of statistical tools, from basic descriptive statistics (mean, variance, skew,
kurtosis, correlation, . . . ) to hypothesis testing (t-tests, χ-Square, analysis of variance, general
linear models, . . . ) to analytical and numerical tools for working with almost every discrete and
continuous statistical distribution you can think of (normal, gamma, poisson, weibull, lognormal,
levy stable, . . . ).
1. Descriptive statistics
The first step in any statistical analysis should be to describe, charaterize and importantly,
visualize your data. The normal distribution (aka Gaussian or bell curve) lies at the heart of
much of formal statistical analysis, and normal distributions have the tidy property that they
are completely characterized by their mean and variance. As you may have observed in your
interactions with family and friends, most of the world is not normal, and many statistical analyses
are flawed by summarizing data with just the mean and standard deviation (square root of variance)
and associated signficance tests (eg the T-Test) as if it were normally distributed data.
In the exercise below, we write a class to provide descriptive statistics of a data set passed into
the constructor, with class methods to pretty print the results and to create a battery of standard
plots which may show structure missing in a casual analysis. Many new programmers, or even
experienced programmers used to a proceedural environment, are uncomfortable with the idea of
classes, having hear their geekier programmer friends talk about them but not really sure what
to do with them. There are many interesting things one can do with classes (aka object oriented
programming) but at their hear they are a way of bundling data with methods that operate on
that data. The self variable is special in python and is how the class refers to its own data and
methods. Here is a toy example
37
38 7. STATISTICS
In [117]: mydata.sumsquare()
Out[117]: 29.6851135284
Listing 7.1
import scipy.stats as stats
from matplotlib.mlab import detrend_linear, load
import numpy
import pylab
XXX = None
class Descriptives:
"""
a helper class for basic descriptive statistics and time series plots
"""
def __init__(self, samples):
self.samples = numpy.asarray(samples)
self.N = XXX # the number of samples
self.median = XXX # sample median
self.min = XXX # sample min
self.max = XXX # sample max
self.mean = XXX # sample mean
self.std = XXX # sample standard deviation
self.var = XXX # sample variance
self.skew = XXX # the sample skewness
self.kurtosis = XXX # the sample kurtosis
self.range = XXX # the sample range max-min
def __repr__(self):
"""
Create a string representation of self; pretty print all the
attributes:
descriptives = (
’N = %d’ % self.N,
XXX # the rest here
)
return ’\n’.join(descriptives)
keyword args:
return c
if __name__==’__main__’:
# load the data in filename fname into the list data, which is a
# list of floating point values, one value per line. Note you
# will have to do some extra parsing
data = []
#fname = ’data/nm560.dat’ # tree rings in New Mexico 837-1987
fname = ’data/hsales.dat’ # home sales
for line in file(fname):
line = line.strip()
40 7. STATISTICS
Figure 1.
desc = Descriptives(data)
print desc
c = desc.plots(pylab.figure, Fs=12, fmt=’-o’)
c.ax1.set_title(fname)
pylab.show()
2. Statistical distributions
We explore a handful of the statistical distributions in scipy.stats module and the connec-
tions between them. The organization of the distribution functions in scipy.stats is quite ele-
gant, with each distribution providing random variates (rvs), analytical moments (mean, variance,
skew, kurtosis), analytic density (pdf, cdf) and survival functions (sf, isf) (where available)
and tools for fitting empirical distributions to the analytic distributions (fit).
in the exercise below, we will simulate a radioactive particle emitter, and look at the empirical
distribution of waiting times compared with the expected analytical distributions. Our radioative
particle emitter has an equal likelihood of emitting a particle in any equal time interval, and emits
particles at a rate of 20 Hz. We will discretely sample time at a high frequency, and record a 1 of
a particle is emitted and a 0 otherwise, and then look at the distribution of waiting times between
emissions. The probability of a particle emission in one of our sample intervals (assumed to be
very small compared to the average interval between emissions) is proportional to the rate and the
sample interval ∆t, ie p(∆t) = α∆t where α is the emission rate in particles per second.
The waiting times between the emissions should follow an exponential distribution (see
scipy.stats.expon) with a mean of 1/α. In the exercise below, you will generate a long
array of emissions, compute the waiting times between emissions, between 2 emissions, and be-
tween 10 emissions. These should approach an 1st order gamma (aka exponential) distribution,
2nd order gamma, and 10th order gamma (see scipy.stats.gamma). Use the probability den-
sity functions for these distributions in scipy.stats to compare your simulated distributions
and moments with the analytic versions provided by scipy.stats. With 10 waiting times, we
should be approaching a normal distribution since we are summing 10 waiting times and under the
central limit theorem the sum of independent samples from a finite variance process approaches
the normal distribution (see scipy.stats.norm). In the final part of the exercise below, you
will be asked to approximate the 10th order gamma distribution with a normal distribution. The
results should look something like those in Figure 2.
Listing 7.2
"""
Illustrate the connections bettwen the uniform, exponential, gamma and
normal distributions by simulating waiting times from a radioactive
source using the random number generator. Verify the numerical
results by plotting the analytical density functions from scipy.stats
"""
import numpy
import scipy.stats
from pylab import figure, show, close
show()
2. STATISTICAL DISTRIBUTIONS 43
Figure 2.