Bernd Klein Python Data Analysis Letter
Bernd Klein Python Data Analysis Letter
Analysis
by
Bernd Klein
bodenseo
© 2021 Bernd Klein
All rights reserved. No portion of this book may be reproduced or used in any
manner without written permission from the copyright owner.
www.python-course.eu
Python Course
Data Analysis With
Python by Bernd
Klein
Numpy Tutorial ..........................................................................................................................8
Numpy Tutorial: Creating Arrays.............................................................................................17
Data Type Objects, dtype..........................................................................................................36
Numerical Operations on Numpy Arrays.................................................................................48
Numpy Arrays: Concatenating, Flattening and Adding Dimensions .......................................68
Python, Random Numbers and Probability ..............................................................................79
Weighted Probabilities..............................................................................................................90
Synthetical Test Data With Python.........................................................................................119
Numpy: Boolean Indexing......................................................................................................136
Matrix Multiplicaion, Dot and Cross Product ........................................................................143
Reading and Writing Data Files .............................................................................................149
Overview of Matplotlib ..........................................................................................................157
Format Plots............................................................................................................................168
Matplotlib Tutorial..................................................................................................................172
Shading Regions with fill_between() .....................................................................................183
Matplotlib Tutorial: Spines and Ticks ....................................................................................186
Matplotlib Tutorial, Adding Legends and Annotations..........................................................197
Matplotlib Tutorial: Subplots .................................................................................................212
Exercise ....................................................................................................................................44
Exercise ....................................................................................................................................44
Matplotlib Tutorial: Gridspec .................................................................................................239
GridSpec using SubplotSpec ..................................................................................................244
Matplotlib Tutorial: Histograms and Bar Plots ......................................................................248
Matplotlib Tutorial: Contour Plots .........................................................................................268
Introduction into Pandas.........................................................................................................303
Data Structures .......................................................................................................................305
Accessing and Changing values of DataFrames.....................................................................343
Pandas: groupby .....................................................................................................................361
Reading and Writing Data ......................................................................................................380
Dealing with NaN...................................................................................................................394
Binning in Python and Pandas................................................................................................404
Expenses and Income Example ..............................................................................................465
Net Income Method Example.................................................................................................478
3
NUMERICAL PROGRAMMING WITH
PYTHON
Numerical Computing defines an area of computer science and mathematics dealing with algorithms for
numerical approximations of problems from mathematical or numerical analysis, in other words: Algorithms
solving problems involving continuous variables. Numerical analysis is used to solve science and engineering
problems.
Data science is an interdisciplinary subject which includes for example statistics and computer science,
especially programming and problem solving skills. Data Science includes everything which is necessary to
create and prepare data, to manipulate, filter and clense data and to analyse data. Data can be both structured
and unstructured. We could also say Data Science includes all the techniques needed to extract and gain
information and insight from data.
Data Science is an umpbrella term which incorporates data analysis, statistics, machine learning and other
related scientific fields in order to understand and analyze data.
Another term occuring quite often in this context is "Big Data". Big Data is for sure one of the most often used
buzzwords in the software-related marketing world. Marketing managers have found out that using this term
can boost the sales of their products, regardless of the fact if they are really dealing with big data or not. The
term is often used in fuzzy ways.
Big data is data which is too large and complex, so that it is hard for data-processing application software to
deal with them. The problems include capturing and collecting data, data storage, search the data, visualization
of the data, querying, and so on.
• volume:
the sheer amount of data, whether it will be giga-, tera-, peta- or exabytes
• velocity:
the speed of arrival and processing of data
The big question is how useful Python is for these purposes. If we would only use Python without any special
modules, this language could only poorly perform on the previously mentioned tasks. We will describe the
necessary tools in the following chapter.
The principal disadvantage of MATLAB against Python are the costs. Python with NumPy, SciPy, Matplotlib
and Pandas is completely free, whereas MATLAB can be very expensive. "Free" means both "free" as in "free
beer" and "free" as in "freedom"! Even though MATLAB has a huge number of additional toolboxes available,
Python has the advantage that it is a more modern and complete programming language. Python is continually
becoming more powerful by a rapidly growing number of specialized modules.
Python in combination with Numpy, Scipy, Matplotlib and Pandas can be used as a complete replacement for
MATLAB.
INTRODUCTION
NumPy is a module for Python. The name is an acronym for
"Numeric Python" or "Numerical Python". It is pronounced
/ˈnʌmpaɪ/ (NUM-py) or less often /ˈnʌmpi (NUM-pee)). It is an
extension module for Python, mostly written in C. This makes
sure that the precompiled mathematical and numerical functions
and functionalities of Numpy guarantee great execution speed.
SciPy (Scientific Python) is often mentioned in the same breath with NumPy. SciPy needs Numpy, as it is
based on the data structures of Numpy and furthermore its basic creation and manipulation functions. It
extends the capabilities of NumPy with further useful functions for minimization, regression, Fourier-
transformation and many others.
Both NumPy and SciPy are not part of a basic Python installation. They have to be installed after the Python
installation. NumPy has to be installed before installing SciPy.
(Comment: The diagram of the image on the right side is the graphical visualisation of a matrix with 14 rows
and 20 columns. It's a so-called Hinton diagram. The size of a square within this diagram corresponds to the
size of the value of the depicted matrix. The colour determines, if the value is positive or negative. In our
example: the colour red denotes negative values and the colour green denotes positive values.)
NumPy is based on two earlier Python modules dealing with arrays. One of these is Numeric. Numeric is like
NumPy a Python module for high-performance, numeric computing, but it is obsolete nowadays. Another
predecessor of NumPy is Numarray, which is a complete rewrite of Numeric but is deprecated as well. NumPy
is a merger of those two, i.e. it is build on the code of Numeric and the features of Numarray.
NUMPY TUTORIAL 8
When we say "Core Python", we mean Python without any special modules, i.e. especially without NumPy.
import numpy
But you will hardly ever see this. Numpy is usually renamed to np:
import numpy as np
Our first simple Numpy example deals with temperatures. Given is a list with values, e.g. temperatures in
Celsius:
cvalues = [20.1, 20.8, 21.9, 22.5, 22.7, 22.3, 21.8, 21.2, 20.9, 2
0.1]
C = np.array(cvalues)
print(C)
[20.1 20.8 21.9 22.5 22.7 22.3 21.8 21.2 20.9 20.1]
Let's assume, we want to turn the values into degrees Fahrenheit. This is very easy to accomplish with a
numpy array. The solution to our problem can be achieved by simple scalar multiplication:
print(C * 9 / 5 + 32)
[68.18 69.44 71.42 72.5 72.86 72.14 71.24 70.16 69.62 68.18]
NUMPY TUTORIAL 9
The array C has not been changed by this expression:
print(C)
[20.1 20.8 21.9 22.5 22.7 22.3 21.8 21.2 20.9 20.1]
Compared to this, the solution for our Python list looks awkward:
So far, we referred to C as an array. The internal type is "ndarray" or to be even more precise "C is an instance
of the class numpy.ndarray":
type(C)
Output: numpy.ndarray
In the following, we will use the terms "array" and "ndarray" in most cases synonymously.
If you use the jupyter notebook, you might be well advised to include the following line of code to prevent an
external window to pop up and to have your diagram included in the notebook:
%matplotlib inline
The code to generate a plot for our values looks like this:
plt.plot(C)
plt.show()
NUMPY TUTORIAL 10
The function plot uses the values of the array C for the values of the ordinate, i.e. the y-axis. The indices of the
array C are taken as values for the abscissa, i.e. the x-axis.
NUMPY TUTORIAL 11
To calculate the memory consumption of the list from the above picture, we will use the function getsizeof
from the module sys.
The size of a Python list consists of the general list information, the size needed for the references to the
elements and the size of all the elements of the list. If we apply sys.getsizeof to a list, we get only the size
without the size of the elements. In the previous example, we made the assumption that all the integer
elements of our list have the same size. Of course, this is not valid in general, because memory consumption
will be higher for larger integers.
We will check now, how the memory usage changes, if we add another integer element to the list. We also
look at an empty list:
lst = []
print("Emtpy list size: ", size(lst))
NUMPY TUTORIAL 12
Size without the size of the elements: 104
Size of all the elements: 112
Total size of list, including elements: 216
Emtpy list size: 72
We can conclude from this that for every new element, we need another eight bytes for the reference to the
new object. The new integer object itself consumes 28 bytes. The size of a list "lst" without the size of the
elements can be calculated with:
64 + 8 * len(lst)
To get the complete size of an arbitrary list of integers, we have to add the sum of all the sizes of the integers.
We will examine now the memory consumption of a numpy.array. To this purpose, we will have a look at the
implementation in the following picture:
We will create the numpy array of the previous diagram and calculate the memory usage:
We get the memory usage for the general array information by creating an empty array:
e = np.array([])
print(size(e))
96
NUMPY TUTORIAL 13
We can see that the difference between the empty array "e" and the array "a" with three integers consists in 24
Bytes. This means that an arbitrary integer array of length "n" in numpy needs
96 + n * 8 Bytes
64 + 8 len(lst) + len(lst) 28
This is a minimum estimation, as Python integers can use more than 28 bytes.
When we define a Numpy array, numpy automatically chooses a fixed integer size. In our example "int64".
We can determine the size of the integers, when we define an array. Needless to say, this changes the memory
requirement:
import time
size_of_vec = 1000
def pure_python_version():
t1 = time.time()
NUMPY TUTORIAL 14
X = range(size_of_vec)
Y = range(size_of_vec)
Z = [X[i] + Y[i] for i in range(len(X)) ]
return time.time() - t1
def numpy_version():
t1 = time.time()
X = np.arange(size_of_vec)
Y = np.arange(size_of_vec)
Z = X + Y
return time.time() - t1
t1 = pure_python_version()
t2 = numpy_version()
print(t1, t2)
print("Numpy is in this example " + str(t1/t2) + " faster!")
0.0010614395141601562 5.2928924560546875e-05
Numpy is in this example 20.054054054054053 faster!
It's an easier and above all better way to measure the times by using the timeit module. We will use the Timer
class in the following script.
The constructor of a Timer object takes a statement to be timed, an additional statement used for setup, and a
timer function. Both statements default to 'pass'.
The statements may contain newlines, as long as they don't contain multi-line string literals.
A Timer object has a timeit method. timeit is called with a parameter number:
timeit(number=1000000)
The main statement will be executed "number" times. This executes the setup statement once, and then returns
the time it takes to execute the main statement a "number" of times. It returns the time in seconds.
import numpy as np
from timeit import Timer
size_of_vec = 1000
X_list = range(size_of_vec)
Y_list = range(size_of_vec)
NUMPY TUTORIAL 15
X = np.arange(size_of_vec)
Y = np.arange(size_of_vec)
def pure_python_version():
Z = [X_list[i] + Y_list[i] for i in range(len(X_list)) ]
def numpy_version():
Z = X + Y
for i in range(3):
t1 = timer_obj1.timeit(10)
t2 = timer_obj2.timeit(10)
print("time for pure Python version: ", t1)
print("time for Numpy version: ", t2)
print(f"Numpy was {t1 / t2:7.2f} times faster!")
time for pure Python version: 0.0021230499987723306
time for Numpy version: 0.0004346180066931993
Numpy was 4.88 times faster!
time for pure Python version: 0.003020321993972175
time for Numpy version: 0.00014882600225973874
Numpy was 20.29 times faster!
time for pure Python version: 0.002028984992648475
time for Numpy version: 0.0002098319964716211
Numpy was 9.67 times faster!
The repeat() method is a convenience to call timeit() multiple times and return a list of results:
print(timer_obj1.repeat(repeat=3, number=10))
print(timer_obj2.repeat(repeat=3, number=10))
[0.0030275019962573424, 0.002999588003149256, 0.002212086998042650
5]
[6.104000203777105e-05, 0.0001641790004214272, 1.904800592456013
e-05]
In [ ]:
NUMPY TUTORIAL 16
NUMPY TUTORIAL: CREATING
ARRAYS
ARANGE
arange returns evenly spaced values within a given interval. The values are generated within the half-open
interval '[start, stop)' If the function is used with integers, it is nearly equivalent to the Python built-in function
range, but arange returns an ndarray rather than a list iterator as range does. If the 'start' parameter is not given,
it will be set to 0. The end of the interval is determined by the parameter 'stop'. Usually, the interval will not
include this value, except in some cases where 'step' is not an integer and floating point round-off affects the
length of output ndarray. The spacing between two adjacent values of the output array is set with the optional
parameter 'step'. The default value for 'step' is 1. If the parameter 'step' is given, the 'start' parameter cannot be
optional, i.e. it has to be given as well. The type of the output array can be specified with the parameter 'dtype'.
If it is not given, the type will be automatically inferred from the other input arguments.
import numpy as np
a = np.arange(1, 10)
print(a)
Be careful, if you use a float value for the step parameter, as you can see in the following example:
The help of arange has to say the following for the stop parameter: "End of interval. The interval does
not include this value, except in some cases where step is not an integer and floating point round-off
affects the length of out . This is what happened in our example.
The following usages of arange is a bit offbeat. Why should we use float values, if we want integers as
result. Anyway, the result might be confusing. Before arange starts, it will round the start value, end value and
the stepsize:
This result defies all logical explanations. A look at help also helps here: "When using a non-integer step, such
as 0.1, the results will often not be consistent. It is better to use numpy.linspace for these cases. Using
linspace is not an easy workaround in some situations, because the number of values has to be known.
linspace returns an ndarray, consisting of 'num' equally spaced samples in the closed interval [start, stop] or the
half-open interval [start, stop). If a closed or a half-open interval will be returned, depends on whether
'endpoint' is True or False. The parameter 'start' defines the start value of the sequence which will be created.
'stop' will the end value of the sequence, unless 'endpoint' is set to False. In the latter case, the resulting
sequence will consist of all but the last of 'num + 1' evenly spaced samples. This means that 'stop' is excluded.
Note that the step size changes when 'endpoint' is False. The number of samples to be generated can be set
with 'num', which defaults to 50. If the optional parameter 'endpoint' is set to True (the default), 'stop' will be
the last sample of the sequence. Otherwise, it is not included.
import numpy as np
import numpy as np
import numpy as np
x = np.array(42)
print("x: ", x)
print("The type of x: ", type(x))
print("The dimension of x:", np.ndim(x))
x: 42
The type of x: <class 'numpy.ndarray'>
The dimension of x: 0
ONE-DIMENSIONAL ARRAYS
We have already encountered a 1-dimenional array - better known to some as vectors - in our initial example.
What we have not mentioned so far, but what you may have assumed, is the fact that numpy arrays are
containers of items of the same type, e.g. only integers. The homogenous type of the array can be determined
with the attribute "dtype", as we can learn from the following example:
[[211 212]
[221 222]]
[[311 312]
[321 322]]]
3
print(np.shape(x))
(6, 3)
print(x.shape)
(6, 3)
The shape of an array tells us also something about the order in which the indices
are processed, i.e. first rows, then columns and after that the further dimensions.
x.shape = (3, 6)
print(x)
[[67 63 87 77 69 59]
[85 87 99 79 72 71]
[63 89 93 68 92 78]]
x.shape = (2, 9)
print(x)
You might have guessed by now that the new shape must correspond to the number of elements of the array,
i.e. the total size of the new array must be the same as the old one. We will raise an exception, if this is not the
case.
x = np.array(11)
print(np.shape(x))
()
print(B.shape)
(4, 2, 3)
Single indexing behaves the way, you will most probably expect it:
print(A[1][0])
1.1
We accessed an element in the second row, i.e. the row with the index 1, and the first column (index 0). We
accessed it the same way, we would have done with an element of a nested Python list.
You have to be aware of the fact, that way of accessing multi-dimensional arrays can be highly inefficient. The
reason is that we create an intermediate array A[1] from which we access the element with the index 0. So it
behaves similar to this:
tmp = A[1]
print(tmp)
print(tmp[0])
[ 1.1 -7.8 -0.7]
1.1
There is another way to access elements of multi-dimensional arrays in Numpy: We use only one pair of
square brackets and all the indices are separated by commas:
print(A[1, 0])
1.1
We assume that you are familar with the slicing of lists and tuples. The syntax is the same in numpy for one-
dimensional arrays, but it can be applied to multiple dimensions as well.
A[start:stop:step]
We illustrate the operating principle of "slicing" with some examples. We start with the easiest case, i.e. the
slicing of a one-dimensional array:
S = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
print(S[2:5])
print(S[:4])
print(S[6:])
We will illustrate the multidimensional slicing in the following examples. The ranges for each dimension are
separated by commas:
A = np.array([
[11, 12, 13, 14, 15],
[21, 22, 23, 24, 25],
[31, 32, 33, 34, 35],
[41, 42, 43, 44, 45],
[51, 52, 53, 54, 55]])
print(A[:3, 2:])
[[13 14 15]
[23 24 25]
[33 34 35]]
print(A[3:, :])
[[41 42 43 44 45]
[51 52 53 54 55]]
The following two examples use the third parameter "step". The reshape function is used to construct the two-
dimensional array. We will explain reshape in the following subchapter:
X = np.arange(28).reshape(4, 7)
print(X)
[[ 0 1 2 3 4 5 6]
[ 7 8 9 10 11 12 13]
[14 15 16 17 18 19 20]
[21 22 23 24 25 26 27]]
print(X[::2, ::3])
[[ 0 3 6]
[14 17 20]]
If the number of objects in the selection tuple is less than the dimension N, then : is assumed
for any subsequent dimensions:
A = np.array(
[ [ [45, 12, 4], [45, 13, 5], [46, 12, 6] ],
[ [46, 14, 4], [45, 14, 5], [46, 11, 5] ],
[ [47, 13, 2], [48, 15, 5], [52, 15, 1] ] ])
A = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
S = A[2:6]
S[0] = 22
S[1] = 23
print(A)
[ 0 1 22 23 4 5 6 7 8 9]
Doing the similar thing with lists, we can see that we get a copy:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
lst2 = lst[2:6]
lst2[0] = 22
lst2[1] = 23
print(lst)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
If you want to check, if two array names share the same memory block, you can use the function
np.may_share_memory.
np.may_share_memory(A, B)
To determine if two arrays A and B can share memory the memory-bounds of A and B are computed. The
function returns True, if they overlap and False otherwise. The function may give false positives, i.e. if it
returns True it just means that the arrays may be the same.
np.may_share_memory(A, S)
Output: True
The following code shows a case, in which the use of may_share_memory is quite useful:
A = np.arange(12)
B = A.reshape(3, 4)
A[0] = 42
print(B)
[[42 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
But we saw that if we change an element of one array the other one is changed as well. This fact is reflected
by may_share_memory:
np.may_share_memory(A, B)
Output: True
The result above is "false positive" example for may_share_memory in the sense that somebody may think
that the arrays are the same, which is not the case.
import numpy as np
E = np.ones((2,3))
print(E)
F = np.ones((3,4),dtype=int)
print(F)
[[1. 1. 1.]
[1. 1. 1.]]
[[1 1 1 1]
[1 1 1 1]
[1 1 1 1]]
What we have said about the method ones() is valid for the method zeros() analogously, as we can see in the
following example:
Z = np.zeros((2,4))
print(Z)
[[0. 0. 0. 0.]
[0. 0. 0. 0.]]
x = np.array([2,5,18,14,4])
E = np.ones_like(x)
print(E)
Z = np.zeros_like(x)
print(Z)
[1 1 1 1 1]
[0 0 0 0 0]
There is also a way of creating an array with the empty function. It creates and returns a reference to a new
array of given shape and type, without initializing the entries. Sometimes the entries are zeros, but you
shouldn't be mislead. Usually, they are arbitrary values.
np.empty((2, 4))
Output: array([[0., 0., 0., 0.],
[0., 0., 0., 0.]])
COPYING ARRAYS
NUMPY.COPY()
copy(obj, order='K')
Parameter Meaning
The possible values are {'C', 'F', 'A', 'K'}. This parameter controls the memory layout of the copy. 'C' means C-order,
order 'F' means Fortran-order, 'A' means 'F' if the object 'obj' is Fortran contiguous, 'C' otherwise. 'K' means match the
layout of 'obj' as closely as possible.
x = np.array([[42,22,12],[44,53,66]], order='F')
y = x.copy()
x[0,0] = 1001
print(x)
print(y)
[[1001 22 12]
[ 44 53 66]]
[[42 22 12]
[44 53 66]]
print(x.flags['C_CONTIGUOUS'])
print(y.flags['C_CONTIGUOUS'])
False
True
NDARRAY.COPY()
There is also a ndarray method 'copy', which can be directly applied to an array. It is similiar to the above
function, but the default values for the order arguments are different.
a.copy(order='C')
Parameter Meaning
order The same as with numpy.copy, but 'C' is the default value for order.
import numpy as np
x = np.array([[42,22,12],[44,53,66]], order='F')
y = x.copy()
x[0,0] = 1001
print(x)
print(x.flags['C_CONTIGUOUS'])
print(y.flags['C_CONTIGUOUS'])
[[1001 22 12]
[ 44 53 66]]
[[42 22 12]
[44 53 66]]
False
True
IDENTITY ARRAY
In linear algebra, the identity matrix, or unit matrix, of size n is the n × n square matrix with ones on the main
diagonal and zeros elsewhere.
• identy
• eye
identity(n, dtype=None)
The parameters:
Parameter Meaning
n An integer number defining the number of rows and columns of the output, i.e. 'n' x 'n'
dtype An optional argument, defining the data-type of the output. The default is 'float'
The output of identity is an 'n' x 'n' array with its main diagonal set to one, and all other elements are 0.
import numpy as np
np.identity(4)
Another way to create identity arrays provides the function eye. This function creates also diagonal arrays
consisting solely of ones.
It returns a 2-D array with ones on the diagonal and zeros elsewhere.
Parameter Meaning
M An optional integer for setting the number of columns in the output. If it is None, it defaults to 'N'.
Defining the position of the diagonal. The default is 0. 0 refers to the main diagonal. A positive value refers to an
k
upper diagonal, and a negative value to a lower diagonal.
eye returns an ndarray of shape (N,M). All elements of this array are equal to zero, except for the 'k'-th
diagonal, whose values are equal to one.
import numpy as np
The principle of operation of the parameter 'd' of the eye function is illustrated in the following diagram:
EXERCISES:
1) Create an arbitrary one dimensional array called "v".
2) Create a new array which consists of the odd indices of previously created array "v".
a = np.array([1, 2, 3, 4, 5])
6) Create a new array from m, in which the elements of each row are in reverse order.
8) Create an array from m, where columns and rows are in reverse order.
9) Cut of the first and last row and the first and last column.
import numpy as np
a = np.array([3,8,12,18,7,11,30])
2)
odd_elements = a[1::2]
3) reverse_order = a[::-1]
4) The output will be 200, because slices are views in numpy and not copies.
5) m = np.array([ [11, 12, 13, 14], [21, 22, 23, 24], [31, 32, 33, 34]])
6) m[::,::-1]
7) m[::-1]
8) m[::-1,::-1]
9) m[1:-1,1:-1]
DTYPE
The data type object 'dtype' is an instance of numpy.dtype class. It
can be created with numpy.dtype.
Before we start with a complex data structure like the previous data, we want to introduce dtype in a very
simple example. We define an int16 data type and call this type i16. (We have to admit, that this is not a nice
name, but we use it only here!). The elements of the list 'lst' are turned into i16 types to create the two-
dimensional array A.
import numpy as np
i16 = np.dtype(np.int16)
print(i16)
A = np.array(lst, dtype=i16)
print(A)
int16
[[ 3 8 9]
[ 1 -7 0]
[ 4 12 4]]
We introduced a new name for a basic data type in the previous example. This has nothing to do with the
structured arrays, which we mentioned in the introduction of this chapter of our dtype tutorial.
STRUCTURED ARRAYS
ndarrays are homogeneous data objects, i.e. all elements of an array have to be of the same data type. The data
type dytpe on the other hand allows as to define separate data types for each column.
import numpy as np
dt = np.dtype([('density', np.int32)])
print(x)
We can access the content of the density column by indexing x with the key 'density'. It looks like accessing a
dictionary in Python:
print(x['density'])
[393 337 256]
You may wonder that we have used 'np.int32' in our definition and the internal representation shows '<i4'. We
can use in the dtype definition the type directly (e.g. np.int32) or we can use a string (e.g. 'i4'). So, we could
have defined our dtype like this as well:
dt = np.dtype([('density', 'i4')])
x = np.array([(393,), (337,), (256,)],
dtype=dt)
print(x)
[(393,) (337,) (256,)]
The 'i' means integer and the 4 means 4 bytes. What about the less-than sign in front of i4 in the result? We
could have written '<i4' in our definition as well. We can prefix a type with the '<' and '>' sign. '<' means that
# little-endian ordering
dt = np.dtype('<d')
print(dt.name, dt.byteorder, dt.itemsize)
# big-endian ordering
dt = np.dtype('>d')
print(dt.name, dt.byteorder, dt.itemsize)
The equal character '=' stands for 'native byte ordering', defined by the operating system. In our case this
means 'little-endian', because we use a Linux computer.
Another thing in our density array might be confusing. We defined the array with a list containing one-tuples.
So you may ask yourself, if it is possible to use tuples and lists interchangeably? This is not possible. The
tuples are used to define the records - in our case consisting solely of a density - and the list is the 'container'
for the records or in other words 'the lists are cursed upon'. The tuples define the atomic elements of the
structure and the lists the dimensions.
Now we will add the country name, the area and the population number to our data type:
print(population_table['density'])
print(population_table['country'])
print(population_table['area'][2:5])
[393 337 256 233 205 192 177 173 111 97 81 65 20 16 13]
[b'Netherlands' b'Belgium' b'United Kingdom' b'Germany' b'Liechten
stein'
b'Italy' b'Switzerland' b'Luxembourg' b'France' b'Austria' b'Gree
ce'
b'Ireland' b'Sweden' b'Finland' b'Norway']
[243610 357021 160]
np.savetxt("population_table.csv",
population_table,
fmt="%s;%d;%d;%d",
delimiter=";")
It is highly probable that you will need to read in the previously written file at a later date. This can be
achieved with the function genfromtxt.
x = np.genfromtxt("population_table.csv",
dtype=dt,
delimiter=";")
print(x)
There is also a function "loadtxt", but it is more difficult to use, because it returns the strings as binary strings!
To overcome this problem, we can use loadtxt with a converter function for the first column.
x = np.loadtxt("population_table.csv",
dtype=dt,
converters={0: lambda x: x.decode('utf-8')},
delimiter=";")
print(x)
[('Netherlands', 393, 41526, 16928800) ('Belgium', 337, 30510, 1
1007020)
('United Kingdom', 256, 243610, 62262000)
('Germany', 233, 357021, 81799600)
('Liechtenstein', 205, 160, 32842) ('Italy', 192, 301230, 5
9715625)
('Switzerland', 177, 41290, 7301994)
('Luxembourg', 173, 2586, 512000) ('France', 111, 547030, 636
01002)
('Austria', 97, 83858, 8169929) ('Greece', 81, 131940, 116068
13)
('Ireland', 65, 70280, 4581269) ('Sweden', 20, 449964, 95157
44)
('Finland', 16, 338424, 5410233) ('Norway', 13, 385252, 50336
75)]
1. Exercise:
Define a structured array with two columns. The first column contains the product ID, which can
be defined as an int32. The second column shall contain the price for the product. How can you
print out the column with the product IDs, the first row and the price for the third article of this
structured array?
2. Exercise:
Figure out a data type definition for time records with entries for hours, minutes and seconds.
SOLUTIONS:
Solution to the first exercise:
import numpy as np
print(stock[1])
print(stock["productID"])
print(stock[2]["price"])
print(stock)
(45765, 439.93)
[34765 45765 99661 12129]
344.19
[(34765, 603.76) (45765, 439.93) (99661, 344.19) (12129, 129.3
9)]
times = np.array( [((11, 42, 17), 20.8), ((13, 19, 3), 23.2) ], dt
ype=time_type)
print(times)
print(times['time'])
print(times['time']['h'])
print(times['temperature'])
[((11, 42, 17), 20.8) ((13, 19, 3), 23.2)]
[(11, 42, 17) (13, 19, 3)]
[11 13]
[ 20.8 23.2]
EXERCISE
This exercise should be closer to real life examples. Usually, we have to create or get the data for our
import pickle
fh = open("cities_and_times.pkl", "br")
cities_and_times = pickle.load(fh)
print(cities_and_times[:30])
[('Amsterdam', 'Sun', (8, 52)), ('Anchorage', 'Sat', (23, 52)),
('Ankara', 'Sun', (10, 52)), ('Athens', 'Sun', (9, 52)), ('Atlant
a', 'Sun', (2, 52)), ('Auckland', 'Sun', (20, 52)), ('Barcelona',
'Sun', (8, 52)), ('Beirut', 'Sun', (9, 52)), ('Berlin', 'Sun',
(8, 52)), ('Boston', 'Sun', (2, 52)), ('Brasilia', 'Sun', (5, 5
2)), ('Brussels', 'Sun', (8, 52)), ('Bucharest', 'Sun', (9, 52)),
('Budapest', 'Sun', (8, 52)), ('Cairo', 'Sun', (9, 52)), ('Calgar
y', 'Sun', (1, 52)), ('Cape Town', 'Sun', (9, 52)), ('Casablanc
a', 'Sun', (7, 52)), ('Chicago', 'Sun', (1, 52)), ('Columbus', 'Su
n', (2, 52)), ('Copenhagen', 'Sun', (8, 52)), ('Dallas', 'Sun',
(1, 52)), ('Denver', 'Sun', (1, 52)), ('Detroit', 'Sun', (2, 5
2)), ('Dubai', 'Sun', (11, 52)), ('Dublin', 'Sun', (7, 52)), ('Edm
onton', 'Sun', (1, 52)), ('Frankfurt', 'Sun', (8, 52)), ('Halifa
x', 'Sun', (3, 52)), ('Helsinki', 'Sun', (9, 52))]
In [ ]:
42 + 5
It is also extremely easy to use all these operators on two arrays as well.
USING SCALARS
Let's start with adding scalars to arrays:
import numpy as np
lst = [2,3, 7.9, 3.3, 6.9, 0.11, 10.3, 12.9]
v = np.array(lst)
v = v + 2
Multiplication, Subtraction, Division and exponentiation are as easy as the previous addition:
print(v * 2.2)
[ 4.4 6.6 17.38 7.26 15.18 0.242 22.66 28.38 ]
print(v - 1.38)
[ 0.62 1.62 6.52 1.92 5.52 -1.27 8.92 11.52]
print(v ** 2)
print(v ** 1.5)
[ 4.00000000e+00 9.00000000e+00 6.24100000e+01 1.08900000
e+01
4.76100000e+01 1.21000000e-02 1.06090000e+02 1.66410000
e+02]
[ 2.82842712e+00 5.19615242e+00 2.22044815e+01 5.99474770
e+00
1.81248172e+01 3.64828727e-02 3.30564215e+01 4.63323753
e+01]
We started this example with a list lst, which we turned into the array v. Do you know how to perform the
above operations on a list, i.e. multiply, add, subtract and exponentiate every element of the list with a scalar?
We could use a for loop for this purpose. Let us do it for the addition without loss of generality. We will add
the value 2 to every element of the list:
print(res)
[4, 5, 9.9, 5.3, 8.9, 2.11, 12.3, 14.9]
Even though this solution works it is not the Pythonic way to do it. We will rather use a list comprehension for
this purpose than the clumsy solution above. If you are not familar with this approach, you may consult our
chapter on list comprehension in our Python course.
Even though we had already measured the time consumed by Numpy compared to "plane" Python, we will
compare these two approaches as well:
%timeit v + 1
1000000 loops, best of 3: 1.69 µs per loop
lst = list(v)
If we use another array instead of a scalar, the elements of both arrays will be component-wise combined:
import numpy as np
A = np.array([ [11, 12, 13], [21, 22, 23], [31, 32, 33] ])
B = np.ones((3,3))
MATRIX MULTIPLICATION:
For this purpose, we can use the dot product. Using the previous arrays, we can calculate the matrix
multiplication:
np.dot(A, B)
Output: array([[ 36., 36., 36.],
[ 66., 66., 66.],
[ 96., 96., 96.]])
dot(a, b, out=None)
For 2-D arrays the dot product is equivalent to matrix multiplication. For 1-D arrays it is the same as the inner
product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of 'a'
and the second-to-last of 'b'::
Parameter Meaning
'out' is an optional parameter, which must have the exact kind of what would be returned, if it was not used. This is a
out performance feature. Therefore, if these conditions are not met, an exception is raised, instead of attempting to be
flexible.
The function dot returns the dot product of 'a' and 'b'. If 'a' and 'b' are both scalars or both 1-D arrays then a
scalar is returned; otherwise an array will returned.
It will raise a ValueError, if the shape of the last dimension of 'a' is not the same size as the shape of the
We will begin with the cases in which both arguments are scalars or one-dimensional array:
print(np.dot(3, 4))
x = np.array([3])
y = np.array([4])
print(x.ndim)
print(np.dot(x, y))
x = np.array([3, -2])
y = np.array([-4, 1])
print(np.dot(x, y))
12
1
12
-14
# es muss gelten:
print(A.shape[-1] == B.shape[-2], A.shape[1])
print(np.dot(A, B))
(True, 3)
[[ 7 7 17 4]
[ 9 9 19 0]]
We can learn from the previous example that the number of columns of the first two-dimension array have to
be the same as the number of the lines of the second two-dimensional array.
It's getting really vexing, if we use 3-dimensional arrays as the arguments of dot.
import numpy as np
X = np.array( [[[3, 1, 2],
[4, 2, 2],
[2, 4, 1]],
[[3, 2, 2],
[4, 4, 3],
[4, 1, 1]],
[[2, 2, 1],
[3, 1, 3],
[3, 2, 3]]])
[[1, 4, 1],
[4, 1, 2],
[4, 1, 2]],
[[1, 2, 3],
[4, 1, 1],
[3, 1, 4]]])
R = np.dot(X, Y)
print("The shapes:")
print(X.shape)
print(Y.shape)
print(R.shape)
The Result R:
[[[[14 19 15]
[15 15 9]
[13 9 18]]
[[18 24 20]
[20 20 12]
[18 12 22]]
[[15 18 22]
[22 13 12]
[21 9 14]]]
[[[16 21 19]
[19 16 11]
[17 10 19]]
[[25 32 32]
[32 23 18]
[29 15 28]]
[[13 18 12]
[12 18 8]
[11 10 17]]]
[[[11 14 14]
[14 11 8]
[13 7 12]]
[[17 23 19]
[19 16 11]
[16 10 22]]
[[19 25 23]
[23 17 13]
[20 11 23]]]]
To demonstrate how the dot product in the three-dimensional case works, we will use different arrays with
import numpy as np
X = np.array(
[[[3, 1, 2],
[4, 2, 2]],
[[-1, 0, 1],
[1, -1, -2]],
[[3, 2, 2],
[4, 4, 3]],
[[2, 2, 1],
[3, 1, 3]]])
Y = np.array(
[[[2, 3, 1, 2, 1],
[2, 2, 2, 0, 0],
[3, 4, 0, 1, -1]],
[[1, 4, 3, 2, 2],
[4, 1, 1, 4, -3],
[4, 1, 0, 3, 0]]])
R = np.dot(X, Y)
[[[[ 14 19 5 8 1]
[ 15 15 10 16 3]]
[[ 18 24 8 10 2]
[ 20 20 14 22 2]]]
[[[ 1 1 -1 -1 -2]
[ 3 -3 -3 1 -2]]
[[ -6 -7 -1 0 3]
[-11 1 2 -8 5]]]
[[[ 16 21 7 8 1]
[ 19 16 11 20 0]]
[[ 25 32 12 11 1]
[ 32 23 16 33 -4]]]
[[[ 11 14 6 5 1]
[ 14 11 8 15 -2]]
[[ 17 23 5 9 0]
[ 19 16 10 19 3]]]]
i = 0
for j in range(X.shape[1]):
for k in range(Y.shape[0]):
for m in range(Y.shape[2]):
fmt = " sum(X[{}, {}, :] * Y[{}, :, {}] : {}"
arguments = (i, j, k, m, sum(X[i, j, :] * Y[k, :, m]))
print(fmt.format(*arguments))
Hopefully, you have noticed that we have created the elements of R[0], one ofter the other.
print(R[0])
[[[14 19 5 8 1]
[15 15 10 16 3]]
[[18 24 8 10 2]
[20 20 14 22 2]]]
This means that we could have created the array R by applying the sum products in the way above. To "prove"
this, we will create an array R2 by using the sum product, which is equal to R in the following example:
R2 = np.zeros(R.shape, dtype=np.int)
for i in range(X.shape[0]):
for j in range(X.shape[1]):
for k in range(Y.shape[0]):
for m in range(Y.shape[2]):
R2[i, j, k, m] = sum(X[i, j, :] * Y[k, :, m])
import numpy as np
R = A * B
print(R)
[[ 3 4 3]
[ 2 4 6]
[-3 -6 -9]]
MA = np.mat(A)
MB = np.mat(B)
R = MA * MB
print(R)
[[ 2 0 -2]
[ 6 4 2]
[ 9 6 3]]
COMPARISON OPERATORS
We are used to comparison operators from Python, when we apply them to integers, floats or strings for
example. We use them to test for True or False. If we compare two arrays, we don't get a simple True or False
as a return value. The comparisons are performed elementswise. This means that we get a Boolean array as a
import numpy as np
A = np.array([ [11, 12, 13], [21, 22, 23], [31, 32, 33] ])
B = np.array([ [11, 102, 13], [201, 22, 203], [31, 32, 303] ])
A == B
Output: array([[ True, False, True],
[False, True, False],
[ True, True, False]], dtype=bool)
It is possible to compare complete arrays for equality as well. We use array_equal for this purpose.
array_equal returns True if two arrays have the same shape and elements, otherwise False will be returned.
print(np.array_equal(A, B))
print(np.array_equal(A, A))
False
True
LOGICAL OPERATORS
We can also apply the logical 'or' and the logical 'and' to arrays elementwise. We can use the functions
'logical_or' and 'logical_and' to this purpose.
print(np.logical_or(a, b))
print(np.logical_and(a, b))
[[ True True]
[ True False]]
[[ True False]
[False False]]
We will see in the following that we can also apply operators on arrays, if they have different shapes. Yet, it
works only under certain conditions.
BROADCASTING
Numpy provides a powerful mechanism, called Broadcasting, which allows to perform arithmetic operations
on arrays of different shapes. This means that we have a smaller array and a larger array, and we transform or
apply the smaller array multiple times to perform some operation on the larger array. In other words: Under
certain conditions, the smaller array is "broadcasted" in a way that it has the same shape as the larger array.
With the aid of broadcasting we can avoid loops in our Python program. The looping occurs implicitly in the
Numpy implementations, i.e. in C. We also avoid creating unnecessary copies of our data.
We demonstrate the operating principle of broadcasting in three simple and descriptive examples.
import numpy as np
A = np.array([ [11, 12, 13], [21, 22, 23], [31, 32, 33] ])
B = np.array([1, 2, 3])
B = np.array([[1, 2, 3],] * 3)
print(B)
[[1 2 3]
[1 2 3]
[1 2 3]]
SECOND EXAMPLE:
For this example, we need to know how to turn a row vector into a column vector:
B = np.array([1, 2, 3])
A * B[:, np.newaxis]
Output: array([[11, 12, 13],
[42, 44, 46],
[93, 96, 99]])
THIRD EXAMPLE:
A[:, np.newaxis] * B
Output: array([[10, 20, 30],
[20, 40, 60],
[30, 60, 90]])
ANOTHER WAY TO DO IT
Doing it without broadcasting:
import numpy as np
A = np.array([ [11, 12, 13], [21, 22, 23], [31, 32, 33] ])
B = np.array([1, 2, 3])
B = B[np.newaxis, :]
B = np.concatenate((B, B, B))
print("Multiplication: ")
print(A * B)
print("... and now addition again: ")
print(A + B)
Using 'tile':
import numpy as np
A = np.array([ [11, 12, 13], [21, 22, 23], [31, 32, 33] ])
print(B)
print("Multiplication: ")
print(A * B)
print("... and now addition again: ")
print(A + B)
[[1 2 3]
[1 2 3]
[1 2 3]]
Multiplication:
[[11 24 39]
[21 44 69]
[31 64 99]]
... and now addition again:
[[12 14 16]
[22 24 26]
[32 34 36]]
DISTANCE MATRIX
In mathematics, computer science and especially graph theory, a distance matrix is a matrix or a two-
dimensional array, which contains the distances between the elements of a set, pairwise taken. The size of this
two-dimensional array in n x n, if the set consists of n elements.
dists = np.array(dist2barcelona[:12])
print(dists)
print(np.abs(dists - dists[:, np.newaxis]))
[ 0 1498 1063 1968 1498 1758 1469 1472 2230 2391 1138 505]
[[ 0 1498 1063 1968 1498 1758 1469 1472 2230 2391 1138 505]
[1498 0 435 470 0 260 29 26 732 893 360 993]
[1063 435 0 905 435 695 406 409 1167 1328 75 558]
[1968 470 905 0 470 210 499 496 262 423 830 1463]
[1498 0 435 470 0 260 29 26 732 893 360 993]
[1758 260 695 210 260 0 289 286 472 633 620 1253]
[1469 29 406 499 29 289 0 3 761 922 331 964]
[1472 26 409 496 26 286 3 0 758 919 334 967]
[2230 732 1167 262 732 472 761 758 0 161 1092 1725]
[2391 893 1328 423 893 633 922 919 161 0 1253 1886]
[1138 360 75 830 360 620 331 334 1092 1253 0 633]
[ 505 993 558 1463 993 1253 964 967 1725 1886 633 0]]
3-DIMENSIONAL BROADCASTING
A = np.array([ [[3, 4, 7], [5, 0, -1] , [2, 1, 5]],
[[1, 0, -1], [8, 2, 4], [5, 2, 1]],
[[2, 1, 3], [1, 9, 4], [5, -2, 4]]])
[[ 3, 0, -7],
[ 8, 0, -4],
[ 5, 4, 3]],
[[ 6, 4, 21],
[ 1, 0, -4],
[ 5, -4, 12]]])
We will use the following transformations in our chapter on Images Manipulation and Processing:
B = np.array([1, 2, 3])
B = B[np.newaxis, :]
print(B.shape)
B = np.concatenate((B, B, B)).transpose()
print(B.shape)
B = B[:, np.newaxis]
print(B.shape)
print(B)
print(A * B)
[[2 2 2]]
[[3 3 3]]]
[[[ 3 4 7]
[ 5 0 -1]
[ 2 1 5]]
[[ 2 0 -2]
[16 4 8]
[10 4 2]]
[[ 6 3 9]
[ 3 27 12]
[15 -6 12]]]
• flatten()
• ravel()
FLATTEN
flatten is a ndarry method with an optional keyword parameter "order". order can have the values "C", "F" and
"A". The default of order is "C". "C" means to flatten C style in row-major ordering, i.e. the rightmost index
"changes the fastest" or in other words: In row-major order, the row index varies the slowest, and the column
index the quickest, so that a[0,1] follows [0,0].
"F" stands for Fortran column-major ordering. "A" means preserve the the C/Fortran ordering.
import numpy as np
Flattened_X = A.flatten()
print(Flattened_X)
print(A.flatten(order="C"))
print(A.flatten(order="F"))
print(A.flatten(order="A"))
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2
1 22 23]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2
1 22 23]
[ 0 8 16 2 10 18 4 12 20 6 14 22 1 9 17 3 11 19 5 13 21
7 15 23]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2
1 22 23]
RAVEL
The order of the elements in the array returned by ravel() is normally "C-style".
ravel(a, order='C')
'C': C-like order, with the last axis index changing fastest, back to the first axis index changing slowest. "C" is
the default!
'F': Fortran-like index order with the first index changing fastest, and the last index changing slowest.
'A': Fortran-like index order if the array "a" is Fortran contiguous in memory, C-like order otherwise.
print(A.ravel())
print(A.ravel(order="A"))
print(A.ravel(order="F"))
print(A.ravel(order="A"))
print(A.ravel(order="K"))
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2
1 22 23]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2
1 22 23]
[ 0 8 16 2 10 18 4 12 20 6 14 22 1 9 17 3 11 19 5 13 21
7 15 23]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2
1 22 23]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2
1 22 23]
RESHAPE
The method reshape() gives a new shape to an array without changing its data, i.e. it returns a new array with a
new shape.
Parameter Meaning
X = np.array(range(24))
Y = X.reshape((3,4,2))
Y
[[ 8, 9],
[10, 11],
[12, 13],
[14, 15]],
[[16, 17],
[18, 19],
[20, 21],
[22, 23]]])
CONCATENATING ARRAYS
In the following example we concatenate three one-dimensional arrays to one array. The elements of the
second array are appended to the first array. After this the elements of the third array are appended:
x = np.array([11,22])
y = np.array([18,7,6])
z = np.array([1,3,5])
c = np.concatenate((x,y,z))
print(c)
[11 22 18 7 6 1 3 5]
If we are concatenating multidimensional arrays, we can concatenate the arrays according to axis. Arrays must
have the same shape to be concatenated with concatenate(). In the case of multidimensional arrays, we can
arrange them according to the axis. The default value is axis = 0:
x = np.array(range(24))
x = x.reshape((3,4,2))
y = np.array(range(100,124))
y = y.reshape((3,4,2))
z = np.concatenate((x,y))
print(z)
[[ 8 9]
[ 10 11]
[ 12 13]
[ 14 15]]
[[ 16 17]
[ 18 19]
[ 20 21]
[ 22 23]]
[[100 101]
[102 103]
[104 105]
[106 107]]
[[108 109]
[110 111]
[112 113]
[114 115]]
[[116 117]
[118 119]
[120 121]
[122 123]]]
z = np.concatenate((x,y),axis = 1)
print(z)
[[ 8 9]
[ 10 11]
[ 12 13]
[ 14 15]
[108 109]
[110 111]
[112 113]
[114 115]]
[[ 16 17]
[ 18 19]
[ 20 21]
[ 22 23]
[116 117]
[118 119]
[120 121]
[122 123]]]
x = np.array([2,5,18,14,4])
y = x[:, np.newaxis]
print(y)
[[ 2]
[ 5]
[18]
[14]
[ 4]]
print(np.row_stack((A, B)))
print(np.column_stack((A, B)))
np.shape(A)
[[3 4 5]
[1 9 0]]
[[3 1]
[4 9]
[5 0]]
Output: (3,)
A = np.array([[3, 4, 5],
[1, 9, 0],
[4, 6, 8]])
np.column_stack((A, A, A))
Output: array([[3, 4, 5, 3, 4, 5, 3, 4, 5],
[1, 9, 0, 1, 9, 0, 1, 9, 0],
[4, 6, 8, 4, 6, 8, 4, 6, 8]])
np.dstack((A, A, A))
[[1, 1, 1],
[9, 9, 9],
[0, 0, 0]],
[[4, 4, 4],
[6, 6, 6],
[8, 8, 8]]])
In another usecase you may have a two-dimensional array like np.array([ [1, 2], [3, 4]]) ,
which you intend to use as a building block to construe the array with the shape (6, 8):
array([[1, 2, 1, 2, 1, 2, 1, 2],
[3, 4, 3, 4, 3, 4, 3, 4],
[1, 2, 1, 2, 1, 2, 1, 2],
[3, 4, 3, 4, 3, 4, 3, 4],
[1, 2, 1, 2, 1, 2, 1, 2],
[3, 4, 3, 4, 3, 4, 3, 4]])
tile(A, reps)
'reps' is usually a tuple (or list) which defines the number of repetitions along the corresponding axis /
directions. if we set reps to (3, 4) for example, A will be repeated 3 times for the "rows" and 4 times in the
direction of the columna. We demonstrate this in the following example:
import numpy as np
x = np.array([ [1, 2], [3, 4]])
np.tile(x, (3,4))
Output: array([[1, 2, 1, 2, 1, 2, 1, 2],
[3, 4, 3, 4, 3, 4, 3, 4],
[1, 2, 1, 2, 1, 2, 1, 2],
[3, 4, 3, 4, 3, 4, 3, 4],
[1, 2, 1, 2, 1, 2, 1, 2],
[3, 4, 3, 4, 3, 4, 3, 4]])
import numpy as np
x = np.array([ 3.4])
y = np.tile(x, (5,))
print(y)
If we stick to writing reps in the tuple or list form, or consider reps = 5 as an abbreviation for reps =
(5,) , the following is true:
If 'reps' has length n, the dimension of the resulting array will be the maximum of n and A.ndim.
If 'A.ndim < n, 'A' is promoted to be n-dimensional by prepending new axes. So a shape (5,) array is promoted
to (1, 5) for 2-D replication, or shape (1, 1, 5) for 3-D replication. If this is not the desired behavior, promote
'A' to n-dimensions manually before calling this function.
Thus for an array 'A' of shape (2, 3, 4, 5), a 'reps' of (2, 2) is treated as (1, 1, 2, 2).
Further examples:
import numpy as np
x = np.array([[1, 2], [3, 4]])
print(np.tile(x, 2))
[[1 2 1 2]
[3 4 3 4]]
import numpy as np
x = np.array([[1, 2], [3, 4]])
print(np.tile(x, (2, 1)))
[[1 2]
[3 4]
[1 2]
[3 4]]
import numpy as np
x = np.array([[1, 2], [3, 4]])
print(np.tile(x, (2, 2)))
[[1 2 1 2]
[3 4 3 4]
[1 2 1 2]
[3 4 3 4]]
INTRODUCTION
"Every American should have above average income, and my
Administration is going to see they get it."
The programming language Python and even the numerical modules Numpy and Scipy will not help us in
understanding the everyday problems mentioned above, but Python and Numpy provide us with powerful
functionalities to calculate problems from statistics and probability theory.
Warning:
Note that the pseudo-random generators in the random module should NOT be used for security purposes. Use
The default pseudo-random number generator of the random module was designed with the focus on
modelling and simulation, not on security. So, you shouldn't generate sensitive information such as passwords,
secure tokens, session keys and similar things by using random. When we say that you shouldn't use the
random module, we mean the basic functionalities "randint", "random", "choise" , and the likes. There is one
exception as you will learn in the next paragraph: SystemRandom
The SystemRandom class offers a suitable way to overcome this security problem. The methods of this class
use an alternate random number generator, which uses tools provided by the operating system (such as /dev/
urandom on Unix or CryptGenRandom on Windows.
As there has been great concern that Python developers might inadvertently make serious security errors, -
even though the warning is included in the documentaiton, - Python 3.6 comes with a new module "secrets"
with a CSPRNG (Cryptographically Strong Pseudo Random Number Generator).
Let's start with creating random float numbers with the random function of the random module. Please
remember that it shouldn't be used to generate sensitive information:
import random
random_number = random.random()
print(random_number)
0.34330263184538523
We will show an alternative and secure approach in the following example, in which we will use the class
SystemRandom of the random module. It will use a different random number generator. It uses sources which
are provided by the operating system. This will be /dev/urandom on Unix and CryptGenRandom on windows.
The random method of the SystemRandom class generates a float number in the range from 0.0 (included) to
1.0 (not included):
Quite often you will need more than one random number. We can create a list of random numbers by
repeatedly calling random().
import random
print(random_list(10, secure=False))
[0.9702685982962019, 0.5095131905323179, 0.9324278634720967, 0.975
0405405778308, 0.9750927470224396, 0.2383439553695087, 0.035916944
33088444, 0.9203791901577599, 0.07793301506800698, 0.4691524576406
6404]
The "simple" random function of the random module is a lot faster as we can see in the following:
%%timeit
random_list(100)
10000 loops, best of 3: 158 µs per loop
%%timeit
random_list(100, secure=False)
100000 loops, best of 3: 8.64 µs per loop
crypto = random.SystemRandom()
[crypto.random() for _ in range(10)]
Output: [0.5832874631978111,
0.7494815897496974,
0.6982338101218046,
0.5164288598133177,
0.15423895558995826,
0.9447842390510461,
0.08095707071826808,
0.5407159221282145,
0.6124979567571185,
0.15764744205801628]
Alternatively, you can use a list comprehension to create a list of random float numbers:
%%timeit
The fastest and most efficient way will be using the random package of the numpy module:
import numpy as np
np.random.random(10)
Output: array([ 0.0422172 , 0.98285327, 0.40386413, 0.34629582,
0.25666744,
0.69242112, 0.9231164 , 0.47445382, 0.63654389,
0.06781786])
%%timeit
np.random.random(100)
The slowest run took 16.56 times longer than the fastest. This cou
ld mean that an intermediate result is being cached.
100000 loops, best of 3: 2.1 µs per loop
Warning:
The random package of the Numpy module apparantly - even though it doesn't say so in the documentation -
is completely deterministic, using also the Mersenne twister sequence!
It's very easy to create a list of random numbers satisfying the condition that they sum up to one. This way, we
turn them into values, which could be used as probalities. We can use any of the methods explained above to
normalize a list of random values. All we have to do is divide every value by the sum of the values. The
easiest way will be using numpy again of course:
import numpy as np
list_of_random_floats = np.random.random(100)
sum_of_values = list_of_random_floats.sum()
print(sum_of_values)
normalized_values = list_of_random_floats / sum_of_values
We assume that you don't use and don't like weak passwords like "123456", "password", "qwerty" and the
likes. Believe it or not, these passwords are always ranking to 10. So you looking for a safe password? You
want to create passwords with Python? But don't use some of the functions ranking top 10 in the search
results, because you may use a functions using the random function of the random module.
We will define a strong random password generator, which uses the SystemRandom class. This class uses, as
we have alreay mentioned, a cryptographically strong pseudo random number generator:
def generate_password(length,
valid_chars=None):
""" generate_password(length, check_char) -> password
length: the length of the created password
check_char: a Boolean function used to check the validity
of a char
"""
if valid_chars==None:
valid_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
valid_chars += valid_chars.lower() + "0123456789"
password = ""
counter = 0
while counter < length:
rnum = sr.randint(0, 128)
char = chr(rnum)
if char in valid_chars:
password += chr(rnum)
counter += 1
return password
Everybody is familar with creating random integer numbers without computers. If you roll a die, you create a
random number between 1 and 6. In terms of probability theory, we would call "the rolling of the die" an
experiment with a result from the set of possible outcomes {1, 2, 3, 4, 5, 6}. It is also called the sample space
of the experiment.
How can we simulate the rolling of a die in Python? We don't need Numpy for this aim. "Pure" Python and its
random module is enough.
import random
outcome = random.randint(1,6)
print(outcome)
4
import random
import numpy as np
You may have noticed, that we used 7 instead of 6 as the second parameter. randint from numpy.random uses a
"half-open" interval unlike randint from the Python random module, which uses a closed interval!
This function returns random integers from 'low' (inclusive) to 'high' (exclusive). In other words: randint
returns random integers from the "discrete uniform" distribution in the "half-open" interval ['low', 'high'). If
'high' is None or not given in the call, the results will range from [0, 'low'). The parameter 'size' defines the
shape of the output. If 'size' is None, a single int will be the output. Otherwise the result will be an array. The
parameter 'size' defines the shape of this array. So size should be a tuple. If size is defined as an integer n, this
is considered to be the tuple (n,).
import numpy as np
print(np.random.randint(1, 7))
print(np.random.randint(1, 7, size=1))
print(np.random.randint(1, 7, size=10))
print(np.random.randint(1, 7, size=(10,))) # the same as the previ
ous one
print(np.random.randint(1, 7, size=(5, 4)))
5
[3]
[1 4 3 5 5 1 5 4 5 6]
[2 1 4 3 2 1 6 5 3 3]
[[4 1 3 1]
[6 4 5 6]
[2 5 5 1]
[4 3 2 3]
[6 2 6 5]]
Simulating the rolling of a die is usually not a security-relevant issue, but if you want to create
cryptographically strong pseudo random numbers you should use the SystemRandom class again:
import random
crypto = random.SystemRandom()
We have learned how to simulate the rolling of a die with Python. We assumed that our die is fair, i.e. the
probability for each face is equal to 1/6. How can we simulate throwing a crooked or loaded die? The randint
First we want to have a look at other useful functions of the random module.
"Having a choice" or "having choices" in real life is better than not having a choice. Even though some people
might complain, if they have too much of a choice. Life means making decisions. There are simple choices
like "Do I want a boiled egg?", "Soft or Hard boiled?" "Do I go to the cinema, theater or museum? Other
choices may have further reaching consequences like choosing the right job, study or what is the best
programming language to learn.
Let's do it with Python. The random module contains the right function for this purpose.This function can be
used to choose a random element from a non-empty sequence.
This means that we are capable of picking a random character from a string or a random element from a list or
a tuple, as we can see in the following examples. You want to have a city trip within Europe and you can't
decide where to go? Let Python help you:
print(choice(possible_destinations))
Strasbourg
The choice function of the random package of the numpy module is more convenient, because it provides
further possibilities. The default call, i.e. no further parameters are used, behaves like choice of the random
module:
print(choice(possible_destinations))
Augsburg
x1 = choice(possible_destinations, size=3)
print(x1)
x2 = choice(possible_destinations, size=(3, 4))
print(x2)
['London' 'Augsburg' 'London']
[['Strasbourg' 'London' 'Rome' 'Berlin']
['Berlin' 'Paris' 'Munich' 'Augsburg']
['Heidelberg' 'Paris' 'Berlin' 'Rome']]
You might have noticed that the city names can have multiple occurrences. We can prevent this by setting the
optional parameter "replace" to "False":
Setting the "size" parameter to a non None value, leads us to the sample function.
A sample can be understood as a representative part from a larger group, usually called a "population".
The module numpy.random contains a function random_sample, which returns random floats in the half open
interval [0.0, 1.0). The results are from the "continuous uniform" distribution over the stated interval. This
function takes just one parameter "size", which defines the output shape. If we set size to (3, 4) e.g., we will
get an array with the shape (3, 4) filled with random elements:
import numpy as np
x = np.random.random_sample((3, 4))
print(x)
If we call random_sample with an integer, we get a one-dimensional array. An integer has the same effect as if
we use a one-tuple as an argument:
x = np.random.random_sample(7)
print(x)
y = np.random.random_sample((7,))
print(y)
[ 0.07729483 0.07947532 0.27405822 0.34425005 0.2968612 0.27
234156
0.41580785]
[ 0.19791769 0.64537929 0.02809775 0.2947372 0.5873195 0.55
059448
0.98943354]
You can also generate arrays with values from an arbitrary interval [a, b), where a has to be less than b. It can
be done like this:
(b - a) * random_sample() + a
Example:
a = -3.4
b = 5.9
A = (b - a) * np.random.random_sample((3, 4)) + a
print(A)
[[ 5.87026891 -0.13166798 5.56074144 3.48789786]
[-2.2764547 4.84050253 0.71734827 -0.7357672 ]
[ 5.8468095 4.56323308 0.05313938 -1.99266987]]
The standard module random of Python has a more general function "sample", which produces samples from a
population. The population can be a sequence or a set.
sample(population, k)
If you want to choose a sample within a range of integers, you can - or better you should - use range as the
argument for the population.
In the following example we produce six numbers out of the range from 1 to 49 (inclusive). This corresponds
to a drawing of the German lottery:
import random
You may also have asked yourself, if the random modules of Python can create "real" or "true" random
numbers, which are e.g. equivalent to an ideal die. The truth is that most random numbers used in computer
programs are pseudo-random. The numbers are generated in a predictable way, because the algorithm is
deterministic. Pseudo-random numbers are good enough for many purposes, but it may not be "true" random
rolling dice or lottery drawings.
The website RANDOM.ORG claims to offer true random numbers. They use the randomness which comes
from atmospheric noise. The numbers created this way are for many purposes better than the pseudo-random
number algorithms typically used in computer programs.
INTRODUCTION
In the previous chapter of our tutorial, we introduced the
random module. We got to know the functions 'choice' and
'sample'. We used 'choice' to choose a random element from
a non-empty sequence and 'sample' to chooses k unique
random elements from a population sequence or set. The
probality for all elements is evenly distributed, i.e. each
element has of the sequences or sets have the same
probability to be chosen. This is exactly what we want, if we
simulate the rolling of dice. But what about loaded dice?
Loaded dice are designed to favor some results over others
for whatever reasons. In our previous chapter we had a look
at the following examples:
from random import choice, sample
print(choice("abcdefghij"))
Like we said before, the chances for the elements of the sequence to be chosen are evenly distributed. So the
chances for getting a 'scientist' as a return value of the call choice(professions) is 1/4. This is out of touch with
WEIGHTED PROBABILITIES 90
reality. There are surely more scientists and engineers in the world than there are priests and philosophers. Just
like with the loaded die, we have again the need of a weighted choice.
We will devise a function "weighted_choice", which returns a random element from a sequence like
random.choice, but the elements of the sequence will be weighted.
import numpy as np
If we create a random number x between 0 and 1 by using random.random(), the probability for x to lie within
the interval [0, cum_weights[0]) is equal to 1/5. The probability for x to lie within the interval
[cum_weights[0], cum_weights[1]) is equal to 1/2 and finally, the probability for x to lie within the interval
[cum_weights[1], cum_weights[2]) is 3/10.
Now you are able to understand the basic idea of how weighted_choice operates:
import numpy as np
import random
from random import random
WEIGHTED PROBABILITIES 91
sum_of_weights = weights.sum()
# standardization:
np.multiply(weights, 1 / sum_of_weights, weights)
weights = weights.cumsum()
x = random()
for i in range(len(weights)):
if x < weights[i]:
return objects[i]
Example:
Suppose, we have a "loaded" die with P(6)=3/12 and P(1)=1/12. The probability for the outcomes of all the
other possibilities is equally likely, i.e. P(2) = P(3) = P(4) = P(5) = p.
1 - P(1) - P(6) = 4 x p
that means
p=1/6
We call weighted_choice with 'faces_of_die' and the 'weights' list. Each call correspondents to a throw of the
loaded die.
We can show that if we throw the die a large number of times, for example 10,000 times, we get roughly the
probability values of the weights:
faces_of_die = [1, 2, 3, 4, 5, 6]
weights = [1/12, 1/6, 1/6, 1/6, 1/6, 3/12]
outcomes = []
n = 10000
for _ in range(n):
outcomes.append(weighted_choice(faces_of_die, weights))
c = Counter(outcomes)
for key in c:
c[key] = c[key] / n
print(sorted(c.values()), sum(c.values()))
WEIGHTED PROBABILITIES 92
[0.0832, 0.1601, 0.1614, 0.1665, 0.1694, 0.2594] 1.0
We define a list of cities and a list with their corresponding populations. The probability of a city to be chosen
should be according to their size:
cities = ["Frankfurt",
"Stuttgart",
"Freiburg",
"München",
"Zürich",
"Hamburg"]
populations = [736000, 628000, 228000, 1450000, 409241, 1841179]
total = sum(populations)
weights = [ round(pop / total, 2) for pop in populations]
print(weights)
for i in range(10):
print(weighted_choice(cities, populations))
[0.14, 0.12, 0.04, 0.27, 0.08, 0.35]
Frankfurt
Hamburg
Stuttgart
Hamburg
Hamburg
Zürich
München
Stuttgart
München
Frankfurt
WEIGHTED PROBABILITIES 93
Parameter Meaning
a 1-dimensional array-like object or an int. If it is an array-like object, the function will return a random sample from
a
the elements. If it is an int, it behaves as if we called it with np.arange(a)
This is an optional parameter defining the output shape. If the given shape is, e.g., (m, n, k) , then m * n *
size
k samples are drawn. The Default is None, in which case a single value will be returned.
replace An optional boolean parameter. It is used to define whether the output sample will be with or without replacements.
An optional 1-dimensional array-like object, which contains the probabilities associated with each entry in a. If it is
p
not given the sample assumes a uniform distribution over all entries in a.
We will base our first exercise on the popularity of programming language as stated by the "Tiobe index"1:
professions = ["scientist",
"philosopher",
"engineer",
"priest",
"programmer"]
choice(professions, p=probabilities)
Output: 'programmer'
Let us use the function choice to create a sample from our professions. To get two professions chosen, we set
the size parameter to the shape (2, ) . In this case multiple occurances are possible. The top ten
programming languages in August 2019 were:
C 15.154% +0.19%
WEIGHTED PROBABILITIES 94
Programming Language Percentage in August 2019 Change to August 2018
C# 3.842% +0.30%
# normalization
weights /= sum(weights)
print(weights)
for i in range(10):
print(choice(programming_languages, p=weights))
[0.33826638 0.32135307 0.21141649 0.12896406]
Java
C++
Python
Python
C
C++
C
C
Python
C
WEIGHTED PROBABILITIES 95
WEIGHTED SAMPLE
In the previous chapter on random numbers and probability, we introduced the function 'sample' of the module
'random' to randomly extract a population or sample from a group of objects liks lists or tuples. Every object
had the same likelikhood to be drawn, i.e. to be part of the sample.
In real life situation there will be of course situation in which every or some objects will have different
probabilities. We will start again by defining a function on our own. This function will use the previously
defined 'weighted_choice' function.
Let's assume we have eight candies, coloured "red", "green", "blue", "yellow", "black", "white", "pink", and
"orange". Our friend Peter will have the "weighted" preference 1/24, 1/6, 1/6, 1/12, 1/12, 1/24, 1/8, 7/24 for
thes colours. He is allowed to take 3 candies:
WEIGHTED PROBABILITIES 96
['green', 'orange', 'pink']
['green', 'orange', 'black']
['yellow', 'black', 'pink']
['red', 'green', 'yellow']
['yellow', 'black', 'pink']
['green', 'orange', 'yellow']
['red', 'blue', 'pink']
['white', 'yellow', 'pink']
['green', 'blue', 'pink']
['orange', 'blue', 'black']
Let's approximate the likelihood for an orange candy to be included in the sample:
n = 100000
orange_counter = 0
for i in range(n):
if "orange" in weighted_sample(candies, weights, 3):
orange_counter += 1
print(orange_counter / n)
0.71015
It was completely unnecessary to write this function, because we can use the choice function of NumPy for
this purpose as well. All we have to do is assign the shape '(2, )' to the optional parameter 'size'. Let us redo the
previous example by substituting weighted_sampe with a call of np.random.choice:
n = 100000
orange_counter = 0
for i in range(n):
if "orange" in np.random.choice(candies,
p=weights,
size=(3,),
replace=False):
orange_counter += 1
print(orange_counter / n)
0.71276
In addition, the function 'np.random.choice' gives us the possibility to allow repetitions, as we can see in the
following example:
WEIGHTED PROBABILITIES 97
"Austria", "Netherlands",
"Belgium", "Poland",
"France", "Ireland"]
weights = np.array([83019200, 8555541, 8869537,
17338900, 11480534, 38413000,
67022000, 4857000])
weights = weights / sum(weights)
for i in range(4):
print(np.random.choice(countries,
p=weights,
size=(3,),
replace=True))
['Germany' 'Belgium' 'France']
['Poland' 'Poland' 'Poland']
['Germany' 'Austria' 'Poland']
['France' 'France' 'France']
CARTESIAN CHOICE
The function cartesian_choice is named after the Cartesian product from set theory
CARTESIAN PRODUCT
The Cartesian product is an operation which returns a set from multiple sets. The result set from the Cartesian
product is called a "product set" or simply the "product".
For two sets A and B, the Cartesian product A × B is the set of all ordered pairs (a, b) where a ∈ A and b ∈ B:
A x B = { (a, b) | a ∈ A and b ∈ B }
If we have n sets A1, A2, ... An, we can build the Cartesian product correspondingly:
We will write now a function cartesian_choice, which takes an arbitrary number of iterables as arguments and
WEIGHTED PROBABILITIES 98
returns a list, which consists of random choices from each iterator in the respective order.
Mathematically, we can see the result of the function cartesian_choice as an element of the Cartesian product
of the iterables which have been passed as arguments.
import numpy as np
def cartesian_choice(*iterables):
"""
A list with random choices from each iterable of iterables
is being created in respective order.
cartesian_choice(["The", "A"],
["red", "green", "blue", "yellow", "grey"],
["car", "house", "fish", "light"],
["smells", "dreams", "blinks", "shines"])
Output: ['The', 'green', 'house', 'shines']
import numpy as np
def weighted_cartesian_choice(*iterables):
"""
An arbitrary number of tuple or lists,
each consisting of population and weights.
weighted_cartesian_choice returns a list
with a chocie from each population
"""
res = []
for population, weights in iterables:
# normalize weight:
weights = np.array(weights) / sum(weights)
lst = np.random.choice(population, p=weights)
WEIGHTED PROBABILITIES 99
res.append(lst)
return res
import random
def weighted_cartesian_choice(*iterables):
"""
A list with weighted random choices from each iterable of iter
ables
wsum = sum(c.values())
for key in c:
print(key, c[key] / wsum)
RANDOM SEED
A random seed, - also called "seed state", or just "seed" - is a number
used to initialize a pseudorandom number generator. When we called
random.random() we expected and got a random number between 0
and 1. random.random() calculates a new random number by using the
previously produced random number. What about the first time we use
random in our program? Yes, there is no previously created random
number. If a random number generator is called for the first time, it will
have to create a first "random" number.
The seed number itself doesn't need to be randomly chosen so that the
algorithm creates values which follow a probability distribution in a
pseudorandom manner. Yet, the seed matters in terms of security. If you
know the seed, you could for example generate the secret encryption key which is based on this seed.
Random seeds are in many programming languages generated from the state of the computer system, which is
in lots of cases the system time.
This is true for Python as well. Help on random.seed says that if you call the function with None or no
argument it will seed "from current time or from an operating system specific randomness source if available."
import random
help(random.seed)
For version 2 (the default), all of the bits are used if *a* i
s a str,
bytes, or bytearray. For version 1, the hash() of *a* is use
d instead.
The seed functions allows you to get a determined sequence of random numbers. You can repeat this
sequence, whenever you need it again, e.g. for debugging purposes.
import random
random.seed(42)
for _ in range(10):
print(random.randint(1, 10), end=", ")
n = 1000
values = []
frequencies = {}
print(values[:10])
[173.49123947564414, 183.47654360102564, 186.96893210720162, 214.9
0676059797428, 199.69909520396007, 183.31521532331496, 157.8503519
2965537, 149.56012897536849, 187.39026585633607, 219.3324248161214
3]
The following program plots the random values, which we have created before. We haven't covered matplotlib
so far, so it's not necessary to understand the code:
%matplotlib inline
freq = list(frequencies.items())
freq.sort()
plt.plot(*list(zip(*freq)))
n = 1000
values = []
frequencies = {}
freq = list(frequencies.items())
freq.sort()
plt.plot(*list(zip(*freq)))
import random
def random_ones_and_zeros(p):
""" p: probability 0 <= p <= 1
returns a 1 with the probability p
"""
x = random.random()
if x < p:
return 1
else:
return 0
n = 1000000
sum(random_ones_and_zeros(0.8) for i in range(n)) / n
It might be a great idea to implement a task like this with a generator. If you are not familar with the way of
working of a Python generator, we recommend to consult our chapter on generators and iterators of our
Python tutorial.
import random
def random_ones_and_zeros(p):
while True:
x = random.random()
yield 1 if x < p else 0
n = 1000000
sum(x for x in firstn(random_ones_and_zeros(0.8), n)) / n
Output: 0.799762
Our generator random_ones_and_zeros can be seen as a sender, which emits ones and zeros with a probability
of p and (1-p) respectively.
We will write now another generator, which is receiving this bitstream. The task of this new generator is to
read the incoming bitstream and yield another bitstream with ones and zeros with a probability of 0.5 without
knowing or using the probability p. It should work for an arbitrary probability value p.2
def ebitter(bitstream):
while True:
bit1 = next(bitstre
am)
bit2 = next(bitstre
am)
if bit1 + bit2 ==
1:
bit3 = next(bitstream)
if bit2 + bit3 == 1:
yield 1
else:
yield 0
def ebitter2(bitstream):
n = 1000000
sum(x for x in firstn(ebitter(random_ones_and_zeros(0.8)), n)) / n
Output: 0.49975
n = 1000000
sum(x for x in firstn(ebitter2(random_ones_and_zeros(0.8)), n)) /
n
Output: 0.500011
Underlying theory:
Such a pair can have the values 01, 10, 00 or 11. The probability P(01) = (p-1) x p and probability P(10) = p x
(p-1), so that the combined probabilty that the two consecutive bits are either 01 or 10 (or the sum of the two
bits is 1) is 2 x (p-1) x p
Now we look at another bit Bi+2. What is the probability that both
Bi + Bi+1 = 1
and
Bi+1 + Bi+2 = 1?
The possible outcomes satisfying these conditions and their corresponding probabilities can be found in the
following table:
p x (1 - p)2 1 0 1
We will denote the outcome sum(Bi, Bi+1)=1 asX1 and correspondingly the outcome sum(Bi+1, Bi+2)=1 as X2
So, the joint probability P(X1, X2) = p2 x (1-p) + p x (1 - p)2 which can be rearranged to p x (1-p)
We start with an array 'sales' of sales figures for the year 1997:
import numpy as np
The aim is to create a comma separated list like the ones you get from Excel. The file should contain the sales
figures, we don't know, for all the shops, we don't have, spanning the year from 1997 to 2016.
We will add random values to our sales figures year after year. For this purpose we construct an array with
growthrates. The growthrates can vary between a minimal percent value (min_percent) and maximum percent
value (max_percent):
To get the new sales figures after a year, we multiply the sales array "sales" with the array "growthrates":
sales * growthrates
Output: array([ 1289.20412146, 2239.649113 , 1645.82833205, 203
6.24919608,
1032.80143634, 2074.41825668, 719.40621318, 104
7.91173209,
567.42139317, 1814.15165757, 1893.62920988, 204
4.21783979])
To get a more sustainable sales development, we change the growthrates only every four years.
This is our complete program, which saves the data in a file called sales_figures.csv:
import numpy as np
fh = open("sales_figures.csv", "w")
We will use this file in our chapter on reading and writing in Numpy.
EXERCISES
1. Let's do some more die rolling. Prove empirically - by writing a simulation program - that the
probability for the combined events "an even number is rolled" (E) and "A number greater than
2 is rolled" is 1/3.
2. The file ["universities_uk.txt"](universities_uk.txt) contains a list of universities in the United
Kingdom by enrollment from 2013-2014 (data from ([Wikepedia](https://en.wikipedia.org/wiki/
List_of_universities_in_the_United_Kingdom_by_enrollment#cite_note-1)). Write a function
which returns a tuple (universities, enrollments, total_number_of_students) with - universities:
list of University names - enrollments: corresponding list with enrollments -
total_number_of_students: over all universities Now you can enroll a 100,000 fictional students
with a likelihood corresponding to the real enrollments.
3.
Let me take you back in time and space in our next exercise.
We will travel back into ancient Pythonia (Πηθωνια). It was
the time when king Pysseus ruled as the benevolent dictator
for live. It was the time when Pysseus sent out his
messengers throughout the world to announce that the time
has come for his princes Anacondos (Ανακονδος), Cobrion
(Κομπριον), Boatos (Μποατος) and Addokles (Ανδοκλης) to
merry. So, they organized the toughest programming
contests amongst the fair and brave amazons, better known
as Pythonistas of Pythonia. Finally, only eleven amazons
were left to choose from:
On the day they arrived the chances to be drawn in the lottery are the same for every amazon, but Pysseus
wants the lottery to be postponed to some day in the future. The probability changes every day: It will be
lowered by 1/13 for the first seven amazones and it will be increased by 1/12 for the last four amazones.
</li>
</ol>
print(len(even_pips) / len(outcomes))
print(len(greater_two) / len(outcomes))
print(len(combined) / len(outcomes))
0.5061
0.6719
0.3402
• At first we will write the function "process_datafile" to process our data file:
def process_datafile(filename):
""" process_datafile -> (universities,
enrollments,
total_number_of_students)
universities: list of University names
enrollments: corresponding list with enrollments
total_number_of_students: over all universities
"""
universities = []
enrollments = []
with open(filename) as fh:
for i in range(14):
print(universities[i], end=": ")
print(enrollments[i])
print("Total number of students onrolled in the UK: ", total_stude
nts)
Open University in England: 123490
University of Manchester: 37925
University of Nottingham: 33270
Sheffield Hallam University: 33100
University of Birmingham: 32335
Manchester Metropolitan University: 32160
University of Leeds: 30975
Cardiff University: 30180
University of South Wales: 29195
University College London: 28430
King's College London: 27645
University of Edinburgh: 27625
Northumbria University: 27565
University of Glasgow: 27390
Total number of students onrolled in the UK: 2299380
We want to enroll now a virtual student randomly to one of the universities. To get a weighted list suitable for
our weighted_choice function, we have to normalize the values in the list enrollments:
We have been asked by the exercise to "enroll" 100,000 fictional students. This can be easily accomplished
with a loop:
outcomes = []
n = 100000
for i in range(n):
outcomes.append(weighted_choice(universities, normalized_enrol
lments))
c = Counter(outcomes)
print(c.most_common(20))
[('Open University in England', 5529), ('University of Mancheste
r', 1574), ('University of Nottingham', 1427), ('University of Bir
mingham', 1424), ('Sheffield Hallam University', 1410), ('Manchest
er Metropolitan University', 1408), ('Cardiff University', 1334),
('University of Leeds', 1312), ('University of South Wales', 126
4), ('University of Plymouth', 1218), ('University College Londo
n', 1209), ('Coventry University', 1209), ('University of the Wes
t of England', 1197), ('University of Edinburgh', 1196), ("King's
College London", 1183), ('University of Glasgow', 1181), ('Univers
ity of Central Lancashire', 1176), ('Nottingham Trent Universit
y', 1174), ('University of Sheffield', 1160), ('Northumbria Univer
sity', 1154)]
</li>
•
The bunch of amazons is implemented as a list, while we choose a set for Pysseusses favorites. The weights at
the beginning are 1/11 for all, i.e. 1/len(amazons).
Every loop cycle corresponds to a new day. Every time we start a new loop cycle, we will draw "n" samples of
Pythonistas to calculate the ratio of the number of times the sample is equal to the king's favorites divided by
the number of times the sample doesn't match the king's idea of daughter-in-laws. This corresponds to the
We can use both the function "weighted_same" and "weighted_sample_alternative" to do the drawing.
import time
prob = 1 / 330
days = 0
factor1 = 1 / 13
factor2 = 1 / 12
start = time.clock()
while prob < 0.9:
for i in range(n):
the_chosen_ones = weighted_sample_alternative(amazons, wei
ghts, 4)
if set(the_chosen_ones) == Pytheusses_favorites:
counter += 1
prob = counter / n
counter = 0
weights[:7] = [ p - p*factor1 for p in weights[:7] ]
weights[7:] = [ p + p*factor2 for p in weights[7:] ]
weights = [ x / sum(weights) for x in weights]
days += 1
print(time.clock() - start)
The following is a solutions without round-off errors. We will use Fraction from the module fractions.
import time
from fractions import Fraction
start = time.clock()
while prob < 0.9:
#print(prob)
for i in range(n):
the_chosen_ones = weighted_sample_alternative(amazons, wei
ghts, 4)
if set(the_chosen_ones) == Pytheusses_favorites:
counter += 1
prob = Fraction(counter, n)
counter = 0
weights[:7] = [ p - p*factor1 for p in weights[:7] ]
weights[7:] = [ p + p*factor2 for p in weights[7:] ]
weights = [ x / sum(weights) for x in weights]
days += 1
print(time.clock() - start)
We can see that the solution with fractions is beautiful but very slow. Whereas the greater precision doesn't
play a role in our case.
So far, we haven't used the power of Numpy. We will do this in the next implementation of our problem:
import time
import numpy as np
n = 1000
counter = 0
prob = 1 / 330
days = 0
factor1 = 1 / 13
factor2 = 1 / 12
start = time.clock()
while prob < 0.9:
for i in range(n):
the_chosen_ones = weighted_sample_alternative(amazons, wei
ghts, 4)
if set(the_chosen_ones) == Pytheusses_favorites:
counter += 1
prob = counter / n
counter = 0
weights[:7] = weights[:7] - weights[:7] * factor1
weights[7:] = weights[7:] + weights[7:] * factor2
weights = weights / np.sum(weights)
FOOTNOTES:
1
The TIOBE index or The TIOBE Programming Community index is - according to the website "an indicator
of the popularity of programming languages. The index is updated once a month. The ratings are based on the
number of skilled engineers world-wide, courses and third party vendors. Popular search engines such as
Google, Bing, Yahoo!, Wikipedia, Amazon, YouTube and Baidu are used to calculate the ratings. It is
important to note that the TIOBE index is not about the best programming language or the language in which
most lines of code have been written."
2
I am thankful to Dr. Hanno Baehr who introduced me to the problem of "Random extraction" when
participating in a Python training course in Nuremberg in January 2014. Hanno outlined some bits of the
theoretical framework. During a night session in a pub called "Zeit & Raum" (english: "Time & Space") I
implemented a corresponding Python program to back the theoretical solution empirically.
• find_interval
• weighted_choice
• cartesian_choice
• weighted_cartesian_choice
• weighted_sample
The sets Di are the data sets from which we want to deduce our synthetical data.
In the actual implementation, the sets will be tuples or lists for practical reasons.
The process of creating synthetic data can be defined by two functions "synthesizer" and "synthesize".
Usually, the word synthesizer is used for a computerized electronic device which produces sound. Our
synthesizer produces strings or alternatively tuples with data, as we will see later.
The function synthesize, - which may also be a generator like in our implementation, - takes no arguments and
the result of a function call sythesize() will be
• a list or a tuple t = (d1, d2, ... dn) where di is drawn at random from Di
• or a string which contains the elements str(d1), str(d2), ... str(dn) where di is also drawn at
random from Di
Let us start with a simple example. We have a list of firstnames and a list of surnames. We want to hire
employees for an institute or company. Of course, it will be a lot easier in our synthetical Python environment
to find and hire specialsts than in real life. The function "cartesian_choice" from the bk_random module and
the concatenation of the randomly drawn firstnames and surnames is all it takes.
import bk_random
number_of_specialists = 15
employees = set()
while len(employees) < number_of_specialists:
employee = bk_random.cartesian_choice(firstnames, surnames)
employees.add(" ".join(employee))
print(employees)
This was easy enough, but we want to do it now in a more structured way, using the synthesizer approach we
mentioned before. The code for the case in which the parameter "weights" is not None is still missing in the
following implementation:
import bk_random
def synthesize():
if not repeats:
memory = set()
while True:
res = bk_random.cartesian_choice(*data)
if not repeats:
sres = str(res)
while sres in memory:
res = bk_random.cartesian_choice(*data)
sres = str(res)
employee = recruit_employee()
for _ in range(15):
print(next(employee))
Sarah Baker
Frank Smiley
Simone Smiley
Frank Bychan
Sarah Moore
Simone Chopman
Frank Chopman
Eve Rampman
Bernard Miller
Simone Bychan
Jane Singer
Roger Smith
John Baker
Robert Cook
Kathrin Cook
Every name, i.e first name and last name, had the same likehood to be drawn in the previous example. This is
not very realistic, because we will expect in countries like the US or England names like Smith and Miller to
occur more often than names like Rampman or Bychan. We will extend our synthesizer function with
additional code for the "weighted" case, i.e. weights is not None. If weights are given, we will have to use the
function weighted_cartesian_choice from the bk_random module. If "weights" is set to None, we will have to
call the function cartesian_choice. We put this decision into a different subfunction of synthesizer to keep the
function synthesize clearer.
We do not want to fiddle around with probabilites between 0 and 1 in defining the weights, so we take the
detour with integer, which we normalize afterwards.
def synthesize():
if not repeats:
memory = set()
while True:
res = choice(data, weights)
if not repeats:
sres = str(res)
while sres in memory:
res = choice(data, weights)
sres = str(res)
memory.add(sres)
if format_func:
yield format_func(res)
else:
yield res
return synthesize
employee = recruit_employee()
WINE EXAMPLE
Let's imagine that you have to describe a dozen wines. Most
probably a nice imagination for many, but I have to admit that it
is not for me. The main reason is that I am not a wine drinker!
If you have defined your lists, you can use the synthesize
function.
import bk_random
def describe(data):
body, adv, adj, noun, adj2 = data
format_str = "This wine is %s with a %s %s %s\nleading to"
format_str += " a lingering %s finish!"
return format_str % (body, adv, adj, noun, adj2)
2. wine:
This wine is medium-bodied with a energistically unctuous bouquet
leading to a lingering vanilla finish!
3. wine:
This wine is medium-bodied with a synergistically flamboyant flavo
ur
leading to a lingering unoaked finish!
4. wine:
This wine is light-bodied with a uniquely toasty flavour
leading to a lingering juicy finish!
5. wine:
This wine is full-bodied with a holisticly flowers flavour
leading to a lingering tobacco finish!
6. wine:
This wine is full-bodied with a energistically toasty flavour
leading to a lingering chocolate finish!
7. wine:
This wine is full-bodied with a proactively tobacco bouquet
leading to a lingering velvetly finish!
8. wine:
This wine is full-bodied with a authoritatively mocha aroma
leading to a lingering juicy finish!
9. wine:
This wine is light-bodied with a dynamically vanilla flavour
leading to a lingering juicy finish!
10. wine:
This wine is medium-bodied with a dynamically structured flavour
leading to a lingering complex finish!
11. wine:
This wine is full-bodied with a distinctively fruits flavour
leading to a lingering complex finish!
12. wine:
For example:
For practical reasons, we will reduce the countries to France, Italy, Switzerland and Germany in the following
example implementation:
helpers = []
print(helpers)
In [ ]:
import numpy as np
A = np.array([4, 7, 3, 4, 2, 8])
print(A == 4)
[ True False False True False Fals
e]
print(A < 5)
[ True False True True True False]
B = np.array([[42,56,89,65],
[99,88,42,12],
[55,42,17,18]])
print(B>=42)
[[ True True True True]
[ True True True False]
[ True True False False]]
import numpy as np
A = np.array([
[12, 13, 14, 12, 16, 14, 11, 10, 9],
[11, 14, 12, 15, 15, 16, 10, 12, 11],
[10, 12, 12, 15, 14, 16, 10, 12, 12],
B = A < 15
B.astype(np.int)
Output: array([[1, 1, 1, 1, 0, 1, 1, 1, 1],
[1, 1, 1, 0, 0, 0, 1, 1, 1],
[1, 1, 1, 0, 1, 0, 1, 1, 1],
[1, 1, 0, 0, 1, 0, 0, 1, 1],
[1, 1, 0, 1, 1, 1, 0, 1, 1],
[1, 0, 0, 1, 1, 1, 0, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1]])
If you have a close look at the previous output, you will see, that it the upper case 'A' is hidden in the array B.
FANCY INDEXING
We will index an array C in the following example by using a Boolean mask. It is called fancy indexing, if
arrays are indexed by using boolean or integer arrays (masks). The result will be a copy and not a view.
In our next example, we will use the Boolean mask of one array to select the corresponding elements of
another array. The new array R contains all the elements of C where the corresponding value of (A<=5) is
True.
R = C[A<=5]
print(R)
[123 190 100]
C[[0, 2, 3, 1, 4, 1]]
Output: array([123, 190, 99, 188, 77, 188])
EXERCISES
Extract from the array np.array([3,4,6,10,24,89,45,43,46,99,100]) with Boolean masking all the number
SOLUTIONS
import numpy as np
A = np.array([3,4,6,10,24,89,45,43,46,99,100])
div3 = A[A%3!=0]
print("Elements of A not divisible by 3:")
div5 = A[A%5==0]
print("Elements of A divisible by 5:")
print(div5)
A[A%3==0] = 42
print("""New values of A after setting the elements of A,
which are divisible by 3, to 42:""")
print(A)
Elements of A not divisible by 3:
[ 4 10 89 43 46 100]
Elements of A divisible by 5:
[ 10 45 100]
Elements of A, which are divisible by 3 and 5:
[45]
------------------
New values of A after setting the elements of A,
which are divisible by 3, to 42:
[ 42 4 42 10 42 89 42 43 46 42 100]
For an ndarray a both numpy.nonzero(a) and a.nonzero() return the indices of the elements of a that are non-
zero. The indices are returned as a tuple of arrays, one for each dimension of 'a'. The corresponding non-zero
values can be obtained with:
a[numpy.nonzero(a)]
import numpy as np
a = np.array([[0, 2, 3, 0, 1],
print(a.nonzero())
(array([0, 0, 0, 1, 1, 2, 2]), array([1, 2, 4, 0, 3, 0, 3]))
If you want to group the indices by element, you can use transpose:
transpose(nonzero(a))
np.transpose(a.nonzero())
Output: array([[0, 1],
[0, 2],
[0, 4],
[1, 0],
[1, 3],
[2, 0],
[2, 3]])
a[a.nonzero()]
Output: array([2, 3, 1, 1, 7, 5, 1])
The function 'nonzero' can be used to obtain the indices of an array, where a condition is True. In the following
script, we create the Boolean array B >= 42:
B = np.array([[42,56,89,65],
[99,88,42,12],
[55,42,17,18]])
np.nonzero(B >= 42) yields the indices of the B where the condition is true:
Calculate the prime numbers between 0 and 100 by using a Boolean array.
Solution:
import numpy as np
print(np.nonzero(is_prime))
(array([ 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 4
7, 53, 59,
61, 67, 71, 73, 79, 83, 89, 97]),)
similar functions:
• flatnonzero :
• Matrix addition
• Matrix subtraction
• Matrix multiplication
• Scalar product
• Cross product
• and lots of other operations on matrices
• +
• -
• *
• /
• %
are applied on the elements, this means that the arrays have to have the same size.
import numpy as np
x = np.array([1, 5, 2])
y = np.array([7, 4, 1])
x + y
Output: array([8, 9, 3])
x * y
Output: array([ 7, 20, 2])
In [ ]:
x - y
x / y
Output: array([0.14285714, 1.25 , 2. ])
x = np.array([3, 2])
y = np.array([5, 1])
z = x + y
z
Output: array([8, 3])
We can see from the definition of the scalar product that it can be used to calculate the cosine of the angle
between two vectors.
x = np.array([1, 2, 3])
y = np.array([-7, 8, 9])
np.dot(x, y)
Output: 36
In [ ]:
36
>>> dot = np.dot(x,y)
>>> x_modulus = np.sqrt((x*x).sum())
>>> y_modulus = np.sqrt((y*y).sum())
>>> cos_angle = dot / x_modulus / y_modulus # cosine of angle betw
MATRIX PRODUCT
The matrix product of two matrices can be calculated if the number of columns of the left matrix is equal to
the number of rows of the second or right matrix.
The product of a (l x m)-matrix A = (aij)i=1...l, j= 1..m and an (m x n)-matrix B = (bij)i=1...m, j= 1..n is a matrix C
= (cij)i=1...l, j= 1..n, which is calculated like this:
If we want to perform matrix multiplication with two numpy arrays (ndarray), we have to use the dot product:
So, what's the price in Euro of these chocolates: A costs 2.98 per 100
g, B costs 3.90 and C only 1.99 Euro.
If we have to calculate how much each of them had to pay, we can use Python, NumPy and Matrix
multiplication:
import numpy as np
NumPersons = np.array([[100, 175, 210],
[90, 160, 150],
[200, 50, 100],
[120, 0, 310]])
Price_per_100_g = np.array([2.98, 3.90, 1.99])
Price_in_Cent = np.dot(NumPersons,Price_per_100_g)
Price_in_Euro = Price_in_Cent / 100
Price_in_Euro
Output: array([13.984, 11.907, 9.9 , 9.745])
x = np.array([0, 0, 1])
y = np.array([0, 1, 0])
np.cross(x, y)
Output: array([-1, 0, 0])
np.cross(y, x)
Output: array([1, 0, 0])
There are lots of ways for reading from file and writing to data
files in numpy. We will discuss the different ways and
corresponding functions in this chapter:
• savetxt
• loadtxt
• tofile
• fromfile
• save
• load
• genfromtxt
In the following simple example, we define an array x and save it as a textfile with savetxt:
import numpy as np
x = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], np.int32)
np.savetxt("test.txt", x)
The file "test.txt" is a textfile and its content looks like this:
Attention: The above output has been created on the Linux command prompt!
Parameter Meaning
newline A string (e.g. "\n", "\r\n" or ",\n") which will end a line instead of the default line ending
A String that will be prepended to the 'header' and 'footer' strings, to mark them as comments. The hash tag '#' is
comments
used as the default.
y = np.loadtxt("test.txt")
print(y)
[[ 1. 2. 3.]
[ 4. 5. 6.]
[ 7. 8. 9.]]
y = np.loadtxt("test2.txt", delimiter=",")
print(y)
[[ 1. 2. 3.]
[ 4. 5. 6.]
[ 7. 8. 9.]]
def time2float_minutes(time):
if type(time) == bytes:
time = time.decode()
t = time.split(":")
minutes = float(t[0])*60 + float(t[1]) + float(t[2]) * 0.05 /
3
return minutes
You might have noticed that we check the type of time for binary. The reason for this is the use of our function
"time2float_minutes in loadtxt in the following example. The keyword parameter converters contains a
dictionary which can hold a function for a column (the key of the column corresponds to the key of the
dictionary) to convert the string data of this column into a float. The string data is a byte string. That is why
we had to transfer it into a a unicode string in our function:
y = np.loadtxt("times_and_temperatures.txt",
converters={ 0: time2float_minutes})
print(y)
[[ 360. 20.1]
[ 361.5 16.1]
[ 363. 16.9]
...,
[ 1375.5 22.5]
[ 1377. 11.1]
[ 1378.5 15.2]]
TOFILE
The data of the A ndarry is always written in 'C' order, regardless of the order of A.
The data file written by this method can be reloaded with the function fromfile() .
Parameter Meaning
The string 'sep' defines the separator between array items for text output. If it is empty (''), a binary file is written,
sep
equivalent to file.write(a.tostring()).
Format string for text file output. Each entry in the array is formatted to text by first converting it to the closest
format
Python type, and then using 'format' % item.
Remark:
Information on endianness and precision is lost. Therefore it may not be a good idea to use the function to
archive data or transport data between machines with different endianness. Some of these problems can be
overcome by outputting the data as text files, at the expense of speed and file size.
fh = open("test6.txt", "bw")
x.tofile(fh)
[((10, 0), 98.25)]
FROMFILE
fromfile to read in data, which has been written with the tofile function. It's possible to read binary
data, if the data type is known. It's also possible to parse simply formatted text files. The data from the file is
turned into an array.
Parameter Meaning
file 'file' can be either a file object or the name of the file to read.
defines the data type of the array, which will be constructed from the file data. For binary files, it is used to
dtype
determine the size and byte-order of the items in the file.
count defines the number of items, which will be read. -1 means all items will be read.
The string 'sep' defines the separator between the items, if the file is a text file. If it is empty (''), the file will be
sep treated as a binary file. A space (" ") in a separator matches zero or more whitespace characters. A separator
consisting solely of spaces has to match at least one whitespace.
fh = open("test4.txt", "rb")
np.fromfile(fh, dtype=dt)
Output: array([((4294967296, 12884901890), 1.0609978957e-313),
((30064771078, 38654705672), 2.33419537056e-313),
((55834574860, 64424509454), 3.60739284543e-313),
((81604378642, 90194313236), 4.8805903203e-313),
((107374182424, 115964117018), 6.1537877952e-313),
((133143986206, 141733920800), 7.42698527006e-313),
((158913789988, 167503724582), 8.70018274493e-313),
((184683593770, 193273528364), 9.9733802198e-313)],
dtype=[('time', [('min', '<i8'), ('sec', '<i8')]), ('te
mp', '<f8')])
import numpy as np
import os
fh = open("test4.txt", "rb")
# 4 * 32 = 128
fh.seek(128, os.SEEK_SET)
Attention:
It can cause problems to use tofile and fromfile for data storage, because the binary files generated
are not platform independent. There is no byte-order or data-type information saved by tofile . Data can
be stored in the platform independent .npy format using save and load instead.
import numpy as np
print(x)
outfile = TemporaryFile()
x = np.arange(10)
np.save(outfile, x)
genfromtxt is slower than loadtxt, but it is capable of coping with missing data. It processes the file data in two
passes. At first it converts the lines of the file into strings. Thereupon it converts the strings into the requested
data type. loadtxt on the other hand works in one go, which is the reason, why it is faster.
RECFROMCSV(FNAME, **KWARGS)
This is not really another way to read in csv data. 'recfromcsv' basically a shortcut for
INTRODUCTION
Matplotlib is a plotting library like GNUplot. The main
advantage towards GNUplot is the fact that Matplotlib is a
Python module. Due to the growing interest in python the
popularity of matplotlib is continually rising as well.
Another characteristic of matplotlib is its steep learning curve, which means that users usually make rapid
progress after having started. The officicial website has to say the following about this: "matplotlib tries to
make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts,
errorcharts, scatterplots, etc, with just a few lines of code."
We will use the pyplot submodule of matplotlib. pyplot provides a procedural interface to the object-oriented
plotting library of matplotlib.
Its plotting commands are chosen in a way that they are similar to Matlab both in naming and with the
arguments.
What we see is a continuous graph, even though we provided discrete data for the Y values. By adding a
format string to the function call of plot, we can create a graph with discrete values, in our case blue circle
markers. The format string defines the way how the discrete points have to be rendered.
days = range(1, 9)
celsius_values = [25.6, 24.1, 26.7, 28.3, 27.5, 30.5, 32.8, 33.1]
fig, ax = plt.subplots()
ax.plot(days, celsius_values)
ax.set(xlabel='Day',
ylabel='Temperature in Celsius',
title='Temperature Graph')
plt.show()
We can specify an arbitrary number of x, y, fmt groups in a one plot function. We will extend our previous
temperature example to demonstrate this. We provide two lists with temperature values, one for the minimum
and one for the maximum values:
days = list(range(1,9))
celsius_min = [19.6, 24.1, 26.7, 28.3, 27.5, 30.5, 32.8, 33.1]
celsius_max = [24.8, 28.9, 31.3, 33.0, 34.9, 35.6, 38.4, 39.2]
fig, ax = plt.subplots()
ax.set(xlabel='Day',
ylabel='Temperature in Celsius',
title='Temperature Graph')
ax.plot(days, celsius_min,
days, celsius_min, "oy",
days, celsius_max,
days, celsius_max, "or")
plt.show()
fig, ax = plt.subplots()
ax.set(xlabel='Day',
ylabel='Temperature in Celsius',
title='Temperature Graph')
ax.plot(days, celsius_min)
ax.plot(days, celsius_min, "oy")
ax.plot(days, celsius_max)
ax.plot(days, celsius_max, "or")
plt.show()
plt.xlabel("Years")
plt.ylabel("Values")
plt.title("Bar Chart Example")
plt.plot()
plt.show()
gaussian_numbers = np.random.normal(size=10000)
gaussian_numbers
plt.hist(gaussian_numbers, bins=20)
plt.title("Gaussian Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
x = np.arange(0, 11)
y1 = np.random.randint(2, 7, (11,))
y2 = np.random.randint(9, 14, (11,))
y3 = np.random.randint(15, 25, (11,))
# Markers: https://matplotlib.org/api/markers_api.html
plt.scatter(x, y1)
plt.scatter(x, y2, marker='v', color='r')
plt.scatter(x, y3, marker='^', color='m')
plt.title('Scatter Plot Example')
plt.show()
idxes = [ 1, 2, 3, 4, 5, 6, 7, 8, 9]
y1 = [23, 42, 33, 43, 8, 44, 43, 18, 21]
y2 = [9, 31, 25, 14, 17, 17, 42, 22, 28]
y3 = [18, 29, 19, 22, 18, 16, 13, 32, 21]
plt.stackplot(idxes,
y1, y2, y3)
plt.title('Stack Plot Example')
plt.show()
# Pie chart, where the slices will be ordered and plotted counter-
clockwise:
labels = 'C', 'Python', 'Java', 'C++', 'C#'
sizes = [13.38, 11.87, 11.74, 7.81, 4.41]
explode = (0, 0.1, 0, 0, 0) # only "explode" the 2nd slice (i.e.
'Hogs')
# Pie chart, where the slices will be ordered and plotted counter-
clockwise:
labels = 'C', 'Python', 'Java', 'C++', 'C#', 'others'
sizes = [13.38, 11.87, 11.74, 7.81, 4.41]
sizes.append(100 - sum(sizes))
explode = (0, 0.1, 0, 0, 0, 0) # only "explode" the 2nd slice
(i.e. 'Hogs')
FORMAT PARAMETER
import matplotlib.pyplot as plt
What we see is a continuous graph, even though we provided discrete data for the Y values. By adding a
format string to the function call of plot, we can create a graph with discrete values, in our case blue circle
markers. The format string defines the way how the discrete points have to be rendered.
The following format string characters are accepted to control the line style or marker:
=============================================
character description
=============================================
'-' solid line style
'--' dashed line style
'-.' dash-dot line style
':' dotted line style
'.' point marker
',' pixel marker
'o' circle marker
'v' triangle_down marker
'^' triangle_up marker
'<' triangle_left marker
'>' triangle_right marker
'1' tri_down marker
'2' tri_up marker
'3' tri_left marker
'4' tri_right marker
's' square marker
'p' pentagon marker
==================
character color
==================
'b' blue
'g' green
'r' red
'c' cyan
'm' magenta
'y' yellow
'k' black
'w' white
==================
As you may have guessed already, you can add X values to the plot function. We will use the multiples of 3
starting at 3 below 22 as the X values of the plot in the following example:
# our X values:
days = list(range(0, 22, 3))
print("Values of days:", days)
# our Y values:
celsius_values = [25.6, 24.1, 26.7, 28.3, 27.5, 30.5, 32.8, 33.1]
plt.plot(days, celsius_values)
plt.show()
As we have already mentioned a figure may also contain more than one axes:
fig, ax = plt.subplots()
print(type(fig))
print(type(ax))
The function subplots can be used to create a figure and a set of subplots. In our previous example, we
called the function without parameters which creates a figure with a single included axes. we will see later
how to create more than one axes with subplots .
We will demonstrate in the following example, how we can go on with these objects to create a plot. We can
see that we apply plot directly on the axis object ax . This leaves no possible ambiguity as in the
plt.plot approach:
fig, ax = plt.subplots()
ax.plot(X, Y)
LABELS ON AXES
We can improve the appearance of our graph by adding labels to the axes. We also want to give our plot a
headline or let us call it a title to stay in the terminology of Matplotlib. We can accomplish this by using
the set method of the axis object ax :
days = list(range(1,9))
celsius_values = [25.6, 24.1, 26.7, 28.3, 27.5, 30.5, 32.8, 33.1]
fig, ax = plt.subplots()
ax.plot(days, celsius_values)
ax.set(xlabel='Day',
ylabel='Temperature in Celsius',
title='Temperature Graph')
plt.show()
We can specify an arbitrary number of x, y, fmt groups in a plot function. We will extend our previous
temperature example to demonstrate this. We provide two lists with temperature values, one for the minimum
and one for the maximum values:
days = list(range(1,9))
celsius_min = [19.6, 24.1, 26.7, 28.3, 27.5, 30.5, 32.8, 33.1]
celsius_max = [24.8, 28.9, 31.3, 33.0, 34.9, 35.6, 38.4, 39.2]
fig, ax = plt.subplots()
ax.set(xlabel='Day',
ylabel='Temperature in Celsius',
title='Temperature Graph')
ax.plot(days, celsius_min,
days, celsius_min, "oy",
days, celsius_max,
days, celsius_max, "or")
We could have used for plot calls in the previous code instead of one, even though this is not very attractive:
fig, ax = plt.subplots()
ax.set(xlabel='Day',
ylabel='Temperature in Celsius',
title='Temperature Graph')
ax.plot(days, celsius_min)
ax.plot(days, celsius_min, "oy")
ax.plot(days, celsius_max)
ax.plot(days, celsius_max, "or")
plt.show()
fig, ax = plt.subplots()
ax.plot(days, celsius_values)
ax.set(xlabel='Day',
ylabel='Temperature in Celsius',
title='Temperature Graph')
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
We can use linewidth to set the width of a line as the name implies.
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(X, F1, color="blue", linewidth=2.5, linestyle="-")
ax.plot(X, F2, color="red", linewidth=1.5, linestyle="--")
ax.plot(X, F3, color="green", linewidth=2, linestyle=":")
ax.plot(X, F4, color="grey", linewidth=2, linestyle="-.")
plt.show()
fig, ax = plt.subplots()
ax.scatter(3, 7, s=42)
Output: <matplotlib.collections.PathCollection at 0x7fd1ee805e20>
It is possible to shade or colorize regions between two curves. We are filling the region between the X axis and
the graph of sin(2*X) in the following example:
import numpy as np
import matplotlib.pyplot as plt
n = 256
X = np.linspace(-np.pi,np.pi,n,endpoint=True)
Y = np.sin(2*X)
fig, ax = plt.subplots()
ax.plot (X, Y, color='blue', alpha=1.0)
ax.fill_between(X, 0, Y, color='blue', alpha=.2)
plt.show()
If None, default to fill between everywhere. If not None, it is an N-length numpy boolean array and the fill will only
where
happen over the regions where where==True.
If True, interpolate between the two lines to find the precise point of intersection. Otherwise, the start and end points
interpolate
of the filled region will only occur on explicit values in the x array.
import numpy as np
import matplotlib.pyplot as plt
n = 256
X = np.linspace(-np.pi,np.pi,n,endpoint=True)
Y = np.sin(2*X)
fig, ax = plt.subplots()
We will move around the spines in the course of this chapter so that the
form a 'classical' coordinate syste. One where we have a x axis and a y
axis and both go through the origin i.e. the point (0, 0)
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plt.show()
CUSTOMIZING TICKS
Matplotlib has so far - in all our previous examples - automatically taken over the task of spacing points on the
axis. We can see for example that the X axis in our previous example was numbered -6. -4, -2, 0,
2, 4, 6 , whereas the Y axis was numbered -1.0, 0, 1.0, 2.0, 3.0
xticks is a method, which can be used to get or to set the current tick locations and the labels. The same is true
for yticks:
xticks = ax.get_xticks()
xticklabels = ax.get_xticklabels()
print(xticks, xticklabels)
for i in range(6):
print(xticklabels[i])
As we said before, we can also use xticks to set the location of xticks:
Now, we will set both the locations and the labels of the xticks:
Let's get back to our previous example with the trigonometric functions. Most people might consider factors of
Pi to be more appropriate for the X axis than the integer labels:
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plt.show()
There is an easier way to set the values of the xticks so that we do not have to caculate them manually. We use
plt.MultipleLocator with np.pi/2 as argument:
import numpy as np
import matplotlib.pyplot as plt
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
plt.show()
import numpy as np
import matplotlib.pyplot as plt
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.set_xticklabels([r'$-2\pi$', r'$-\frac{3\pi}{2}$', r'$-\pi$',
r'$-\frac{\pi}{2}$', 0, r'$\frac{\pi}{2}$',
r'$+\pi$', r'$\frac{3\pi}{2}$', r'$+2\pi$'])
ax.plot(X, F1, X, F2)
plt.show()
print(ax.get_xticklabels())
<a list of 4 Text xticklabel objects>
Let's increase the fontsize and make the font semi transparant:
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.set_xticklabels([r'$-2\pi$', r'$-\frac{3\pi}{2}$', r'$-\pi$',
r'$-\frac{\pi}{2}$', 0, r'$\frac{\pi}{2}$',
r'$+\pi$', r'$\frac{3\pi}{2}$', r'$+2\pi$'])
ax.legend(loc='lower left')
plt.show()
ADDING A LEGEND
legend(*args, **kwargs)
All we have to do to create a legend for lines, which already exist on the axes, is to simply call the function
"legend" with an iterable of strings, one for each legend item.
We have mainly used trigonometric functions in our previous chapter. For a change we want to use now
polynomials. We will use the Polynomial class which we have defined in our chapter Polynomials.
fig, ax = plt.subplots()
X = np.linspace(-2, 3, 50, endpoint=True)
F = p(X)
F_derivative = p_der(X)
ax.plot(X, F)
ax.plot(X, F_derivative)
If we add a label to the plot function, the value will be used as the label in the legend command. There is
anolther argument that we can add to the legend function: We can define the location of the legend inside of
the axes plot with the parameter loc :
If we add a label to the plot function, the values will be used in the legend command:
fig, ax = plt.subplots()
X = np.linspace(-2, 3, 50, endpoint=True)
F = p(X)
F_derivative = p_der(X)
ax.plot(X, F, label="p")
ax.plot(X, F_derivative, label="derivation of p")
ax.legend(loc='upper left')
It might be even more interesting to see the actual function in mathematical notation in our legend. Our
polynomial class is capable of printing the function in LaTeX notation.
print(p)
print(p_der)
-0.8x^4 + 2.3x^3 + 0.5x^2 + 1x^1 + 0.2
-3.2x^3 + 6.8999999999999995x^2 + 1.0x^1 + 1
p = Polynomial(2, 3, -4, 6)
p_der = p.derivative()
fig, ax = plt.subplots()
X = np.linspace(-2, 3, 50, endpoint=True)
F = p(X)
F_derivative = p_der(X)
ax.plot(X, F, label="$" + str(p) + "$")
ax.plot(X, F_derivative, label="$" + str(p_der) + "$")
ax.legend(loc='upper left')
In many cases we don't know what the result may look like before you plot it. It could be for example, that the
legend will overshadow an important part of the lines. If you don't know what the data may look like, it may
be best to use 'best' as the argument for loc. Matplotlib will automatically try to find the best possible location
for the legend:
fig, ax = plt.subplots()
X = np.linspace(-2, 3, 50, endpoint=True)
F = p(X)
F_derivative = p_der(X)
ax.plot(X, F, label="$" + str(p) + "$")
ax.plot(X, F_derivative, label="$" + str(p_der) + "$")
ax.legend(loc='best')
We will go back to trigonometric functions in the following examples. These examples show that
loc='best'
import numpy as np
import matplotlib.pyplot as plt
plt.show()
plt.show()
We demonstrate how easy it is in matplotlib to to annotate plots in matplotlib with the annotate method. We
will annotate the local maximum and the local minimum of a function. In its simplest form annotate
method needs two arguments annotate(s, xy) , where s is the text string for the annotation and xx
is the position of the point to be annotated:
p = Polynomial(1, 0, -12, 0)
p_der = p.derivative()
fig, ax = plt.subplots()
X = np.arange(-5, 5, 0.1)
F = p(X)
F_derivative = p_der(X)
ax.grid()
ax.plot(X, F, label="p")
ax.plot(X, F_derivative, label="derivation of p")
ax.legend(loc='best')
plt.show()
p = Polynomial(1, 0, -12, 0)
p_der = p.derivative()
fig, ax = plt.subplots()
X = np.arange(-5, 5, 0.1)
F = p(X)
F_derivative = p_der(X)
ax.grid()
ax.annotate("local maximum",
xy=(-2, p(-2)),
xytext=(-1, p(-2)+35),
arrowprops=dict(facecolor='orange'))
ax.annotate("local minimum",
xy=(2, p(2)),
xytext=(-2, p(2)-40),
arrowprops=dict(facecolor='orange', shrink=0.05))
ax.annotate("inflection point",
xy=(0, p(0)),
xytext=(-3, -30),
arrowprops=dict(facecolor='orange', shrink=0.05))
ax.legend(loc='best')
plt.show()
We have to provide some informations to the parameters of annotate, we have used in our previous example.
Parameter Meaning
The xy and the xytext locations of our example are in data coordinates. There are other coordinate systems
available we can choose. The coordinate system of xy and xytext can be specified string values assigned to
xycoords and textcoords. The default value is 'data':
figure points points from the lower left corner of the figure
figure pixels pixels from the lower left corner of the figure
figure fraction 0,0 is lower left of figure and 1,1 is upper right
axes fraction 0,0 is lower left of axes and 1,1 is upper right
Additionally, we can also specify the properties of the arrow. To do so, we have to provide a dictionary of
arrow properties to the parameter arrowprops:
shrink move the tip and base some percent away from the annotated point and text
Of course, the sinus function has "boring" and interesting values. Let's assume that you are especially
interested in the value of 3 ∗ sin(3 ∗ pi / 4).
import numpy as np
The numerical result doesn't look special, but if we do a symbolic calculation for the above expression we get
3
. Now we want to label this point on the graph. We can do this with the annotate function. We want to
√2
annotate our graph with this point.
import numpy as np
import matplotlib.pyplot as plt
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
#plt.xticks(np.arange(-2 * np.pi, 2.5 * np.pi, np.pi / 2))
ax.set_xticklabels([r'$-2\pi$', r'$-\frac{3\pi}{2}$', r'$-\pi$',
r'$-\frac{\pi}{2}$', 0, r'$\frac{\pi}{2}$',
r'$+\pi$', r'$\frac{3\pi}{2}$', r'$+2\pi$'])
plt.show()
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(X, F)
plt.show()
for ax in axs.flat:
ax.set(xlim=(0, 1), ylim=(0, 1), xticks=[], yticks=[], aspec
t=1)
fig.tight_layout(pad=0.2)
plt.show()
In the simplest case this might mean, that you have one curve
and you want another curve printed over it. This is not a
problem, because it will be enough to put the two plots in your
scripts, as we have seen before. The more interesting case is, if
you want two plots beside of each other for example. In one
figure but in two subplots. The idea is to have more than one
graph in one window and each graph appears in its own subplot.
The function subplot create a figure and a set of subplots. It is a wrapper function to make it convenient to
create common layouts of subplots, including the enclosing figure object, in a single call.
This function returns a figure and an Axes object or an array of Axes objects.
If we call this function without any parameters - like we do in the following example - a Figure object and one
Axes object will be returned:
Parameter Meaning
analogue to sharex
When subplots have a shared x-axis along a column, only the x tick labels of the bottom subplot are created.
sharey
Similarly, when subplots have a shared y-axis along a row, only the y tick labels of the first column subplot are
created.
dict, optional
subplot_kw
Dict with keywords passed to the ~matplotlib.figure.Figure.add_subplot call used to create each
subplot.
dict, optional
gridspec_kw Dict with keywords passed to the ~matplotlib.gridspec.GridSpec constructor used to create the grid
the subplots are placed on.
**fig_kw All additional keyword arguments are passed to the .pyplot.figure call.
In the previous chapters of our tutorial, we saw already the simple case of creating one figure and one axes.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2) + np.cos(x)
We will demonstrate now, how to create to subplots beside of each other. By setting the parameter sharey
ax1.text(0.5, 0.5,
"left",
color="green",
fontsize=18,
ha='center')
ax2.text(0.5, 0.5,
"right",
color="green",
fontsize=18,
ha='center')
plt.show()
<Figure size 432x288 with 0 Axes>
We demonstrate in the following example how to create a subplot for polar plotting. We achieve this by
creating a key polar in the the subplot_kw dictionary and set it to True :
rows, cols = 2, 3
fig, ax = plt.subplots(rows, cols,
sharex='col',
sharey='row')
plt.show()
fig, ax = plt.subplots(2,
sharex='col', sharey='row')
ax[0].text(0.5, 0.5,
"top",
color="green",
ax[1].text(0.5, 0.5,
"bottom",
color="green",
fontsize=18,
ha='center')
plt.show()
We will create now a similar structure with two subplots on top of each other containing polar plots:
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4))
def f(x):
return np.sin(x) - x * np.cos(x)
def fp(x):
""" The derivative of f """
return x * np.sin(x)
fig, ax = plt.subplots(2,
sharex='col', sharey='row')
plt.show()
ax[0, 0].set_facecolor('xkcd:salmon')
ax[1,1].text(0.5, 0.5,
'ax[1, 1]',
ha='center', va='center',
fontsize=20,
color="y")
ax[1, 0].set_facecolor((0.8, 0.6, 0.5))
ax[0, 1].set_facecolor((1, 1, 0.5))
Let us get rid of the ticks again. This time we cannot use plt.xticks(()) and plt.yticks(()) .
We have to use the set_xticks(()) and set_yticks(()) methods instead.
python_course_green = "#476042"
fig = plt.figure(figsize=(6, 4))
sub1 = plt.subplot(2, 2, 1)
sub1.set_xticks(())
sub1.set_yticks(())
sub1.text(0.5, 0.5, 'subplot(2,2,1)', ha='center', va='center',
size=20, alpha=.5)
sub2 = plt.subplot(2, 2, 2)
sub2.set_xticks(())
sub2.set_yticks(())
sub2.text(0.5, 0.5, 'subplot(2,2,2)', ha='center', va='center',
size=20, alpha=.5)
sub3 = plt.subplot(2, 2, 3)
sub3.set_xticks(())
sub3.set_yticks(())
sub3.text(0.5, 0.5, 'subplot(2,2,3)', ha='center', va='center',
size=20, alpha=.5)
fig.tight_layout()
plt.show()
The previous examples are solely showing how to create a subplot design. Usually, you want to write Python
programs using Matplotlib and its subplot features to depict some graphs. We will demonstrate how to
populate the previous subplot design with some example graphs:
import numpy as np
from numpy import e, pi, sin, exp, cos
import matplotlib.pyplot as plt
def f(t):
return exp(-t) * cos(2*pi*t)
def fp(t):
return -2*pi * exp(-t) * sin(2*pi*t) - e**(-t)*cos(2*pi*t)
def g(t):
return sin(t) * cos(1/(t+0.1))
def g(t):
python_course_green = "#476042"
fig = plt.figure(figsize=(6, 4))
plt.plot(t, g(t))
plt.tight_layout()
plt.show()
The following example shows nothing special. We will remove the xticks and play around with the size of the
figure and the subplots. To do this we introduce the keyword paramter figsize of 'figure' and the function
'subplot_adjust' along with its keyword parameters bottom, left, top, right:
Alternative Solution:
As the first three three elements of 2x3 grid have to be joined, we can choose a tuple notation, inour case (1,3)
in (2,3,(1,3)) to define that the first three elements of a notional 2x3 grid are joined:
fig =plt.figure(figsize=(6,4))
fig.subplots_adjust(bottom=0.025, left=0.025, top = 0.975, righ
t=0.975)
How can you create a subplotdesign of a 3x2 design, where the complete first column is spanned?
Solution:
EXERCISE 227
EXERCISE
Solution:
plt.show()
EXERCISE 228
SUBPLOTS WITH GRIDSPEC
'matplotlib.gridspec' contains a class GridSpec. It can be used as an alternative to subplot to specify the
geometry of the subplots to be created. The basic idea behind GridSpec is a 'grid'. A grid is set up with a
number of rows and columns. We have to define after this, how much of the grid a subplot should span.
The following example shows the the trivial or simplest case, i.e. a 1x1 grid
fig = plt.figure()
gs = GridSpec(1, 1)
ax = fig.add_subplot(gs[0,0])
plt.show()
EXERCISE 229
We could have used some of the parameters of Gridspec, e.g. we can define, that our graph should begin at 20
% from the bottom and 15 % to the left side of the available figure area:
fig = plt.figure()
gs = GridSpec(1, 1,
bottom=0.2,
left=0.15,
top=0.8)
ax = fig.add_subplot(gs[0,0])
plt.show()
The next example shows a complexer example with a more elaborate grid design:
EXERCISE 230
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4))
G = gridspec.GridSpec(3, 3)
plt.tight_layout()
plt.show()
EXERCISE 231
We will use now the grid specification from the previous example to populate it with the graphs of some
functions:
plt.figure(figsize=(6, 4))
G = gridspec.GridSpec(3, 3)
EXERCISE 232
axes_5.plot([1,2,3,4], [7, 5, 4, 3.8])
plt.tight_layout()
plt.show()
fig = plt.figure()
X = [1, 2, 3, 4, 5, 6, 7]
Y = [1, 3, 4, 2, 5, 8, 6]
# main figure
axes1.plot(X, Y, 'r')
axes1.set_xlabel('x')
axes1.set_ylabel('y')
axes1.set_title('title')
# insert
axes2.plot(Y, X, 'g')
axes2.set_xlabel('y')
axes2.set_ylabel('x')
axes2.set_title('title inside');
EXERCISE 233
SETTING THE PLOT RANGE
It's possible to configure the ranges of the axes. This can be done by using the set_ylim and set_xlim methods
in the axis object. With axis('tight') we create automatrically "tightly fitted" axes ranges:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0, 5, 0.25)
EXERCISE 234
LOGARITHMIC SCALE
It is also possible to set a logarithmic scale for one or both axes. This functionality is in fact only one
application of a more general transformation system in Matplotlib. Each of the axes' scales are set seperately
using set_xscale and set_yscale methods which accept one parameter (with the value "log" in this case):
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
x = np.arange(0, 5, 0.25)
ax.set_yscale("log")
plt.show()
EXERCISE 235
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(1,7,0.1)
ax1.plot(x, 2 * np.pi * x, lw=2, color="blue")
ax1.set_ylabel(r"Circumference $(cm)$", fontsize=16, color="blue")
for label in ax1.get_yticklabels():
label.set_color("blue")
ax2 = ax1.twinx()
ax2.plot(x, np.pi * x ** 2, lw=2, color="darkgreen")
ax2.set_ylabel(r"area $(cm^2)$", fontsize=16, color="darkgreen")
for label in ax2.get_yticklabels():
label.set_color("darkgreen")
EXERCISE 236
The following topics are not directly related to subplotting, but we want to present them to round up the
introduction into the basic possibilities of matplotlib. The first one shows how to define grid lines and the
second one is quite important. It is about saving plots in image files.
GRID LINES
import numpy as np
import matplotlib.pyplot as plt
def f(t):
return np.exp(-t) * np.cos(2*np.pi*t)
def g(t):
return np.sin(t) * np.cos(1/(t+0.1))
SAVING FIGURES
The savefig method can be used to save figures to a file:
fig.savefig("filename.png")
It is possible to optionally specify the DPI and to choose between different output formats:
fig.savefig("filename.png", dpi=200)
EXERCISE 237
Output can be generated in the formats PNG, JPG, EPS, SVG, PGF and PDF.
EXERCISE 238
MATPLOTLIB TUTORIAL: GRIDSPEC
We create a figure and four containing axes in the following code. We have covered this in our previous
chapter.
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
fig = plt.figure(constrained_layout=True)
spec = gridspec.GridSpec(ncols=2, nrows=2, figure=fig)
ax1 = fig.add_subplot(spec[0, 0])
ax2 = fig.add_subplot(spec[0, 1])
ax3 = fig.add_subplot(spec[1, 0])
ax4 = fig.add_subplot(spec[1, 1])
The importance and power of Gridspec unfolds, if we create subplots that span rows and columns.
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
fig = plt.figure(constrained_layout=True)
gs = fig.add_gridspec(3, 3)
ax1 = fig.add_subplot(gs[0, :])
ax1.set_title('gs[0, :]')
ax2 = fig.add_subplot(gs[1, :-1])
ax2.set_title('gs[1, :-1]')
ax3 = fig.add_subplot(gs[1:, -1])
ax3.set_title('gs[1:, -1]')
ax4 = fig.add_subplot(gs[-1, 0])
ax4.set_title('gs[-1, 0]')
ax5 = fig.add_subplot(gs[-1, -2])
ax5.set_title('gs[-1, -2]')
:mod: ~matplotlib.gridspec is also indispensable for creating subplots of different widths via a
couple of methods.
The method shown here is similar to the one above and initializes a uniform grid specification, and then uses
numpy indexing and slices to allocate multiple "cells" for a given subplot.
fig = plt.figure(constrained_layout=True)
spec = fig.add_gridspec(ncols=2, nrows=2)
optional_params = dict(xy=(0.5, 0.5),
xycoords='axes fraction',
va='center',
ha='center')
ax = fig.add_subplot(spec[0, 0])
ax.annotate('GridSpec[0, 0]', **optional_params)
fig.add_subplot(spec[0, 1]).annotate('GridSpec[0, 1:]', **optiona
l_params)
fig.add_subplot(spec[1, 0]).annotate('GridSpec[1:, 0]', **optiona
l_params)
fig.add_subplot(spec[1, 1]).annotate('GridSpec[1:, 1:]', **optiona
l_params)
Another option is to use the width_ratios and height_ratios parameters. These keyword
arguments are lists of numbers. Note that absolute values are meaningless, only their relative ratios matter.
That means that width_ratios=[2, 4, 8] is equivalent to width_ratios=[1, 2, 4] within
equally wide figures. For the sake of demonstration, we'll blindly create the axes within for loops since we
won't need them later.
fig5 = plt.figure(constrained_layout=True)
widths = [2, 3, 1.5]
heights = [1, 3, 2]
spec5 = fig5.add_gridspec(ncols=3, nrows=3, width_ratios=widths,
height_ratios=heights)
for row in range(3):
for col in range(3):
ax = fig5.add_subplot(spec5[row, col])
label = 'Width: {}\nHeight: {}'.format(widths[col], height
s[row])
ax.annotate(label, (0.1, 0.5), xycoords='axes fraction', v
a='center')
fig10 = plt.figure(constrained_layout=True)
gs0 = fig10.add_gridspec(1, 2)
gs00 = gs0[0].subgridspec(2, 3)
gs01 = gs0[1].subgridspec(3, 2)
for a in range(2):
for b in range(3):
fig10.add_subplot(gs00[a, b])
fig10.add_subplot(gs01[b, a])
The PSD is a common plot in the field of signal processing. NumPy has many useful libraries for computing a
PSD. Below we demo a few examples of how this can be accomplished and visualized with Matplotlib.
dt = 0.01
t = np.arange(0, 10, dt)
nse = np.random.randn(len(t))
r = np.exp(-t / 0.05)
cnse = np.convolve(nse, r) * dt
cnse = cnse[:len(t)]
s = 0.1 * np.sin(2 * np.pi * t) + cnse
plt.subplot(211)
plt.plot(t, s)
plt.subplot(212)
plt.psd(s, 512, 1 / dt)
This type of flexible grid alignment has a wide range of uses. I most often use it when creating multi-axes
histogram plots like the ones shown here:
However, we are primarily interested in how to create charts and histograms in this chapter. A splendid way to
create such charts consists in using Python in combination with Matplotlib.
What is a histogram? A formal definition can be: It's a graphical representation of a frequency distribution of
some numerical data. Rectangles with equal width have heights with the associated frequencies.
If we construct a histogram, we start with distributing the range of possible x values into usually equal sized
and adjacent intervals or bins.
We start now with a practical Python program. We create a histogram with random numbers:
plt.hist(gaussian_numbers)
plt.title("Gaussian Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
n[i] contains the number of values of gaussian numbers that lie within the interval with the boundaries bins [i]
and bins [i + 1]:
So n is an array of frequencies. The last return value of hist is a list of patches, which corresponds to the
rectangles with their properties:
Let's take a closer look at the return values. To create the histogram array gaussian_numbers are divided into
equal intervals, i.e. the "bins". The interval limits calculated by hist are obtained in the second component of
the return tuple. In our example, they are denoted by the variable bins:
Let's increase the number of bins. 10 bins is not a lot, if you imagine, that we have 10,000 random values. To
do so, we set the keyword parameter bins to 100:
plt.hist(gaussian_numbers, bins=100)
plt.show()
plt.hist(gaussian_numbers,
bins=100,
orientation="horizontal")
plt.show()
Another important keyword parameter of hist is density, which replaces the deprecated normed parameter. If
set to true, the first component - that is, the frequencies - of the return tuple is normalized to form a probability
density, i. the area (or the integral) under the histogram makes the sum 1
If both the parameters 'density' and 'stacked' are set to 'True', the sum of the histograms is normalized to 1.
With the parameters edgecolor and color we can define the line color and the color of the surfaces:
plt.hist(gaussian_numbers,
bins=100,
density=True,
stacked=True,
edgecolor="#6A9662",
color="#DDFFDD")
plt.show()
Okay, you want to see the data depicted as a plot of cumulative values? We can plot it as a cumulative
plt.hist(gaussian_numbers,
bins=100,
stacked=True,
cumulative=True)
plt.show()
BAR PLOTS
Now we come to one of the most commonly used chart types, well known even among non-scientists. A bar
chart is composed of rectangles that are perpendicular to the x-axis and that rise up like columns. The width of
the rectangles has no mathematical meaning.
fig, ax = plt.subplots()
ax.yaxis.set_major_formatter(formatter)
ax.bar(x=np.arange(len(GDP)), # The x coordinates of the bars.
height=GDP, # the height(s) of the vars
color="green",
align="center",
tick_label=years)
ax.set_ylabel('GDP in $')
ax.set_title('Largest Economies by nominal GDP in 2018')
plt.show()
land_GDP_per_capita = []
with open('data/GDP.txt') as fh:
for line in fh:
index, *land, gdp, population = line.split()
land = " ".join(land)
gdp = int(gdp.replace(',', ''))
population = int(population.replace(',', ''))
per_capita = int(round(gdp * 1000000 / population, 0))
land_GDP_per_capita.append((land, per_capita))
width = 0.35
ticks = np.arange(len(names))
ax.bar(ticks, last_week_cups, width, label='Last week')
ax.bar(ticks + width, this_week_cups, width, align="center",
label='This week')
ax.set_ylabel('Cups of Coffee')
ax.set_title('Coffee Consummation')
ax.set_xticks(ticks + width/2)
ax.set_xticklabels(names)
ax.legend(loc='best')
plt.show()
election_results_per_year = {}
with open('data/german_election_results.txt') as fh:
fh.readline()
for line in fh:
year, *results = line.rsplit()
election_results_per_year[year] = [float(x) for x in resul
ts]
election_results_per_party = list(zip(*election_results_per_year.v
alues()))
years = list(election_results_per_year.keys())
width = 0.9 / len(parties)
ticks = np.arange(len(years))
ax.set_ylabel('Percentages of Votes')
ax.set_title('German Elections')
ax.set_xticks(ticks + 0.45)
ax.set_xticklabels(years)
ax.legend(loc='best')
plt.show()
We also change the creation of the bar code by assigning a color value to the parameter ‚color‘:
election_results_per_year = {}
with open('data/german_election_results.txt') as fh:
fh.readline()
for line in fh:
year, *results = line.rsplit()
election_results_per_year[year] = [float(x) for x in resul
ts]
election_results_per_party = list(zip(*election_results_per_year.v
alues()))
years = list(election_results_per_year.keys())
width = 0.9 / len(parties)
ticks = np.arange(len(years))
ax.set_ylabel('Percentages of Votes')
ax.set_title('German Elections')
ax.set_xticks(ticks + 0.45)
ax.set_xticklabels(years)
ax.legend(loc='best')
plt.show()
The stacked bar chart stacks bars that represent different groups on top of each other. The height of the
resulting bar shows the combined result or summation of the individual groups.
Stacked bar charts are great to depict the total and at the same time providing a view of how the single parts
are related to the sum.
Stacked bar charts are not suited for datasets where some groups have negative values. In such cases, grouped
bar charts are the better choice.
width = 0.35
ticks = np.arange(len(names))
ax.bar(ticks, tea, width, label='Coffee', bottom=water+coffee)
ax.bar(ticks, coffee, width, align="center", label='Tea',
bottom=water)
ax.bar(ticks, water, width, align="center", label='Water')
CONTOUR PLOT
CREATING A "MESHGRID"
n, m = 7, 7
start = -3
print(X)
print(Y)
We can visulize our meshgridif we add the following code to our previous program:
fig, ax = plt.subplots()
ax.scatter(X, Y, color="green")
ax.set_title('Regular Grid, created by Meshgrid')
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()
import numpy as np
Z = np.sqrt(X**2 + Y**2)
print(Z)
[[4.24264069 3. 4.24264069]
[3.16227766 1. 3.16227766]
[3.16227766 1. 3.16227766]
[4.24264069 3. 4.24264069]]
import numpy as np
Z = np.sqrt(X**2 + Y**2)
print(Z)
fig = plt.figure(figsize=(6,5))
left, bottom, width, height = 0.1, 0.1, 0.8, 0.8
ax = fig.add_axes([left, bottom, width, height])
Z = np.sqrt(X**2 + Y**2)
cp = ax.contour(X, Y, Z)
ax.clabel(cp, inline=True,
fontsize=10)
ax.set_title('Contour Plot')
ax.set_xlabel('x (cm)')
ax.set_ylabel('y (cm)')
plt.show()
FILLED CONTOURS
fig = plt.figure(figsize=(6,5))
left, bottom, width, height = 0.1, 0.1, 0.8, 0.8
ax = fig.add_axes([left, bottom, width, height])
cp = plt.contourf(X, Y, Z)
plt.colorbar(cp)
ax.set_title('Contour Plot')
ax.set_xlabel('x (cm)')
ax.set_ylabel('y (cm)')
plt.show()
INDIVIDUAL COLOURS
import numpy as np
import matplotlib.pyplot as plt
plt.figure()
LEVELS
The levels were decided automatically by contour and contourf so far. They can be defined manually, by
providing a list of levels as a fourth parameter. Contour lines will be drawn for each value in the list, if we use
contour. For contourf, there will be filled colored regions between the values in the list.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
Z = np.sqrt(X ** 2 + Y ** 2 )
plt.figure()
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
y, x = np.ogrid[-1:2:100j, -1:1:100j]
plt.contour(x.ravel(),
y.ravel(),
x**2 + (y-((x**2)**(1.0/3)))**2,
[1],
colors='red',)
plt.axis('equal')
plt.show()
In [ ]:
INTRODUCTION
It has never been easier to take a picture than it is
today. All you normally need is a cell phone. These
are the essentials to take and view a picture. Taking
photos is free if we don't include the cost of the
mobile phone, which is often bought for other
purposes anyway. A generation ago, amateur and real
artists needed specialized and often expensive
equipment, and the cost per image was far from free.
In order to provide you with the necessary knowledge, this chapter of our Python tutorial deals with basic
image processing and manipulation. For this purpose we use the modules NumPy, Matplotlib and SciPy.
We start with the scipy package misc . The helpfile says that scipy.misc contains "various utilities that
don't have another home". For example, it also contains a few images, such as the following:
ascent = misc.ascent()
plt.gray()
plt.imshow(ascent)
plt.show()
ascent = misc.ascent()
import matplotlib.pyplot as plt
plt.axis("off") # removes the axis and the ticks
plt.gray()
plt.imshow(ascent)
plt.show()
ascent.shape
Output: (512, 512)
import scipy.misc
face = scipy.misc.face()
print(face.shape)
print(face.max)
print(face.dtype)
plt.axis("off")
plt.gray()
plt.imshow(face)
plt.show()
(768, 1024, 3)
<built-in method max of numpy.ndarray object at 0x7fa59c3e4710>
uint8
print(img[:3])
[[[0.4117647 0.5686275 0.8 ]
[0.40392157 0.56078434 0.7921569 ]
[0.40392157 0.5686275 0.79607844]
...
[0.48235294 0.62352943 0.81960785]
[0.47843137 0.627451 0.81960785]
[0.47843137 0.62352943 0.827451 ]]
plt.axis("off")
imgplot = plt.imshow(img)
lum_img = img[:,:,1]
plt.axis("off")
imgplot = plt.imshow(lum_img)
windmills = plt.imread('windmills.png')
We want to tint the image now. This means we will "mix" our colours with white. This will increase the
lightness of our image. For this purpose, we write a Python function, which takes an image and a percentage
value as a parameter. Setting 'percentage' to 0 will not change the image, setting it to one means that the image
will be completely whitened:
import numpy as np
import matplotlib.pyplot as plt
windmills = plt.imread('windmills.png')
import numpy as np
import matplotlib.pyplot as plt
windmills = plt.imread('windmills.png')
horizontal_brush = vertical_gradient_line(windmills)
tinted_windmills = windmills * horizontal_brush
plt.axis("off")
plt.imshow(tinted_windmills)
We will tint the image now from right to left by setting the reverse parameter of our Python function to "True":
vertical_brush = horizontal_gradient_line(windmills)
tinted_windmills = windmills
plt.imshow(tinted_windmills)
A tone is produced either by the mixture of a color with gray, or by both tinting and shading.
charlie = plt.imread('Chaplin.png')
plt.gray()
print(charlie)
plt.imshow(charlie)
plt.imshow(colored)
We will use different colormaps in the following example. The colormaps can be found in
matplotlib.pyplot.cm.datad:
plt.cm.datad.keys()
import numpy as np
import matplotlib.pyplot as plt
charlie = plt.imread('Chaplin.png')
# colormaps plt.cm.datad
# cmaps = set(plt.cm.datad.keys())
cmaps = {'afmhot', 'autumn', 'bone', 'binary', 'bwr', 'brg',
'CMRmap', 'cool', 'copper', 'cubehelix', 'Greens'}
sub.imshow(charlie*0.0002, cmap=cmaps.pop())
sub.set_yticks([])
INTRODUCTION
As you may have noticed, each of our pages in our various
tutorials are introduced by eye candy pictures, which have
created with great care to enrich the content. One of those
images has been the raison d'être of this chapter. We want to
demonstrate how we created the picture for our chapter on
Decorators. The idea was to play with decorators in "real
life", small icons with images of small workers painting a
room and on the other hand blending this with the "at" sign,
the Python symbol for decorator. It is also a good example
of how to create a watermark.
Finally, we will use the original image, the shaded image, plus an image with a binary at sign with the
conditional numpy where function to create the final image. The final image contains the at sign as a
watermark, cut out from the shaded image.
TILING AN IMAGE
The function imag_tile, which we are going to design, can be best explained with the following diagram:
imag_tile(img, n, m)
creates a tiled image by appending an image "img" m times in horizontal direction. After this we append the
strip image consisting of m img images n times in vertical direction.
In the following code, we use a picture of painting decorators as the tile image:
%matplotlib inline
import numpy as np
if n == 1:
tiled_img = img
else:
lst_imgs = []
for i in range(n):
lst_imgs.append(img)
tiled_img = np.concatenate(lst_imgs, axis=1 )
if m > 1:
lst_imgs = []
for i in range(m):
lst_imgs.append(tiled_img)
tiled_img = np.concatenate(lst_imgs, axis=0 )
return tiled_img
basic_pattern = mpimg.imread('decorators_b2.png')
decorators_img = imag_tile(basic_pattern, 3, 3)
plt.axis("off")
plt.imshow(decorators_img)
type(basic_pattern)
Output: numpy.ndarray
The first three rows of our image basic_pattern look like this:
basic_pattern[:3]
The innermost lists of our image contain the pixels. We have three values corresponding the the R, G, and B
values, this means that we have a 24-bit RGB PNG image, eight bits for each of R, G, B.
PNG images might also consist of 32-bit images (RGBA). The fourth value "A" will be used for transparancy,
single channel grayscale.
It's easy to access indivual pixels by indexing, e.g. the pixel in row 100 and column 20:
basic_pattern[100, 28]
Output: array([0.9019608, 0.8901961, 0.8627451], dtype=float32)
As we have seen, the pixels are float (float32) values between 0 and 1. Matplotlib plotting can handle both
float32 and uint8 for PNG images. For all other formats it will be only uint8.
We can use the slicing function to crop parts of an image. We will use this to make sure that both images have
the same size.
at_img=mpimg.imread('at_sign.png')
FIRST EXAMPLE
We have everything together now to create the blended image. Our at sign picture consists of black and white
pixels. The blended image is constructed like this: Let p=(n, m) be an arbitrary pixel in the n-th row and m-th
column of the image at_image. If the value of this pixel is not black or dark gray, we will use the pixel at
position (n, m) from the picture decorators_img, otherwise, we will use the corresponding pixel from
tinted_decorator_img. The where function of numpy is ideal for this task:
print(at_img.shape,
decorators_img.shape,
tinted_decorator_img.shape)
#basic_pattern = mpimg.imread('decorators2.png')
img2 = np.where(at_img > [0.1, 0.1, 0.1],
decorators_img,
tinted_decorator_img)
plt.axis("off")
plt.imshow(img2)
(1077, 771, 3) (1077, 771, 3) (1077, 771, 3)
Output: <matplotlib.image.AxesImage at 0x7f0c18ce3a10>
mpimg.imsave('decorators_with_at.png', img2)
We want to use now a different image as a "watermark". Instead of the at sign, we want to use now a director's
chair. We will create the image from the top of this page.
plt.axis("off")
plt.imshow(sea)
Output: <matplotlib.image.AxesImage at 0x7f0c18cc4450>
plt.axis("off")
plt.imshow(director_chair)
In the following, we blend together the images director_chair, decorators_img and sea by using where of
numpy once more:
#sea2 = mpimg.imread('the_sea2.png')
img = np.where(director_chair > [0.9, 0.9, 0.9],
decorators_img,
sea)
plt.axis("off")
plt.imshow(img)
mpimg.imsave('decorators_with_chair.png', img)
We could have used "Image.open" from PIL instead of mpimg.imread from matplotlib to read in the pictures.
There is a crucial difference or a potential "problem" between these two ways: The image we get from imread
has values between 0 and 1, whereas Image.open consists of values between 0 and 255. So we might have to
divide all the pixels by 255, if we have to work with an image read in by mpimg.imread:
img = Image.open("director_chair.jpg")
img = img.resize((at_img.shape[1], at_img.shape[0]))
img = np.asarray(img)
plt.axis("off")
plt.imshow(img)
print(img[100, 129])
[27 27 27]
print(img[100, 129])
[0.10588235 0.10588235 0.10588235]
Introduction
into Pandas
Pandas is a software library written for the Python programming language. It is used for data manipulation and
analysis. It provides special data structures and operations for the manipulation of numerical tables and time
series. Pandas is free software released under the three-clause BSD license.
We will start with the following two important data structures of Pandas:
• Series and
• DataFrame
SERIES
A Series is a one-dimensional labelled array-like object. It is capable of holding any data type, e.g. integers,
floats, strings, Python objects, and so on. It can be seen as a data structure with two arrays: one functioning as
the index, i.e. the labels, and the other one contains the actual data.
We define a simple Series object in the following example by instantiating a Pandas Series object with a list.
We will later see that we can use other data objects for example Numpy arrays and dictionaries as well to
instantiate a Series object.
import pandas as pd
S = pd.Series([11, 28, 72, 3, 5, 8])
S
Output: 0 11
1 28
2 72
3 3
4 5
5 8
dtype: int64
We haven't defined an index in our example, but we see two columns in our output: The right column contains
our data, whereas the left column contains the index. Pandas created a default index starting with 0 going to 5,
which is the length of the data minus 1.
We can directly access the index and the values of our Series S:
print(S.index)
print(S.values)
RangeIndex(start=0, stop=6, step=1)
[11 28 72 3 5 8]
So far our Series have not been very different to ndarrays of Numpy. This changes, as soon as we start
defining Series objects with individual indices:
A big advantage to NumPy arrays is obvious from the previous example: We can use arbitrary indices.
If we add two series with the same indices, we get a new series with the same index and the correponding
values will be added:
The indices do not have to be the same for the Series addition. The index will be the "union" of both indices. If
In principle, the indices can be completely different, as in the following example. We have two indices. One is
the Turkish translation of the English fruit names:
INDEXING
print(S['apples'])
20
However, Series objects can also be accessed by multiple indexes at the same time. This can be done by
packing the indexes into a list. This type of access returns a Pandas Series again:
import numpy as np
print((S + 3) * 4)
print("======================")
print(np.sin(S))
apples 92
oranges 144
cherries 220
pears 52
dtype: int64
======================
apples 0.912945
oranges 0.999912
cherries 0.986628
pears -0.544021
dtype: float64
PANDAS.SERIES.APPLY
The function "func" will be applied to the Series and it returns either a Series or a DataFrame, depending on
"func".
Parameter Meaning
a function, which can be a NumPy function that will be applied to the entire Series or a Python function that will
func
be applied to every single value of the series
A boolean value. If it is set to True (default), apply will try to find better dtype for elementwise function results.
convert_dtype
If False, leave as dtype=object
args Positional arguments which will be passed to the function "func" additionally to the values from the series.
Example:
S.apply(np.log)
Output: apples 2.995732
oranges 3.496508
cherries 3.951244
pears 2.302585
dtype: float64
We can also use Python lambda functions. Let's assume, we have the following task. The test the amount of
fruit for every kind. If there are less than 50 available, we will augment the stock by 10:
Similar to numpy arrays, we can filter Pandas Series with a Boolean array:
S[S>30]
Output: oranges 33
cherries 52
dtype: int64
"apples" in S
We can even use a dictionary to create a Series object. The resulting Series contains the dict's keys as the
indices and the values as the values.
city_series = pd.Series(cities)
print(city_series)
London 8615246
Berlin 3562166
Madrid 3165235
Rome 2874038
Paris 2273305
Vienna 1805681
Bucharest 1803425
Hamburg 1760433
Budapest 1754000
Warsaw 1740119
Barcelona 1602386
Munich 1493900
Milan 1350680
dtype: int64
my_city_series = pd.Series(cities,
index=my_cities)
my_city_series
Output: London 8615246.0
Paris 2273305.0
Zurich NaN
Berlin 3562166.0
Stuttgart NaN
Hamburg 1760433.0
dtype: float64
Due to the Nan values the population values for the other cities are turned into floats. There is no missing data
in the following examples, so the values are int:
my_city_series = pd.Series(cities,
index=my_cities)
my_city_series
We can see, that the cities, which are not included in the dictionary, get the value NaN assigned. NaN stands
for "not a number". It can also be seen as meaning "missing" in our example.
We can check for missing values with the methods isnull and notnull:
my_city_series = pd.Series(cities,
index=my_cities)
print(my_city_series.isnull())
London False
Paris False
Zurich True
Berlin False
Stuttgart True
Hamburg False
dtype: bool
print(my_city_series.notnull())
London True
Paris True
Zurich False
Berlin True
Stuttgart False
Hamburg True
dtype: bool
pd.isnull(S)
Output: a False
b False
c True
d False
dtype: bool
pd.notnull(S)
Output: a True
b True
c False
d True
dtype: bool
It's possible to filter out missing data with the Series method dropna. It returns a Series which consists only of
non-null data:
print(my_city_series.dropna())
London 8615246.0
Paris 2273305.0
Berlin 3562166.0
Hamburg 1760433.0
dtype: float64
In many cases you don't want to filter out missing data, but you want to fill in appropriate data for the empty
gaps. A suitable method in many situations will be fillna:
print(my_city_series.fillna(0))
Okay, that's not what we call "fill in appropriate data for the empty gaps". If we call fillna with a dict, we can
provide the appropriate data, i.e. the population of Zurich and Stuttgart:
We still have the problem that integer values - which means values which should be integers like number of
people - are converted to float as soon as we have NaN values. We can solve this problem now with the
method 'fillna':
my_city_series = pd.Series(cities,
DATAFRAME
The underlying idea of a DataFrame is based on
spreadsheets. We can see the data structure of a DataFrame
as tabular and spreadsheet-like. A DataFrame logically
corresponds to a "sheet" of an Excel document. A
DataFrame has both a row and a column index.
import pandas as pd
This result is not what we have intended or expected. The reason is that concat used 0 as the default for the
axis parameter. Let's do it with "axis=1":
print("------")
shops_df2 = pd.concat([shop1, shop2, shop3], axis=1)
print(shops_df2)
Zürich Winterthur Freiburg
2014 2409.14 1203.45 3412.12
2015 2941.01 3441.62 3491.16
2016 3496.83 3007.83 3457.19
2017 3119.55 3619.53 1963.10
------
Zürich Winterthur Freiburg
2014 2409.14 1203.45 3412.12
2015 2941.01 3441.62 3491.16
2016 3496.83 3007.83 3457.19
2017 3119.55 3619.53 1963.10
This was nice, but what kind of data type is our result?
print(type(shops_df))
<class 'pandas.core.frame.DataFrame'>
A DataFrame has a row and column index; it's like a dict of Series with a common index.
city_frame.columns.values
Output: array(['name', 'population', 'country'], dtype=object)
We can see that an index (0,1,2, ...) has been automatically assigned to the DataFrame. We can also assign a
custom index to the DataFrame object:
We can also define and rearrange the order of the columns at the time of creation of the DataFrame. This
makes also sure that we will have a defined ordering of our columns, if we create the DataFrame from a
dictionary. Dictionaries are not ordered, as you have seen in our chapter on Dictionaries in our Python tutorial,
so we cannot know in advance what the ordering of our columns will be:
city_frame = pd.DataFrame(cities,
columns=["name",
"country",
"population"])
city_frame
Output:
name country population
Now, we want to rename our columns. For this purpose, we will use the DataFrame method 'rename'. This
method supports two calling conventions
We will rename the columns of our DataFrame into Romanian names in the following example. We set the
city_frame.rename(columns={"name":"Soyadı",
"country":"Ülke",
"population":"Nüfus"},
inplace=True)
city_frame
Output:
Soyadı Ülke Nüfus
We want to create a more useful index in the following example. We will use the country name as the index,
i.e. the list value associated to the key "country" of our cities dictionary:
city_frame
Output:
name population
Alternatively, we can change an existing DataFrame. We can us the method set_index to turn a column into an
index. "set_index" does not work in-place, it returns a new data frame with the chosen column as the index:
city_frame = pd.DataFrame(cities)
city_frame2 = city_frame.set_index("country")
print(city_frame2)
We saw in the previous example that the set_index method returns a new DataFrame object and doesn't change
the original DataFrame. If we set the optional parameter "inplace" to True, the DataFrame will be changed in
place, i.e. no new object will be created:
city_frame = pd.DataFrame(cities)
city_frame.set_index("country", inplace=True)
print(city_frame)
name population
country
England London 8615246
Germany Berlin 3562166
Spain Madrid 3165235
Italy Rome 2874038
France Paris 2273305
Austria Vienna 1805681
Romania Bucharest 1803425
Germany Hamburg 1760433
Hungary Budapest 1754000
Poland Warsaw 1740119
Spain Barcelona 1602386
Germany Munich 1493900
Italy Milan 1350680
So far we have accessed DataFrames via the columns. It is often necessary to select certain rows via the index
names. We will demonstrate now, how we can access rows from DataFrames via the locators 'loc' and 'iloc'.
We will not cover 'ix' because it is deprecated and will be removed in the future.
city_frame = pd.DataFrame(cities,
columns=("name", "population"),
index=cities["country"])
print(city_frame.loc["Germany"])
name population
Germany Berlin 3562166
Germany Hamburg 1760433
Germany Munich 1493900
It is also possible to simultaneously extracting rows by chosen more than on index labels. To do this we use a
list of indices:
print(city_frame.loc[["Germany", "France"]])
name population
Germany Berlin 3562166
Germany Hamburg 1760433
Germany Munich 1493900
France Paris 2273305
We will also need to select pandas DataFrame rows based on conditions, which are applied to column values.
We can use the operators '>', '=', '=', '<=', '!=' for this purpose. We select all cities with a population of more
than two million in the following example:
condition = city_frame.population>2000000
condition
Output: England True
Germany True
Spain True
Italy True
France True
Austria False
Romania False
Germany False
Hungary False
Poland False
Spain False
Germany False
Italy False
Name: population, dtype: bool
print(city_frame.loc[condition])
name population
England London 8615246
Germany Berlin 3562166
Spain Madrid 3165235
Italy Rome 2874038
France Paris 2273305
It is also possible to logically combine more than one condition with & and | :
condition1 = (city_frame.population>1500000)
condition2 = (city_frame['name'].str.contains("m"))
print(city_frame.loc[condition1 & condition2])
name population
Italy Rome 2874038
Germany Hamburg 1760433
We use a logical or | in the following example to see all cities of the Pandas DataFrame, where either the
city name contains the letter 'm' or the population number is greater than three million:
condition1 = (city_frame.population>3000000)
condition2 = (city_frame['name'].str.contains("m"))
print(city_frame.loc[condition1 | condition2])
name population
England London 8615246
Germany Berlin 3562166
Spain Madrid 3165235
Italy Rome 2874038
Germany Hamburg 1760433
The iloc method of a Pandas DataFrame object can be used to select rows and columns by number, i.e. in
the order that they appear in the data frame. iloc allows selections of the rows, as if they were numbered
by integers 0 , 1 , 2 , ....
df = city_frame.iloc[3]
print(df)
To get a DataFrame with selected rows by numbers, we use a list of integers. We can see that we can change
the order of the rows and we are also able to select rows multiple times:
df = city_frame.iloc[[3, 2, 0, 5, 0]]
print(df)
name population
Italy Rome 2874038
Spain Madrid 3165235
England London 8615246
Austria Vienna 1805681
England London 8615246
The DataFrame object of Pandas provides a method to sum both columns and rows. Before we will explain the
usage of the sum method, we will create a new DataFrame object on which we will apply our examples. We
will start by creating an empty DataFrame without columns but an index. We populate this DataFrame by
adding columns with random values:
print(shops)
Zürich Freiburg München Konstanz Saarbrücken
2014 872.51 838.22 961.17 934.99 796.42
2015 944.47 943.27 862.66 784.23 770.94
2016 963.22 859.97 818.13 965.38 995.74
2017 866.11 731.42 811.37 955.21 836.36
2018 790.95 837.39 941.92 735.93 837.23
We can see that it summed up all the columns of our DataFrame. What about calculating the sum of the rows?
We can do this by using the axis parameter of sum .
shops.sum(axis=1)
Output: 2014 4226.30
2015 4458.91
2016 4696.46
2017 4186.20
2018 4284.65
dtype: float64
You only want to the the sums for the first, third and the last column and for all the years:
Of course, you could have also have achieved it in the following way, if the column names are known:
We can use "cumsum" to calculate the cumulative sum over the years:
x = shops.cumsum()
print(x)
Zürich Freiburg München Konstanz Saarbrücken
2014 780.38 908.77 952.88 729.34 854.93
2015 1748.43 1719.42 1759.91 1707.65 1749.80
2016 2738.43 2698.64 2563.32 2640.33 2740.95
2017 3553.88 3424.87 3298.14 3599.63 3691.35
2018 4504.85 4182.09 3999.67 4557.86 4608.05
Using the keyword parameter axis with the value 1, we can build the cumulative sum over the rows:
x = shops.cumsum(axis=1)
print(x)
Zürich Freiburg München Konstanz Saarbrücken
2014 780.38 1689.15 2642.03 3371.37 4226.30
2015 968.05 1778.70 2585.73 3564.04 4458.91
2016 990.00 1969.22 2772.63 3705.31 4696.46
2017 815.45 1541.68 2276.50 3235.80 4186.20
2018 950.97 1708.19 2409.72 3367.95 4284.65
x is a Pandas Series. We can reassign the previously calculated cumulative sums to the population column:
city_frame["population"] = x
print(city_frame)
Instead of replacing the values of the population column with the cumulative sum, we want to add the
cumulative population sum as a new culumn with the name "cum_population".
city_frame = pd.DataFrame(cities,
columns=["country",
"population",
"cum_population"],
index=cities["name"])
city_frame
We can see that the column "cum_population" is set to Nan, as we haven't provided any data for it.
city_frame["cum_population"] = city_frame["population"].cumsum()
city_frame
We can also include a column name which is not contained in the dictionary, when we create the DataFrame
from the dictionary. In this case, all the values of this column will be set to NaN:
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
print(city_frame)
There are two ways to access a column of a DataFrame. The result is in both cases a Series:
# in a dictionary-like way:
print(city_frame["population"])
London 8615246
Berlin 3562166
Madrid 3165235
Rome 2874038
Paris 2273305
Vienna 1805681
Bucharest 1803425
Hamburg 1760433
Budapest 1754000
Warsaw 1740119
Barcelona 1602386
Munich 1493900
Milan 1350680
Name: population, dtype: int64
# as an attribute
print(city_frame.population)
print(type(city_frame.population))
<class 'pandas.core.series.Series'>
city_frame.population
Output: London 8615246
Berlin 3562166
Madrid 3165235
Rome 2874038
Paris 2273305
Vienna 1805681
Bucharest 1803425
Hamburg 1760433
Budapest 1754000
Warsaw 1740119
Barcelona 1602386
Munich 1493900
Milan 1350680
Name: population, dtype: int64
From the previous example, we can see that we have not copied the population column. "p" is a view on the
data of city_frame.
The column area is still not defined. We can set all elements of the column to the same value:
city_frame["area"] = 1572
print(city_frame)
In this case, it will be definitely better to assign the exact area to the cities. The list with the area values needs
to have the same length as the number of rows in our DataFrame.
city_frame["area"] = area
print(city_frame)
country area population
London England 1572.00 8615246
Berlin Germany 891.85 3562166
Madrid Spain 605.77 3165235
Rome Italy 1285.00 2874038
Paris France 105.40 2273305
Vienna Austria 414.60 1805681
Bucharest Romania 228.00 1803425
Hamburg Germany 755.00 1760433
Budapest Hungary 525.20 1754000
Warsaw Poland 517.00 1740119
Barcelona Spain 101.90 1602386
Munich Germany 310.40 1493900
Milan Italy 181.80 1350680
Let's assume, we have only the areas of London, Hamburg and Milan. The areas are in a series with the correct
indices. We can assign this series as well:
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
city_frame['area'] = some_areas
print(city_frame)
In the previous example we have added the column area at creation time. Quite often it will be necessary to
add or insert columns into existing DataFrames. For this purpose the DataFrame class provides a method
"insert", which allows us to insert a column into a DataFrame at a specified location:
Parameter Meaning
loc int
If allow_duplicates is False,
an Exception will be raised,
allow_duplicates
if column is already contained
in the DataFrame.
city_frame = pd.DataFrame(cities,
columns=["country",
"population"],
index=cities["name"])
0 name0 1 8
1 name1 5 6
2 name2 9 9
3 name3 8 8
4 name4 7 0
growth_frame = pd.DataFrame(growth)
growth_frame
You like to have the years in the columns and the countries in the rows? No problem, you can transpose the
data:
growth_frame.T
Output:
2010 2011 2012 2013
growth_frame = growth_frame.T
growth_frame2 = growth_frame.reindex(["Switzerland",
"Italy",
"Germany",
"Greece"])
print(growth_frame2)
2010 2011 2012 2013
Switzerland 3.0 1.8 1.1 1.9
Italy 1.7 0.6 -2.3 -1.9
Germany 4.1 3.6 0.4 0.1
Greece -5.4 -8.9 -6.6 -3.3
import numpy as np
import pandas as pd
Another intestering question is about the speed of both methods in comparison. We will measure the time
behaviour in the following code examples:
replace(self,
to_replace=None,
value=None,
inplace=False,
limit=None,
regex=False,
method='pad')
This method replaces values given in to_replace with value . This differs from updating with .loc
or .iloc , which requires you to specify a location to update with some value. With replace it is possible to
replace values in a Series or DataFrame without knowing where they occur.
We will explain the way of working of the method replace by discussing the different data types of the
parameter to_replace individually:
Data
processing
Type
numeric values in the DataFrame (or Series) which are equal to to_replace will be replaced with parameter
numeric
value
If the parameter to_replace consists of a string, all the strings of the DataFrame (or Series) which exactly match
str
this string will be replaced by the value of the parameter value
All strings inside of the DataFrame (or Series) which match the regular expression of to_replace will be replaced
regex
with value
Let's start with a Series example. We will change one value into another one. Be aware of the fact that
replace by default creates a copy of the object in which all the values are replaced. This means that the
parameter inplace is set to False by default.
If we really want to change the object s is referencing, we should set the inplace parameter to True :
We can also change multiple values into one single value, as you can see in the following example.
s = pd.Series([0, 1, 2, 3, 4])
s.replace([0, 1, 2], 42, inplace=True)
s
Output: 0 42
1 42
2 42
3 3
4 4
dtype: int64
We will show now, how replace can be used on DataFrames. For this purpose we will recreate the example
from the beginning of this chapter of our tutorial. It will not be exactly the same though. More people have
learned Python now, but unfortunately somebody has put in some spelling errors in the word Python. Pete has
become a programmer now:
import pandas as pd
df
Output:
last job language
Both ambitious Dorothee and enthusiastic Pete have finished a degree in Computer Science in the meantime.
So we have to change our DataFrame accodingly:
df.replace("programmer",
"computer scientist",
inplace=True)
df
to_replace can be a list consisting of str, regex or numeric objects. If both the values of to_replace
and value are both lists, they must be the same length.
First, if to_replace and value are both lists, they must be the same length. Second, if
regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match
directly. We slightly changed our DataFrame for the next example. The first names build a column now!
Now it is time to do something about the surnames in our DataFrame. All these names sound the same - at
least for German speakers. Let us assume that we just found out that they should all be spelled the same, i.e.
Mayer. Now the time has come for regular expression. The parameter to_replace can also be used with
regular expressions. We will use a regular expression to unify the surnames. We will also fix the misspellings
of Python:
df.replace(to_replace=[r'M[ea][iy]e?r', r'P[iy]th[eo]n'],
value=['Mayer', 'Python'],
regex=True,
inplace=True)
df
Dictionaries can be used to specify different replacement values for different existing values. For example,
{'a': 'b', 'y': 'z'} replaces the value 'a' with 'b' and 'y' with 'z'. To use a dict in this way the
value parameter should be None .
Ìf replace is applied on a DataFrame, a dict can specify that different values should be replaced in
different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column 'a' and the value
'z' in column 'b' and replaces these values with whatever is specified in value . The value parameter
should not be None in this case. You can treat this as a special case of passing two lists except that you are
specifying the column to search in.
If we use nested dictionaries like {'A': {'foo': 'bar'}} , they are interpreted like this: look in
column 'A' for the value 'foo' and replace it with 'bar'. The value parameter has to be None in this case.
0 42 foo green
1 1 bar red
2 2 vloo blue
3 33 blee yellow
4 4 vloo green
When the parameter value is None and the parameter to_replace is a scalar, list or tuple, the
method replace will use the parameter method to decide which replacement to perform. The possible
values for method are pad , ffill , bfill , None . The default value is pad .
First of all, you have to know that pad and ffill are equivalent.
We will explain the way of working with the following Python example code. We will replace the values NN
and LN of the following DataFrame in different ways by using the parameter method .
import pandas as pd
df = pd.DataFrame({
'name':['Ben', 'Kate', 'Agnes', 'Ashleigh', 'Tom'],
'job':['programmer', 'NN', 'NN', 'engineer', 'teacher'],
'language':['Java', 'Python', 'LN', 'LN', 'C']})
The picture above shows how the following call to replace . In words: Every occurrence of NN in the
job column will have to be replaced. The value is defined by the parameter method . ffill means
'forward fill', i.e. we will take the preceding value to the first NN as the fill value. This is why we will replace
the values by 'programmer':
# method is backfill
df.replace(to_replace='NN',
value=None,
method='ffill')
2 Agnes programmer LN
3 Ashleigh engineer LN
4 Tom teacher C
Instead of using a single value for to_replace we can also use a list or tuple. We will replace in the
following all occurrences of NN and LN accordingly, as we can see in the following picture:
df.replace(to_replace=['NN', 'LN'],
value=None,
method='ffill')
4 Tom teacher C
We will show now what happens, if we use bfill (backward fill). Now the occurrences of 'LN' become 'C'
instead of Python. We also turn the 'NN's into engineers instead of programmers.
2 Agnes engineer C
3 Ashleigh engineer C
4 Tom teacher C
df.replace('NN',
value=None,
inplace=True,
method='bfill')
df.replace('LN',
value=None,
inplace=True,
method='ffill')
df
Output:
name job language
4 Tom teacher C
OTHER EXAMPLES
We will present some more examples, which are taken from the help file of loc :
cobra 1 2
viper 4 5
sidewinder 7 8
df.loc['viper']
Output: max_speed 4
shield 5
Name: viper, dtype: int64
df.loc[['viper', 'sidewinder']]
Output:
max_speed shield
viper 4 5
sidewinder 7 8
df.loc['cobra', 'shield']
Output: 2
Slice with labels for row and single label for column. As mentioned above, note that both the start and stop of
the slice are included.
df.loc['cobra':'viper', 'max_speed']
sidewinder 7 8
df.loc[df['shield'] > 6]
Output:
max_speed shield
sidewinder 7 8
sidewinder 7
sidewinder 7 8
cobra 1 2
viper 4 50
sidewinder 7 50
df.loc['cobra'] = 10
df
Output:
max_speed shield
cobra 10 10
viper 4 50
sidewinder 7 50
df.loc[:, 'max_speed'] = 30
df
Output:
max_speed shield
cobra 30 10
viper 30 50
sidewinder 30 50
cobra 30 10
viper 0 0
sidewinder 0 0
'Applying' means
import pandas as pd
import numpy as np
import random
nvalues = 30
# we create random values, which will be used as the Series value
s:
values = np.random.randint(1, 20, (nvalues,))
fruits = ["bananas", "oranges", "apples", "clementines", "cherrie
s", "pears"]
fruits_index = np.random.choice(fruits, (nvalues,))
s = pd.Series(values, index=fruits_index)
grouped = s.groupby(s.index)
grouped
Output: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fda33
1c1050>
We can see that we get a SeriesGroupBy object, if we apply groupby on the index of our series
object s . The result of this operation grouped is iterable. In every step we get a tuple object returned,
which consists of an index label and a series object. The series object is s reduced to this label.
grouped = s.groupby(s.index)
We could have got the same result - except for the order - without using `` groupby '' with the following
Python code.
import pandas as pd
beverages = pd.DataFrame({'Name': ['Robert', 'Melinda', 'Brenda',
'Samantha', 'Melinda', 'Rober
t',
'Melinda', 'Brenda', 'Samanth
a'],
'Coffee': [3, 0, 2, 2, 0, 2, 0, 1, 3],
'Tea': [0, 4, 2, 0, 3, 0, 3, 2, 0]})
beverages
Output:
Name Coffee Tea
0 Robert 3 0
1 Melinda 0 4
2 Brenda 2 2
3 Samantha 2 0
4 Melinda 0 3
5 Robert 2 0
6 Melinda 0 3
7 Brenda 1 2
8 Samantha 3 0
It's simple, and we've already seen in the previous chapters of our tutorial how to calculate the total number of
coffee cups. The task is to sum a column of a DatFrame, i.e. the 'Coffee' column:
beverages['Coffee'].sum()
Output: 13
beverages[['Coffee', 'Tea']].sum()
Output: Coffee 13
Tea 14
dtype: int64
'groupby' has not been necessary for the previous tasks. Let's have a look at our DataFrame again. We can see
that some of the names appear multiple times. So it will be very interesting to see how many cups of coffee
and tea each person drank in total. That means we are applying 'groupby' to the 'Name' column. Thereby we
split the DatFrame. Then we apply 'sum' to the results of 'groupby':
res = beverages.groupby(['Name']).sum()
print(res)
Coffee Tea
Name
Brenda 3 4
Melinda 0 10
Robert 5 0
Samantha 5 0
We can see that the names are now the index of the resulting DataFrame:
print(res.index)
Index(['Brenda', 'Melinda', 'Robert', 'Samantha'], dtype='objec
t', name='Name')
print(res.columns)
Index(['Coffee', 'Tea'], dtype='object')
We can also calculate the average number of coffee and tea cups the persons had:
beverages.groupby(['Name']).mean()
Name
ANOTHER EXAMPLE
The following Python code is used to create the data, we will use in our next groupby example. It is not
necessary to understand the following Python code for the content following afterwards. The module faker
has to be installed. In cae of an Anaconda installation this can be done by executing one of the following
commands in a shell:
fake = Faker('de_DE')
number_of_names = 10
names = []
for _ in range(number_of_names):
names.append(fake.first_name())
data = {}
workweek = ("Monday", "Tuesday", "Wednesday", "Thursday", "Frida
y")
weekend = ("Saturday", "Sunday")
Kenan 0 0 5 9 2 2 9
Jovan 9 1 5 1 0 0 0
Stanislaus 6 7 5 1 1 5 3
Adelinde 2 2 5 8 7 4 9
Cengiz 2 7 3 8 4 6 9
Edeltraud 4 7 9 9 7 9 7
Sara 7 1 7 0 7 8 3
Gerda 9 8 7 0 8 5 8
Tilman 5 1 9 4 7 5 5
Roswita 1 8 5 3 5 3 9
print(names)
['Kenan', 'Jovan', 'Stanislaus', 'Adelinde', 'Cengiz', 'Edeltrau
d', 'Sara', 'Gerda', 'Tilman', 'Roswita']
Ortwin 0 2 6 1 3 8 0
Mara 9 6 1 8 5 4 8
Siegrun 2 3 1 6 6 8 7
Sylvester 3 3 9 9 6 2 8
Metin 7 5 4 9 5 3 9
Adeline 3 5 0 4 2 9 7
Utz 9 7 8 1 2 3 2
Susan 2 7 6 7 4 4 0
Gisbert 4 1 8 3 6 9 5
Senol 9 0 8 2 5 7 2
def is_weekend(day):
if day in {'Saturday', 'Sunday'}:
return "Weekend"
else:
return "Workday"
data_df.groupby(by=is_weekend, axis=1).sum()
Ortwin 8 12
Mara 12 29
Siegrun 15 18
Sylvester 10 30
Metin 12 30
Adeline 16 14
Utz 5 27
Susan 4 26
Gisbert 14 22
Senol 9 24
EXERCISES
EXERCISE 1
import pandas as pd
product_prices = pd.DataFrame(d)
product_prices
Output:
products colours customer_price non_customer_price
EXERCISE 2
EXERCISE 3
Read in the project_times.txt file from the data1 directory. This rows of this file contain comma
separated the date, the name of the programmer, the name of the project, the time the programmer spent on the
project.
EXERCISE 4
Create a DateFrame containing the total times spent on a project per day by all the programmers
Calculate the total times spent on the projects over the whole month.
EXERCISE 6
EXERCISE 7
Rearrange the DataFrame with a MultiIndex consisting of the date and the project names, the columns should
be the programmer names and the data of the columns the time of the programmers spent on the projects.
time
programmer Antonie Elise Fatima Hella Mariola
date project
2020-01-01 BIRDY NaN NaN NaN 1.50 1.75
NSTAT NaN NaN 0.25 NaN 1.25
XTOR NaN NaN NaN 1.00 3.50
2020-01-02 BIRDY NaN NaN NaN 1.75 2.00
NSTAT 0.5 NaN NaN NaN 1.75
SOLUTIONS
SOLUTION TO EXERCISE 1
x = product_prices.groupby("products").mean()
x
products
SOLUTION TO EXERCISE 2
x = product_prices.groupby("colours").sum()
x
Output:
customer_price non_customer_price
colours
SOLUTION TO EXERCISE 3
import pandas as pd
df = pd.read_csv("data1/project_times.txt", index_col=0)
df
date
times_per_day = df.groupby(df.index).sum()
print(times_per_day[:10])
time
date
2020-01-01 9.25
2020-01-02 6.00
2020-01-03 2.50
2020-01-06 5.75
2020-01-07 15.00
2020-01-08 13.25
2020-01-09 10.25
2020-01-10 17.00
2020-01-13 4.75
2020-01-14 10.00
SOLUTION TO EXERCISE 5
df.groupby(['project']).sum()
Output:
time
project
BIRDY 9605.75
NSTAT 8707.75
XTOR 6427.50
SOLUTION TO EXERCISE 6
df.groupby(['programmer']).sum()
programmer
Antonie 1511.25
Elise 80.00
Fatima 593.00
Hella 10642.00
Mariola 11914.75
SOLUTION TO EXERCISE 7
x = x.unstack()
x
date project
x = x.fillna(0)
print(x[:10])
All the powerful data structures like the Series and the
DataFrames would avail to nothing, if the Pandas module
wouldn't provide powerful functionalities for reading in
and writing out data. It is not only a matter of having a
functions for interacting with files. To be useful to data
scientists it also needs functions which support the most
important data formats like
DELIMITER-SEPARATED VALUES
Most people take csv files as a synonym for delimter-
separated values files. They leave the fact out of account
that csv is an acronym for "comma separated values",
which is not the case in many situations. Pandas also uses
"csv" and contexts, in which "dsv" would be more
appropriate.
We call a text file a "delimited text file" if it contains text in DSV format.
For example, the file dollar_euro.txt is a delimited text file and uses tabs (\t) as delimiters.
• DataFrame.from_csv
• read_csv
There is no big difference between those two functions, e.g. they have different default values in some cases
and read_csv has more paramters. We will focus on read_csv, because DataFrame.from_csv is kept inside
Pandas for reasons of backwards compatibility.
exchange_rates = pd.read_csv("data1/dollar_euro.txt",
sep="\t")
print(exchange_rates)
Year Average Min USD/EUR Max USD/EUR Working days
0 2016 0.901696 0.864379 0.959785 247
1 2015 0.901896 0.830358 0.947688 256
2 2014 0.753941 0.716692 0.823655 255
3 2013 0.753234 0.723903 0.783208 255
4 2012 0.778848 0.743273 0.827198 256
5 2011 0.719219 0.671953 0.775855 257
6 2010 0.755883 0.686672 0.837381 258
7 2009 0.718968 0.661376 0.796495 256
8 2008 0.683499 0.625391 0.802568 256
9 2007 0.730754 0.672314 0.775615 255
10 2006 0.797153 0.750131 0.845594 255
11 2005 0.805097 0.740357 0.857118 257
12 2004 0.804828 0.733514 0.847314 259
13 2003 0.885766 0.791766 0.963670 255
14 2002 1.060945 0.953562 1.165773 255
15 2001 1.117587 1.047669 1.192748 255
16 2000 1.085899 0.962649 1.211827 255
17 1999 0.939475 0.848176 0.998502 261
As we can see, read_csv used automatically the first line as the names for the columns. It is possible to give
other names to the columns. For this purpose, we have to skip the first line by setting the parameter "header"
to 0 and we have to assign a list with the column names to the parameter "names":
import pandas as pd
exchange_rates = pd.read_csv("data1/dollar_euro.txt",
sep="\t",
header=0,
names=["year", "min", "max", "days"])
print(exchange_rates)
EXERCISE
The file "countries_population.csv" is a csv file, containing the population numbers of all countries (July
2014). The delimiter of the file is a space and commas are used to separate groups of thousands in the
numbers. The method 'head(n)' of a DataFrame can be used to give out only the first n rows or lines. Read the
file into a DataFrame.
Solution:
pop = pd.read_csv("data1/countries_population.csv",
header=None,
names=["Country", "Population"],
index_col=0,
quotechar="'",
sep=" ",
thousands=",")
print(pop.head(5))
Population
Country
China 1355692576
India 1236344631
European Union 511434812
United States 318892103
Indonesia 253609643
We can create csv (or dsv) files with the method "to_csv". Before we do this, we will prepare some data to
output, which we will write to a file. We have two csv files with population data for various countries.
countries_male_population.csv contains the figures of the male populations and
countries_female_population.csv correspondingly the numbers for the female populations. We will create a
new csv file with the sum:
female_pop = pd.read_csv("data1/countries_female_population.csv",
header=None,
index_col=0,
names=column_names)
population
Country
Czech
10269726.0 10203269 10211455 10220577 10251079 10287189 10381130 104675
Republic
Country
New
3939130.0 4009200 4062500 4100570 4139470 4228280 4268880 43158
Zealand
Slovak
5378951.0 5379161 5380053 5384822 5389180 5393637 5400998 54122
Republic
United
58706905.0 59262057 59699828 60059858 60412870 60781346 61179260 615950
Kingdom
United
277244916.0 288774226 290810719 294442683 297308143 300184434 304846731 3051275
States
population.to_csv("data1/countries_total_population.csv")
We want to create a new DataFrame with all the information, i.e. female, male and complete population. This
means that we have to introduce an hierarchical index. Before we do it on our DataFrame, we will introduce
this problem in a simple example:
import pandas as pd
shop1 = pd.DataFrame(shop1)
shop2 = pd.DataFrame(shop2)
one 2010 23 13
2011 25 29
We want to swap the hierarchical indices. For this we will use 'swaplevel':
shops.swaplevel()
shops.sort_index(inplace=True)
shops
one 2010 23 13
2011 25 29
We will go back to our initial problem with the population figures. We will apply the same steps to those
DataFrames:
pop_complete = pd.concat([population.T,
male_pop.T,
female_pop.T],
keys=["total", "male", "female"])
df = pop_complete.swaplevel()
df.sort_index(inplace=True)
df[["Austria", "Australia", "France"]]
df.to_csv("data1/countries_total_population.csv")
EXERCISE
• Read in the dsv file (csv) bundeslaender.txt. Create a new file with the columns 'land', 'area',
'female', 'male', 'population' and 'density' (inhabitants per square kilometres.
• print out the rows where the area is greater than 30000 and the population is greater than 10000
• Print the rows where the density is greater than 300
lands.insert(loc=len(lands.columns),
column='population',
value=lands['female'] + lands['male'])
lands[:3]
Output:
land area female male population
lands.insert(loc=len(lands.columns),
column='density',
value=(lands['population'] * 1000 / lands['area']).ro
und(0))
lands[:4]
Output:
land area female male population density
We will use a simple Excel document to demonstrate the reading capabilities of Pandas. The document
sales.xls contains two sheets, one called 'week1' and the other one 'week2'.
An Excel file can be read in with the Pandas function "read_excel". This is demonstrated in the following
example Python code:
excel_file = pd.ExcelFile("data1/sales.xls")
sheet = pd.read_excel(excel_file)
sheet
Output:
Weekday Sales
0 Monday 123432.980000
1 Tuesday 122198.650200
2 Wednesday 134418.515220
3 Thursday 131730.144916
4 Friday 128173.431003
The document "sales.xls" contains two sheets, but we only have been able to read in the first one with
"read_excel". A complete Excel document, which can consist of an arbitrary number of sheets, can be
completely read in like this:
week2:
Weekday Sales
0 Monday 223277.980000
1 Tuesday 234441.879000
2 Wednesday 246163.972950
3 Thursday 241240.693491
4 Friday 230143.621590
We will calculate now the avarage sales numbers of the two weeks:
average = docu["week1"].copy()
average["Sales"] = (docu["week1"]["Sales"] + docu["week2"]["Sale
s"]) / 2
print(average)
Weekday Sales
0 Monday 173355.480000
1 Tuesday 178320.264600
2 Wednesday 190291.244085
3 Thursday 186485.419203
4 Friday 179158.526297
We will save the DataFrame 'average' in a new document with 'week1' and 'week2' as additional sheets as well:
writer = pd.ExcelWriter('data1/sales_average.xlsx')
document['week1'].to_excel(writer,'week1')
document['week2'].to_excel(writer,'week2')
average.to_excel(writer,'average')
writer.save()
writer.close()
INTRODUCTION
NaN was introduced, at least officially, by
the IEEE Standard for Floating-Point
Arithmetic (IEEE 754). It is a technical
standard for floating-point computation
established in 1985 - many years before
Python was invented, and even a longer
time befor Pandas was created - by the
Institute of Electrical and Electronics
Engineers (IEEE). It was introduced to
solve problems found in many floating
point implementations that made them
difficult to use reliably and portably.
'NAN' IN PYTHON
Python knows NaN values as well. We can create it with "float":
n1 = float("nan")
n2 = float("Nan")
n3 = float("NaN")
n4 = float("NAN")
print(n1, n2, n3, n4)
nan nan nan nan
Warning: Do not perform comparison between "NaN" values or "Nan" values and regular numbers. A simple
or simplified reasoning is this: Two things are "not a number", so they can be anything but most probably not
the same. Above all there is no way of ordering NaNs:
print(n1 == n2)
print(n1 == 0)
print(n1 == 100)
print(n2 < 0)
False
False
False
False
NAN IN PANDAS
Before we will work with NaN data, we will process a file without any NaN values. The data file
temperatures.csv contains the temperature data of six sensors taken every 15 minuts between 6:00 to 19.15
o'clock.
Reading in the data file can be done with the read_csv function:
import pandas as pd
df = pd.read_csv("data1/temperatures.csv",
sep=";",
decimal=",")
df.loc[:3]
We want to calculate the avarage temperatures per measuring point over all the sensors. We can use the
DataFrame method 'mean'. If we use 'mean' without parameters it will sum up the sensor columns, which isn't
what we want, but it may be interesting as well:
df.mean()
Output: sensor1 19.775926
sensor2 19.757407
sensor3 19.840741
sensor4 20.187037
sensor5 19.181481
sensor6 19.437037
dtype: float64
average_temp_series = df.mean(axis=1)
print(average_temp_series[:8])
0 13.933333
1 14.533333
2 14.666667
3 14.900000
4 15.083333
5 15.116667
6 15.283333
7 15.116667
dtype: float64
sensors = df.columns.values[1:]
# all columns except the time column will be removed:
df = df.drop(sensors, axis=1)
print(df[:5])
We will assign now the average temperature values as a new column 'temperature':
# best practice:
df = df.assign(temperature=average_temp_series) # inplace option
not available
# alternatively:
#df.loc[:,"temperature"] = average_temp_series
df[:3]
Output:
time temperature
0 06:00:00 13.933333
1 06:15:00 14.533333
2 06:30:00 14.666667
We will use now a data file similar to the previous temperature csv, but this time we will have to cope with
NaN data, when the sensors malfunctioned.
We will create a temperature DataFrame, in which some data is not defined, i.e. NaN.
We will use and change the data from the the temperatures.csv file:
import pandas as pd
temp_df = pd.read_csv("data1/temperatures.csv",
sep=";",
index_col=0,
decimal=",")
temp_df[:8]
time
We will randomly assign some NaN values into the data frame. For this purpose, we will use the where
method from DataFrame. If we apply where to a DataFrame object df, i.e. df.where(cond, other_df), it will
return an object of same shape as df and whose corresponding entries are from df where the corresponding
element of cond is True and otherwise are taken from other_df.
Before we continue with our task, we will demonstrate the way of working of where with some simple
examples:
import numpy as np
0 18 -5
1 22 -7
2 14 -3
3 16 -23
For our task, we need to create a DataFrame 'nan_df', which consists purely of NaN values and has the same
shape as our temperature DataFrame 'temp_df'. We will use this DataFrame in 'where'. We also need a
DataFrame with the conditions "df_bool" as True values. For this purpose we will create a DataFrame with
random values between 0 and 1 and by applying 'random_df < 0.8' we get the df_bool DataFrame, in which
about 20 % of the values will be True:
random_df = pd.DataFrame(np.random.random(size=temp_df.shape),
columns=temp_df.columns.values,
index=temp_df.index)
nan_df = pd.DataFrame(np.nan,
columns=temp_df.columns.values,
index=temp_df.index)
df_bool = random_df<0.8
df_bool[:5]
Output:
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
disturbed_data.to_csv("data1/temperatures_with_NaN.csv")
disturbed_data[:10]
Output:
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
df = disturbed_data.dropna()
df
time
'dropna' can also be used to drop all columns in which some values are NaN. This can be achieved by
assigning 1 to the axis parameter. The default value is False, as we have seen in our previous example. As
every column from our sensors contain NaN values, they will all disappear:
df = disturbed_data.dropna(axis=1)
time
06:00:00
06:15:00
06:30:00
06:45:00
07:00:00
Let us change our task: We only want to get rid of all the rows, which contain more than one NaN value. The
parameter 'thresh' is ideal for this task. It can be set to the minimum number. 'thresh' is set to an integer value,
which defines the minimum number of non-NaN values. We have six temperature values in every row. Setting
'thresh' to 5 makes sure that we will have at least 5 valid floats in every remaining row:
time
average_temp_series = cleansed_df.mean(axis=1)
sensors = cleansed_df.columns.values
df = cleansed_df.drop(sensors, axis=1)
# best practice:
df = df.assign(temperature=average_temp_series) # inplace option
not available
df[:6]
Output:
temperature
time
06:00:00 13.933333
06:15:00 14.440000
06:30:00 14.760000
06:45:00 14.940000
07:00:00 15.083333
07:15:00 15.100000
INTRODUCTION
Bins do not necessarily have to be numerical, they can be categorical values of any kind, like "dogs", "cats",
"hamsters", and so on.
Binning is also used in image processing, binning. It can be used to reduce the amount of data, by combining
neighboring pixel into single pixels. kxk binning reduces areas of k x k pixels into single pixel.
Pandas provides easy ways to create bins and to bin data. Before we describe these Pandas functionalities, we
will introduce basic Python functions, working on Python lists and tuples.
BINNING IN PYTHON
The following Python function can be used to create bins.
bins = []
for low in range(lower_bound,
lower_bound + quantity*width + 1, width):
bins.append((low, low+width))
return bins
We will create now five bins (quantity=5) with a width of 10 (width=10) starting from 10 (lower_bound=10):
bins = create_bins(lower_bound=10,
width=10,
quantity=5)
bins
Output: [(10, 20), (20, 30), (30, 40), (40, 50), (50, 60), (60, 70)]
The next function 'find_bin' is called with a list or tuple of bin 'bins', which have to be two-tuples or lists of
two elements. The function finds the index of the interval, where the value 'value' is contained:
print(bins)
binned_weights = []
frequencies = Counter(binned_weights)
print(frequencies)
[(50, 54), (54, 58), (58, 62), (62, 66), (66, 70), (70, 74), (74,
78), (78, 82), (82, 86), (86, 90), (90, 94)]
73.4 5 (70, 74)
69.3 4 (66, 70)
64.9 3 (62, 66)
75.6 6 (74, 78)
74.9 6 (74, 78)
80.3 7 (78, 82)
78.6 7 (78, 82)
84.1 8 (82, 86)
88.9 9 (86, 90)
90.3 10 (90, 94)
83.4 8 (82, 86)
69.3 4 (66, 70)
52.4 0 (50, 54)
58.3 2 (58, 62)
67.4 4 (66, 70)
74.0 6 (74, 78)
89.3 9 (86, 90)
63.4 3 (62, 66)
Counter({4: 3, 6: 3, 3: 2, 7: 2, 8: 2, 9: 2, 5: 1, 10: 1, 0: 1,
2: 1})
We used a list of tuples as bins in our previous example. We have to turn this list into a usable data structure
for the pandas function "cut". This data structure is an IntervalIndex. We can do this with
pd.IntervalIndex.from_tuples:
import pandas as pd
bins2 = pd.IntervalIndex.from_tuples(bins)
"cut" is the name of the Pandas function, which is needed to bin values into bins. "cut" takes many parameters
but the most important ones are "x" for the actual values und "bins", defining the IntervalIndex. "x" can be any
1-dimensional array-like structure, e.g. tuples, lists, nd-arrays and so on:
The result of the Pandas function "cut" is a so-called "Categorical object". Each bin is a category. The
categories are described in a mathematical notation. "(70, 74]" means that this bins contains values from 70 to
74 whereas 70 is not included but 74 is included. Mathematically, this is a half-open interval, i.e. nn interval in
which one endpoint is included but not the other. Sometimes it is also called an half-closed interval.
We had also defined the bins in our previous chapter as half-open intervals, but the other way round, i.e. left
side closed and the right side open. When we used pd.IntervalIndex.from_tuples, we could have defined the
"openness" of this bins by setting the parameter "closed" to one of the values:
To have the same behaviour as in our previous chapter, we will set the parameter closed to "left":
We used an IntervalIndex as a bin for binning the weight data. The function "cut" can also cope with two other
kinds of bin representations:
• an integer:
defining the number of equal-width bins in the range of the values "x". The
• sequence of scalars:
Defines the bin edges allowing for non-uniform
print(categorical_object)
[(71.35, 73.456], (69.244, 71.35], (62.928, 65.033], (75.561, 77.6
67], (73.456, 75.561], ..., (56.611, 58.717], (67.139, 69.244], (7
3.456, 75.561], (88.194, 90.3], (62.928, 65.033]]
Length: 18
Categories (18, interval[float64]): [(52.362, 54.506] < (54.506, 5
6.611] < (56.611, 58.717] < (58.717, 60.822] ... (81.878, 83.983]
< (83.983, 86.089] < (86.089, 88.194] < (88.194, 90.3]]
The next and most interesting question is now how we can see the actual bin counts. This can be accomplished
with the function "value_counts":
pd.value_counts(categorical_object)
Output: [74, 78) 3
[66, 70) 3
[86, 90) 2
[82, 86) 2
[78, 82) 2
[62, 66) 2
[90, 94) 1
[70, 74) 1
[58, 62) 1
[50, 54) 1
[54, 58) 0
dtype: int64
"categorical_object.codes" provides you with a labelling of the input values into the binning categories:
labels = categorical_object.codes
labels
Output: array([ 5, 4, 3, 6, 6, 7, 7, 8, 9, 10, 8, 4, 0,
2, 4, 6, 9,
3], dtype=int8)
categories = categorical_object.categories
categories
categorical_object.categories
Output: IntervalIndex([[50, 54), [54, 58), [58, 62), [62, 66), [66, 7
0) ... [74, 78), [78, 82), [82, 86), [86, 90), [90, 94)],
closed='left',
dtype='interval[int64]')
NAMING BINS
Let's imagine, we have an University, which confers three levels of Latin honors depending on the grade point
average (GPA):
degrees = ["none", "cum laude", "magna cum laude", "summa cum laud
e"]
student_results = [3.93, 3.24, 2.80, 2.83, 3.91, 3.698, 3.731, 3.2
5, 3.24, 3.82, 3.22]
labels = student_results_degrees.codes
categories = student_results_degrees.categories
INTRODUCTION
We learned the basic concepts of Pandas
in our previous chapter of our tutorial on
Pandas. We introduced the data structures
• Series and
• DataFrame
We also learned how to create and manipulate the Series and DataFrame objects in numerous Python
programs.
Now it is time to learn some further aspects of theses data structures in this chapter of our tutorial.
import pandas as pd
print(city_series["Vienna"])
country Austria
area 414.6
population 1805681
dtype: object
We can also access the information about the country, area or population of a city. We can do this in two ways:
print(city_series["Vienna"]["area"])
414.6
print(city_series["Vienna", "area"])
414.6
We can also get the content of multiple cities at the same time by using a list of city names as the key:
city_series = city_series.sort_index()
print("city_series with sorted index:")
print(city_series)
print(city_series[:, "area"])
Berlin 891.85
Hamburg 755
Vienna 414.6
Zürich 87.88
dtype: object
Parameters
----------
i, j : int, string (can be mixed)
Level of index to be swapped. Can pass level name as string.
The indexes 'i' and 'j' are optional, and default to
the two innermost levels of the index
Returns
-------
swapped : Series
city_series = city_series.swaplevel()
city_series.sort_index(inplace=True)
city_series
INTRODUCTION
It is seldom a good idea to
present your scientific or
business data solely in rows
and columns of numbers. We
rather use various kinds of
diagrams to visualize our
data. This makes the
communication of information
more efficiently and easy to
grasp. In other words, it
makes complex data more
accessible and understandable.
The numerical data can be
graphically encoded with line
charts, bar charts, pie
charts, histograms,
scatterplots and others.
SERIES
Both the Pandas Series and DataFrame objects support a plot method.
You can see a simple example of a line plot with for a Series object. We use a
simple Python list "data" as the data for the range. The index will be used for
the x values, or the domain.
import pandas as pd
Output:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe98be8a710>
s.plot(use_index=False)
Output:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe98be8a710>
Output:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe98be8a710>
We will introduce now the plot method of a DataFrame. We define a dcitionary with
the population and area figures. This dictionary can be used to create the
DataFrame, which we want to use for plotting:
city_frame = pd.DataFrame(cities,
columns=["population", "area"],
index=cities["name"])
print(city_frame)
population area
London 8615246 1572.00
Berlin 3562166 891.85
Madrid 3165235 605.77
Rome 2874038 1285.00
Paris 2273305 105.40
Vienna 1805681 414.60
Bucharest 1803425 228.00
Hamburg 1760433 755.00
Budapest 1754000 525.20
Warsaw 1740119 517.00
Barcelona 1602386 101.90
Munich 1493900 310.40
Milan 1350680 181.80
The following code plots our DataFrame city_frame. We will multiply the area
column by 1000, because otherwise the "area" line would not be visible or in
other words would be overlapping with the x axis:
Output:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe9b13f2c50>
This plot is not coming up to our expectations, because not all the city names
appear on the x axis. We can change this by defining the xticks explicitly with
"range(len((city_frame.index))". Furthermore, we have to set use_index to True,
so that we get city names and not numbers from 0 to len((city_frame.index):
city_frame.plot(xticks=range(len(city_frame.index)),
use_index=True)
Output:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe983cd72e8>
Now, we have a new problem. The city names are overlapping. There is remedy at
hand for this problem as well. We can rotate the strings by 90 degrees. The
names will be printed vertically afterwards:
city_frame.plot(xticks=range(len(city_frame.index)),
use_index=True,
rot=90)
Output:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe98bbf6c88>
We multiplied the area column by 1000 to get a proper output. Instead of this, we
could have used twin axes. We will demonstrate this in the following example. We
will recreate the city_frame DataFrame to get the original area column:
city_frame = pd.DataFrame(cities,
columns=["population", "area"],
index=cities["name"])
print(city_frame)
population area
London 8615246 1572.00
Berlin 3562166 891.85
Madrid 3165235 605.77
Rome 2874038 1285.00
Paris 2273305 105.40
Vienna 1805681 414.60
Bucharest 1803425 228.00
Hamburg 1760433 755.00
Budapest 1754000 525.20
Warsaw 1740119 517.00
Barcelona 1602386 101.90
Munich 1493900 310.40
Milan 1350680 181.80
To get a twin axes represenation of our diagram, we need subplots from the module
matplotlib and the function "twinx":
fig, ax = plt.subplots()
fig.suptitle("City Statistics")
ax.set_ylabel("Population")
ax.set_xlabel("Cities")
ax2 = ax.twinx()
ax2.set_ylabel("Area")
city_frame["area"].plot(ax=ax2,
style="g-",
use_index=True,
rot=90)
plt.show()
ax1= city_frame["population"].plot(style="b-",
xticks=range(len(city_frame.index)),
use_index=True,
rot=90)
ax2 = ax1.twinx()
city_frame["area"].plot(ax=ax2,
style="g-",
use_index=True,
#secondary_y=True,
rot=90)
plt.show()
Let's add another axes to our city_frame. We will add a column with the
population density, i.e. the number of people per square kilometre:
city_frame
fig, ax = plt.subplots()
fig.suptitle("City Statistics")
ax.set_ylabel("Population")
ax.set_xlabel("Citites")
rspine = ax_density.spines['right']
rspine.set_position(('axes', 1.25))
ax_density.set_frame_on(True)
ax_density.patch.set_visible(False)
fig.subplots_adjust(right=0.75)
city_frame["population"].plot(ax=ax,
style="b-",
xticks=range(len(city_frame.index)),
use_index=True,
rot=90)
city_frame["area"].plot(ax=ax_area,
style="g-",
use_index=True,
rot=90)
city_frame["density"].plot(ax=ax_density,
style="r-",
use_index=True,
rot=90)
plt.show()
...
import pandas as pd
data_path = "data1/"
data = pd.read_csv(data_path + "python_course_monthly_history.txt",
quotechar='"',
thousands=",",
delimiter=r"\s+")
def unit_convert(x):
value, unit = x
if unit == "MB":
value *= 1024
elif unit == "GB":
value *= 1048576 # i.e. 1024 **2
return value
del data["Unit"]
data["Bandwidth"] = bandwidth
data.set_index("Month", inplace=True)
del data["Bandwidth"]
"tiobe_programming_language_usage_nov2018.txt"
with a ranking of programming languages by usage. The data has been collected and
created by TIOBE in November 2018.
The percentage column contains strings with a percentage sign. We can get rid of
this when we read in the data with read_csv. All we have to do is define a
converter function, which we to read_csv via the converters dictionary, which
contains column names as keys and references to functions as values.
data_path = "data1/"
progs = pd.read_csv(data_path + "tiobe_programming_language_usage_nov201
8.txt",
quotechar='"',
thousands=",",
index_col=1,
converters={'Percentage':strip_percentage_sign},
delimiter=r"\s+")
del progs["Position"]
print(progs.head(6))
progs.plot(xticks=range(1, len(progs.index)),
use_index=True, rot=90)
plt.show()
A SIMPLE EXAMPLE
import pandas as pd
s.plot(kind="bar")
Let's get back to our programming language ranking. We will printout now a bar
plot of the six most used programming languages:
progs[:6].plot(kind="bar")
progs.plot(kind="bar")
import pandas as pd
series.plot.pie(figsize=(6, 6))
It looks ugly that we see the y label "Percentage" inside our pie plot. We can
remove it by calling "plt.ylabel('')"
plt.ylabel('')
INTRODUCTION
Python provides rich
functionalities for dealing
with date and time data. The
standard libraries contains
the modules
• time
• calendar
• datetime
PYTHON STANDARD
MODULES FOR TIME DATA
The most important modules of
Python dealing with time are
the modules time ,
calendar and datetime .
• The instances of the date class represent dates, whereas the year can
range between 1 and 9999.
• The instances of the datetime class are made up both by a date and a
time.
• The time class implements time objects.
• The timedelta class is used to hold the differences between two times
or two date objects.
• The tzinfo class is used to implement timezone support for time and
datetime objects.
1993-12-14
We can instantiate dates in the range from January 1, 1 to December 31, 9999.
This can be inquired from the attributes min and max :
print(date.min)
print(date.max)
0001-01-01
9999-12-31
x.toordinal()
date.fromordinal(727911)
Output:
datetime.date(1993, 12, 14)
If you want to know the weekday of a certain date, you can calculate it by using
the method weekday:
x.weekday()
Output:
1
date.today()
Output:
datetime.date(2017, 4, 12)
14
12
1993
t = time(15, 6, 23)
print(t)
15:06:23
print(time.min)
print(time.max)
00:00:00
23:59:59.999999
Output:
(15, 6, 23)
t = t.replace(hour=11, minute=59)
t
Output:
datetime.time(11, 59, 23)
x.ctime()
Output:
'Tue Dec 14 00:00:00 1993'
• naive
• aware
An aware object on the other hand possesses knowledge of the time zone it belongs
to or the daylight saving time information. This way it can locate itself
relative to other aware objects.
Output:
datetime.datetime(2017, 4, 19, 16, 31)
t.tzinfo == None
We will create an aware datetime object from the current date. For this purpose
we need the module pytz. pytz is a module, which brings the Olson tz database
into Python. The Olson timezones are nearly completely supported by this module.
We can see that both t.tzinfo and t.tzinfo.utcoffset(t) are different from None,
so t is an aware object:
t.tzinfo, t.tzinfo.utcoffset(t)
Output:
(<UTC>, datetime.timedelta(0))
Output:
(datetime.timedelta(959), datetime.timedelta)
The result of the subtraction of the two datetime objects is a timedelta object,
as we can see from the example above.
We can get information about the number of days elapsed by using the attribute
'days':
delta.days
Output:
(412, 76680)
d3 = d1 - timedelta(100)
print(d3)
d4 = d1 - 2 * timedelta(50)
print(d4)
1991-05-10 00:00:00
10 days, 0:00:00
1991-01-20 00:00:00
1991-01-20 00:00:00
1991-05-10 00:01:40
10 days, 0:01:40
The easiest way to convert a datetime object into a string consists in using str.
s = str(d1)
s
Output:
'1991-04-30 00:00:00'
print(d1.strftime('%Y-%m-%d'))
print("weekday: " + d1.strftime('%a'))
print("weekday as a full name: " + d1.strftime('%A'))
1991-04-30
weekday: Tue
weekday as a full name: Tuesday
weekday as a decimal number: 2
Formatting months:
30
Apr
April
04
2000-11-30 00:00:00
dt = "2007-03-04T21:08:12"
datetime.strptime( dt, "%Y-%m-%dT%H:%M:%S" )
Output:
datetime.datetime(2007, 3, 4, 21, 8, 12)
Output:
datetime.datetime(1957, 12, 24, 4, 3, 29)
We can create an English date string on a Linux machine with the Shell command
LC_ALL=en_EN.utf8 date
2017-04-12 20:29:53
parse('2011-01-03')
Output:
datetime.datetime(2011, 1, 3, 0, 0)
Output:
datetime.datetime(2017, 4, 12, 20, 29, 53, tzinfo=tzlocal())
INTRODUCTION
Our next chapter of our Pandas
Tutorial deals with time
series. A time series is a
series of data points, which
are listed (or indexed) in
time order. Usually, a time
series is a sequence of
values, which are equally
spaced points in time.
Everything which consists of
measured data connected with
the corresponding time can be
seen as a time series.
Measurements can be taken
irregularly, but in most cases
time series consist of fixed
frequencies. This means that
data is measured or taken in a
regular pattern, i.e. for
example every 5 milliseconds,
every 10 seconds, or very hour.
Often time series are plotted as line charts.
In this chapter of our tutorial on Python with Pandas, we will introduce the
tools from Pandas dealing with time series. You will learn how to cope with large
time series and how modify time series.
Before you continue reading it might be useful to go through our tutorial on the
standard Python modules dealing with time processing, i.e. datetime, time and
calendar:
import numpy as np
import pandas as pd
values = [25, 50, 15, 67, 70, 9, 28, 30, 32, 12]
ts = pd.Series(values, index=dates)
ts
Output:
2017-03-31 25
2017-03-30 50
2017-03-29 15
2017-03-28 67
2017-03-27 70
2017-03-26 9
2017-03-25 28
2017-03-24 30
2017-03-23 32
2017-03-22 12
dtype: int64
type(ts)
Output:
pandas.core.series.Series
What does the index of a time series look like? Let's see:
ts.index
values2 = [32, 54, 18, 61, 72, 19, 21, 33, 29, 17]
It is possible to use arithmetic operations on time series like we did with other
series. We can for example add the two previously created time series:
ts + ts2
Output:
2017-03-31 57
2017-03-30 104
2017-03-29 33
2017-03-28 128
2017-03-27 142
2017-03-26 28
2017-03-25 49
2017-03-24 63
2017-03-23 61
2017-03-22 29
dtype: int64
Arithmetic mean between both Series, i.e. the values of the series:
Output:
2017-03-31 28.5
2017-03-30 52.0
2017-03-29 16.5
2017-03-28 64.0
2017-03-27 71.0
2017-03-26 14.0
2017-03-25 24.5
2017-03-24 31.5
2017-03-23 30.5
2017-03-22 14.5
dtype: float64
import pandas as pd
ndays = 10
values = [25, 50, 15, 67, 70, 9, 28, 30, 32, 12]
values2 = [32, 54, 18, 61, 72, 19, 21, 33, 29, 17]
ts = pd.Series(values, index=dates)
ts2 = pd.Series(values2, index=dates2)
ts + ts2
import pandas as pd
Output:
DatetimeIndex(['1970-12-24', '1970-12-25', '1970-12-26', '1970-12-2
7',
'1970-12-28', '1970-12-29', '1970-12-30', '1970-12-3
1',
'1971-01-01', '1971-01-02', '1971-01-03'],
dtype='datetime64[ns]', freq='D')
We have passed a start and an end date to date_range in our previous example. It
is also possible to pass only a start or an end date to the function. In this
case, we have to determine the number of periods to generate by setting the
keyword parameter 'periods':
We can also create time frequencies, which consists only of business days for
example by setting the keyword parameter 'freq' to the string 'B':
In the following example, we create a time frequency which contains the month
ends between two dates. We can see that the year 2016 contained the 29th of
February, because it was a leap year:
Output:
DatetimeIndex(['2016-02-29', '2016-03-31', '2016-04-30', '2016-05-3
1',
'2016-06-30'],
dtype='datetime64[ns]', freq='M')
Other aliases:
Alias Description
W weekly frequency
H hourly frequency
T minutely frequency
S secondly frequency
L milliseonds
U microseconds
Output:
DatetimeIndex(['2017-02-06', '2017-02-13', '2017-02-20', '2017-02-2
7',
'2017-03-06', '2017-03-13', '2017-03-20', '2017-03-2
7',
'2017-04-03', '2017-04-10'],
dtype='datetime64[ns]', freq='W-MON')
In [ ]:
Date;Description;Category;Out;In
2020-06-02;Salary Frank;Income;0;4896.44
2020-06-03;supermarket;food and beverages;132.40;0
2020-06-04;Salary Laura;Income;0;4910.14
2929-06-04;GreenEnergy Corp., (electricity);utility;87.34;0
2020-06-09;water and sewage;utility;60.56;0
2020-06-10;Fitness studio, Jane;health and sports;19.00;0
2020-06-11;payment to bank;monthly redemption payment;1287.43;0
2020-06-12;LeGourmet Restaurant;restaurants and hotels;145.00;0
2020-06-13;supermarket;food and beverages;197.42;0
2020-06-13;Pizzeria da Pulcinella;restaurants and hotels;60.00;0
2020-06-26;supermarket;food and beverages;155.42;0
2020-06-27;theatre tickets;education and culture;125;0
The above mentioned csv file is saved under the name expenses_and_income.csv in
the folder data . It is easy to read it in with Pandas as we can see in our
chapter Pandas Data Files:
import pandas as pd
food and
1 2020-06-03 supermarket 132.40 0.00
beverages
GreenEnergy Corp.,
3 2929-06-04 utility 87.34 0.00
(electricity)
monthly
6 2020-06-11 payment to bank redemption 1287.43 0.00
payment
restaurants and
7 2020-06-12 LeGourmet Restaurant 145.00 0.00
hotels
food and
8 2020-06-13 supermarket 197.42 0.00
beverages
food and
10 2020-06-26 supermarket 155.42 0.00
beverages
education and
11 2020-06-27 theatre tickets 125.00 0.00
culture
food and
13 2020-07-03 supermarket 147.90 0.00
beverages
insurances and
16 2020-07-09 house insurance 167.89 0.00
taxes
food and
18 2020-07-10 supermarket 144.12 0.00
beverages
monthly
19 2020-07-11 payment to bank redemption 1287.43 0.00
payment
food and
20 2020-07-18 supermarket 211.24 0.00
beverages
education and
22 2020-07-23 Cinema 19.00 0.00
culture
food and
23 2020-07-25 supermarket 186.11 0.00
beverages
By reading the CSV file, we created a DataFrame object. What can we do with it,
or in other words: what information interests Frank and Laura? Of course they are
interested in the account balance. They want to know what the total income was
and they want to see the total of all expenses.
The balances of their expenses and incomes can be easily calculated by applying
the sum on the DataFrame exp_inc[['Out', 'In']] :
exp_inc[['Out', 'In']].sum()
Output:
Out 5097.44
In 19613.16
dtype: float64
What other information do they want to gain from the data? They might be
interested in seeing the expenses summed up according to the different
categories. This can be done using groupby and sum:
category_sums = exp_inc.groupby("Category").sum()
category_sums
Out In
Category
category_sums.index
ax = category_sums.plot.bar(y="Out")
plt.xticks(rotation=45)
Output:
(array([0, 1, 2, 3, 4, 5, 6, 7]), <a list of 8 Text xticklabel obje
cts>)
Output:
<matplotlib.legend.Legend at 0x7fe20d953fd0>
Alternatively, we can create the same pie plot with the following code:
ax = category_sums["Out"].plot.pie()
ax.legend(loc="upper left", bbox_to_anchor=(1.5, 1))
If you imagine that you will have to type in all the time category names like
"household goods and service" or "rent and mortgage interest", you will agree
that it is very likely to have typos in your journal of expenses and income.
So it will be a good idea to use numbers (account numbers) for your categories.
The following categories are available in our example.
transport 204
clothing 207
communications 208
income 400
The next step is to replace our "clumsy" category names with the account numbers.
The replace method of DataFrame is ideal for this purpose. We can replace all
the occurrences of the category names in our DataFrame by the corresponding
account names:
exp_inc.replace(category2account, inplace=True)
exp_inc.rename(columns={"Category": "Accounts"}, inplace=True)
exp_inc[:5]
We will save this DataFrame object now in an excel file. This excel file will
have two sheets: One with the "expenses and income" journal and the other one
with the mapping of account numbers to category names.
We will turn the category2account dictionary into a Series object for this
purpose. The account numbers serve as the index:
Very often a Microsoft Excel file is used for mainting a journal file of the
financial transactions.
In the data1 folder there is a file that contains an Excel file
net_income_method_2020.xlsx with the accounting data of a fictitious company.
We could have used simple texts files like csv as well.
JOURNAL FILE
This excel document contains two data sheets: One with the actual data "journal"
and one with the name "account numbers", which contains the mapping from the
account numbers to the description.
journal.index = pd.to_datetime(journal.index)
journal.index
The first one is the tab "account numbers" which contains the mapping from the
account numbers to the description of the accounts:
description
account
2010 souvenirs
2020 clothes
2050 books
2100 insurances
2200 wages
2300 loans
2400 hotels
2500 petrol
account
2600 telecommunication
2610 internet
The second data sheet "journal" contains the actual journal entries:
journal[:10]
date
Birmann,
2020-04-02 2010 57550799 19 -1890.00
Souvenirs
Filling Station,
2020-04-02 2500 12766279 19 -89.40
Petrol
EnergyCom,
2020-04-02 4400 3733462359 19 4663.54
Hamburg
Enoigo,
2020-04-02 4402 7526058231 19 2412.82
Strasbourg
gross amount
account number
2010 -4090.00
2020 -10500.80
2030 -1350.00
2050 -900.00
2100 -612.00
2200 -69912.92
2300 -18791.92
2400 -1597.10
2500 -89.40
2600 -492.48
2610 -561.00
4400 37771.84
account number
4401 69610.35
4402 61593.99
ACCOUNT CHARTS
What about showing a pie chart of these sums? We encounter one problem: Pie
charts cannot contain negative values. However this is not a real problem. We can
split the accounts into income and expense accounts. Of course, this corresponds
more to what we really want to see.
gross amount
account number
4400 37771.84
4401 69610.35
4402 61593.99
upper
correspondingly the upper right corner
right
We use this to position the legend with its left upper corner positioned in the
middle of the plot:
Output:
<matplotlib.legend.Legend at 0x7f172ca03a90>
Now we position the lower right corner of the legend into the center of the plot:
There is another thing we can improve. We see the labels 4400, 4401, and 4402
beside of each pie segment. In addition, we see them in the legend. This is ugly
and redundant information. In the following we will turn the labels off, i.e. set
them to an empty string, in the plot and we explicitly set them in the legend
method:
Now, we are close to perfection. Just one more tiny thing. Some might prefer to
see the actual description text rather than an account number. We will cut out
this information from the DataFrame accounts2descr by using loc and the list
of desired numbers [4400, 4401, 4402] . The result of this operation will be the
argument of the set_index method. (Atention: reindex is not giving the wanted
results!)
plot.legend(bbox_to_anchor=(0.5, 0.5),
loc="lower left",
labels=descriptions)
Would you prefer a bar chart? No problem, we just have to set the parameter
kind to bar instead of pie :
gross amount
account number
2010 -4090.00
2020 -10500.80
2030 -1350.00
2050 -900.00
2100 -612.00
2200 -69912.92
2300 -18791.92
2400 -1597.10
2500 -89.40
2600 -492.48
2610 -561.00
Output:
account number
2010 souvenirs
2020 clothes
2030 other articles
2050 books
2100 insurances
2200 wages
2300 loans
2400 hotels
2500 petrol
2600 telecommunication
2610 internet
Name: description, dtype: object
expenses_accounts.set_index(acc2descr_expenses.values, inplace=True)
expenses_accounts *= -1
TAX SUMS
We will sum up the amount according to their tax rate.
journal.drop(columns=["account number"])
date
BoKoData, Bodensee,
2020-07-25 5204418668 19 3678.38
Konstanz
In the following we will define a function tax_sums that calculates the VAT
sums according to tax rates from a journal DataFrame:
return taxes.fillna(0)
tax_sums(journal)
Output:
Tax Rate
Tax Rate
Tax Rate
DEFINITIONS
A linear combination in mathematics is an
expression constructed from a set of terms by
multiplying
each term by a constant and adding the results.
Generally;
p = λ 1 · x1 + λ2 ·
x2 … λ n · x n
</p>
import numpy as np
x = np.array([[0, 0, 1],
[0, 1, 0],
[1, 0, 0]])
y = ([3.65, 1.55, 3.42])
scalars = np.linalg.solve(x, y)
scalars
Output:
array([3.42, 1.55, 3.65])
The previous example was very easy, because we could work out the result in our
head. What about writing our vector y = (3.21, 1.77, 3.65) as a linear
combination of the vectors (0,1,1), (1,1,0) and (1,0,1)? It looks like this in
Python:
import numpy as np
x = np.array([[0, 1, 1],
[1, 1, 0],
[1, 0, 1]])
y = ([3.65, 1.55, 3.42])
scalars = np.linalg.solve(x, y)
scalars
Output:
array([0.66, 0.89, 2.76])
ANOTHER EXAMPLE
Any integer between -40 and 40 can be written as a linear combination of 1, 3, 9,
For example:
7 = 1 · 1 + (-1) · 3 + 1 · 9 + 0 · 27
We can calculate these scalars with Python. First we need a generator generating
all the possible scalar
combinations. If you have problems in understanding the concept of a generator,
we recommend the chapter
"Iterators and Generators" of our tutorial.
def factors_set():
for i in [-1, 0, 1]:
for j in [-1,0,1]:
for k in [-1,0,1]:
for l in [-1,0,1]:
yield (i, j, k, l)
We will use the memoize() technique (see chapter "Memoization and Decorators" of
our tutorial) to memorize previous results:
def memoize(f):
results = {}
def helper(n):
if n not in results:
results[n] = f(n)
return results[n]
return helper
@memoize
def factors_set():
for i in [-1, 0, 1]:
for j in [-1, 0, 1]:
for k in [-1, 0, 1]:
for l in [-1, 0, 1]:
yield (i, j, k, l)
def memoize(f):
results = {}
def helper(n):
if n not in results:
results[n] = f(n)
return results[n]
return helper
@memoize
def linear_combination(n):
""" returns the tuple (i,j,k,l) satisfying
n = i*1 + j*3 + k*9 + l*27 """
weighs = (1, 3, 9, 27)
(1, 0, 0, 0)
(-1, 1, 0, 0)
(0, 1, 0, 0)
(1, 1, 0, 0)
(-1, -1, 1, 0)
(0, -1, 1, 0)
(1, -1, 1, 0)
(-1, 0, 1, 0)
(0, 0, 1, 0)
(1, 0, 1, 0)