Learningthepandaslibrary PDF
Learningthepandaslibrary PDF
Matt Harrison
hairysun.com
COPYRIGHT © 2016
Contents
1 Introduction 3
1.1 Who this book is for . . . . . . . . . . . . . . . . . . . 4
1.2 Data in this Book . . . . . . . . . . . . . . . . . . . . 4
1.3 Hints, Tables, and Images . . . . . . . . . . . . . . . . 4
2 Installation 5
2.1 Other Installation Options . . . . . . . . . . . . . . . 6
2.2 scipy.stats . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Data Structures 9
3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Series 11
4.1 The index abstraction . . . . . . . . . . . . . . . . . . 12
4.2 The pandas Series . . . . . . . . . . . . . . . . . . . 12
4.3 The NaN value . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Similar to NumPy . . . . . . . . . . . . . . . . . . . . 15
4.5 Categorical Data . . . . . . . . . . . . . . . . . . . . . 17
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Series CRUD 21
5.1 Creation . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Updating . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.4 Deletion . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 26
6 Series Indexing 27
6.1 .iloc and .loc . . . . . . . . . . . . . . . . . . . . . . 29
6.2 .at and .iat . . . . . . . . . . . . . . . . . . . . . . . 31
6.3 .ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.4 Index Operations . . . . . . . . . . . . . . . . . . . . 32
6.5 Indexing Summary . . . . . . . . . . . . . . . . . . . 33
v
Contents
6.6 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.7 Boolean Arrays . . . . . . . . . . . . . . . . . . . . . 34
6.8 More Boolean Arrays . . . . . . . . . . . . . . . . . . 36
6.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 36
7 Series Methods 39
7.1 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.2 Overloaded operations . . . . . . . . . . . . . . . . . 41
7.3 Possible Alignment Issues . . . . . . . . . . . . . . . 44
7.4 Getting and Setting Values . . . . . . . . . . . . . . . 44
7.5 Reset Index . . . . . . . . . . . . . . . . . . . . . . . . 48
7.6 Counts . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.7 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.8 Convert Types . . . . . . . . . . . . . . . . . . . . . . 58
7.9 More Converting . . . . . . . . . . . . . . . . . . . . 60
7.10 Dealing with None . . . . . . . . . . . . . . . . . . . 60
7.11 Matrix Operations . . . . . . . . . . . . . . . . . . . . 62
7.12 Append, combining, and joining two series . . . . . 62
7.13 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.14 Applying a function . . . . . . . . . . . . . . . . . . . 66
7.15 Serialization . . . . . . . . . . . . . . . . . . . . . . . 68
7.16 String operations . . . . . . . . . . . . . . . . . . . . 71
7.17 Date Operations . . . . . . . . . . . . . . . . . . . . . 72
7.18 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 74
8 Series Plotting 75
8.1 Other plot types . . . . . . . . . . . . . . . . . . . . . 80
8.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 81
10 DataFrames 93
10.1 DataFrames . . . . . . . . . . . . . . . . . . . . . . . 94
10.2 Construction . . . . . . . . . . . . . . . . . . . . . . . 96
10.3 Data Frame Axis . . . . . . . . . . . . . . . . . . . . . 97
10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 98
vi
Contents
vii
Contents
20 Summary 211
Index 215
viii
Forward
Python is easy to learn. You can learn the basics in a day and be
productive with it. With only an understanding of Python, moving
to pandas can be difficult or confusing. This book is meant to aid
you in mastering pandas.
I have taught Python and pandas to many people over the years,
in large corporate environments, small startups, and in Python
and Data Science conferences. I have seen what hangs people up,
and confuses them. With the correct background, an attitude of
acceptance, and a deep breath, much of this confusion evaporates.
Having said this, pandas is an excellent tool. Many are using it
around the world to great success. I hope you do as well.
Cheers!
Matt
1
Chapter 1
Introduction
3
1. Introduction
2
http://hairysun.com/books/tread/
4
Chapter 2
Installation
Python 3 has been out for a while now, and people claim it is the
future. As an attempt to be modern, this book will use Python 3
throughout! Do not despair; the code will run in Python 2 as well.
In fact, review versions of the book neglected to list the Python
version, and there was a single complaint about a superfluous
list(range(10)) call. The lone line of (Python 2) code required
for compatibility is:
$ python
>>> import pandas
>>> pandas . __version__
’0.22.0 ’
5
2. Installation
Pandas can also be installed from source. I feel the need to advise
you that you might spend a bit of time going down this rabbit hole
if you are not familiar with getting compiler toolchains installed on
your system.
It may be necessary to prep the operating system for building
pandas from source by installing dependencies and the proper
header files for Python. On Ubuntu this is straightforward, other
environments may be different:
After a while, pandas should be ready for use. Try to import the
library and check the version:
3
https://www.continuum.io/downloads
4
http://pip-installer.org/
6
2.2. scipy.stats
2.2 scipy.stats
Some nicer plotting features require scipy.stats. This library is not
required, but pandas will complain if the user tries to perform an
action that has this dependency. scipy.stats has many non-Python
dependencies and in practice turns out to be a little more involved
to install. For Ubuntu, the following packages are required before a
pip install scipy will work:
2.3 Summary
Unlike "pure" Python modules, pandas is not just a pip install away
unless you have an environment configured to build it. The easiest
way to get going is to use the Anaconda Python distribution. It is
indeed possible to install pandas using other methods.
5
http://www.virtualenv.org
6
https://store.continuum.io/cshop/anaconda/
7
Chapter 3
Data Structures
The most widely used data structures are the Series and the
DataFrame that deal with array data and tabular data respectively.
An analogy with the spreadsheet world illustrates the basic
differences between these types. A DataFrame is similar to a sheet
with rows and columns, while a Series is similar to a single column
of data. A Panel is a group of sheets. Likewise, in pandas a
multiple Series.
Diving into these core data structures a little more is useful
because a bit of understanding goes a long way towards better use
of the library. This book will ignore the Panel because I have yet to
see anyone use it in the real world (as of 0.20 it is deprecated). On
the other hand, we will spend a good portion of time discussing
the Series and DataFrame. Both the Series and DataFrame share
features. For example, they both have an index, which we will need
to examine to understand how pandas works.
Also, because the DataFrame can be thought of as a collection of
columns that are really Series objects, it is imperative that we have
a comprehensive study of the Series first. Additionally, we see this
when we iterate over rows, and the rows are represented as Series.
9
3. Data Structures
Figure 32: Figure showing the relation between the main data structures in
pandas. Namely, that a data frame can have multiple series, and a panel has
multiple data frames.
3.1 Summary
The pandas library includes three main data structures and
associated functions for manipulating them. This book will focus
on the Series and DataFrame. First, we will look at the Series as
the DataFrame can be thought of as a collection of Series.
10
Chapter 4
Series
Artist Data
0 145
1 142
2 38
3 13
>>> ser = {
... ’ index ’:[0 , 1 , 2 , 3] ,
... ’data ’:[145 , 142 , 38 , 13] ,
... ’name ’: ’ songs ’
... }
The get function defined below can pull items out of this data
structure based on the index:
11
4. Series
Note
The code samples in this book are shown as if they were typed
directly into an interpreter. Lines starting with >>> and ...
are interpreter markers for the input prompt and continuation
prompt respectively. Lines that are not prefixed by one of those
sequences are the output from the interpreter after running
the code.
The Python interpreter will print the return value of
the last invocation (even if the print statement is missing)
automatically. If you desire to use the code samples found
in this book, leave the interpreter prompts out.
>>> songs = {
... ’ index ’:[ ’ Paul ’ , ’ John ’ , ’ George ’ , ’ Ringo ’] ,
... ’data ’:[145 , 142 , 38 , 13] ,
... ’name ’: ’ counts ’
... }
12
4.2. The pandas Series
>>> songs2
0 145
1 142
2 38
3 13
Name : counts , dtype : int64
When the interpreter prints our series, pandas makes a best effort
to format it for the current terminal size. The leftmost column is the
index column which contains entries for the index. The generic name
for an index is an axis, and the values of the index—0, 1, 2, 3—are
called axis labels. The two-dimensional structure in pandas—a
DataFrame—has two axes, one for the rows and another for the
columns.
The rightmost column in the output contains the values of the
series. In this case, they are integers (the console representation says
dtype: int64, dtype meaning data type, and int64 meaning 64-bit
integer), but in general values of a Series can hold strings, floats,
booleans, or arbitrary Python objects. To get the best speed (such
as vectorized operations), the values should be of the same type,
though this is not required.
It is easy to inspect the index of a series (or data frame), as it is
an attribute of the object:
13
4. Series
Note
The index can be string based as well, in which case pandas
indicates that the datatype for the index is object (not string):
>>> songs3
Paul 145
John 142
George 38
Ringo 13
Name : counts , dtype : int64
>>> ringo
0 Richard
1 Starkey
2 13
3 < __main__ . Foo instance at 0 x ... >
Name : ringo , dtype : object
14
4.3. The NaN value
Note
One thing to note is that the type of this series is float64, not
int64! The type is a float because the only numeric type that
supports NaN is the float type. When pandas sees numeric data
(2) as well as the None, it coerced the 2 to a float value.
Note
If you load data from a CSV file, an empty value for an
otherwise numeric column will become NaN. Later, methods
such as .fillna and .dropna will explain how to deal with NaN.
None, NaN, nan, and null are synonyms in this book when referring
to empty or missing data found in a pandas series or data frame.
15
4. Series
>>> mask
Paul True
John True
George False
Ringo False
Name : counts , dtype : bool
NumPy also has filtering by boolean arrays, but lacks the .median
method on an array. Instead, NumPy provides a median function in
the NumPy namespace:
Note
Both NumPy and pandas have adopted the convention of using
import statements in combination with an as statement to
rename their imports to two letter acronyms:
16
4.5. Categorical Data
7
Type import this into an interpreter to see the Zen of Python. Or search for "PEP
20".
17
4. Series
In this case, we limited the categories to just ’s’, ’m’, and ’l’,
but the data had values that were not in those categories. These
extra values were replaced with NaN.
If we have ordered categories, we can do comparisons on them:
>>> s3 > ’s ’
0 True
1 True
2 False
3 False
4 False
dtype : bool
18
4.6. Summary
Note
String and datetime series have a str and dt attribute that allow
us to perform common operations specific to that type. If we
convert these types to categorical types, we can still use the str
or dt attributes on them.
4.6 Summary
The Series object is a one-dimensional data structure. It can hold
numerical data, time data, strings, or arbitrary Python objects. If you
are dealing with numeric data, using pandas rather than a Python
list will give you additional benefits. Pandas is faster, consumes less
memory, and comes with built-in methods that are very useful to
manipulate the data. Also, the index abstraction allows for accessing
values by position or label. A Series can also have empty values,
and has some similarities to NumPy arrays. It is the basic workhorse
of pandas, mastering it will pay dividends.
19
Chapter 5
Series CRUD
The pandas Series data structure provides support for the basic
CRUD operations—create, read, update, and delete. One thing to
be aware of is that in general pandas objects tend to behave in an
immutable manner. Although they are mutable, you don’t normally
update a series, but rather perform an operation that will return a
new Series. Exceptions to this are noted throughout the book.
5.1 Creation
It is easy to create a series from a Python list. Here we create a
series with the count of songs attributed to George Harrison during
the final years of The Beatles and the release of his 1970 album, All
Things Must Pass. The index is specified as the second parameter
using a list of string years as values. Note that 1970 is included once
for George’s work as a member of the Beatles and repeated for his
solo album:
>>> george_dupe
1968 10
1969 7
1970 1
1970 22
Name : George Songs , dtype : int64
21
5. Series CRUD
This series was created with a list and an explicit index. A series
can also be created with a dictionary that maps index entries to
values. If a dictionary is used, an additional sequence containing
the order of the index is mandatory. This last parameter is necessary
because a Python dictionary is not ordered.
For our current data, creating this series from a dictionary is less
powerful, because it cannot place different values in a series for the
same index label (a dictionary has unique keys, and we are using
the keys as index labels). One might attempt to get around this by
mapping the label to a list of values, but these attempts will fail. The
list of values will be interpreted as a Python list, not two separate
entries:
>>> g2
1969 7
1970 [1 , 22]
1970 [1 , 22]
dtype : object
Tip
If you need to have multiple values for an index entry, use a
list to specify both the index and values.
5.2 Reading
An index operation (in the Python sense, using the [] syntax) in
combination with the index value (in the pandas sense), will read
or select data from the series:
22
5.2. Reading
Note
Care must be taken when working with data that has non-
unique index values. Scalar values and Series objects have a
different interface and trying to treat them the same will lead
to errors.
>>> 22 in george_dupe
False
If you want to test a series for membership, test against the set
of the series or the .values attribute:
>>> 22 in set ( george_dupe )
True
We can iterate over tuples containing both the index label and
the value, with the .iteritems method:
>>> for item in george_dupe . iteritems ():
... print ( item )
( ’1968 ’ , 10)
( ’1969 ’ , 7)
( ’1970 ’ , 1)
( ’1970 ’ , 22)
23
5. Series CRUD
5.3 Updating
Updating values in a series can be a little tricky as well. To update
a value for a given index label, the standard index assignment
operation works and performs the update in-place (in effect
mutating the series):
>>> george_dupe
1968 10
1969 6
1970 2
1970 2
1973 11
Name : George Songs , dtype : int64
Both values for 1970 were set to 2. If you had to deal with data
such as this, it would probably be better to use a data frame with a
column for artist (i.e. Beatles, or George Harrison) or a multi-index
(described later). We can update values, based purely on position,
by performing an index assignment on the .iloc attribute:
24
5.4. Deletion
1968 10
1969 6
1970 2
1970 22
1973 11
Name : George Songs , dtype : int64
Note
There is an .append method on the series object, but it does not
behave like the Python list’s .append method. It is somewhat
analogous the Python list’s .extend method in that it expects
another series to append to:
The series object has a .set_value method that will both add a new
item to the existing series and return a series:
5.4 Deletion
Deletion is not common in the pandas world. It is more common
to use filters or masks to create a new series that has only the items
that you want. However, if you really want to remove entries, you
can delete based on index entries.
Recent versions of pandas support the del statement, which
deletes based on the index:
>>> george_dupe
25
5. Series CRUD
1968 10
1969 6
1970 2
1970 22
1974 9
Name : George Songs , dtype : int64
Note
The del statement appears to have problems with duplicate
index values (as of version 0.14.1):
>>> s
1 4
dtype : int64
One might assume that del would remove any entries with
that index value. For some reason, it also appears to have
removed index 2 but left the second index 1.
One way to delete values from a series is to filter the series to get a
new series. Here is a basic filter that returns all values less than or
equal to 2. The example below uses a boolean array inlined into the
index operation. This is common in NumPy but not supported in
normal Python lists or dictionaries:
5.5 Summary
A Series doesn’t just hold data. It allows you to get at the data,
update it, or remove it. Often, we perform these operations through
the index. We have just scratched the surface in this chapter. In
future chapters, we will dive deeper into the Series.
26
Chapter 6
Series Indexing
>>> george
1968 10
1969 7
Name : George Songs , dtype : int64
27
6. Series Indexing
Much like numpy arrays, a Series object can be both indexed and
sliced along the axis. Indexing pulls out either a scalar or multiple
values (if there are non-unique index labels):
>>> george
1968 10
1969 7
Name : George Songs , dtype : int64
8
http://pandas.pydata.org/pandas-docs/stable/10min.html
28
6.1. .iloc and .loc
Note
As we have seen, the result of an index operation may not be a
scalar. If the index labels are not unique, it is possible that the
index operation returns a sub-series rather than a scalar value:
>>> dupe
1968 10
1968 2
1969 7
Name : George Songs , dtype : int64
Note
If the index is already using integer labels, then the fallback to
position based indexing does not work!:
29
6. Series Indexing
30
6.2. .at and .iat
1969 7
Name : George Songs , dtype : int64
Because george has string index values, .loc will fail when
indexing with an integer:
31
6. Series Indexing
1970 22
Name : George Songs , dtype : int64
One reason to use these accessors is that they are about three
times faster than the .loc and .iloc alternatives.
EDITADV
6.3 .ix
.ix is similar to [] indexing. Because it tries to support both
positional and label-based indexing, I advise against its use in
general. It tends to lead to confusing results and violates the notion
that “explicit is better than implicit”. (As of 0.20 it is deprecated):
The case where .ix turns out to be useful is given in the pandas
documentation:
If you are using pivot tables or stacking (as described later), .ix
can be useful. Note that the pandas documentation continues:
32
6.5. Indexing Summary
6.6 Slicing
As mentioned, slicing can be performed on the index at-
tributes—.iloc and .loc. When we slice, we pull out a range of
index locations, and the result is a series. If we only pass in a single
index location to an index operation, the result is a scalar (assuming
unique index keys).
Slices take the form of [start]:[end][:stride] where start,
end, and stride are integers and the square brackets represent
optional values. The table below explains slicing possibilities for
.iloc:
Slice Result
0:1 First item
:1 First item (start default is 0)
:-2 Take from the start up to the second to last item
::2 Take from start to the end skipping every other item
33
6. Series Indexing
>>> mask
1968 True
1969 False
Name : George Songs , dtype : bool
Note
Boolean arrays might be confusing for programmers used to
Python, but not NumPy. Taking a series and applying an
operation to each value of the series is known as broadcasting.
The > operation is broadcasted, or applied, to every entry in
the series. This returns a new series with the result of each of
those operations. Because the result of applying the greater
than operator to each value returns a boolean, the final result
is a new series with the same index labels as the original, but
each value is True or False. This is referred to as a boolean
array.
There are other broadcasting operations. Here we increment
the numerical values by adding two to them:
>>> george + 2
1968 12
1969 9
Name : George Songs , dtype : int64
Operation Example
And ser[a & b]
Or ser[a | b]
Not ser[~a]
34
6.7. Boolean Arrays
35
6. Series Indexing
Tip
If you inline boolean array operations, make sure to surround
them with parentheses.
>>> mask
1968 True
1969 False
Name : George Songs , dtype : bool
If you use a boolean mask with .iloc you will get an error:
6.9 Summary
In this chapter, we looked at the index. Through index operations,
we can pull values out of a series. Because you can pull out values
by both position and label, indexing can be a little complicated.
Using .loc and .iloc allow you to be more explicit about indexing
operations. We can also use slicing to pull out values. This is a
36
6.9. Summary
37
Chapter 7
Series Methods
A Series object has many attributes and methods that are useful
for data analysis. This section will cover a few of them.
In general, methods don’t mutate their values but return a new
Series object. Most of the methods returning a new instance also
have an inplace or a copy parameter. This is because the default
behavior tends towards immutability, and these optional parameters
default to False and True respectively.
Note
The inplace and copy parameters are the logical complement
of each other. Luckily, a method will only take one of them.
This is one of those slight inconsistencies found in the library.
In practice, immutability works out well and both of these
parameters can be ignored.
The examples in this chapter will use the following series. They
contain the count of Beatles songs attributed to individual band
members in the years 1966 and 1969:
7.1 Iteration
Iteration over a series iterates over the values:
39
7. Series Methods
Note
Python supports unpacking or destructuring during variable
assignment, which includes iteration (as seen in the example
above). The .iteritems method returns a sequence of index,
value tuples. By using unpacking, we can immediately put
them each in their own variables.
If Python did not support this feature, we would have to
create an intermediate variable to hold the tuple (which works
but adds a few more lines of code):
40
7.2. Overloaded operations
Operation Result
+ Adds scalar (or series with matching index values) returns
Series
- Subtracts scalar (or series with matching index values)
returns Series
/ Divides scalar (or series with matching index values)
returns Series
// “Floor” Divides scalar (or series with matching index
values) returns Series
* Multiplies scalar (or series with matching index values)
returns Series
% Modulus scalar (or series with matching index values)
returns Series
==, != Equality scalar (or series with matching index values)
returns Series
>, < Greater/less than scalar (or series with matching index
values) returns Series
>=, <= Greater/less than or equal scalar (or series with matching
index values) returns Series
^ Binary XOR returns Series
| Binary OR returns Series
& Binary AND returns Series
>>> songs_66 + 2
George 5.0
Ringo NaN
John 13.0
Paul 11.0
Name : Counts , dtype : float64
Note
Broadcasting is a NumPy and pandas feature. A normal
Python list supports some of the operations listed in the prior
table, but not in the elementwise manner that NumPy and
41
7. Series Methods
>>> [1 , 3, 4] * 2
[1 , 3, 4, 1, 3, 4]
Note
There is a corresponding method for each operator. The .add
and .sub methods are thin wrappers around the operator.add
and operator.sub functions, which call out to + and -
respectively.
There is an .eq (==) method and an .equals method. They
have slightly different behavior. The later treats NaN and equal,
while the former does not. If you were writing unit tests to
compare dataframes, this distinction is important.
When we add two series, pandas will align the index values. Index
alignment means it only adds those items whose index occurs in
both series, otherwise, it inserts a NaN for index values found only in
one of the series. Note that though Ringo appears in both indices,
he has a value of NaN in songs_66 (leading to NaN as the result of
the addition operation):
Note
The above result might be problematic. Should the count of
Ringo songs really be unknown? In this case, we use the fillna
method to replace NaN with zero and give us a better answer:
42
7.2. Overloaded operations
Note
If you want to provide missing numbers, you can call .fillna
with 0 or use the .add method and set the fill_value
parameter:
If the value is missing for the same index from both series
fill_value has no effect:
>>> a = pd . Series ([ None , 1] , index =[1 , 2])
>>> a. add (a , fill_value =0)
43
7. Series Methods
Method Result
get(label, Returns a scalar (or Series if duplicate indexes) for
[default]) label or default on failed lookup.
get_value(label) Returns a scalar (or Series if duplicate indexes) for
label
set_value(label, Returns a new Series with label and value
value) inserted (or updated)
>>> songs_66
George 3.0
Ringo NaN
John 11.0
Paul 9.0
Name : Counts , dtype : float64
44
7.4. Getting and Setting Values
Note
Valid attribute names are names that begin with letters, and
contain alphanumerics or underscores. If an index name
contains spaces, you couldn’t use dotted attribute access to
read it, but index access would work fine:
45
7. Series Methods
Note
The Python language gives you great flexibility. But with
that flexibility comes responsibility. Paraphrasing Spiderman
here, but because dotted attribute setting is possible, one can
overwrite some of the methods and attributes of a series.
Below is a series that has various index names. normal is a
perfectly valid index name. median is a fine name, but is also
the name of the method for calculating the median. class is
another name that would be fine if it wasn’t a reserved name
in Python. index is another name that conflicts with a pandas
attribute:
>>> ser = pd . Series ([1 , 2 , 3 , 4] ,
... index =[ ’ normal ’ , ’ median ’ , ’ class ’ , ’ index ’])
46
7.4. Getting and Setting Values
>>> ser
normal 4
median 2
class 3
index 4
dtype : int64
47
7. Series Methods
Also, .set_value will update all the values for a given index. If
you have non-unique indexes and only want to update one of the
values for a repeated index, this cannot be done via .set_value.
Tip
One way to update only one value for a repeated index label is
to update by position. The following series repeats the index
label 1970:
>>> george = pd . Series ([10 , 7 , 1 , 22] ,
... index =[ ’1968 ’ , ’1969 ’ , ’1970 ’ , ’1970 ’] ,
... name =’ George Songs ’)
>>> george
1968 10
1969 7
1970 1
1970 22
Name : George Songs , dtype : int64
We can update just the first value for 1970, using the .iloc
index assignment:
48
7.5. Reset Index
49
7. Series Methods
Note
The above code explicitly calls the list function on idx2
because the author is using Python 3 in the examples in this
book. In Python 3, range is an iterable that does not materialize
the contents of the sequence until it is iterated over. It behaves
similarly to Python 2’s xrange built-in.
This code (as with most of the code in this book) will still
work in Python 2.
7.6 Counts
This section will explore how to get an overview of the data found
in a series. For the following examples we will use two series. The
songs_66 series:
>>> songs_66
George 3.0
Ringo NaN
John 11.0
Paul 9.0
Name : Counts , dtype : float64
50
7.6. Counts
>>> scores2
Ringo 67.3
Paul 100.0
George 96.7
Peter NaN
Billy 100.0
Name : test2 , dtype : float64
A few methods are provided to get a feel for the counts of the
entries, how many are unique, and how many are duplicated. Given
a series, the .count method returns the number of non-null items.
The scores2 series has 5 entries but one of them is None, so .count
only returns 4:
51
7. Series Methods
7.7 Statistics
There are many basic statistical measures provided by the series
object. We will look at a few of them in this section.
One of the most basic measurements is the sum of the values in
a series:
52
7.7. Statistics
Note
Most of the methods that perform a calculation ignore NaN.
Some also provide an optional parameter—skipna—to change
that behavior. But in practice, if you do not ignore NaN, the
result is nan:
53
7. Series Methods
The series also has methods to find the minimum and maximum
for the values, .min and .max. Also, there are methods to get the
index location of the minimum and maximum index labels, .idxmin
and .idxmax:
>>> songs_66 . min ()
3.0
54
7.7. Statistics
In this case, the sample size is so low that it is hard to say much
about the data. Also, given the multimodal nature of all_songs,
the skew is hard to interpret. In a unimodal distribution, negative
skew has a long left tail:
55
7. Series Methods
56
7.7. Statistics
all_songs series, it tells us how many more songs they wrote each
year than the previous year:
57
7. Series Methods
Note that even though the value is rounded, the type is still a
float.
Numbers can be clipped between lower and upper thresholds
using the .clip method. This method does not change the type
either:
58
7.8. Convert Types
Final Type Method
string use .astype(str)
numeric use pd.to_numeric
integer use .astype(int), note that this will fail with NaN
datetime use pd.to_datetime
category use .astype("category")
59
7. Series Methods
TODO
The types that we can use for numeric types are the Numpy types.
astype(np.int8)
Note
If a numeric column is missing values, it may not be converted
to an integer. Only floats have support for NaN in pandas.
category convert
60
7.10. Dealing with None
>>> val_mask
George True
Ringo False
John True
Paul True
Name : Counts , dtype : bool
If we want the mask for the NaN positions, we can use .isnull:
>>> nan_mask
George False
Ringo True
John False
Paul False
Name : Counts , dtype : bool
Note
We can flip a boolean mask by applying the not operator (~):
>>> ~ nan_mask
George True
Ringo False
John True
Paul True
Name : Counts , dtype : bool
61
7. Series Methods
Locating the position of the first and last valid index values is simple
as well, using the .first_valid_index and .last_valid_index
methods respectively:
>>> songs_66 .T
George 3.0
Ringo NaN
John 11.0
Paul 9.0
Name : Counts , dtype : float64
62
7.12. Append, combining, and joining two series
The .update method will return a series that has updated values
from another series. It accepts a new series and will return a series
that has replaced the values using the passed in series:
>>> songs_66
George 7.0
Ringo 5.0
John 18.0
Paul 22.0
Name : Counts , dtype : float64
63
7. Series Methods
Note
.update is another method that is an anomaly from most other
pandas methods. It behaves similarly to the .update method
of a native Python dictionary—it does not return anything and
updates the values in place. Tread with caution.
7.13 Sorting
There used to be a .sort method in pandas, but it has since
been deprecated and removed. The suggested replacement is the
.sort_values method. That method returns a new series:
Note
The .sort_values method exposes the kind parameter. The
default value is ’quicksort’, which is generally fast. Another
option to pass to kind is ’mergesort’. When this is passed,
.sort_values performs a stable sort (so that items that sort in
the same position will not move relative to one another) when
this method is invoked. Here’s a small example:
64
7.13. Sorting
a2 2
a1 2
a3 2
dtype : int64
Note
The .order method in pandas is similar to .sort and
.sort_values. It is deprecated as of 0.18, so please use
.sort_values instead.
65
7. Series Methods
66
7.14. Applying a function
67
7. Series Methods
Note
0.20 also introduced the .transform method, which is a
thin wrapper around .agg. It will check if a reduction was
performed and raise a ValueError in that case.
7.15 Serialization
We have seen examples that create a Series object from a list, a
dictionary, or another series. Also, a series will serialize to and from
a CSV file.
We can save a series as a CSV file, by passing a file object to the
.to_csv method. The following example shows how this is done
with a StringIO object (it implements the file interface, but allows
us to easily inspect the results):
68
7.15. Serialization
Note
Some of the intentions of Python 3 were to make things
consistent and clean up warts or annoyances in Python 2.
Python 3 created an io module to handle reading and writing
from streams. In Python 2 the import above should be:
The best practice in Python for dealing with real files is to use a
context manager. This will automatically close the file for you when
the indented block exits:
Note
The name of the series must be specified for the header of the
values to appear. This can be passed in as a parameter during
creation. Alternatively, you can set the .name attribute of the
series.
69
7. Series Methods
In this case, the values of the series are strings (notice the dtype:
object). This is because the header was parsed as a value, and not as
the header. The pandas parsing code was not able to coerce Counts
into a numerical value, and assumed the column had string values.
Here is a second attempt that reads in the data as numerics and uses
line zero as the header:
Note
The .from_csv method is deprecated TODO
Note
In practice, when dealing with data frames, the read_csv
function is used, rather than invoking the .from_csv class-
method on Series or DataFrame. The result of this function is
a DataFrame rather than a Series:
70
7.16. String operations
>>> df [’ Counts ’]
Name
George 7.0
Ringo 5.0
John 18.0
Paul 22.0
Name : Counts , dtype : float64
71
7. Series Methods
Method Result
cat Concatenate list of strings onto items
center Centers strings to width
contains Boolean for whether pattern matches
count Count pattern occurs in string
decode Decode a codec encoding
encode Encode a codec encoding
endswith Boolean if strings end with item
findall Find pattern in string
get Attribute access on items
join Join items with separator
len Return length of items
lower Lowercase the items
lstrip Remove whitespace on left of items
match Find groups in items from the pattern
pad Pad the items
repeat Repeat the string a certain number of times
replace Replace a pattern with a new value
rstrip Remove whitespace on the right of items
slice Pull out slices from strings (we can also slice directly on
the .str attribute)
split Split items by pattern
startswith Boolean if strings starts with item
strip Remove whitespace from the items
title Titlecase the items
upper Uppercase the items
72
7.17. Date Operations
>>> times
0 9 Oct 1940
1 6/18/1942
2 Feb 25 , 1943
3 7 -7 -1940
dtype : object
>>> times2
0 1940 -10 -09
1 1942 -06 -18
2 1943 -02 -25
3 1940 -07 -07
dtype : datetime64 [ ns ]
When you have a series that has dates in it (rather than string
objects), you can take advantage of date manipulation and using the
.dt attribute. This attribute has many methods for interacting with
dates. A simple example is shown below that reformats the dates:
The following table lists the attributes found on the .dt attribute:
73
7. Series Methods
Attribute Result
date Date without timestamp
day Day of month
dayofweek Day number (Monday=0)
dayofyear Day of year
days_in_month Number of days in month
daysinmonth Number of days in month
hour Hours of timestamp
is_month_end Is last day of month
is_month_start Is first day of month
is_quarter_end Is last day of quarter
is_quarter_start Is first day of quarter
is_year_end Is last day of year
is_year_start Is first day of year
microsecond Microseconds of timestamp
minute Minutes of timestamp
month Month number (Jan=1)
nanosecond Nanoseconds of timestamp
quarter Quarter of date
second Seconds of timestamp
time Time without date
tz Timezone
week Week of year
weekday Day number (Monday=0)
weekofyear Week of year
year Year
7.18 Summary
This has been a long chapter. That is because there are a lot of
methods on the Series object. We have looked at looping over the
values, overloaded operations, accessing values, changing the index,
basics stats, coercion, dealing with missing values and more. You
should have a basic understanding of the power of the Series. In
the next chapter, we will look at how to plot with a Series.
74
Chapter 8
Series Plotting
Note that the index values have some overlap and that there is a
NaN value as well.
The .plot method plots the index against value. If you are
running from Jupyter or an interpreter, a matplotlib plot will appear
when calling that method. In the examples in the book, we save
the plot as a png file, which requires a bit more boilerplate. (The
matplotlib.pyplot library needs to be loaded and a Figure object
9
matplotlib (http://matplotlib.org/) also refers to itself in lowercase.
75
8. Series Plotting
Figure 81: Plotting two series that have string indexes. The default plot type
is a line plot.
By default, .plot creates line charts, but it can also create bar
charts by changing the kind parameter. The bar chart is not stacked
by default, so the bars will occlude one another. We address this in
the example below by setting color for scores2 to black (’k’) and
lowering the transparency by setting the alpha parameter:
76
Figure 82: Plotting two series that have string indexes as bar plots.
77
8. Series Plotting
78
Figure 85: pandas can generate nice KDE charts if scipy.stats is installed
rather than using bins to represent areas where numbers fall, it plots
a curved line:
>>> _ = data . plot ( kind = ’ kde ’ , color = ’b ’ , alpha =.6 , ax = ax ) # requires sci
79
8. Series Plotting
80
8.2. Summary
8.2 Summary
In this chapter, we examined plotting a Series. The pandas library
provides some hooks to the matplotlib library. These can be really
convenient. When you need more power, you can drop into raw
matplotlib commands. In the next chapter, we will wrap up our
coverage of the Series, by looking at an example of a basic analysis.
10
http://stanford.edu/~mwaskom/software/seaborn/
81
Chapter 9
83
9. Another Series Example
84
9.1. Standard Python
First, we will load the data and store it in a variable. Note, that
we are using Python 3 here, in Python 2 we would have to call
.decode(’utf=8’) because the file contains UTF-8 encoded accented
characters:
>>> filename = ’/ usr / share / dict / american - english ’
>>> data = open ( filename ). read ()
Now, the newlines are removed and the results are flattened into
a single string:
With a big string containing the letters of all the words, the built-
in class collections.Counter class makes easy work of counting
letter frequency:
85
9. Another Series Example
This is quick and dirty, though it has a few issues. Certainly the
built-in Python tools could handle dealing with this data. But this
book is discussing pandas, so let’s look at the pandas version.
86
9.3. Tweaking data
maps letters (as index values) to counts will probably allow basic
analysis similar to Wikipedia. The question is how to get there?
One way is to create a new series, counts. This series will have
letters in the index, and counts of those letters as the values. We
can create it by iterating over the words using the .apply method to
add the count of every letter to counts. We will also lowercase the
letters to normalize them:
87
9. Another Series Example
Figure 92: Figure showing the default plot of letter counts. Note that the
default plot is a line plot.
dtype : int64
88
9.3. Tweaking data
89
9. Another Series Example
I’ll load it on the source of this book (which contains both the
code and the text) and see what happens:
A brief look at this indicates that the text of this book is abnormal
relative to normal English. Also, were I to customize my keyboard
based on this text, the non-alphabetic characters that I hit the
most—space, period, return, equals, and greater than—should be
pretty close to the home row. It seems that I need a larger corpus
to sample from and that my current keyboard layout is not optimal.
My most popular characters do not have keys on the home row.
Again, we can visualize this quickly using the .plot method:
90
9.5. Summary
Figure 95: Figure showing letter frequency of the text of this book
Note
I am currently typing with the Norman layout13 on my
ergonomic keyboard.
9.5 Summary
This chapter concludes our Series coverage. We examined loading
data into a Series, processing it, and plotting it. We also saw how we
could do similar processing with only the Python Standard Library.
While that code is straightforward, once we start tweaking the data
and plotting it, the pandas version becomes more concise and will
be faster.
13
https://normanlayout.info/
91
Chapter 10
DataFrames
Note
In practice, many highly optimized analytical databases (those
used for OLAP cubes) are also column-oriented. Laying out
the data in a columnar manner can improve performance and
require fewer resources. Columns of a single type can be
compressed easily. Performing analysis on a column requires
loading only that column whereas a row-oriented database
would require reading the complete database to access an entire
column.
93
10. DataFrames
>>> df = {
... ’ index ’:[0 ,1 ,2] ,
... ’cols ’: [
... { ’name ’: ’ growth ’ ,
... ’data ’:[.5 , .7 , 1.2] } ,
... { ’name ’: ’ Name ’ ,
... ’data ’:[ ’ Paul ’ , ’ George ’ , ’ Ringo ’] } ,
... ]
... }
Rows are accessed via the index, and columns are accessible from
the column name. Below are simple functions for accessing rows
and columns:
>>> def get_row (df , idx ):
... results = []
... value_idx = df [ ’ index ’]. index ( idx )
... for col in df [ ’ cols ’]:
... results . append ( col [ ’ data ’][ value_idx ])
... return results
10.1 DataFrames
Using the pandas DataFrame object, the previous data structure
could be created like this:
>>> df
Name growth
0 Paul 0.5
1 George 0.7
2 Ringo 1.2
94
10.1. DataFrames
Columns are accessible via indexing the column name off of the
object:
>>> df [’ Name ’]
0 Paul
1 George
2 Ringo
Name : Name , dtype : object
Note
The DataFrame overrides __getattr__ to allow access to
columns as attributes. This tends to work ok, but will fail if the
column name conflicts with an existing method or attribute.
It will also fail if the column has a non-valid attribute name
(such as a column name with a space):
95
10. DataFrames
>>> df . Name
0 Paul
1 George
2 Ringo
Name : Name , dtype : object
The above should provide clues as to why the Series was covered
in such detail. When column operations are required, a series
method is often involved. Also, the index behavior across both
data structures is the same.
10.2 Construction
Data frames can be created from many types of input:
>>> pd . DataFrame ([
... {’ growth ’:.5 , ’ Name ’: ’ Paul ’} ,
... {’ growth ’:.7 , ’ Name ’: ’ George ’} ,
... {’ growth ’:1.2 , ’ Name ’: ’ Ringo ’}])
Name growth
0 Paul 0.5
1 George 0.7
2 Ringo 1.2
96
10.3. Data Frame Axis
Type Description
float Floating point can be float32 or float64 (default)
int Integer number int64 (default). Can put u in front
for unsigned. Can have 8, 16, 32, or 64
datetime64 Datetime number
datetime64[ns, Datetime number with timezone
tz]
timedelta[ns] A difference between datetimes
category Used to specify categorical columns
object Used to other columns such as strings, or Python
objects
str Used to other ensure column is converted to a
string, though dtype will be object
Figure 102: Data types in pandas
97
10. DataFrames
>>> df . axes
[ RangeIndex ( start =0 , stop =3 , step =1) ,
Index ([ ’ Name ’, ’ growth ’] , dtype = ’ object ’)]
Tip
In order to remember which axis is 0 and which is 1 it can be
handy to think back to a Series. It, like a DataFrame, also has
axis 0 along the index. A mnemonic to aid in remembering is
that the 1 looks like a column (axis 1 is across columns):
10.4 Summary
In this section, we introduced a Python data structure that is similar
to how the pandas data frame is implemented. It illustrated the
index and the columnar nature of the data frame. Then we looked at
the main components of the data frame, and how columns are really
98
10.4. Summary
Figure 103: Figure showing the relation between axis 0 and axis 1. Note that
when an operation is applied along axis 0, it is applied down the column.
Likewise, operations along axis 1 operate across the values in the row.
99
Chapter 11
We’ll load this data into a data frame and use its data to show
basic CRUD operations and how to plot.
Reading in CSV files is straightforward in pandas. Here we paste
the contents into a StringIO buffer to emulate a CSV file:
15
Data existed at one point at http://www.wasatch100.com/index.php?option=com_content&vie
101
11. Data Frame Example
>>> df
LOCATION MILES ELEVATION CUMUL % CUMUL GAIN
0 Big Mountain Pass Aid Station 39.07 7432 11579.0
43.8%
1 Mules Ear Meadow 40.75 7478 12008.0
45.4%
2 Bald Mountain 42.46 7869 12593.0
47.6%
3 Pence Point 43.99 7521 12813.0
48.4%
4 Alexander Ridge Aid Station 46.90 6160 13169.0
49.8%
5 Alexander Springs 47.97 5956 13319.0
50.3%
6 Rogers Trail junction 49.52 6698 13967.0
52.8%
7 Rogers Saddle 49.77 6790 14073.0
53.2%
8 Railroad Bed 50.15 6520 NaN
NaN
9 Lambs Canyon Underpass Aid Station 52.48 6111 14329.0
54.2%
102
9 Lambs Canyon Underpass Aid Station 52.48 6111
If you want to change the output width for the duration of your
session, you can set the global options.display.width attribute
found in the default pandas namespace:
1 2 \
LOCATION Mules Ear Meadow Bald Mountain
MILES 40.75 42.46
ELEVATION 7478 7869
CUMUL 12008 12593
% CUMUL GAIN 45.4% 47.6%
3 4 \
LOCATION Pence Point Alexander Ridge Aid Station
MILES 43.99 46.9
ELEVATION 7521 6160
CUMUL 12813 13169
% CUMUL GAIN 48.4% 49.8%
5 6 \
LOCATION Alexander Springs Rogers Trail junction
MILES 47.97 49.52
ELEVATION 5956 6698
CUMUL 13319 13967
% CUMUL GAIN 50.3% 52.8%
103
11. Data Frame Example
7 8 \
LOCATION Rogers Saddle Railroad Bed
MILES 49.77 50.15
ELEVATION 6790 6520
CUMUL 14073 NaN
% CUMUL GAIN 53.2% NaN
9
LOCATION Lambs Canyon Underpass Aid Station
MILES 52.48
ELEVATION 6111
CUMUL 14329
% CUMUL GAIN 54.2%
104
11.2. Plotting With Data Frames
>>> df . corr ()
MILES ELEVATION CUMUL
MILES 1.000000 -0.783780 0.986613
ELEVATION -0.783780 1.000000 -0.674333
CUMUL 0.986613 -0.674333 1.000000
The default saved plot is actually empty. (Note that if you are
using Jupyter, this is not the case and a plot will appear if you used
the %matplotlib inline directive). If we want to save a plot of a
data frame that has the image in it, the ax parameter needs to be
passed a matplotlib Axis. Calling fig.subplot(111) will give us
one:
These plots are not perfect, yet they start to show the power that
pandas provides for visualizing data quickly.
The pandas library has some built-in support for the matplotlib
library. Though there are a few quirks, we can easily produce charts
and visualizations.
105
11. Data Frame Example
Figure 111: Default .plot of a data frame containing both numerical and
string data. Note that when we try to save this as a png file it is empty if
we forget the call to add a matplotlib axes to the figure (one way is to call
fig.add_subplot(111)). Within Jupyter notebook, we will see a real plot,
this is only an issue when using pandas to plot and then saving the plot.
106
11.2. Plotting With Data Frames
Figure 113: Plot using secondary_y parameter to use different scales on the
left and right axis for elevation and distance.
This plot has the problem that the scale of the miles plot is blown
out due to the elevation numbers. pandas allows labelling the other
y-axis (the one on the right side), but to do so requires two calls to
.plot. For the first .plot call, pull out only the elevation columns
using an index operation with a list of (numerical) columns to pull
out. The second call will be made against the series with the mileage
data and a secondary_y parameter set to True. It also requires an
explicit call to plt.legend to place a legend for the mileage:
107
11. Data Frame Example
Figure 114: Plot using LOCATION as the x-axis rather than the default (the
index values).
108
11.3. Adding rows
Figure 115: Plot using MILES as the x-axis rather than the default (the index
values).
109
11. Data Frame Example
110
11.5. Deleting Rows
Note
The .drop method does not work in place. It returns a new
data frame.
This method accepts index labels, which can be pulled out by slicing
the .index attribute as well. This is useful when using text indexes
or to delete large slices of rows. The previous example can be written
as:
>>> df . drop ( df . index [5:10:4])
LOCATION MILES ELEVATION \
0 Big Mountain Pass Aid Station 39.07 7432
1 Mules Ear Meadow 40.75 7478
2 Bald Mountain 42.46 7869
3 Pence Point 43.99 7521
4 Alexander Ridge Aid Station 46.90 6160
6 Rogers Trail junction 49.52 6698
7 Rogers Saddle 49.77 6790
8 Railroad Bed 50.15 6520
111
11. Data Frame Example
>>> bogus
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Name : bogus , dtype : int64
>>> df . columns
Index ([ ’ LOCATION ’, ’ MILES ’ , ’ ELEVATION ’ , ’ CUMUL ’ ,
’% CUMUL GAIN ’, ’ STATION ’] , dtype = ’ object ’)
>>> df . columns
112
11.7. Summary
Note
These operations operate on the data frame in place.
The .drop method accepts an axis parameter and does not work in
place—it returns a new data frame:
Note
It will be more consistent to use .drop with axis=1 than del
or .pop. You will have to get used to the meaning of axis=1,
which you can interpret as “apply this to the columns”.
Working with this data should give you a feeling for the kinds of
operations that are possible on DataFrame objects. This section has
only covered a small portion of them.
11.7 Summary
In this chapter, we saw a quick overview of the data frame. We
saw how to load data from a CSV file. We also looked at CRUD
operations and plotting data.
In the next chapter, we will examine the various members of the
DataFrame object.
113
Chapter 12
Part of the power of pandas is due to the rich methods that are
built-in to the Series and DataFrame objects. This chapter will look
into many of the attributes of the DataFrame.
The data for this section is sample retail sales data:
115
12. Data Frame Methods
For basic information about the object, use the .info method.
Notice that the dtype for UPC is int64. Though UPC appears
number-like here, it is possible to have dashes or other non-numeric
values. It might be preferable to have it stored as a string. Also, the
dtype for Date is object, it would be nice if it was a date instead.
This may prove problematic when doing actual analysis. In later
sections, we will show how to change the type using the .astype
method and the to_datetime function.
The .info method summarizes the types and columns of a data
frame. It also provides insight into how much memory is being
consumed. When you have larger datasets, this information is useful
to see where memory is going. Converting string types to numeric
or date types can go far to help lower the memory usage:
12.2 Iteration
Data frames include a variety of methods to iterate over the values.
By default, iteration occurs over the column names:
116
12.2. Iteration
Note
Unlike the Series object, which tests for membership against
the index, the DataFrame tests for membership against the
columns. The iteration behavior (__iter__) and membership
behavior (__contains__) is the same for the DataFrame:
>>> ’ Units ’ in sales
True
>>> 0 in sales
False
117
12. Data Frame Methods
2 1 -3 -2014
3 1 -1 -2014
4 1 -2 -2014
5 1 -3 -2014
6 1 -5 -2014
Name : Date , dtype : object
The .iterrows method returns a tuple for every row. The tuple
has two items. The first is the index value. The second is the row
converted into a Series object. This might be a little tricky in practice
because a row’s values might not be homogenous, whereas that is
usually the case in a column of data. Notice that the dtype for the
row series may change to object because the row has strings and
numeric values in it:
Note
If you aren’t familiar with NamedTuples in Python, check them
out in the collections module. They give you all the benefits
of a tuple: immutable, low memory requirements, and index
access. Also, the namedtuple allows you to access values by
attribute:
118
12.3. Arithmetic
of the code what 0 is. But .upc is very explicit and makes for
readable code.
We can ask a data frame how long it is with the len function. This
is not the number of columns (even though iteration is over the
columns), but the number of rows:
Note
Operations performed during iteration are not vectorized in
pandas and have overhead. If you find yourself performing
operations in an iteration loop, there might be a vectorized way
to do the same thing.
For example, you would not want to iterate over the row
data to sum the column values. The .sum method is optimized
to perform this operation.
12.3 Arithmetic
Data frames support broadcasting of arithmetic operations. When
we add a number to a numeric series, it increments every cell by
that amount. But there is a caveat, to increment every numeric value
by ten, simply adding ten to the data frame will fail:
>>> sales + 10
Traceback ( most recent call last ):
...
TypeError : Could not operate 10 with block values must be str , not int
119
12. Data Frame Methods
5 6
UPC 789 789
Units NaN 1
Sales NaN 1.8
Date 1 -3 -2014 1 -5 -2014
Tip
The .T property of a data frame is a nice wrapper to the
.transpose method. It comes in handy when examining a
data frame in an iPython Notebook. It turns out that viewing
the column headers along the left-hand side often makes the
data more compact and easier to read.
The dot product can be called on a data frame if the contents are
numeric:
120
12.5. Serialization
12.5 Serialization
Data frames can serialize to many forms. The most important
functionality is probably converting to and from a CSV file, as
this format is the lingua franca of data. We already saw that the
pd.read_csv function will create a DataFrame. Writing to CSV is
easy, we simply use the .to_csv method:
Data frames can also be created from the serialized dict if needed:
121
12. Data Frame Methods
The .to_json method will create JSON output from a data frame:
Given the exported JSON string, we can use the to_json function
to create a data frame. You will need to provide an orient parameter
that matches the parameter used to create the JSON (columns is the
default for .to_json, but you need to explicitly set it for read_json):
16
https://specs.frictionlessdata.io/table-schema/
122
12.5. Serialization
Also, data frames can read and write Excel files. Use the
.to_excel method to dump the data out:
Note
You might need to install the openpypxl module to support
reading and writing xlsx to Excel. This is easy with pip:
If you are dealing with xls files, you will need xlrd and
xlwt. Again, pip makes this easy:
Note
The read_excel function has many options to help it divine
how to parse spreadsheets that aren’t simply CSV files that are
loaded into Excel. You might need to play around with them.
Often, it is easier (but perhaps not quite as satisfying) to open a
spreadsheet and simply export a new sheet with only the data
you need.
123
12. Data Frame Methods
If you want to rename the index or column names, you can use the
.rename method. You can provide a dictionary or function mapping
old name to new name:
124
12.7. Getting and Setting Values
Note
If you are reading a CSV file with pd.read_csv, you can specify
the index_col parameter, to use a column as the index.
Note
Be careful, if you think of the index as analogous to a primary
key in database parlance. Because the index can contain
duplicate entries, this description is not quite accurate. Use the
verify_integrity parameter to ensure uniqueness:
Be aware that calling .reset_index will insert the old index as the
first column.
125
12. Data Frame Methods
TODO
126
12.7. Getting and Setting Values
Value Description
’number’ Any number (float or int)
’floating’ Float
’int’ Int
’object’ Object (including string)
’datetime’ Datetime
’timedelta’ Timedelta
’category’ Categorical
Figure 121: Data types for select_dtypes
TODO
Another way to filter out data is to use the .filter method.
127
12. Data Frame Methods
Note
The .clip, .clip_lower, and .clip_upper methods can be
used in a similar way to .where.
The .mask method also returns an object as the same shape of the
input. It accepts a boolean array or callable and filters out the values
where the array or callable evaluate to True. There is also an other
parameter that defaults to NaN for the True values:
There are two methods to pull out a single "cell" in the data frame.
One—.iat—uses the position of the index and column (0-based):
128
12.7. Getting and Setting Values
choose to drop the duplicate from. Below, the first UPC value is
kept:
Hint
Python has both statements and expressions. Most of the
time you don’t need to worry about the difference. For
example creating a variable or defining a function are both
statements. The right hand side of the assignment statement is
an expression.
129
12. Data Frame Methods
# no return value !
>>> sales
UPC Category Units Sales Date
0 1234 Food 5.0 20.2 1 -1 -2014
1 1234 Food 2.0 8.0 1 -2 -2014
2 1234 Food 3.0 13.0 1 -3 -2014
3 789 Food 1.0 2.0 1 -1 -2014
4 789 Food 2.0 3.8 1 -2 -2014
5 789 Food NaN NaN 1 -3 -2014
6 789 Food 1.0 789.0 1 -5 -2014
130
12.7. Getting and Setting Values
Note
Column insertion is also available through index assignment
on the data frame. When new columns are added this way, they
are always appended to the end (the right-most column). If
you want to change the order of the columns, call the .reindex
method. Alternatively, you could just pass in a list of the
desired order of the columns to an index operation.
For example, creating a ’Category’ column with a value of
"Food" could be done like this:
Because the sales column for index 6 also has a value of 789, that
cell will be replaced as well. To fix this, instead of passing in a scalar
for the to_replace parameter, use a dictionary mapping column
name to a dictionary of value to new value. If the new sales value
of 789.0 was also erroneous, it could be updated in the same call:
131
12. Data Frame Methods
The .pop method takes the name of a column and removes it from
the data frame. It operates in-place. Rather than returning a data
frame, it returns the removed column. Below, the column subcat
will be added and then subsequently removed:
>>> sales
UPC Category Units Sales Date
132
12.10. Slicing
12.10 Slicing
The pandas library provides powerful methods for slicing a data
frame. The .head and .tail methods allow for pulling data off the
front and end of a data frame. They come in handy when using an
133
12. Data Frame Methods
Figure 122: Figure showing how to slice by row or column. Note that
positional slicing uses the half-open interval, while label-based slicing is
inclusive (closed interval).
134
12.10. Slicing
four. We also take columns from the first up to but not including
one (just the column in the zero index position):
There is also support for slicing out data by labels. Using the
.loc attribute, we can take index values a through d:
Note
In contrast to normal Python slicing, which are half-open
(meaning take the start index and go up to, but not including
the final index) indexing by label uses the closed interval. A
closed interval includes not only the initial location, but also
the final location:
135
12. Data Frame Methods
Figure 123: Figure showing various ways to slice a data frame. Note that we
can slice by label or position.
Slice Result
.iloc[i:j] Rows position i up to but not including j (half-open)
.iloc[:,i:j] Columns position i up to but not including j
(half-open)
.iloc[[i,k,m]] Rows at i, k, and m (not an interval)
.loc[a:b] Rows from index label a through b (closed)
.loc[:,c:d] Columns from column label c through d (closed)
.loc[:[b, d, Columns at labels b, d, and f (not an interval)
f]]
Hint
If you want to slice out columns by value, but rows by position,
you can chain index operations to .iloc or .loc together.
136
12.11. Sorting
And finally, you can use the .get_loc method of the index
to get the position of labels. With those positions, you can
index off of .iloc:
12.11 Sorting
Sometimes, we need to sort a data frame by index, or the values in
the columns. The data frame operations are very similar to what we
saw with series.
Here is the sales data frame:
>>> sales
UPC Category Units Sales Date
0 1234 Food 5.0 20.2 1 -1 -2014
1 1234 Food 2.0 8.0 1 -2 -2014
2 1234 Food 3.0 13.0 1 -3 -2014
3 789 Food 1.0 2.0 1 -1 -2014
4 789 Food 2.0 3.8 1 -2 -2014
137
12. Data Frame Methods
The .sort_values method will sort columns. Let’s sort the UPC
column:
Note
Avoid using the .sort method. It is now deprecated because
it does an in-place sort by default. Use .sort_values instead.
138
12.12. Summary
Note
Sorting the index has other benefits. Filtering rows off of the
index is typically faster than filtering off of columns. It can be
made faster still by sorting the index, which allows filtering
operations to do a binary search. If the index items are unique,
a hash lookup can be used to make filtering even faster!
Note
If the index is sorted, we can slice based on the index labels:
>>> sales . assign ( UPC = lambda x : x . UPC . astype ( str )) \
... . set_index ( ’ UPC ’) \
... . sort_index () \
... . loc [ ’1 ’: ’7 ’]
Category Units Sales Date
UPC
1234 Food 5.0 20.2 1 -1 -2014
1234 Food 2.0 8.0 1 -2 -2014
1234 Food 3.0 13.0 1 -3 -2014
Note that neither ’1’ nor ’7’ are in the index, but we can
slice with those values, and pandas will bound the index
lexicographically. Also note, that because .loc is based on
labels, the interval is inclusive or closed. Because ’789’ comes
after ’7’, it is not included in the result.
TODO
12.12 Summary
In this chapter, we examined many of the methods on the DataFrame
object. We saw how to examine the data, loop over it, broadcast
operations, and serialize it. We also looked at index operations that
were very similar to the Series index operations. We saw how to
do CRUD operations and ended with slicing and sorting data.
In the next chapter, we will explore some of the statistical
functionality found in the data frame.
139
Chapter 13
If you are doing data science or statistics with pandas, you are in
luck, because the data frame comes with basic functionality built in.
In this section, we will examine snow totals from Alta for the
past couple years. I scraped this data off the Utah Avalanche Center
website17 , but will use the .read_table function of pandas to create
a data frame.
>>> snow
year inches location
0 2006 633.5 utah
1 2007 356.0 utah
2 2008 654.0 utah
3 2009 578.0 utah
4 2010 430.0 utah
5 2011 553.0 utah
6 2012 329.5 utah
7 2013 382.5 utah
8 2014 357.5 utah
9 2015 267.5 utah
17
https://utahavalanchecenter.org/alta-monthly-snowfall
141
13. Data Frame Statistics
Note that the location column, which has a string type, is ignored
by the .describe method. If we set the include parameter to
’all’, then we also get summary statistics for categorical and string
columns:
Here we get the 10% and 90% percentile levels. We can see that
if 635 inches fall, we are at the 90% level:
142
13.1. describe and quantile
Note
Changing the q parameter to a list, rather than a scalar, makes
the .quantile method returns a data frame, rather than a
series.
If you have data and want to know whether any of the values
in the columns evaluate to True in a boolean context, use the .any
method:
143
13. Data Frame Statistics
Both .any and .all are pretty boring in this dataset because they
are all truthy (non-empty or not false).
13.2 rank
The .rank method goes through every column and assigns a number
to the rank of that cell within the column. Again, the year column
isn’t particularly useful here:
Note that because the location columns are all the same, the rank
of that column is the average by default. To change this behavior, we
144
13.3. clip
Note
Specifying method=’first’ fails with non-numeric data:
13.3 clip
Occasionally, there are outliers in the data. If this is problematic, the
.clip method trims a column (or row if axis=1) to certain values:
145
13. Data Frame Statistics
9 400.0
If you have two data frames that you want to correlate, you can
use the .corrwith method to compute column-wise (the default) or
row-wise (when axis=1) Pearson correlations:
13.5 Reductions
There are various reducing methods on the data frame, that collapse
columns into a single value. An example is the .sum method, which
will apply the add operation to all members of columns. Note, that
by default, string columns are concatenated:
146
13.5. Reductions
dtype : object
147
13. Data Frame Statistics
13.6 Summary
The pandas library provides basic statistical operations out of the
box. This chapter looked at the .describe method, which is one of
the first tools I reach for when looking at new data. We also saw
how to sort data, clip it to certain ranges, perform correlations, and
reduce columns.
In the next chapter, we will look at the more advanced topics of
changing the shape of the data.
148
Chapter 14
Note that Fred is missing a score from test1. That could represent
that he did not take the test, or that someone forgot to enter his
score.
149
14. Grouping, Pivoting, and Reshaping
This also included the age column, to ignore that we can slice
out just the test columns:
>>> scores . groupby ( ’ teacher ’). median ()[[ ’ test1 ’ , ’ test2 ’]]
test1 test2
teacher
Ashby 88.0 81.0
Jones 89.0 86.0
150
14.1. Reducing Methods in groupby
Figure 141: Figure showing the split, apply, and combine steps on a groupby
object. Note that there are various built-in methods, and also the apply
method, which allows arbitrary operations.
151
14. Grouping, Pivoting, and Reshaping
Note
When you group by multiple columns, the result has a
hierarchical index or multi-level index.
The groupby object has many methods that reduce group values
to a single value, they are:
Method Result
.all Boolean if all cells in group are True
.any Boolean if any cells in group are True
.count Count of non null values
.size Size of group (includes null)
.nunique Count of unique values
.idxmax Index of maximum values
.idxmin Index of minimum values
.quantile Quantile (default of .5) of group
.agg(func) Apply func to each group. If func returns scalar, then
reducing
.apply(func) Use split-apply-combine rules
.first First value
.last Last value
.nth Nth row from group
.max Maximum value
.min Minimum value
.mean Mean value
.median Median value
.mad Mean absolute deviation
.skew Skew of group
.sem Standard error of mean of group
.std Standard deviation
.var Variation of group
.prod Product of group
.sum Sum of group
.dtypes Data type of each group
152
14.2. Non-Reducing Groupbys
We can see that the pivot table and group by behavior is very
similar. Many spreadsheet power users are more familiar with
the declarative style of .pivot_table, while programmers not
accustomed to pivot tables prefer using group by semantics.
153
14. Grouping, Pivoting, and Reshaping
154
14.3. Pivot Tables
155
14. Grouping, Pivoting, and Reshaping
156
14.4. Melting Data
Figure 144: Figure showing columns that are preserved during melting,
id_vars, and column names that are pulled into columns, value_vars.
is pulled out into its own column, and the scores for the test are in
a single column. To do this, we put the list of fact columns in the
value_vars parameter. Any dimensions we want to keep should be
listed in the id_vars parameter.
Here we keep name and age as dimensions, and pull out the test
scores as facts:
157
14. Grouping, Pivoting, and Reshaping
Note
Long data is also referred to as tidy data. See the Tidy Data
paper18 by Hadley Wickham.
>>> wide_df
score
test test1 test2
name age
Adam 15 95.0 80.0
Bob 16 81.0 82.0
Dave 16 89.0 84.0
Fred 15 NaN 88.0
18
http://vita.had.co.nz/papers/tidy-data.html
158
14.6. Creating Dummy Variables
Note that this creates hierarchical column labels (or multi-level) and a
hierarchical index. To flatten the index, use the .reset_index method.
It will take the existing index, and make a column (or columns if it
is hierarchical):
159
14. Grouping, Pivoting, and Reshaping
160
14.8. Stacking and Unstacking
14.9 Summary
This chapter covered some more advanced topics of pandas. We
saw how to group by columns and perform reductions. We also
saw how some of these group by operations can be done with the
.pivot_table method. Then we looked at melting data, creating
dummy variables, and stacking.
Often, you will you find you need your data organized slightly
differently. You can use one of the methods discussed to munge
your data. It will be quicker and have less code than an imperative
solution requiring iterating over the values manually. But, it might
require a little while pondering how to transform the data. Play
around with these methods and check out other examples of how
people are using them in the wild for inspiration.
161
14. Grouping, Pivoting, and Reshaping
Figure 145: Figure showing how to stack and unstack data. Stack takes the
innermost column label and places them in the index. Unstack takes the
innermost index labels and places them in the columns.
162
Chapter 15
Multidimensional Indexing
>>> url = ’ http :// qrc . depaul . edu / Excel_Files / Presidents . xls ’
>>> df = pd . read_excel ( url )
Note
In order to read from Excel files we need to have the xlrd
package installed:
>>> df . columns
Index ([ ’ President ’ , ’ Years in office ’ ,
’ Year first inaugurated ’ , ’ Age at inauguration ’ ,
’ State elected from ’ , ’# of electoral votes ’ ,
’# of popular votes ’ , ’ National total votes ’ ,
’ Total electoral votes ’ , ’ Rating points ’ ,
’ Political Party ’ , ’ Occupation ’ , ’ College ’ ,
’% electoral ’ , ’% popular ’] ,
dtype =’ object ’)
163
15. Multidimensional Indexing
College
Political Party Occupation
None Planter None
Federalist Lawyer Harvard
Democratic - Republican Planter , Lawyer William and Mary
Lawyer Princeton
Lawyer William and Mary
164
15.1. Using .loc
College
Political Party Occupation
Republican Lawyer None
Lawyer Kenyon
Lawyer Williams
Lawyer Miami
Lawyer Allegheny College
Lawyer Yale
Lawyer Whittier
College
Political Party Occupation
Democrat Author Harvard
Businessman US Naval Academy
Educator Princeton
165
15. Multidimensional Indexing
Lawyer None
Lawyer None
166
Chapter 16
More often than I would like, I spend time being a data janitor.
Cleaning up, removing, updating, and tweaking data I need to deal
with. This can be annoying, but luckily pandas has good support
for these actions. We’ve already seen much of this type of work. In
this section we will discuss dealing with missing data.
Let’s start out by looking a simple data frame with missing data.
I’ll use the StringIO class and the pandas read_table function to
simulate reading tabular data:
>>> import io
>>> data = ’’’ Name | Age | Color
... Fred |22| Red
... Sally |29| Blue
... George |24|
... Fido || Black ’’ ’
>>> df
Name Age Color
0 Fred 22.0 Red
1 Sally 29.0 Blue
2 George 24.0 NaN
3 Fido NaN Black
167
16. Dealing With Missing Data
Perhaps more insidious is when you are missing (a big chunk of)
data and don’t even notice it. I’ve found that plotting can be a useful
tool to visually see holes in the data. Below we will discuss a few
more.
In our df data, one might assume that there should be an age for
every row. Every living thing has an age, but Fido’s is missing. Is
that because he didn’t want anyone to know how old he was? Maybe
he doesn’t know his birthday? Maybe he isn’t a human, so giving
him an age doesn’t make sense. To effectively deal with missing
data, it is useful to determine which data is missing and why it
is missing. This will aid in deciding what to do with the missing
data. Unfortunately, this book cannot help with that. That requires
sleuthing and often non-programming related skills.
>>> df . isnull ()
Name Age Color
0 False False False
1 False False False
2 False False True
3 False True False
168
16.2. Dropping Missing Data
By applying .any to axis 1, we can get the rows that have null
values:
To get the index values where null values occur you can do the
following:
>>> df . dropna ()
Name Age Color
0 Fred 22.0 Red
1 Sally 29.0 Blue
169
16. Dealing With Missing Data
What if you wanted to get the rows that were valid for both age
and color? You could combine the column masks using the boolean
AND operator (&):
>>> df [ mask ]
Name Age Color
0 Fred 22.0 Red
1 Sally 29.0 Blue
In this case, the result is the same as .dropna, but in other cases,
it might be ok to keep missing values around in certain columns.
When that need arises, .dropna is too heavy-handed, and you will
need to be a little more fine-grained with your mask.
170
16.3. Inserting Data for Missing Data
Note
In pandas, there is often more than one way to do something.
Another option to combine the two column masks would be
like this. Use the .apply method on the columns with the
Python built-in function all. To collapse these boolean values
along the row, make sure you pass the axis=1 parameter:
>>> mask = valid [[ ’ Age ’ , ’ Color ’]]. apply ( all , axis =1)
>>> mask
0 True
1 True
2 False
3 False
dtype : bool
>>> df
Name Age Color
0 Fred 22.0 Red
1 Sally 29.0 Blue
2 George 24.0 NaN
3 Fido NaN Black
171
16. Dealing With Missing Data
Note
A ffill of bfill is not guaranteed to insert data if the first
or last value is missing. The .fillna call with bfill above
illustrates this.
This is a small example of an operation that you cannot
blindly apply to a dataset. Just because it worked on a past
dataset, it is not a guarantee that it will work on a future dataset.
If you have numeric data that has an order, then another option
is the .interpolate method. This will fill in values based on the
method parameter provided:
>>> df . interpolate ()
Name Age Color
0 Fred 22.0 Red
172
16.3. Inserting Data for Missing Data
Method Effect
linear Treat values as evenly spaced (default)
time Fill in values based in based on time index
values/index Use the index to fill in blanks
If you have scipy installed you can use the following additional
options:
Method Effect
nearest Use nearest data point
zero Zero order spline (use last value seen)
slinear Spline interpolation of first order
quadratic Spline interpolation of second order
cubic Spline interpolation of third order
polynomial Polynomial interpolation (pass order param)
spline Spline interpolation (pass order param)
barycentric Use Barycentric Lagrange Interpolation
krogh Use Krogh Interpolation
piecewise_polynomial Use Piecewise Polynomial Interpolation
pchip Use Piecewise Cubic Hermite Interpolating
Polynomial
Finally, you can use the .replace method to fill in missing values:
Note that if you try to replace None, pandas will throw an error,
as this is the default value for the value parameter:
173
16. Dealing With Missing Data
16.4 Summary
In the real world, data is messy. Sometimes you have to tweak it
slightly or filter it. And sometimes, it is just missing. In these cases,
having insight into your data and where it came from is invaluable.
In this chapter, we saw how to find missing data. You can simply
drop that data that is incomplete. You can also for fill in the missing
data. These operations are straightforward in pandas.
174
Chapter 17
Data frames hold tabular data. Databases hold tabular data. You
can perform many of the same operations on data frames that you
do to database tables. In this section, we will look at joining data
frames.
Here are the two tables we will be using for examples:
175
17. Joining Data Frames
Note that this repeats the name column. Using SQL, we can join
two database tables together based on common columns. If we want
176
17.3. Joins
17.3 Joins
Databases have different types of joins. The four common ones
include inner, outer, left, and right. The data frame has a method
to support these operations. Sadly, it is not the .join method, but
rather the .merge method.
Note
The .join method is meant for joining based on index, rather
than columns. In practice, I find myself joining based on
columns instead of index values.
If you want the .join method to join based on column
values, you need to set that column as the index first:
>>> df1 . set_index ( ’ name ’). join ( df2 . set_index ( ’ name ’))
color carcolor
name
John Blue NaN
George Blue Blue
Ringo Purple NaN
The default join type for the .merge method is an inner join. The
.merge method looks for common column names. It then aligns the
values in those columns. If both data frames have values that are the
same, they are kept along with the remaining columns from both
data frames. Rows with values in the aligned columns that only
appear in one data frame are discarded:
177
17. Joining Data Frames
Figure 171: Figure showing how the result of four different joins: inner, outer,
left, and right.
178
17.3. Joins
Finally, there is support for a right join as well. A right join keeps
the values from the overlapping columns in the data frame that is
passed in as the first parameter of the .merge method. If the data
frame that .merge was called on has aligned values, they are kept,
otherwise NaN is used to fill in the missing values:
The .merge method has a few other parameters that turn out to
be useful in practice. The table below lists them:
Parameter Meaning
on Column names to join on. String or list. (Default is
intersection of names).
left_on Column names for left data frame. String or list. Used
when names don’t overlap.
right_on Column names for right data frame. String or list. Used
when names don’t overlap.
left_index Join based on left data frame index. Boolean
right_index Join based on right data frame index. Boolean
179
17. Joining Data Frames
17.4 Summary
Data can often have more utility if we combine it with other data.
In the 70’s, relational algebra was invented to describe various joins
among tabular data. The .merge method of the DataFrame lets us
apply these operations to tabular data in the pandas world. This
chapter described concatenation, and the four basic joins that are
possible via .merge.
180
Chapter 18
Jupyter Styles
The Dataframe has a .style attribute that can affect how the data is
displayed in Jupyter.
TODO
181
Chapter 19
183
19. Avalanche Analysis and Plotting
184
19.1. Getting Data
185
19. Avalanche Analysis and Plotting
< div class =" field field - name - field - observation - date
field - type - datetime field - label - above " >
< div class =" field - label " > Observation Date </ div >
< div class =" field - items " >
< div class =" field - item even " >
Thursday , March 5 , 2015
</ div >
</ div >
</ div >
The interesting data resides in <div> tags that have class set to
field. The name is found in a <div> with class set to field-label
and the value in a <div> with class set to field-item.
Here is some code that takes the base url and the dictionary
containing the overview for that avalanche. It iterates over every
class set to field and updates the dictionary with the detailed data:
186
19.1. Getting Data
187
19. Avalanche Analysis and Plotting
With this code in hand, we can create a data frame with the
data by running the following code. Note that this takes about two
minutes to scrape the data:
188
19.2. Munging Data
It looks like some of the values are numeric, though the type
of Occurrence Date is object, which means it is a string and not a
datetime object. We will address that later.
Note
Because I read this data from the CSV file, pandas tried its
hardest to coerce numeric values. Had I simply converted the
189
19. Avalanche Analysis and Plotting
list of dictionaries from the crawled data, the type for all of the
columns would have been object, the string data type (because
the scraping returned strings).
>>> df . shape
(92 , 38)
190
19.3. Describing Data
We could do this for each of the numeric columns here and decide
whether we need to change them. If we had access to the someone
who knows the data a little better, we could ask them how to resolve
such issues.
On an aesthetic note, there are a bunch of columns with colons
on the end. Let’s clean that up, by replacing colons with an empty
string using the .rename method:
191
19. Avalanche Analysis and Plotting
Note
The above uses a dictionary comprehension to create a dictionary
from the columns. The syntax:
new_cols = {}
for x in df2 . columns :
new_cols [x] = x . replace ( ’: ’ , ’ ’)
>>> len ( df [( df . Aspect >= ’ North ’) & ( df . Aspect <= ’ East ’)])
47
This tells us that slopes that are facing north-east are more prone
to slide. Or does it? Skiers tend to ski the north and east aspects.
Because they stay out of the sun, the snow stays softer. One should
be careful to draw the conclusion that skiing south-facing aspects
192
19.5. Converting Column Types
193
19. Avalanche Analysis and Plotting
>>> import re
>>> def to_inches ( orig ):
... txt = str ( orig )
... if txt == ’nan ’:
... return orig
... reg = r ’ ’ ’(((\ d *\.)?\ d *) ’)?(((\ d *\.)?\ d *)")? ’ ’ ’
... mo = re . search ( reg , txt )
... feet = mo . group (2) or 0
... inches = mo . group (5) or 0
... return float ( feet ) * 12 + float ( inches )
Note
Regular expressions could fill up a book on their own. A few
things to note. We use raw strings to specify them (they have
an r at the front), as raw strings don’t interpret backslash as
an escape character. This is important because the backslash
has special meaning in regular expressions. \d means match a
digit.
The parentheses are used to specify groups. After invoking
the search function, we get match objects as results (mo in the
code above). The .group method pulls out the match inside of
the group. mo.group(2) looks for the second left parenthesis
and returns the match inside of those parentheses. mo.group(5)
looks for the fifth left parentheses, and the match inside of it.
Normally Python is zero-based, where we start counting from
zero, but in the case of regular expression groups, we start
counting at one. The first left parenthesis indicates where the
first group starts, group one, not zero.
Let’s add a new column to store the depth of the avalanche in inches:
194
19.5. Converting Column Types
195
19. Avalanche Analysis and Plotting
Let’s look at what day of the week avalanches occur on. The
dt attribute has the weekday and dayofweek attribute (both are the
same):
>>> dates = pd . to_datetime ( df [ ’ Occurrence Date ’])
>>> dates . dt . dayofweek . value_counts ()
5 29
6 14
4 14
2 10
0 10
3 9
1 6
Name : Occurrence Date , dtype : int64
196
19.7. Splitting a Column into Two Columns
"Occurrence Date" has the day of week and there are no missing
values:
197
19. Avalanche Analysis and Plotting
198
19.8. Analysis
19.8 Analysis
The final product of my analysis was an infographic containing
various chunks of information derived from the data. The first part
was the number of fatal avalanches since 199523 :
>>> ava95 = df [ df . year >= 1995]
>>> len ( ava95 )
61
199
19. Avalanche Analysis and Plotting
200
19.9. Plotting on Maps
Figure 194: A figure illustrating plotting deaths over time, with a regression
line compliments of the seaborn library. Note that Seaborn changes the
default aesthetics of matplotlib.
to scatter_kws to make the size larger and set the color to a shade
of red:
>>> import seaborn as sns
>>> ax = fig . add_subplot (111)
>>> summed = ava95 . groupby ( ’ year ’). sum (). reset_index ()
>>> _ = sns . regplot ( x = ’ year ’ , y = ’ killed ’ , data = summed ,
... lowess =0 , marker = ’x ’ ,
... scatter_kws ={ ’s ’:100 , ’ color ’: ’# a40000 ’})
>>> fig . savefig ( ’/ tmp / pd - ava -2. png ’)
201
19. Avalanche Analysis and Plotting
import folium
from IPython . display import HTML
25
https://folium.readthedocs.org/en/latest/
202
19.10. Bar Plots
Figure 196: A figure illustrating a portion of the Folium map used in the
infographic.
inline_map ( map )
203
19. Avalanche Analysis and Plotting
Unknown 3
Hiker 2
Snowshoer 1
Name : Trigger , dtype : int64
204
19.10. Bar Plots
205
19. Avalanche Analysis and Plotting
206
19.11. Assorted Plots
Figure 1911: Figure illustrating plot of avalanche slopes. Note that the default
ratio of the plot is not square, hence the call to ax.set_aspect(’equal’,
adjustable=’box’).
... _ = plt . plot ([0 , 1] , [0 , math . tan ( to_rad ( row [ ’ Slope Angle ’] +
... jitter ))] , alpha =.3 , color = ’b ’ , linewidth =1)
>>> _ = ax . set_xlim (0 , 1)
>>> _ = ax . set_ylim (0 , 1)
>>> ax . set_aspect ( ’ equal ’ , adjustable = ’ box ’)
>>> fig . savefig ( ’/ tmp / pd - ava -5. png ’)
207
19. Avalanche Analysis and Plotting
208
19.12. Summary
19.12 Summary
In this chapter, we looked at a sample project. Even without a
database or CSV file floating around, we were able to scrape the
data from a website. Then, using pandas, we did some pretty heavy
janitorial work on the data. Finally, we were able to do some analysis
and generate some plots of the data. Since matplotlib has the ability
to save as SVG, we were able to import these plots into a vector editor
and create a fancy infographic from them.
This should give you a feel for the kind of work that pandas will
enable. Combined with the power of Python, you are only limited
by your imagination. (And your free time).
209
Chapter 20
Summary
Thanks for learning about the pandas library. Hopefully, as you have
read through this book, you have begun to appreciate the power in
this library. You might be wondering what to do now that you have
finished this book?
I’ve taught many people Python and pandas over the years, and
they typically question what to do to continue learning. My answer
is pretty simple: find a project that you would like to work on and
find an excuse to use Python or pandas. If you are in a business
setting and use Excel, try to see if you can replicate what you do
in Jupyter and pandas. If you are interested in Machine Learning,
check out Kaggle for projects to try out your new skills. Or simply
find some data about something you are interested in and start
playing around.
For those who like videos and screencasts, I offer a screencast
service called PyCast26 which has many examples of using Python
and pandas in various projects.
As pandas is an open source project, you can contribute and
improve the library. The library is still in active development.
26
https://pycast.io
211
About the Author
213
Index
Index
215
Index
cummax, 57 findall, 72
cummin, 57 first_valid_index, 61
cumprod, 57 for, 23
cumsum, 57 from_csv, 69, 86
from_dict, 121
data structures, 9
DataFrame, 94 get, 44, 45, 72
DataFrame, 9 get_dummies, 159
date, 73 get_loc, 137
date conversion, 196 get_value, 44, 45, 128
date manipulation, 72 getting values in a data frame,
day, 73 128
dayofweek, 73 getvalue, 121
dayofyear, 73 groupby, 150
days_in_month, 73 groupby (series), 52
daysinmonth, 73
decode, 72 half-open interval, 135
del, 25 hasnans, 60
delete column, 112 head, 133
describe, 53, 104, 142 highlight_null, 168
destructuring, 40 hist, 55, 77
dictionary comprehension, 192 hour, 73
diff, 56
dimension, 156 iat, 31, 128
dot, 62, 120 idmin, 54
drop, 111, 113, 132 idxmax, 54, 148, 152
drop column, 112 idxmin, 148, 152
drop_duplicates, 51, 128 ignore_index, 110
dropna, 60, 169 iloc, 24, 29, 94, 134
dt, 73, 196 in, 23
dtype, 58, 97, 118 in (data frame), 117
dummy variable, 159 index, 49, 111, 115
duplicate index, 24, 52 index, 12, 13, 27
duplicate values, 52 index alignment, 42, 44
duplicated, 52 index joining, 177
indexing, 27
encode, 72 indicator variable, 159
endswith, 72 info, 116
equals, 42 inner join, 177
Excel, 122 inplace, 39
ExcelWriter, 122 insert, 129
expression, 129 inserting columns, 129
inserting missing data, 171
fact, 156 interpolate, 172
ffill, 172 is_month_end, 73
fillna, 42, 60, 171 is_month_start, 73
filter, 127 is_quarter_end, 73
216
Index
is_quarter_start, 73 microsecond, 73
is_unique, 27 min, 54, 152
is_year_end, 73 minute, 73
is_year_start, 73 missing data, 171
isnull, 61, 168 month, 73
iteration, 40
iteration (data frame), 116 namedtuple, 118
iteration, series, 23 NaN, 15, 60
iteritems, 23, 40, 117 nanosecond, 73
iterrows, 118 nlargest, 138
itertuples, 118 notnull, 61, 169
ix, 32 nsmallest, 138
nth, 152
join, 72, 177 NumPy, 15, 97, 123
joins, 177 nunique, 51
217
Index
218
Also Available
221
Also Available
Reviews
Matt Harrison gets it, admits there are undeniable flaws
and schisms in Python, and guides you through it in
short and to the point examples. I bought both Kindle
and paperback editions to always have at the ready for
continuing to learn to code in Python.
S. Oakland
27
http://hairysun.com/books/tread/
222
Treading on Python: Vol 2: Intermediate Python
• Functional Programming
• Lambda Expressions
• List Comprehensions
• Generator Comprehensions
• Iterators
• Generators
• Closures
• Decorators
• And more . . .
28
http://hairysun.com/books/treadvol2/
223
Also Available
Reviews
Complete! All you must know about Python Decorators:
theory, practice, standard decorators.
All written in a clear and direct way and very affordable
price.
Nice to read in Kindle.
F. De Arruda (Brazil)
224
Reviews
225