Python Data Science Essentials - Sample Chapter
Python Data Science Essentials - Sample Chapter
ee
P U B L I S H I N G
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
$ 39.99 US
26.99 UK
pl
Alberto Boschetti
Luca Massaron
Sa
m
Alberto Boschetti
Luca Massaron
First Steps
Whether you are an eager learner of data science or a well-grounded data science
practitioner, you can take advantage of this essential introduction to Python for
data science. You can use it to the fullest if you already have at least some previous
experience in basic coding, writing general-purpose computer programs in Python,
or some other data analysis-specific language, such as MATLAB or R.
The book will delve directly into Python for data science, providing you with a
straight and fast route to solve various data science problems using Python and its
powerful data analysis and machine learning packages. The code examples that
are provided in this book don't require you to master Python. However, they will
assume that you at least know the basics of Python scripting, data structures such
as lists and dictionaries, and the working of class objects. If you don't feel confident
about this subject or have minimal knowledge of the Python language, we suggest
that before you read this book, you should take an online tutorial, such as the Code
Academy course at http://www.codecademy.com/en/tracks/python or Google's
Python class at https://developers.google.com/edu/python/. Both the courses
are free, and in a matter of a few hours of study, they should provide you with all the
building blocks that will ensure that you enjoy this book to the fullest. We have also
prepared a tutorial of our own, which you can download from the Packt Publishing
website, in order to provide an integration of the two aforementioned free courses.
In any case, don't be intimidated by our starting requirements; mastering Python
for data science applications isn't as arduous as you may think. It's just that we have
to assume some basic knowledge on the reader's part because our intention is to go
straight to the point of using data science without having to explain too much about
the general aspects of the language that we will be using.
Are you ready, then? Let's start!
[1]
First Steps
In this short introductory chapter, we will work out the basics to set off in full swing
and go through the following topics:
Using IPython
Python can easily integrate different tools and offer a truly unifying ground
for different languages (Java, C, Fortran, and even language primitives),
data strategies, and learning algorithms that can be easily fitted together
and which can concretely help data scientists forge new powerful solutions.
[2]
Chapter 1
It offers a large, mature system of packages for data analysis and machine
learning. It guarantees that you will get all that you may need in the course
of a data analysis, and sometimes even more.
It can work with in-memory big data because of its minimal memory footprint
and excellent memory management. The memory garbage collector will often
save the day when you load, transform, dice, slice, save, or discard data using
the various iterations and reiterations of data wrangling.
It is very simple to learn and use. After you grasp the basics, there's no other
better way to learn more than by immediately starting with the coding.
Installing Python
First of all, let's proceed to introduce all the settings you need in order to create a
fully working data science environment to test the examples and experiment with
the code that we are going to provide you with.
Python is an open source, object-oriented, cross-platform programming language
that, compared to its direct competitors (for instance, C++ and Java), is very concise.
It allows you to build a working software prototype in a very short time. Did it become
the most used language in the data scientist's toolbox just because of this? Well, no.
It's also a general-purpose language, and it is very flexible indeed due to a large variety
of available packages that solve a wide spectrum of problems and necessities.
Python 2 or Python 3?
There are two main branches of Python: 2 and 3. Although the third version is the
newest, the older one is still the most used version in the scientific area, since a few
libraries (see http://py3readiness.org for a compatibility overview) won't run
otherwise. In fact, if you try to run some code developed for Python 2 with a Python
3 interpreter, it won't work. Major changes have been made to the newest version,
and this has impacted past compatibility. So, please remember that there is no
backward compatibility between Python 3 and 2.
[3]
First Steps
In this book, in order to address a larger audience of readers and practitioners, we're
going to adopt the Python 2 syntax for all our examples (at the time of writing this
book, the latest release is 2.7.8). Since the differences amount to really minor changes,
advanced users of Python 3 are encouraged to adapt and optimize the code to suit
their favored version.
Step-by-step installation
Novice data scientists who have never used Python (so, we figured out that they
don't have it readily installed on their machines) need to first download the installer
from the main website of the project, https://www.python.org/downloads/, and
then install it on their local machine.
This section provides you with full control over what can be installed
on your machine. This is very useful when you have to set up single
machines to deal with different tasks in data science. Anyway, please
be warned that a step-by-step installation really takes time and effort.
Instead, installing a ready-made scientific distribution will lessen the
burden of installation procedures and it may be well suited for first
starting and learning because it saves you time and sometimes even
trouble, though it will put a large number of packages (and we won't
use most of them) on your computer all at once. Therefore, if you
want to start immediately with an easy installation procedure, just
skip this part and proceed to the next section, Scientific distributions.
3. If a syntax error is raised, it means that you are running Python 3 instead of
Python 2. Otherwise, if you don't experience an error and you can read that
your Python version has the attribute major=2, then congratulations for
running the right version of Python. You're now ready to move forward.
[4]
Chapter 1
To clarify, when a command is given in the terminal command line, we prefix the
command with $>. Otherwise, if it's for the Python REPL, it's preceded by >>>.
NumPy
NumPy, which is Travis Oliphant's creation, is the true analytical workhorse of the
Python language. It provides the user with multidimensional arrays, along with
a large set of functions to operate a multiplicity of mathematical operations on
these arrays. Arrays are blocks of data arranged along multiple dimensions, which
implement mathematical vectors and matrices. Arrays are useful not just for storing
data, but also for fast matrix operations (vectorization), which are indispensable
when you wish to solve ad hoc data science problems.
Website: http://www.numpy.org/
[5]
First Steps
SciPy
An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy
completes NumPy's functionalities, offering a larger variety of scientific algorithms
for linear algebra, sparse matrices, signal and image processing, optimization, fast
Fourier transformation, and much more.
Website: http://www.scipy.org/
pandas
The pandas package deals with everything that NumPy and SciPy cannot do. Thanks
to its specific object data structures, DataFrames and Series, pandas allows you to
handle complex tables of data of different types (which is something that NumPy's
arrays cannot do) and time series. Thanks to Wes McKinney's creation, you will be
able to easily and smoothly load data from a variety of sources. You can then slice,
dice, handle missing elements, add, rename, aggregate, reshape, and finally visualize
this data at your will.
Website: http://pandas.pydata.org/
Scikit-learn
Started as part of the SciKits (SciPy Toolkits), Scikit-learn is the core of data science
operations on Python. It offers all that you may need in terms of data preprocessing,
supervised and unsupervised learning, model selection, validation, and error
metrics. Expect us to talk at length about this package throughout this book. Scikitlearn started in 2007 as a Google Summer of Code project by David Cournapeau.
Since 2013, it has been taken over by the researchers at INRA (French Institute for
Research in Computer Science and Automation).
Website: http://scikit-learn.org/stable/
Chapter 1
IPython
A scientific approach requires the fast experimentation of different hypotheses in a
reproducible fashion. IPython was created by Fernando Perez in order to address the
need for an interactive Python command shell (which is based on shell, web browser,
and the application interface), with graphical integration, customizable commands,
rich history (in the JSON format), and computational parallelism for an enhanced
performance. IPython is our favored choice throughout this book, and it is used
to clearly and effectively illustrate operations with scripts and data and the
consequent results.
Website: http://ipython.org/
Matplotlib
Originally developed by John Hunter, matplotlib is the library that contains all the
building blocks that are required to create quality plots from arrays and to visualize
them interactively.
You can find all the MATLAB-like plotting frameworks inside the pylab module.
Website: http://matplotlib.org/
You can simply import what you need for your visualization purposes with the
following command:
import matplotlib.pyplot as plt
[7]
First Steps
Statsmodels
Previously part of SciKits, statsmodels was thought to be a complement to SciPy
statistical functions. It features generalized linear models, discrete choice models,
time series analysis, and a series of descriptive statistics as well as parametric and
nonparametric tests.
Website: http://statsmodels.sourceforge.net/
Version at the time of print: 0.6.0
Suggested install command: pip install statsmodels
Beautiful Soup
Beautiful Soup, a creation of Leonard Richardson, is a great tool to scrap out data
from HTML and XML files retrieved from the Internet. It works incredibly well,
even in the case of tag soups (hence the name), which are collections of malformed,
contradictory, and incorrect tags. After choosing your parser (basically, the HTML
parser included in Python's standard library works fine), thanks to Beautiful Soup,
you can navigate through the objects in the page and extract text, tables, and any
other information that you may find useful.
Website: http://www.crummy.com/software/BeautifulSoup/
Version at the time of print: 4.3.2
Suggested install command: pip install beautifulsoup4
Note that the imported module is named bs4.
NetworkX
Developed by the Los Alamos National Laboratory, NetworkX is a package
specialized in the creation, manipulation, analysis, and graphical representation
of real-life network data (it can easily operate with graphs made up of a million
nodes and edges). Besides specialized data structures for graphs and fine
visualization methods (2D and 3D), it provides the user with many standard
graph measures and algorithms, such as the shortest path, centrality, components,
communities, clustering, and PageRank. We will frequently use this package in
Chapter 5, Social Network Analysis.
Website: https://networkx.github.io/
Version at the time of print: 1.9.1
Suggested install command: pip install networkx
[8]
Chapter 1
NLTK
The Natural Language Toolkit (NLTK) provides access to corpora and lexical
resources and to a complete suit of functions for statistical Natural Language
Processing (NLP), ranging from tokenizers to part-of-speech taggers and from
tree models to named-entity recognition. Initially, the package was created by
Steven Bird and Edward Loper as an NLP teaching infrastructure for CIS-530 at
the University of Pennsylvania. It is a fantastic tool that you can use to prototype
and build NLP systems.
Website: http://www.nltk.org/
Version at the time of print: 3.0
Suggested install command: pip install nltk
Gensim
Gensim, programmed by Radim ehek, is an open source package that is suitable
for the analysis of large textual collections with the help of parallel distributable
online algorithms. Among advanced functionalities, it implements Latent Semantic
Analysis (LSA), topic modeling by Latent Dirichlet Allocation (LDA), and Google's
word2vec, a powerful algorithm that transforms text into vector features that can be
used in supervised and unsupervised machine learning.
Website: http://radimrehurek.com/gensim/
Version at the time of print: 0.10.3
Suggested install command: pip install gensim
PyPy
PyPy is not a package; it is an alternative implementation of Python 2.7.8 that
supports most of the commonly used Python standard packages (unfortunately,
NumPy is currently not fully supported). As an advantage, it offers enhanced speed
and memory handling. Thus, it is very useful for heavy duty operations on large
chunks of data and it should be part of your big data handling strategies.
Website: http://pypy.org/
First Steps
If both these commands end with an error, you need to install any one of them.
We recommend that you use pip because it is thought of as an improvement over
easy_install. By the way, packages installed by pip can be uninstalled and if,
by chance, your package installation fails, pip will leave your system clean.
To install pip, follow the instructions given at https://pip.pypa.io/en/latest/
installing.html.
The most recent versions of Python should already have pip installed by default.
So, you may have it already installed on your system. If not, the safest way is to
download the get-pi.py script from https://bootstrap.pypa.io/get-pip.py
and then run it using the following:
$> python get-pip.py
The script will also install the setup tool from https://pypi.python.org/pypi/
setuptools, which also contains easy_install.
You're now ready to install the packages you need in order to run the examples
provided in this book. To install the generic package <pk>, you just need to run
the following command:
$> pip install <pk>
After this, the package <pk> and all its dependencies will be downloaded and
installed. If you're not sure whether a library has been installed or not, just try to
import a module inside it. If the Python interpreter raises an ImportError error,
it can be concluded that the package has not been installed.
[ 10 ]
Chapter 1
This is what happens when the NumPy library has been installed:
>>> import numpy
In the latter case, you'll need to first install it through pip or easy_install.
Take care that you don't confuse packages with modules. With pip,
you install a package; in Python, you import a module. Sometimes,
the package and the module have the same name, but in many cases,
they don't match. For example, the sklearn module is included
in the package named Scikit-learn.
Finally, to search and browse the Python packages available for Python, take a
look at https://pypi.python.org.
Package upgrades
More often than not, you will find yourself in a situation where you have to
upgrade a package because the new version is either required by a dependency
or has additional features that you would like to use. First, check the version of
the library you have installed by glancing at the __version__ attribute, as shown
in the following example, numpy:
>>> import numpy
>>> numpy.__version__ # 2 underscores before and after
'1.9.0'
Now, if you want to update it to a newer release, say the 1.9.1 version, you can
run the following command from the command line:
$> pip install -U numpy==1.9.1
[ 11 ]
First Steps
Scientific distributions
As you've read so far, creating a working environment is a time-consuming
operation for a data scientist. You first need to install Python and then, one by
one, you can install all the libraries that you will need (sometimes, the installation
procedures may not go as smoothly as you'd hoped for earlier).
If you want to save time and effort and want to ensure that you have a fully working
Python environment that is ready to use, you can just download, install, and use
the scientific Python distribution. Apart from Python, they also include a variety
of preinstalled packages, and sometimes, they even have additional tools and an
IDE. A few of them are very well known among data scientists, and in the sections
that follow, you will find some of the key features of each of these packages.
We suggest that you first promptly download and install a scientific distribution,
such as Anaconda (which is the most complete one), and after practicing the
examples in the book, decide to fully uninstall the distribution and set up Python
alone, which can be accompanied by just the packages you need for your projects.
Anaconda
Anaconda (https://store.continuum.io/cshop/anaconda) is a Python
distribution offered by Continuum Analytics that includes nearly 200 packages,
which include NumPy, SciPy, pandas, IPython, Matplotlib, Scikit-learn, and NLTK.
It's a cross-platform distribution that can be installed on machines with other existing
Python distributions and versions, and its base version is free. Additional addons that contain advanced features are charged separately. Anaconda introduces
conda, a binary package manager, as a command-line tool to manage your package
installations. As stated on the website, Anaconda's goal is to provide enterpriseready Python distribution for large-scale processing, predictive analytics and
scientific computing.
[ 12 ]
Chapter 1
Enthought Canopy
Enthought Canopy (https://www.enthought.com/products/canopy/) is a Python
distribution by Enthought, Inc. It includes more than 70 preinstalled packages, which
include NumPy, SciPy, Matplotlib, IPython, and pandas. This distribution is targeted
at engineers, data scientists, quantitative and data analysts, and enterprises. Its base
version is free (which is named Canopy Express), but if you need advanced features,
you have to buy a front version. It's a multiplatform distribution and its commandline install tool is canopy_cli.
PythonXY
PythonXY (https://code.google.com/p/pythonxy/) is a free, open source
Python distribution maintained by the community. It includes a number of packages,
which include NumPy, SciPy, NetworkX, IPython, and Scikit-learn. It also includes
Spyder, an interactive development environment inspired by the MATLAB IDE.
The distribution is free. It works only on Microsoft Windows, and its command-line
installation tool is pip.
WinPython
WinPython (http://winpython.sourceforge.net) is also a free, open-source
Python distribution maintained by the community. It is designed for scientists,
and includes many packages such as NumPy, SciPy, Matplotlib, and IPython. It
also includes Spyder as an IDE. It is free and portable (you can put it in any
directory, or even in a USB flash drive). It works only on Microsoft Windows,
and its command-line tool is the WinPython Package Manager (WPPM).
Introducing IPython
IPython is a special tool for interactive tasks, which contains special commands
that help the developer better understand the code that they are currently writing.
These are the commands:
[ 13 ]
First Steps
Let's demonstrate the usage of these commands with an example. We first start
the interactive console with the ipython command that is used to run IPython, as
shown here:
$> ipython
Python 2.7.6 (default, Sep
9 2014, 15:04:36)
object?
details.
Then, in the first line of code, which is marked by IPython as [1], we create a list
of 10 numbers (from 0 to 9), assigning the output to an object named obj1:
In [2]: obj1?
Type:
list
10
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items
In [3]: %timeit x=100
10000000 loops, best of 3: 23.4 ns per loop
In [4]: %quickref
In the next line of code, which is numbered [2], we inspect the obj1 object using
the IPython command ?. IPython introspects the object, prints its details (obj is a
list that contains the values [1, 2, 3..., 9] and elements), and finally prints
some general documentation on lists. It's not the case in this example. However,
for complex objects, the usage of ??instead of ?gives a more verbose output.
In line [3], we use the magic function timeit to a Python assignment (x=100).
The timeit function runs this instruction many times and stores the computational
time needed to execute it. Finally, it prints the average time that was taken to run
the Python function.
We complete the overview with a list of all the possible IPython special functions
by running the helper function quickref, as shown in line [4].
[ 14 ]
Chapter 1
As you noticed, each time we use IPython, we have an input cell and optionally,
an output cell, if there is something that has to be printed on stdout. Each input
is numbered, so it can be referenced inside the IPython environment itself. For
our purposes, we don't need to provide such references in the code of the book.
Therefore, we will just report inputs and outputs without their numbers. However,
we'll use the generic In: and Out: notations to point out the input and output cells.
Just copy the commands after In: to your own IPython cell and expect an output
that will be reported on the following Out:.
Therefore, the basic notations will be:
Otherwise, if we expect you to operate directly on the Python console, we will use
the following form:
>>> command
Wherever necessary, the command-line input and output will be written as follows:
$> command
Moreover, to run the bash command in the IPython console, prefix it with a "!"
(an exclamation mark):
In: !ls
Applications
Develop
Google Drive
Public
Pictures
env
temp
Desktop
...
In: !pwd
/Users/mycomputer
Present your work (this will be a combination of text, code, and images)
[ 15 ]
First Steps
3. Then, click on New Notebook. A new window will open, as shown in the
following screenshot:
This is the web app that you'll use to compose your story. It's very similar to a Python
IDE, with the bottom section (where you can write the code) composed of cells.
A cell can be either a piece of text (eventually formatted with a markup language)
or a piece of code. In the second case, you have the ability to run the code, and any
eventual output (the standard output) will be placed under the cell. The following
is a very simple example of the same:
[ 16 ]
Chapter 1
In: import random
a = random.randint(0, 100)
a
Out: 16
In: a*2
Out: 32
In the first cell, which is denoted by In:, we import the random module, assign a
random value between 0 and 100 to the variable a, and print the value. When this
cell is run, the output, which is denoted as Out:, is the random number. Then, in the
next cell, we will just print the double of the value of the variable a.
As you can see, it's a great tool to debug and decide which parameter is best for a
given operation. Now, what happens if we run the code in the first cell? Will the
output of the second cell be modified since a is different? Actually, no. Each cell is
independent and autonomous. In fact, after we run the code in the first cell, we fall
in this inconsistent status:
In: import random
a = random.randint(0, 100)
a
Out: 56
In: a*2
Out: 32
Also note that the number in the squared parenthesis has changed
(from 1 to 3) since it's the third executed command (and its output)
from the time the notebook started. Since each cell is autonomous, by
looking at these numbers, you can understand their order of execution.
IPython is a simple, flexible, and powerful tool. However, as seen in the preceding
example, you must note that when you update a variable that is going to be used
later on in your Notebook, remember to run all the cells following the updated code
so that you have a consistent state.
When you save an IPython notebook, the resulting .ipynb file is JSON formatted,
and it contains all the cells and their content, plus the output. This makes things easier
because you don't need to run the code to see the notebook (actually, you also don't
need to have Python and its set of toolkits installed). This is very handy, especially
when you have pictures featured in the output and some very time-consuming
routines in the code. A downside of using the IPython Notebook is that its file format,
which is JSON structured, cannot be easily read by humans. In fact, it contains images,
code, text, and so on.
[ 17 ]
First Steps
Now, let's discuss a data science related example (don't worry about understanding
it completely):
In:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
Then, in cell [2], the dataset is loaded and an indication of its shape is shown.
The dataset contains 506 house values that were sold in the suburbs of Boston, along
with their respective data arranged in columns. Each column of the data represents
a feature. A feature is a characteristic property of the observation. Machine learning
uses features to establish models that can turn them into predictions. If you are from
a statistical background, you can add features that can be intended as variables
(values that vary with respect to the observations).
To see a complete description of the dataset, print boston_dataset.DESCR.
After loading the observations and their features, in order to provide a demonstration
of how IPython can effectively support the development of data science solutions, we
will perform some transformations and analysis on the dataset. We will use classes,
such as SelectKBest, and methods, such as .getsupport() or .fit(). Don't worry
if these are not clear to you now; they will all be covered extensively later in this book.
Try to run the following code:
In:
selector = SelectKBest(f_regression, k=1)
[ 18 ]
Chapter 1
selector.fit(X_full, Y)
X = X_full[:, selector.get_support()]
print X.shape
Out:
(506, 1)
In:, we select a feature (the most discriminative one) of the SelectKBest class that
is fitted to the data by using the .fit() method. Thus, we reduce the dataset
to a vector with the help of a selection operated by indexing on all the rows and on
the selected feature, which can be retrieved by the .get_support() method.
Since the target value is a vector, we can, therefore, try to see whether there is a linear
relation between the input (the feature) and the output (the house value). When there
is a linear relationship between two variables, the output will constantly react to
changes in the input by the same proportional amount and direction.
In:
plt.scatter(X, Y, color='black')
plt.show()
[ 19 ]
First Steps
In the fourth cell, we scatter the input and output values for this problem:
In:
regressor = LinearRegression(normalize=True)
regressor.fit(X, Y)
plt.scatter(X, Y, color='black')
plt.plot(X, regressor.predict(X), color='blue', linewidth=3)
plt.show()
In the next cell, we create a regressor (a simple linear regression with feature
normalization), train the regressor, and finally plot the best linear relation (that's
the linear model of the regressor) between the input and output. Clearly, the linear
model is an approximation that is not working well. We have two possible roads
that we can follow at this point. We can transform the variables in order to make
their relationship linear, or we can use a nonlinear model. Support Vector Machine
(SVM) is a class of models that can easily solve nonlinearities. Also, Random Forests
is another model for the automatic solving of similar problems. Let's see them in
action in IPython:
In:
regressor = SVR()
regressor.fit(X, Y)
plt.scatter(X, Y, color='black')
plt.scatter(X, regressor.predict(X), color='blue', linewidth=3)
plt.show()
[ 20 ]
Chapter 1
In:
regressor = RandomForestRegressor()
regressor.fit(X, Y)
plt.scatter(X, Y, color='black');
plt.scatter(X, regressor.predict(X), color='blue', linewidth=3)
plt.show()
Finally, in the last two cells, we will repeat the same procedure. This time we will
use two nonlinear approaches: an SVM and a Random Forest based regressor.
[ 21 ]
First Steps
Having been written down on the IPython interface, this demonstrative code solves
the nonlinearity problem. At this point, it is very easy to change the selected feature,
regressor, the number of features we use to train the model, and so on, by simply
modifying the cells where the script is. Everything can be done interactively, and
according to the results we see, we can decide both what should be kept or changed
and what is to be done next.
[ 22 ]
Chapter 1
After loading, we can explore the data description and understand how the
features and targets are stored. Basically, all Scikit-learn datasets present the
following methods:
.shape: This is a method that you can apply to both .data and .target;
numbered classes
Now, let's just try to implement them (no output is reported, but the print commands
will provide you with plenty of information):
In: print iris.DESCR
In: print iris.data
In: print iris.data.shape
In: print iris.feature_names
In: print iris.target
In: print iris.target.shape
In: print iris.target_names
Now, you should know something more about the datasetabout how many
examples and variables are present and what their names are.
Notice that the main data structures that are enclosed in the iris object are the
two arrays, data and target:
In: print type(iris.data)
Out: <type 'numpy.ndarray'>
Iris.data offers the numeric values of the variables named sepal length, sepal
width, petal length, and petal width arranged in a matrix form (150,4), where
150 is the number of observations and 4 is the number of features. The order of the
variables is the order presented in iris.feature_names.
class (refer to the content of target_names; each class name is related to its index
number and setosa, which is the zero element of the list, is represented as 0 in the
target vector).
[ 23 ]
First Steps
The Iris flower dataset was first used in 1936 by Ronald Fisher, who was one of
the fathers of modern statistical analysis, in order to demonstrate the functionality
of linear discriminant analysis on a small set of empirically verifiable examples (each
of the 150 data points represented iris flowers). These examples were arranged into
tree balanced species classes (each class consisted of one-third of the examples) and
were provided with four metric descriptive variables that, when combined, were
able to separate the classes.
The advantage of using such a dataset is that it is very easy to load, handle, and
explore for different purposes, from supervised learning to graphical representation.
Modeling activities take almost no time on any computer, no matter what its
specifications are. Moreover, the relationship between the classes and the role of the
explicative variables are well known. So, the task is challenging, but it is not arduous.
For example, let's just observe how classes can be easily separated when you wish to
combine at least two of the four available variables by using a scatterplot matrix.
Scatterplot matrices are arranged in a matrix format, whose columns and rows are
the dataset variables. The elements of the matrix contain single scatterplots whose x
values are determined by the row variable of the matrix and y values by the column
variable. The diagonal elements of the matrix may contain a distribution histogram
or some other univariate representation of the variable at the same time in its row
and column.
The pandas library offers an off-the-shelf function to quickly make up scatterplot
matrices and start exploring relationship and distributions between the quantitative
variables in a dataset.
In:
import pandas as pd
import numpy as np
In: colors = list()
In: palette = {0: "red", 1: "green", 2: "blue"}
In:
for c in np.nditer(iris.target): colors.append(palette[int(c)])
# using the palette dictionary, we convert
# each numeric class into a color string
In: dataframe = pd.DataFrame(iris.data,
columns=iris.feature_names)
In: scatterplot = pd.scatter_matrix(dataframe, alpha=0.3,
figsize=(10, 10), diagonal='hist', color=colors, marker='o',
grid=True)
[ 24 ]
Chapter 1
We encourage you to expriment a lot with this dataset and with similar ones before
you work on other complex real data, because the advantage of focusing on an
accessible, non-trivial data problem is that it can help you to quickly build your
foundations on data science.
After a while, anyway, though useful and interesting for your learning activities,
toy datasets will start limiting the variety of different experimentations that you
can achieve. In spite of the insight provided, in order to progress, you'll need to
gain access to complex and realistic data science topics. We will, therefore, have
to resort to some external data.
[ 25 ]
First Steps
As in the case of the Scikit-learn package toy dataset, the obtained object is a complex
dictionary-like structure, where your predictive variables are earthquakes.data
and your target to be predicted is earthquakes.target. This being the real data,
in this case, you will have quite a lot of examples and just a few variables available.
Chapter 1
In: import urllib2
In: target_page =
'http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a'
In: a2a = urllib2.urlopen(target_page)
In: from sklearn.datasets import load_svmlight_file
In: X_train, y_train = load_svmlight_file(a2a)
In: print X_train.shape, y_train.shape
Out: (2265, 119) (2265L,)
In return, you will get two single objects: a set of training examples in a sparse
matrix format and an array of responses.
[ 27 ]
First Steps
import numpy as np
>>> type(np.loadtxt)
<type 'function'>
>>> help(np.loadtxt)
If you need to determinate a different type (for example, an int), you have to
declare it beforehand.
For instance, if you want to convert numeric data to int, use the following code:
In: housing_int = np.loadtxt('regression-datasetshousing.csv',delimiter=',', dtype=int)
Printing the first three elements of the row of the housing and housing_int arrays
can help you understand the difference:
In: print housing[0,:3], '\n', housing_int[0,:3]
Out:
[
6.32000000e-03
[ 0 18
1.80000000e+01
2.31000000e+00]
2]
Frequently, though not always the case in our example, the data on files feature in
the first line a textual header that contains the name of the variables. In this situation,
the parameter that is skipped will point out the row in the loadtxt file from where
it will start reading the data. Being the header on row 0 (in Python, counting always
starts from 0), parameter skip=1 will save the day and allow you to avoid an error
and fail to load your data.
The situation would be slightly different if you were to download the Iris dataset,
which is present at http://mldata.org/repository/data/viewslug/datasetsuci-iris/. In fact, this dataset presents a qualitative target variable, class, which
is a string that expresses the iris species. Specifically, it's a categorical variable with
four levels.
[ 28 ]
Chapter 1
Therefore, if you were to use the loadtxt function, you will get a value error due
to the fact that an array must have all its elements of the same type. The variable
class is string, whereas the other variables are constituted of floating point values.
How to proceed? The pandas library offers the solution, thanks to its DataFrame
data structure that can easily handle datasets in a matrix form (row per columns)
that is made up of different types of variables.
First of all, just download the datasets-uci-iris.csv file and have it saved in
your local directory.
At this point, using pandas' read_csv is quite straightforward:
In: iris_filename = 'datasets-uci-iris.csv'
In: iris = pd.read_csv(iris_filename, sep=',', decimal='.',
header=None, names= ['sepal_length', 'sepal_width', 'petal_length',
'petal_width', 'target'])
In: print type(iris)
Out: <class 'pandas.core.frame.DataFrame'>
Apart from the filename, you can specify the separator (sep), the way the decimal
points are expressed (decimal), whether there is a header (in this case, header=None;
normally, if you have a header, then header=0), and the name of the variablewhere
there is one (you can use a list; otherwise, pandas will provide some automatic
naming).
Also, we have defined names that use single words (instead of spaces,
we used underscores). Thus, we can later directly extract single
variables by calling them as we do for methods; for instance, iris.
sepal_length will extract the sepal length data.
If, at this point, you need to convert the pandas DataFrame into a couple of
NumPy arrays that contain the data and target values, this can be easily done
in a couple of commands:
In: iris_data = iris.values[:,:4]
In: iris_target, iris_target_labels = pd.factorize(iris.target)
In: print iris_data.shape, iris_target.shape
Out: (150L, 4L) (150L,)
[ 29 ]
First Steps
y.shape
After importing just the datasets module, we ask, using the make_classification
command, for 1 million examples (the n_samples parameter) and 10 useful features
(n_features). The random_state should be 101, so we can be assured that we
can replicate the same datasets at a different time and in a different machine.
For instance, you can type the following command:
$> datasets.make_classification(1, n_features=4, random_state=101)
1.40145585]]),
No matter what the computer and the specific situation is, random_state assures
deterministic results that make your experimentations perfectly replicable.
Defining the random_state parameter using a specific integer number (in this
case 101, but it may be any number that you prefer or find useful) allows the easy
replication of the same dataset on your machine, the way it is set up, on different
operating systems, and on different machines.
By the way, did it take too long?
[ 30 ]
Chapter 1
If it doesn't seem so also on your machine and if you are ready, having set up and
tested everything up to this point, we can start our data science journey.
Summary
In this short introductory chapter, we installed everything that we will be using
throughout this book, even the examples, which were installed either directly or by
using a scientific distribution. We also introduced you to IPython and demonstrated
how you can have access to the data run in the tutorials.
In the next chapter, Data Munging, we will have an overview of the data science
pipeline and explore all the key tools to handle and prepare data before you apply
any learning algorithm and set up your hypothesis experimentation schedule.
[ 31 ]
www.PacktPub.com
Stay Connected: