Python Business Intelligence Cookbook - Sample Chapter
Python Business Intelligence Cookbook - Sample Chapter
ee
Rather than spending day after day scouring Internet forums for "how-to" information, here you'll find more
than 60 recipes that take you through the entire process of creating actionable intelligence from your raw
data, no matter what shape or form it's in. Within the first 30 minutes of opening this book, you'll learn how
to use the latest in Python and NoSQL databases to glean insights from data just waiting to be exploited.
Robert Dempsey
The amount of data produced by businesses and devices is going nowhere but up. In this scenario, Python
is an excellent tool for more specialized analysis tasks, and is powered with related libraries to process data
streams, to visualize datasets, and to carry out scientific calculations.
$ 39.99 US
25.99 UK
P U B L I S H I N G
P U B L I S H I N G
Sa
pl
e
Python Business
Intelligence Cookbook
Leverage the computational power of Python with more than 60
recipes that arm you with the required skills to make informed
business decisions
Robert Dempsey
Preface
Data! Everyone is surrounded by it, but few know how to truly exploit it. For those who do,
glory awaits!
Okay, so that's a little dramatic; however, being able to turn raw data into actionable
information is a goal that every organization is working to achieve. This book helps you
achieve it.
Making sense of data isn't some esoteric art requiring multiple degreesit's a matter of
knowing the recipes to take your data through each stage of the process. It all starts with
asking an interesting question.
My mission is that, by the end of this book, you will be equipped to apply Python to business
intelligence taskspreparing, exploring, analyzing, visualizing, and reportingin order to make
more informed business decisions using the data at hand.
Prepare for an awesome read, my friend!
A little context first. The code in this book is developed on Mac OS X 10.11.1, using
Python 3.4.3, IPython 4.0.0, matplotlib 1.4.3, NumPy 1.9.1, scikit-learn 0.16.1, and
Pandas 0.16.2in other words, the latest or near-latest versions at the time of publishing.
Preface
Chapter 4, Performing Data Analysis for Non Data Analysts, provides recipes to perform
statistical and predictive analysis on your data.
Chapter 5, Building a Business Intelligence Dashboard Quickly, builds on everything that
you've learned and shows you how to generate reports in Excel, and build web-based business
intelligence dashboards.
Installing Anaconda
Installing Rodeo
Starting Rodeo
Installing Robomongo
Introduction
In this chapter, you'll get fully set up to perform business intelligence tasks with Python. We'll
start by installing a distribution of Python called Anaconda. Next, we'll get MongoDB up and
running for storing data. After that, we'll install additional Python libraries, install a GUI tool for
MongoDB, and finally take a look at the dataset that we'll be using throughout this book.
Without further ado, let's get started!
Installing Anaconda
Throughout this book, we'll be using Python as the main tool for performing business
intelligence tasks. This recipe shows you how to get a specific Python distributionAnaconda,
installed.
Getting ready
Regardless of which operating system you use, open a web browser and browse to the
Anaconda download page at http://continuum.io/downloads.
The download page will automatically detect your operating system.
How to do it
In this section, we have listed the steps to install Anaconda for all the major operating
systems: Mac OS X, Windows, and Linux.
Mac OS X 10.10.4
1. Click on the I WANT PYTHON 3.4 link. We'll be using Python 3.4 throughout this book.
2. Next, click on the Mac OS X 64-Bit Python 3.4 Graphical Installer button to
download Anaconda.
3. Once the download completes, browse your computer to find the downloaded
Anaconda, and double-click on the Anaconda installer file (a .pkg file) to begin the
installation.
4. Walk through the installer steps to complete the installation. I recommend keeping
the default settings.
5. To verify that Anaconda is installed correctly, open a terminal and type the following
command:
python
Chapter 1
6. If the installer was successful, you should see something like this:
Windows 8.1
1. Click on the I WANT PYTHON 3.4 link. We'll be using Python 3.4 throughout this book.
2. Next, click on the Windows 64-Bit Python 3.4 Graphical Installer button to download
Anaconda.
3. Once the download completes, browse your computer to find the downloaded
Anaconda, and double-click on the Anaconda3-2.3.0-Windows-x86_64.exe file
to begin the installation.
4. Walk through the installer steps to complete the installation. I recommend keeping
the default settings.
5. To verify that Anaconda has installed correctly, open a terminal, or open a command
prompt in Windows. Now type the following command:
python
5. I've created a special shortcut on my website that is a bit easier to type at the
command line: http://robertwdempsey.com/anaconda3-linux.
6. Once Anaconda downloads, use the following command to start the installer:
bash Anaconda3-2.3.0-Linux-x86_64.sh
7.
8. When asked if you would like Anaconda to prepend the Anaconda3 install location to
the PATH variable, type yes.
To have the PATH update take effect immediately after the installation
completes, type the following command in the command line:
source ~/.bashrc
Chapter 1
9. Once the installation is complete, verify the installation by typing python in the
command line. If everything worked correctly, you should see something like this:
How it works
Anaconda holds many advantages over downloading Python from http://www.python.org
or using the Python distribution included with your computer, some of which are as follows:
f
Almost 90 percent of what you'll use on a day-to-day basis is already included. In fact,
it contains over 330 of the most popular Python packages.
Using Anaconda on both the computer you use for development and the server where
your solutions will be deployed helps ensure that you are using the same version of
the Python packages that your applications require.
It's constantly updated; so, you will always be using the latest version of Python and
the Python packages.
At the time of writing this, the current version of Anaconda for Python 3 is 2.3.0.
Scikit-learn: Gives us simple and efficient tools for data mining and data analysis
including classification, regression, clustering, dimensionality reduction, model
selection, and preprocessing. This will be the workhorse library for our analysis.
Matplotlib: A 2D plotting library. We'll use this to generate all our charts.
PyMongo: Allows us to connect to and use MongoDB. We'll use this to insert and
retrieve data from MongoDB.
XlsxWriter: This allows us to access and create Microsoft Excel files. This library will
be used to generate reports in the Excel format.
Getting ready
Open a web browser and visit: https://www.mongodb.org/downloads.
How to do it
Mac OS X
The following steps explain how to install, configure, and run MongoDB on Mac OS X:
1. On the download page, click on the Mac OS X tab, and select the version you want.
Chapter 1
2. Click on the Download (TGZ) button to download MongoDB.
3. Unpack the downloaded file and copy to any directory that you like. I typically create
an Applications folder in my home directory where I install apps like this.
4. For our purpose, we're going to set up a single instance of MongoDB. This means
there is literally nothing to configure. To run MongoDB, open a command prompt and
do the following:
Make your user the owner of the directory using the chown command:
./mongod
Windows
The following steps explain how to install, configure, and run MongoDB on Windows:
1. Click on the Windows tab, and select the version you want.
2. Click on the Download (MSI) button to download MongoDB.
3. Once downloaded, browse to the folder where Mongo was downloaded, and doubleclick on the installer file.
When asked which setup type you want, select Complete
4. Follow the instructions to complete the installation.
5. Create a data folder at C:\data\db. MongoDB needs this directory in order to run.
This is where, by default, Mongo is going to store all its database files.
6. Next, at the command prompt, navigate to the directory where Mongo was installed
and run Mongo:
cd C:\Program Files\MongoDB\Server\3.0\bin
Mongod.exe
7.
8. You should see an output like the following screenshot from Mongo, letting you know
it's working:
Chapter 1
Linux
The easiest way to install MongoDB in Linux is by using apt. At the time of writing, there are
apt packages for 64-bit long-term support Ubuntu releases, specifically 12.04 LTS and 14.04
LTS. Since the URL for the public key can change, please visit the Mongo Installation Tutorial
to ensure that you have the most recent one: https://docs.mongodb.org/manual/
tutorial/install-mongodb-on-ubuntu/.
Install Mongo as follows:
1. Log in to your Linux box
2. Import the public key:
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 -recv 7F0CEB10
4. Update apt:
sudo apt-get update
7.
Verify that MongoDB is running by checking the contents of the log file at /var/log/
mongodb/mongod.log for a line that looks like this: [initandlisten] waiting
for connections on port 27017
How it works
MongoDB's document data model makes it easy for you to store data of any structure and
to dynamically modify the schema. In layman's terms, MongoDB provides a vast amount of
flexibility when it comes to storing your data. This comes in very handy when we import our
data. Unlike with an SQL database, we won't have to create a table, set up a scheme, or
create indexesall of that will happen automatically when we import the data.
Installing Rodeo
IPython Notebook, an interactive, browser-based tool for developing in Python, has become
the de facto standard for creating and sharing code. We'll be using it throughout this book.
The Python library that we're about to installRodeois an alternative you can use. The
difference between IPython Notebook and Rodeo is that Rodeo has a built-in functionality
to view data in a Pandas data frame, a functionality that can come in handy when you want
to view, real-time, the changes that you are making to your data. Having said that, IPython
Notebook is the current standard.
Getting ready
To use this recipe, you need a working installation of Python.
How to do it
Regardless of the operating system, you install Rodeo with the following command:
pip install rodeo
How it works
The pitch for Rodeo is that it's a data centric IDE for Python. I use it as an alternative to
IPython Notebook when I want to be able to view the contents of my Pandas data frames while
working with my data. If you've ever used a tool like R Studio, Rodeo will feel very familiar.
10
Chapter 1
Starting Rodeo
Using this recipe, you will get to learn how to start Rodeo.
Getting ready
To use this recipe, you need to have Rodeo installed.
How to do it
To start an instance of Rodeo, change to the directory where you want to run it, and type the
following command in your working directory:
rodeo .
Once Rodeo is up and running, open a browser and enter the following URL:
http://localhost:5000
11
Installing Robomongo
Robomongo is a GUI tool for managing MongoDB that runs on Mac OS X, Windows, and Linux.
It allows you to create new databases and collections and to run queries. It gives you the full
power of the MongoDB shell in a GUI application, and has features including multiple shells,
multiple results, and autocompletion. And to top it all, it's free.
Getting ready
Open a web browser, and browse to http://robomongo.org/.
How to do it
Mac OS X
The following steps explain how to install Robomongo on Mac OS X:
1. Click on the Download for Mac OS X button.
2. Click on the Mac OS X Installer (.dmg) link to download the file.
3. Once downloaded, double-click on the installer file.
4. Drag the Robomongo application to the Applications folder.
5. Open the Applications folder, and double-click on Robomongo to start it up.
6. In the MongoDB Connections window, create a new connection:
12
Chapter 1
7.
Click on Save.
Windows
The following steps explain how to install Robomongo on Windows:
1. Click on the Download for Windows button.
2. Click on the Windows Installer (.exe) link to download the file.
3. Once downloaded, double-click on the installer file, and follow the install instructions,
accepting all the defaults.
4. Finally, run Robomongo.
5. In the MongoDB Connections window, create a new connection:
6. Click on Save.
7.
8. In the View menu, select Explorer to start browsing the existing MongoDB databases.
As this is a brand new instance, you will only have the system collection.
13
Getting ready
To use this recipe, you need to have a working installation of MongoDB and have Robomongo
installed.
How to do it
You can use Robomongo to run any query against MongoDB that you would run at the
command line. Use the following command to retrieve a single record:
db.getCollection('accidents').findOne()
Tree mode
Table mode
Text mode
By default, Robomongo will show you the results in tree mode as shown in the following
screenshot:
14
Chapter 1
How to do it
1. Visit the following URL: http://data.gov.uk/dataset/road-accidentssafety-data/resource/80b76aec-a0a1-4e14-8235-09cc6b92574a.
2. Click on the red Download button on the right side of the page. I suggest creating a
data directory to hold the data files.
3. Unpack the provided zip files in the directory you created.
4. You should see the following four files included in the expanded directory:
Accidents7904.csv
Casualty7904.csv
Road-Accident-Safety-Data-Guide-1979-2004.xls
Vehicles7904.csv
How it works
The CSV files contain the data that we are going to use in the recipes throughout this book.
The Excel file is pure magic, though. It contains a reference for all the data, including a list of
the fields in each dataset as well as the coding used.
Coding data is a very important preprocessing step. Most analysis tools that you will use
expect to see numbers rather than labels such as city or road type. The reason for this is
that computers don't understand context like we humans do. Is Paris a city or a person? It
depends. Computers can't make that judgment call. To get around this, we assign numbers to
each text value. That's been done with this dataset.
15
16
www.PacktPub.com
Stay Connected: