Root A. Python For Data Analytics. A Beginners Guide For Learning 2019
Root A. Python For Data Analytics. A Beginners Guide For Learning 2019
What happened?
How or why did it happen?
What’s happening now?
What is likely to happen next?
It has a direct influence on the decisions a business makes and on its
outcome.
Statements in Python
Every statement is isolated with End of Line (Press Enter in the
script). A statement can be anything from assigning a value to
reading the input or writing the output on the screen.
We need to keep in mind that Python is a case sensitive language.
(e.g. ABC is not similar to abc). Therefore, if we declare a variable
with capital letters, then it will not identify it a later stage in the code
with small case letters.
For example:
x=1
y=2
z=3
Comments in Python
Comments are used to write the meaning or purpose of the statement.
Comments can be ignored by the python interpreter during the
runtime. They are very handy to give details about the logic
performed in your code for other readers. It ensures that other
programmers that are reading your code will understand it easily.
# is used to comment a single statement
""" """ is used to comment multiple statements
For example:
count = 0 # initializing count with zero
"""
i=1
j=2
k=3
"""
Command Line Arguments
The script name and additional arguments are passed to the script in
the variable “sys.argv” (sys is a module and argv is a list in python).
They are very important to perform the system configuration tasks.
Single space ( ) isolates the arguments in the command line.
"print" is used to print statements in the console.
For example:
import sys
print "Argument number zero: ", sys.argv[ 0 ]
print "Argument number One: ", sys.argv[ 1 ]
print "Argument number Two: ", sys.argv[ 2 ]
print "Argument number Three: ", sys.argv[ 3 ]
Variables
Variables are used to store values inside them. We can directly
define any variable in python and there is no declaration required for
variables. In other words, python does not allow declaring variables
with data types. There are many types of variables, like Boolean,
Character, Integer, Float, String, etc.
For example:
b = False # Boolean type variable
c = ‘d' # character type variable
i = 20 # integer type variable
s = “Hi" # string type variable
f = 9.99992 # float type
variable
int j = 10 # Syntax
Error and not allowed in python
Mathematical Functions
We can perform all mathematical functions in Python easily. Let us
see the different type of mathematical functions available in Python.
Numbers:
The operators +, -, * and / work just like in most other languages.
>>> 3 + 6
9
Here the sum of 3 and 9 is asked and the result 9 is returned. It is just
like using the Python interpreter as a calculator.
Just like the addition, as shown above, the other operations
subtraction, multiplication and division can be performed. The
examples are shown below.
>>> 9 - 2
7
>>> 4 * 4
16
>>> 4 / 2
2
>>> 3 % 2
1
>>> c = 10.5
>>> d = 4
>>> c + d
14.5
>>> c - d
6.5
>>> c * d
42.0
Data Structures
Sequence
Sequence is a very basic term in python that is used to denote the
ordered set of values. There are many sequence data types in python:
str, unicode, list, tuple, buffer and xrange.
Tuples
A tuple consists of a number of values separated by commas. Tuples
are also a sequence data type in Python, like strings and lists. We
need to keep in mind that tuples are immutable. It means that they
can’t be changed.
The tuples consist of the number of values separated by a comma.
The tuples are enclosed in parentheses, while the lists are enclosed in
brackets.
Now let us see an example:
>>> m = (14, 34, 56)
>>> m
(14, 34, 56)
>>> m[0]
14
>>> m[ 0:2 ]
(14, 34)
Tuples also have the properties like indexing and slicing. Tuples can
be nested. Elements in a tuple can be grouped with ()
Now let us see an example:
i=1
j=2
t1 = i, j # is a tuple consists to elements i and j
t2 = (3, 4, 5) # is a tuple consists to elements 3,4 and 5
t3 = 0, t1, t2 # is a tuple consists to elements 0, t1 and t2
print t3 # result is (0, (1, 2), (3, 4, 5))
Lists
A list consists of a number of heterogeneous values separated by
commas enclosed by [and] and started from index 0. Lists can be
used to group together other values. Unlike Tuples, Lists are mutable
in nature. In other words, they can be changed by removing or
reassigning existing values. Also, new elements can be inserted to
the existing ones.
Now let us see an example:
>>> a = [1, 2, 3, 4, 5]
>>> a
[1, 2, 3, 4, 5]
As strings, lists can also be indexed and sliced.
>>> a = [1, 2, 3, 4, 5]
>>> a
[1, 2, 3, 4, 5]
>>> a[0]
1
>>> a[4]
5
>>> a[ 0:2 ]
[1, 2]
>>> a[ 3:5 ]
[4, 5]
Unlike strings, lists are mutable (i.e. the values can be changed)
>>> b = [1, 2, 4, 7, 9]
>>> b
[1, 2, 4, 7, 9]
>>> b[2] = 6
>>> b
[1, 2, 6, 7, 9] # Here the index [2] is changed to 6 (the initial value
is 4)
>>> b[0] = 9
>>> b
[9, 2, 6, 7, 9] # Here the index [0] is changed to 9 (the initial value is
1)
The values in the list can be separated by using comma (,) between
the square bracket. Lists can be nested. List can be used as a Stack or
a Queue.
For example:
list1 = [ 1, 2, 3, 4]
print len (list1) # returns 4 - which is the length
of the list
list1[2] # returns 3 - which is third
element in the list Starts
list1[-1] # returns 4 - which is extreme last
element in the list
list1[-2] # returns 3 - which is extreme
last but one element
list1[ 0:2 ] = [ 11, 22] # replacing first two
elements 1 and 2 with 11 and 22
stackList = [ 1, 2, 3, 4]
stackList.append(5) # inserting 5 from the last in
the stack
print stackList # result is: [1, 2, 3, 4, 5]
stackList.pop() # removing 5 from the stack Last In
First Out
print stackList # result is: [1, 2, 3, 4]
queueList = [ 1, 2, 3, 4]
queueList.append(5) # inserting 5 from the last in the queue
print queueList # result is: [1, 2, 3, 4, 5]
del(queueList[0] ) # removing 1 from the queue First In
First Out
print queueList # result is: [2, 3, 4, 5]
Sets
A set doesn’t have any duplicate elements present in it and it is an
unordered collection type. It means it will have all distinct elements
in it with no repetition.
Now let us seen an example:
fruits = ['apple', 'orange', 'apple', 'pear', 'orange', 'banana']
basket = set (fruits) # removed the duplicate
element apple
print 'orange' in basket # checking orange in basket, result is
True
print 'pine apple' in basket # checking pine apple in basket,
result is False
Dictionaries
Dictionaries are the data structures in Python that are indexed by
keys.
Key and values separated by : and pairs of keys separated by a
comma and enclosed by { and }
Lists cannot be used as keys.
Now let us see an example:
capitals = { 'AP' : 'Hyderabad', 'MH' : 'Mumbai' }
capitals[ 'TN' ] = 'Chennai'
print capitals[ 'AP' ] # returns value of AP in the
dictionary
del capitals[ 'TN' ] # deletes TN from the dictionary
capitals[ 'UP' ] = 'Luck now' # adding UP to the dictionary
print 'AP' in capitals # checks where AP key exist in dictionary
print 'TN' in capitals
Numbers = {'1': 'One', '2': 'Two'}
for key, value in Numbers.iteritems() :
print key, value
Strings
In Python, a string is identified by the characters in quotes, such as
single (‘’) and double (””). They can only store character values and
are a primitive datatype. Please note that strings are altogether
different from integers or numbers. Therefore, if you declare a string
“111”, then it has no relation with the number 111.
>>> print "hello"
hello
>>> print 'good'
good
Please note that the starting position is always included and the
ending position is always excluded.
Develop
0 1 2 3 4 5 6 ---- Index value
In the above example, the word is assigned a value develop.
Considering the first statement word [0:2], the output is ‘de’. Here
the starting position ‘d’ (0th index) is included and the ending
position ‘v’ (2nd index) is excluded. Similarly, in the second
statement word [2:4], the starting position ‘v’ (2nd index) is included
and the ending position ‘l’ (4th index) is excluded.
The important point to be noted in strings that is Python strings are
immutable (i.e. Strings cannot be changed).
There are many in-built functions available with a String. They are
used for various purposes. Let’s see some of the basic ones that are
most commonly used.
While Loop
The while loop will execute until the expression is true and it stops
once it is false.
The syntax of while loop is:
While expression:
statement
For example:
>>> a = 1
>>> while(a < 10 ):
... print "The number is:" , a
... a = a + 1
...
The number is: 1
The number is: 2
The number is: 3
The number is: 4
The number is: 5
The number is: 6
The number is: 7
The number is: 8
The number is: 9
The number is: 10
Control Statements
Break
The break statement breaks out of the smallest enclosing for or while
loop.
Now let us see an example:
def primeNumberValidation ( input ) :
for x in range( 2, input ) :
if input % x == 0:
print input, 'is not a prime number and equals', x,
'*', input/x
break
else:
print input, 'is a prime number'
primeNumberValidation( 3 )
primeNumberValidation( 14 )
Continue
The continue statement continues with the next iteration of the loop.
Now let us see an example:
def evenNumbers( start, end ) :
print "\n\nEven numbers in between ", start , " and ",
end
for n in range( start + 1, end ) :
if n % 2 != 0:
continue
print n
evenNumbers( 1, 11 ) # result is 14 is 2 4 6 8 10
evenNumbers( 10, 30 ) # result is 12 14 16 18 20 22 24 26 28
Pass
The pass is a valid statement and can be used when there is a
statement required syntactically, but the program requires no action.
Now let us see an example:
while True :
pass # In condition loop press (Ctrl + c) for the
keyboard interrupt
In this example, while followed by pass does not execute any
statement.
There is a necessity to include at least one statement in a block (e.g.
function, while, for loop etc.) in these cases, use pass as one
statement, which does nothing but includes one statement under ‘:’
Now let us see an example:
def x() :
pass # one valid statement that does not do any
action
Here pass is considered a statement for the declaration of
function x.
String Manipulation
We can use built in functions to manipulate strings in python. The
package “string” provides more functions on strings.
For example:
print name = "ABCD XYZ xyz"
print len(name) # It will return the length
of the string name
print list(name) # It will return the list of
characters in name print
name.startswith( 'A' ) # It will return True if
name starts with A else returns False
print name.endswith( 'Z' ) # It will return True if name ends
with Z else returns False
print name.index( 'CD' ) # It will return the index of CD in
name
print 'C'.isalpha( ) # It will return True if C is alpha
or returns False
print '1'.isdigit( ) # It will return True if 1 is digit or
returns False
print name.lower( ) # It will return a string with
lowercase characters in name
print name.upper( ) # It will return a string with
uppercase characters in name
Exception Handling
Exceptions are the errors detected during execution and these are not
unconditionally fatal.
Exception blocks will be enclosed with try and except statements.
try :
<statements>
except <exception type > :
<statements>
IPython Basics
To launch IPython, just type in the ipython command on your
command line:
$ ipython
You can also run random Python expressions and statements by
simply typing them in, as you normally would in a text editor, and
then pressing Enter. Note than when you type in a variable in
IPython, it will return your object in a string format.
Now let us see an example:
In [12]: data = { j : randn() for j in range(9) }
In [13]: data
Out[13]:
{0: 1.3400015224687594,
1: 0.36578355627737346,
2: -1.8315467916607481,
3: 0.24569328634683402,
4: -1.4894682507426382,
5: -1.7920860835730431,
6: 0.5570148722483058,
7: 1.2914683929487693,
8: -0.287602058693052}
Also note that some objects in Python come with an easy reading
format at the printing moment that differs from the normal print
command. IPython formats (or pretty-prints) Python objects to be
more readable. But it could be the case where the standard Python
interpreter seems much less readable. You can observe this
difference by printing the same dictionary as above in a traditional
Python shell.
Now let us see an example:
>>> from numpy.random import randn
>>> values = { j : randn() for j in range(9) }
>>> print values
{0: 1.3400015224687594, 1: 0.36578355627737346, 2:
-1.8315467916607481,
3: 0.24569328634683402, 4: -1.4894682507426382, 5:
-1.7920860835730431,
6: 0.5570148722483058, 7: 1.2914683929487693, 8:
-0.287602058693052}
Tab Completion
A significant improvement IPython has over the traditional Python
shell is the tab completion feature - a handy feature found in many
programming environments for data analysis. While typing
expressions in the console, pressing tab after a typed prefix will
search the respective namespace for any variables (like functions,
objects, etc.) matching your prefix.
Now let us see an example:
In [ 1 ]: inter_competitions = 15
In [ 2 ]: intra_competitions = 23
In [ 3 ]: total_competitions = inter_competitions +
intra_competitions
In [ 4 ]: in <tab>
in input int inter_competitions
intra_competitions
If you are a new IPython user, you will find some methods and
attributes hidden, such as underscores so, your display does not get
disarranged. In addition, there are other magic and internal “private”
methods and attributes that will also be hidden, but by typing an
underscore you will be able to unfold them and use the tab-
completion tool. The IPython configuration enables this setting so
you can see such methods in tab completion and modify the setting
any time you want.
Tab completion also facilitates completing a function’s argument list.
Explore this yourself to find out how!
Introspection
Introspection refers to the general information about an object, such
as its type, whether it is a function or instance method - if it is one,
string form, length and many other relevant details. As such,
preceding or following an object with a question mark will enable
you to introspect it.
Now let us see an example:
In [ 12 ]: fruits = [‘apple’, ‘orange’, ‘papaya’, ‘tangerine’]
In [ 13 ]: fruits?
Type: list
String Form: [‘apple’, ‘orange’, ‘papaya’, ‘tangerine’]
Length: 4
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items
In [ 14 ]: my_multiplication?
Type: function
String Form:<function my_multiplication at 0x5fad359>
File: python_myscripts/<ipython-input-16-3474012eca43>
Definition: my_multiplication(a, b)
Docstring:
Multiply two numbers.
-- Returns --
the_product : type of arguments
Lastly, the ? can also search the entire IPython namespace. If you
enclose some characters in a wildcard (*), followed by a ?, it will
yield all results matching your expression:
In [ 16 ]: np.*nanm*?
np.nanmax
np.nanmean
np.nanmedian
np.nanmin
Now you can easily run this function by passing the script’s name in
%run command:
In [ 17 ]: %run ipython_test.py
The script will be an empty namespace run, which means no other
imports or variables are defined. As a result of this, the < python
script.py > command line and the running of the program will be
identical. Therefore, all the variables will be set up in the file
(globals, imports and functions) and also available in IPython shell
as well.
Now let us see an example:
In [ 18 ]: z
Out [ 18 ]: 20
In [ 19 ]: sum
Out [ 19 ]: 45
In some cases, the script running in Python demands certain
command line arguments. They are located in sys.argv and can be
executed on the command line by following the same procedure after
the file path.
The usual %run tool will give you access to different variables in the
interactive Python namespace regarding to script access, but if you
want to define these variables beforehand, use %run -1.
By using %cpaste block, you will be able to paste all the code you
wish before running it. This function also gives you the chance to
view all the pasted code before the run instance. In case you have
accidentally pasted any incorrect code, by pressing < Ctrl + C > you
can leave out from the %cpaste prompt.
In next pages, we’ll talk about IPython HTML Notebook and its new
sharp level to develop block by block analyses, which comes in a
browser based format along with executable code cells.
How IPython interacts with various IDEs and
editors
Text editors often have third-party extensions that send blocks of
code from the editor to a running IPython shell. This is a direct and
very common function that the editors like Emacs and vim do. You
can visit the IPython website or search in the Internet for more
information.
IDEs have integration with the IPython terminal application. Some of
them are the PyDev plugin for Eclipse and Python Tools for Visual
Studio from Microsoft, among others. This merging process gives
you the chance to work both with the IPython console features and
the IDE itself.
Keyboard Shortcuts
IPython keyboard shortcuts will be familiar for existing users using
the Emacs text editor as well as UNIX bash shell. The shortcuts can
be used for navigating the prompt and for interacting with the shell’s
command history (described in later sections). In the below table,
you will find some of the most regular shortcuts used frequently be
the programmers.
Shortcut Command
Description
Ctrl + P or Upward Arrow
To search in backward direction in the command history beginning
with current position of text
Ctrl + N or Downward Arrow
To search in forward direction in the command history beginning
with current position of text
Ctrl + Shift + V
To paste existing text from the clipboard
Ctrl + C
To interrupt the code that is currently running
Ctrl + R
To read the back search history with partial matching
Ctrl + E
Shift the cursor towards end of line
Ctrl + A
Shift the cursor towards start of line
Ctrl + K
Delete text till end of line
Ctrl + F
Shift the cursor forward by 1 character
Ctrl + B
Shift the cursor backward by 1 character
Ctrl + L
To clear the screen
In [ 16 ]: %run Chapter2/ipython_error.py
AssertionError Traceback (most recent call last)
/home/project/code/ipython/utils/ch2.pyc in execfile(fname, *where)
251 else:
252 filename = fname
-- > 253 __builtin__.execfile(filename, *where)
home/Chapter2/ipython_error.py in <module>()
25 throws_an_exception()
26
-- > 27 method_called()
AssertionError:
In [ 7 ]: 'x' in _ip.user_ns
Out [ 7 ]: True
In [ 8 ]: %reset –f
In [ 9 ]: 'x' in _ip.user_ns
Out [ 9 ]: False
Then by pressing < Ctrl + R > you can jump to every line that
matches with the exact characters typed by you.
You can find the stored input variables as _iX, meaning X as the
input line number. There is an output variable _X corresponding to
each input variable. So let’s say that after input line 15, you’ll see
that there are two new variables; for the output named _15 and _i15
for the input.
In [ 14 ]: fun = 'car'
In [ 15 ]: fun
Out [ 15 ]: 'car'
In [ 16 ]: _i15
Out [ 16 ]: u'fun'
In [ 17 ]: _15
Out [ 17 ]: 'car'
You can execute the input variables again (as they are strings) with
the Python exec keyword:
In [ 18 ]: exec _i15
There are a lot of magic functions that help you in using the input, as
well as output, history. A function like %reset will help to delete all
the variable names present in the namespace and is optional to
remove stored input/output cache. You can also print the command
history using %hist. And you can remove all references to a
particular object from IPython using the %xdel function.
Interacting with OS
IPython also provides a tight integration with the OS shell. Thanks to
this useful feature, you are allowed to perform basic command line
actions just as you would in other shells, such as the ones from
Windows or UNIX without any need to exit from IPython. This
means you can do things as change directories, execute shell
commands and store the command's results in an IPython object such
as a string or a list.
In below table you can find more information on magic functions.
Command
Description
%bookmark
To take advantage of bookmarking system of IPython’s directory
!cmd
T execute cmd in shell
%pwd
To show the current system directory
%popd
To change directory to the start of the stack
%cd directory
To change the current working directory in system
Output = ! cmd args
To run cmd and store output of stdout
%dhist
To look up the history of directories
Once this is done, you can use the %cd magic to use the bookmarks
that you have already defined.
In [ 2 ]: cd data
(bookmark:data) -> /project//GoogleMailbox/
/project//GoogleMailbox/
Interactive Debugger
The enhanced IPython’s debugger provides more advantages to pdb
with syntax highlighter, automatic tab completion, and line context
in exception tracebacks. Code debugging comes in handy after an
error has appeared in the program. Right after an exception, you can
use the %debug command to call the debugger directly and open the
full stack frame from the position where the exception was found:
In [ 3 ]: run chapter02/ipython_error.py
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
/home/project/chapter02/ipython_error.py in <module>()
17 throws_an_exception()
18
---> 19 method_called()
/home/project/chapter02/ipython_error.py in method_called()
15 def method_called():
16 correct_working()
---> 17 throws_an_exception()
18
19 method_called()
/home/project/chapter02/ipython_error.py in throws_an_exception()
11 x = 2
12 y = 4
----> 13 assert(x + y = = 8)
14
15 def method_called():
AssertionError:
In [ 4 ]: %debug
> /home/project/chapter02/ipython_error.py(13)
throws_an_exception()
12 y = 4
----> 13 assert(x + y = = 8)
14
ipdb>
In order to look all the objects and data that are currently active in
the interpreter inside the stack frame, you can arbitrarily execute
IPython code from the debugger. It starts you in the lowest level by
default right where the error occurred. In order to switch among
various levels of the stack trace, you can do it by either pressing the
u key(upwards) or the d key(downwards):
ipdb> u
> /home/project/chapter02/ipython_error.py(13) method_called()
16 correct_working()
---> 17 throws_an_exception()
18
Many users find this really useful because of the fact that you can
automatically make the debugger window start by using the %pdb
command after any exception has occurred in the program.
When you wish to set the breakpoints in your code to check the
execution of any particular function so that you can examine the state
of every variable at any stage, the debugger tool will help you a lot.
You can do it in many ways. One way is to use the %run command
suffixed with the -d character. This will invoke the debugger at the
initial stage itself before running the code of the script. Once it is
done, you need to use the ( s ) key to start the execution of your
script:
In [ 6 ]: run -d chapter02/ipython_error.py
Breakpoint 1 at /home/project/chapter02/ipython_error.py:1
Note: Press 'c' at the ipdb> prompt to run the script.
> <string>(1)<module>()
ipdb> s
-- Call --
> /home/project/chapter02/ipython_error.py(1)<module>()
1---> 1 def correct_working():
2x=2
3y=4
From this point, you decide the way you want to work over the file.
For instance, in the above mentioned example, you can set a
breakpoint before calling the correct_working method. Then
afterwards you can run the complete script by using the ( c ) key
until it reaches the set breakpoint.
Now let us see an example:
ipdb> y 12
ipdb> c
> /home/project/chapter02/ipython_error.py(15) method_called()
14 def method_called():
2 --> 15 correct_working()
16 throws_an_exception()
You can advance to the line containing the error and check out the
variables that are in the internal scope of the program by using
throws_an_exception. Remember to use the prefix ! before the
variables in order to look out their content. This prefix is required
because the debugger command can automatically take precedence
over those variable names.
Now let us see an example:
ipdb> s
-- Call --
> /home/project/chapter02/ipython_error.py(5)
throws_an_exception()
4
---- > 5 def throws_an_exception():
6x=2
ipdb> n
> //home/project/chapter02/ipython_error.py(6)
throws_an_exception()
5 def throws_an_exception():
----> 6 x = 2
7y=4
ipdb> n
> //home/project/chapter02/ipython_error.py(7)
throws_an_exception()
6x=2
---- > 7 y = 4
8 assert( x + y = = 8)
ipdb> n
> /home/project/chapter02/ipython_error.py(8)
throws_an_exception()
7y=4
---- > 8 assert( x + y = = 8)
9
ipdb> !x
2
ipdb> !y
4
Practice and experience is the most reliable way to become efficient
when using the interactive debugger. You can also find a full list of
the debugger commands in the below table. To anyone who is using
the IDE for the first time, this debugger will look like a little bit
difficult to start off with but along with time it will become easier.
Command
Description
H
It is used to show the command list
Help
It is used to provide the documentation for commands
C
It is used to resume the current program execution
B
It is used to set a particular breakpoint in the current program file
Q
It is used to exit from the debugger without further executing the
program code
S
It is used to step inside a function call
N
It is used to run the current line and then move towards the next
line in the program level
u or d
It is used to move upwards or downwards when inside a function
call stack
A
It is used to display the arguments present inside the currently
running function
debug
It is used to invoke the statement in an all new debugger
W
It is used to show the complete stack trace along with the context
present at the current position
L
It is used to display the current position and the context that is at
the current stack level
It is very simple to use the set_trace() function. You can put the
function anywhere in your program’s code where you want it to get
stopped. Then you can easily look around it; like for example,
putting it before an exception has occured:
In [ 8 ]: run Chapter02/ipython_error.py
> /home/project/chapter02/ipython_error.py(15)method_called()
14 set_trace()
--- > 15 throws_an_exception()
16
Now you can easily use the ( c ) key to resume the code from where
it stopped without any issues. You can also use the debug function
defined above to invoke the debugger on any random function call.
def val(a, b, c = 1 ):
tempVal = a + b
return tempVal / c
Now you want to step into it to check the logic you have used in the
code. If you use the function val() ordinarily it will look like val(1, 2,
c=3). Therefore, if you wish to instead step into val directly, you
need to pass val as the first argument to the debug function. After
that pass the positional and the keyword arguments.
Now let us see an example:
In [ 9 ]: debug(val, 1, 2, c = 3)
> <ipython-input>(2)val()
2 def val(a, b, c):
---- > 3 tempVal = a + b
4 return tempVal / c
ipdb>
You can save a lot of time by using both the above code snippets on
a regular basis.
You can also use the command %run in concurrence with the
debugger. To do so, you have to execute the script suffixed by –d.
That way you will get into the debugger console directly, set any
breakpoints wherever you want in the code and run the script.
In [ 2 ]: %run -d chapter02/ipython_error.py
Breakpoint 1 at /home/project/chapter02/ipython_error.py:1
Note: Enter 'c' at the ipdb > prompt to run your script.
> <string>(1)<module>()
ipdb>
You will think that both will have the same execution time. We can
check it using the %time command:
In [ 243 ]: %time firstMethod = [ i for i in strings if
i.startswith('book') ]
CPU times: user 0.15 s, sys: 0.00 s, total: 0.15 s
Wall time: 0.15 s
In [ 244 ]: %time secondMethod = [ i for i in strings if i[:4] == 'book'
]
CPU times: user 0.07 s, sys: 0.00 s, total: 0.07 s
Wall time: 0.07 s
Profiling Code
Profiling code is also related to the code’s execution timing except
for the fact that it determines where the time is spent. The main
profiling tool used in Python is called cProfile. Please note that it is
not related to IPython. cProfile helps to execute blocks of code or the
complete program and keeps track of time spent on every function
present in the program.
You can use cProfile on command line by running the complete
program and then getting the combined time per function as output.
You can run the below script via cProfile in command line:
python -m cProfile own_example.py
Once you run that, you will see the execution time for all functions
sorted by their names in ascending order. It can be pretty difficult to
check where the most time has been actually spent. To overcome
this, you can use sort order by using –s suffux after cProfile
command.
For example:
$ python -m cProfile -s cumulative own_example.py
One with the most time: 9.12572475
21376 function calls (17534 primitive calls) in 1.32 seconds
One thing to note here is that if a function calls another function, its
timing will not stop there. It will calculate the total time required by
the function until it exits from it. It will record the start time and the
end time to give you the results.
cProfile not only provides the above command line function, but it
can also be used to profile any code blocks without the need to start a
new process. IPython provides a suitable interface with %prun
command.
In [ 251 ]: %prun -l 8 -s cumulative own_example()
6932 function calls in 0.738 seconds
Introduction to Pandas
Pandas is a package used in python for analyzing data. Pandas is one
of the most widely used tool in data munging/wrangling (process of
transforming and mapping data from raw data form into another
more appropriate and valuable form, which can be used for
downstream purpose such as analytics). Pandas is open source, free
to use (under a BSD license) and was originally written by Wes
McKinney. Pandas takes data (from CSV file or SQL database) and
convert it into python objects in the form of rows and columns (here
you can imagine it as a table).
In Python, Pandas is an open source library that is responsible for
delivering high performance as well as usable data structures. It is a
BSD licensed library that provides the required data analysis tools
within the platform of Python. Primarily, the Pandas library enables
faster data analysis in Python. Pandas library is appropriate to be
used in the applications that are built on NumPy because Pandas
itself is built on the NumPy library.
Pandas is a package used in Python to make analysis of relational
database (such as SQL) and labeled data (excel, csv) easy, fast and
intuitive. Examples where Pandas can be Used for data analytics are:
Generally, you would add “as pd” (or “as np”) alias, so you can
access different commands as “pd.command” instead of
“pandas.command” saving you from typing more characters.
From the above code, we can see that Pandas library is being
imported as “pd”. So, whenever, the word “pd” is seen within a code,
one needs to understand that it is a direct reference to the Pandas
library. Since we use DataFrame and Series quite often, it will be a
good idea to import them to the local namespace. The above code
does exactly the same.
Let’s Start Working!
We can code our program in different IDEs like Jupyter, Spyder,
which come along with the Anaconda package.
We can use pandas with other libraries like NumPy, MatPlotLib,
SKlearn for data analysis and visualization purpose. Pandas is a
powerful and simplicity data analysis framework for Python. It’s
tightly incorporated with NumPy and matplotlib packages.
The Pandas library has evolved incrementally in last four years and
has developed itself into a much broader and a larger library. The
well-developed Pandas library can now handle complex problems
related to the data analysis, with the simplicity it is known for. That
has been the beauty of Pandas; it still exists with the ease-of-use
even though as a library it has grown over these years. For a
programmer, it means a lot since he desires for a platform that is
simple as well as easy to use, with lots of utilities. At the end of this
book, one can easily understand the reason why Pandas is considered
as one of the important components of Python. In the rest of this
book, we will use the common conventions for import in Pandas.
The import conventions are mentioned below.
Note: All the coding is performed in Jupyter Notebook using
python 3 kernel.
Explanation:
import is required since pandas is a package.
Import pandas as p is an alias.
df(dataframes) is a data structure of pandas used to store the data of
the file.
Read_csv(path+filename): to read csv file (other methods:
read_excel, read_sql, etc.)
Df1 is used to display all the values stored in dataframe
Other Related Commands:
df.head()-To view top 5 values of a file
df.tail()- To view bottom 5 values of file
Explanation:
QI is a variable name where we are storing the value
df[‘QID’]-> ‘QID’ is name of column header
Case 1:
If you want to retrieve two or more columns and save it in
dataframe structure for further use you can add the following:
import pandas as p
df1 =
p.read_csv('C:\\Users\\akanksha.amarendra\\Desktop\\new\\test.csv',
index_col=False)
df1 = df1[['QID','Category']]
df1
Case 2:
In case you want conditions on rows and well as columns:
a) all rows and selected columns
import pandas as p
df1 =
p.read_csv(''D:\\Python\\marysoloman\\new\\test.csv',index_col
= False)
df1 = df1.loc[:,['QID','Category']]
df1
Explanation:
-Qi is a variable
-df1.[‘QID’].values[0]-> used to retrieve data from QID column at
0th index
-“index_col = False” -> in path is to remove the serial number
column that would have otherwise been displayed with the “Qi”
Explanation:
The value can be stored in text file using the highlighted part of
code.
Explanation:
Name in for loop will be the QID received from the file
df.groupby(‘column_name’) used to group the data by QID.
group.to_csv(path) used to save file at particular path with name in
csv format.
name.astype(str) is used because value of name i.e. QID is Integer.
Python Pandas DataFrame
It is basically a 2-D data structure having columns of different types.
It acts as like a Microsoft Excel spreadsheet as well as a SQL
database table. It has the capability to accept various types of data
input like the below ones:
Structured ndarray
DataFrame
Dict of 1D ndarrays, dicts, Series, or lists
A Series
2-D numpy.ndarray
Addition:
We can add specific columns to DataFrame, as shown below.
Example - Addition:
import pandas as pd
df = pd.DataFrame( { 'A' : [21, 22, 23, 42],
'B' : [24, 23, 22, 13],
'C' : [67, 14, 12, 18],
'D' : [24, 53, 52, 41] } )
df[ 'E' ] = df[ 'A' ] * df[ 'B' ]
print df
Output:
A B C D E
0 21 24 67 24 504
1 22 23 14 53 506
2 23 22 12 52 506
3 42 13 18 41 546
Deletion
Columns can be deleted or popped from DataFrame.
Example:
import pandas as pd
df = pd.DataFrame( { 'A' : [21, 22, 23, 42],
'B' : [24, 23, 22, 13],
'C' : [67, 14, 12, 18],
'D' : [24, 53, 52, 41] } )
# Delete the column
del df[ 'A' ]
# Pop the column
df.pop( 'B' )
print df
Output:
C D
0 67 24
1 14 53
2 12 52
3 18 41
DataFrame - Transpose
We use T attribute when we need to transform data from row-level
to columnar data.
Now let us see an example:
import pandas as pd
df = pd.DataFrame( { 'A' : [21, 22, 23, 42],
'B' : [24, 23, 22, 13],
'C' : [67, 14, 12, 18],
'D' : [24, 53, 52, 41] } )
print "Actual DataFrame : \n", df
print "\nAfter Transpose : \n", df.T
Output:
Actual DataFrame:
A B C D
0 21 24 67 24
1 22 23 14 53
2 23 22 12 52
3 42 13 18 41
After Transpose:
0 1 2 3
A 21 22 23 42
B 24 23 22 13
C 67 14 12 18
D 24 53 52 41
DataFrame - Groupby
It is used to arrange identical data into groups. We could naturally
group by either single column or list of columns.
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['fo', 'ba', 'fo', 'ba','fo', 'ba', 'fo', 'fo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
grouped = df.groupby('A')
print grouped.groups
Output:
{'fo': [0L, 2L, 4L, 6L, 7L], 'ba': [1L, 3L, 5L]}
Output:
ba A B C D
1 ba one 0.286725 0.735988
3 ba three 0.031028 1.368286
5 ba two -0.095358 -0.466581
fo A B C D
0 fo one -0.168500 0.155344
2 fo two 0.234364 1.336054
4 fo two -0.397370 0.722332
6 fo one 0.714563 -1.437803
7 fo three -1.307215 -0.118485
DataFrame – Aggregation
Once the Groupby is complete, various methods are available to do a
computation on the grouped data. The method name as below.
aggregate or agg
Now let us see an example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['fo', 'ba', 'fo', 'ba','fo', 'ba', 'fo', 'fo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
grouped = df.groupby('A')
print grouped.aggregate(np.sum)
Output:
A C D
ba 0.282556 -0.368244
fo -1.434853 -4.828347
A C D
ba 0.282556 -0.368244
fo -1.434853 -4.828347
Important Functionalities
We will see the basic and important mechanisms of data interaction
in DataFrame and series in this section. We will not focus on all the
documentation present in the pandas library; but instead we will see
the most important and frequently used functionalities.
Re-indexing
An important function to perform on pandas objects is to re-index. It
is a process in which we create a new object and the data follows to
the new index. Let us see an example:
In [ 1 ]: first_obj = Series( [2.4, 5.1, 3.7, 7.6], index = ['v', 'u', 'w', 'x']
)
In [ 2 ]: first_obj
Out [ 2 ]:
v 2.4
u 5.1
w 3.7
x 7.6
Now we will call the reindex function on the first object. It will
rearrange all the data as per the new index.
In [ 3 ]: new_obj = obj.reindex( ['u', 'v', 'w', 'x', 'y'] )
In [ 4 ]: new_obj
Out [ 4 ]:
u 5.1
v 2.4
w 3.7
x 7.6
y NaN
We can also use the arguments ffill and bfill to carry values forward
or carry values backwards, respectively.
Here are some of the arguments that can be used with the reindex
function.
Argument name
Description of the argument
Index
It is used to provide a new index sequence to the object. It will be
used exactly the same in the new object.
Fill_value
It is used to replace the missing values in the newly re-indexed
object with your own defined values. For example, you can replace
it with 0.
Limit
It is used to fill the maximum size gap while filling it forward or
backward.
Copy
It is used to command not to copy the underlying data in case both
the new index and the old index are identical. Its default value is
True, which means it will copy the data at all times.
Level
It is used to match the simple index if there are multiple index. If
not, then it will select its subset.
Function Mapping
All the ufuncs in NumPy (array methods) can easily work with
objects created in pandas.
In [ 12 ]: obj_frame = DataFrame(np.random.randn(4, 3), columns =
list( 'xyz' ),
.....: index = [ 'USA', 'UK', 'Tokyo',
'Canada' ] )
In [ 13 ]: obj_frame
Out [ 13 ]:
x y z
USA 0.1235 0.5248
0.5823
UK 1.7383 0.4629
0.2346
Tokyo 0.5683 0.3146
1.9424
Canada 1.626 -0.4245
0.9124
In [ 14 ]: np.abs(obj_frame)
Out [ 14 ]:
x y z
USA 0.1235 0.5248
0.5823
UK 1.7383 0.4629
0.2346
Tokyo 0.5683 0.3146
1.9424
Canada 1.626 -0.4245
0.9124
Out[ 16 ]:
x y z
minimum -0.1395 0.1396
-0.3561
maximum 1.1435 0.9587
1.9465
Element wise Python functions can also be used here. We can use the
applymap method. It has been named so because a map function is
already there in series, which performs the same task.
Sorting in Pandas
Sorting is a process of arranging the row or column values
systematically using some pre0defined criteria. It can be done on the
row or the column index with the help of sort_index function. It will
return a new object with the sorted values.
Let us see an example.
In [ 17 ]: first_obj = Series(range(4), index = [ 'x', 'u', 'v', 'w' ] )
In [ 18 ]: first_obj.sort_index()
Out [ 18 ]:
u1
v2
w3
x0
We can sort the DataFrame by using both axis index.
In [ 19 ]: obj_frame = DataFrame(np.arange(8).reshape((2, 4)), index
= [ 'third', 'first' ],
.....: columns = [ 'x', 'u', 'v', 'w' ] )
In [ 20 ]: obj_frame.sort_index()
Out [ 20 ]:
x u v w
first 4 5 6 7
third 0 1 2 3
By default, the data is sorted in ascending value. If we want to sort it
in the descending order, then we need to pass the argument
ascending = False in the sort function.
In order to sort the Series, we can use the order function. If there are
any missing values present in the Series, then by default it will be
sorted at the end.
In [ 21 ]: first_obj = Series( [ 3, 6, -2, 1 ] )
In [ 22 ]: first_obj.order()
Out [ 22 ]:
2 -2
31
03
16
Chapter 4: NumPy Basics
Introduction
The NumPy, which is the short form of Numerical Python, is the
core library for handling scientific computation in python related to
arrays and vectorized computation. In this book, the base of all high-
level tools is this library. In data analysis, data cleaning NumPy is
most widely used library.
NumPy provides these features:
In high level data analysis pandas are widely used, where NumPy is
used only in mathematical computational level. Time series
manipulation and many other features are only included in pandas.
That’s why in basic operations pandas are used.
Short Notes:
In most of the book, numpy is imported like this: import numpy as
np. It can also be imported like, from numpy import *. But the first
one is my recommended convention of importing array.
Short Notes:
With little bit of exceptions, the ndarray, numpy array and array are
similar as objects of numpy array.
Creating ndarrays
To create arrays in numpy we have to use .array functions. It works
like following.
>>> x = np.array( [10 ,20, 30] )
>>> y = np.array([ (10, 20, 30), ( 40, 50, 60) ], dtype = int)
>>> z = np.array( [[( 10, 20, 30), ( 40, 50, 60) ], [ (30, 20, 10), ( 40,
50, 60) ]], dtype = int)
A list of equal-length lists, which is a Nested list, will get converted
to a multidimensional array:
In [ 20 ]: values = [[ 10, 20, 30, 40], [50, 60, 70, 80]]
In [ 21 ]: array1 = np.array(values)
In [ 22 ]: array1
Out [ 22 ]: array ([[ 10, 20, 30, 40] , [50, 60, 70, 80 ]] )
In [ 23 ]: array1.ndim
Out [ 23 ]: 2
In [ 24 ]: array1.shape
Out[ 24 ]: (2, 4)
In [ 25 ]: array1.dtype
Out [ 25 ]: dtype(‘int32’)
There are many functions for creating arrays like ones and zeros and
empty. For example, arrange is an in-built method used in the array
value present in the range function.
In [ 27 ]: np.arange(20)
Out [ 27 ]: array ([ 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
33, 34, 35, 36, 37, 38, 39, 40] )
Below table has the list of some common functions used for array
creation. As NumPy’s main focus is on numerical computing, if the
data type is not specified explicitly, it will be found as float in many
cases. They work like below:
>>> np.zeros((2,3)) # It helps in
creating an array of all zeros
>>> np.ones((1,2,3), dtype = np.int16) # It helps in creating an
array of all ones
>>> x = np.arange(5,10,18, 7) # It helps in creating an array
of evenly spaced values (which
is step value)
>>> np.linspace(0,1,8) # It helps in
creating an array of evenly spaced values (which
is number of
samples)
>>> y = np.full( (1, 1), 6) # It helps in creating
a constant array
>>> z = np.eye(3) # It helps in
creating a 3 X 2 identity matrix
>>> np.random.random( (3, 3) ) # It helps in creating an array
with all the random values
>>> np.empty( (2,3) ) # It helps in
creating an empty array
Short Notes: np.empty can give zeros or pre-initialized garbage
value. So, use it with caution.
This is a short list of function that creates array, and the datatypes
will be float64 if not pre-specified.
Function Name
Description of the function
array
It converts the input data like array, tuple, list into a ndarray. It is
done either by specifying a dtype explicitly or by inferring a dtype.
It will also by default copy the input data entered by you.
empty
It will create new arrays without any initial values and only
allocates new memory to the array.
as array
It will convert the input data to a ndarray, but it will disregard it if
the input provided by you is already a ndarray.
arrange
It is an in-built method used in the array value present in the range
function. It will return the ndarray and not the list.
ones
It will create an array comprising of all 1’s. It will take input for
the shape and dtype of the array.
eye
It will produce a N x N matrix, which has all 1’s in the diagonal
and all 0’s in the remaining matrix.
Fancy Indexing
If the indexing is performed by integers array, then it is called fancy
indexing.
In [ 76 ]: x = np.empty( ( 8, 4 ) )
In [ 77 ]: for j in range(8): .....: x[i] = i
In [ 78 ]: x
Out [ 78 ]: array( [ [ 10, 10, 10, 10 ] , [ 20, 20, 20, 20 ] ,
[ 30, 30, 30, 30 ] , [ 40, 40, 40, 40 ] ,
[ 50, 50, 50, 50 ] , [ 60, 60, 60, 60 ] ,
[ 70, 70, 70, 70] , [ 80, 80, 80, 80 ] ] )
If you want to select particular rows or subsets of array you just have
to pass a ndarray of data.
In [ 79 ]: x[ [5, 8, 1, 0 ] ]
Out [ 103 ]:
array( [ [ 5, 5, 5, 5 ] ,
[ 8, 8, 8, 8 ] ,
[ 0, 0, 0, 0 ] ,
[ 1, 1, 1, 1 ] ]
When we want to pass multiple ndarray then it takes a 1 dimensional
array of elements corresponding to each indices:
In [ 80 ]: x = np.arange(25).reshape( (5, 5) )
In [ 81 ]: x
Out [ 81 ]: array( [ [ 0, 1, 2, 3, 4 ] ,
[ 5, 6, 7, 8, 9 ] ,
[ 10, 11, 12, 13, 14 ]
,
[15, 16, 17, 18, 19 ] ,
[20, 21, 22, 23, 24 ] ] )
In [ 82 ]: x[ [ 0, 1, 2, 3, 4 ] , [ 0, 1, 2, 3, 4 ] ]
Out [ 82 ]: array( [ 0, 6, 12, 18, 24 ] )
In the above example, we selected the elements (0,0), (1,1), (2,2),
(3,3), (4,4). Therefore, it showed the value present at those locations
from the array.
Fancy indexing always copy the elements into array.
modf It is used to
return integer and fraction part of the array
elements in a separate array.
isnan It is used to return
the boolean array, which indicates if every
element present in the array is of NaN type or
not. NaN means Not a Number.
Isfinite and isinf It is used to return the
boolean array, which indicates if every element
present in the array is finite or infinite.
Sorting
NumPy can obviously do array sorting. With sort method, you can
sort the elements of the array easily in ascending or descending order
as per your requirement.
In [ 113 ]: x = randn(10)
In [ 114 ]: x
Out [ 114 ]: array( [1.7903, 1.5678, 1.1968, -1.2349, 1.9979,
1.1185, -2.4147, -1.6425, 2.3476, 0.3243])
In [ 115 ]: arr.sort()
The np.sort method returns copy of the input array without changing
the actual one. Therefore, you can sort the array first and then select
the particular rank’s value.
In [ 116 ]: sorted_arr = randn(1000)
In [ 117 ]: sorted_arr.sort()
In [ 118 ]: sorted_arr[ int( 0.02 * len (sorted_arr) ) ]
Out [ 118 ]: -2.6802134371907113
in1d(a, b) It is used to
compute the boolean array, which
indicates if every element of array a is
present in array y or not
setxor1d(a, b) It is used to set the symmetric differences
between the arrays a and b i.e. elements that are
present in only one of the arrays, but not in
both.
File Input and Output with Arrays
NumPy can save and load data in txt or binary format. The np.save
and np.load functions are used to load and save data in disk. The.
npy extension is the file format in which array data are stored.
In [ 123 ]: x = np.arange(10)
In [ 124 ]: np.save(random_array', x)
In [ 125 ]: np.load('random_array.npy')
Out [ 125 ]: array( [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ] )
With np.savez you can save multiple file as zip extension.
In [ 126 ]: np.savez('array_archive.npz', a=x, b=x)
You can also work with txt file and csv file where data are comma
separated. You can read and save the data as you want. Here is a
comma separated file names arr_example:
In [ 127 ]: !cat arr_example.txt
1.246783, 2.346630, 0.060817, 2.336771, 1.395163, -1.133918,
-1.536658, 1.236714
1.756410, 1.764607, -1.585624, 1.737556, -2.584313, 1.794476,
-1.844203, 1.255487
0.572769, 2.730531, 2.532438, 0.836707, -1.477528, 2.425233,
1.742803, 1.593734
In [ 128 ]: arr = np.loadtxt( 'arr_example.txt', delimiter = ',' )
In [ 129 ]: arr
Out[129]: array([ [ 1.246783, 2.346630, 0.060817, 2.336771,
1.395163, -1.133918, -1.536658, 1.236714], [
1.756410, 1.764607, -1.585624, 1.737556, -2.584313, 1.794476,
-1.844203, 1.255487],
[ 0.572769, 2.730531, 2.532438, 0.836707, -1.477528,
2.425233, 1.742803, 1.593734] ] )
The np.savetxt will do the reverse task that writing an array and save
as txt file.
Let us see an overview about how to use the above functions for
converting the text file’s data into a DataFrame. Here are the options
under which the above functions fall:
It’s not necessary that every file will have a header row. Let’s see the
below example:
In [ 158 ]: !cat chapter05/csvExample2.csv
10, 20, 30, 40, this
50, 60, 70, 80, is
90, 100, 110, 120, fun
We have two options to read this. One is to permit the pandas library
to automatically assign the default column names. The other option
is to specify the column names on your own.
In [ 159 ]: pd.read_csv('chapter05/csvExample2.csv', header = None)
Out [ 159 ]:
X.1 X.2 X.3 X.4 X.5
0 10 20 30 40 this
1 50 60 70 80 is
2 90 100 110 120 fun
Argument name
Description of the argument
Path
It is a string value that is used to indicate the location of file, file-
like objects, or the URL.
Delimiter or sep
It is used to split the columns present in each row by either using a
regular expression or a character sequence.
Names
It is used to show the result in form of column names, combined
with the header value as None.
na_values
It is used to provide the sequence of values, which needs to be
replaced by NA
Header
It is used to provide the row number that will be used as column
names. The default value is 0 and is None in case there is no
header row present in the file.
comment
It is used to split the character(s) comments after the end of lines.
date_parser
It is used for parsing dates from a text file.
index_col
It is used to provide the column names or numbers that need to be
used as row index in result.
Nrows
It I used to pass the number of rows, which need to be read from
the starting point of the file.
Verbose
It is used to print different output information, such as total
number of missing values that are present in non-numeric
columns.
Iterator
It is used to return the TextParser object to read the file in steps.
convertors
It is used to perform name mapping on functions.
skip_footer
It is used to provide the number of lines, which needs to be
ignored at the end of the file.
Squeeze
It is used to return a series if the parsed data has only one column
present in it.
chunksize
It is used to iterate and provide size of the file chunks.
encoding
It is used to provide the text encoding for Unicode system.
Dayfirst
It is used to deal with international format when ambiguous dates
are getting parsed. Its default value is false.
We can see that this is a very large file having 50,000 rows in it.
Therefore, we will read the file in small pieces of rows instead of
reading the complete file. It can be done using “nrows”.
In [ 176 ]: pd.read_csv('chapter05/example6.csv', nrows = 10)
Out [ 176 ]:
w x y z word
0 10 20 30 40 A
1 50 60 70 80 B
2 90 100 110 120 C
3 130 140 150 160 D
4 170 180 190 200 E
5 10 20 30 40 F
6 50 60 70 80 G
7 90 100 110 120 H
8 130 140 150 160 I
9 170 180 190 200 J
In order to read the file, we can also use chunksize for number of
rows.
In [ 177 ]: file_chunker = pd.read_csv('chapter05/example6.csv',
chunksize = 500)
In [ 178 ]: file_chunker
Out [ 178 ]: <pandas.io.parsers.TextParser at 0x8398150>
# TextParser object gets returned from read_csv will be iterating the
file as per the chunksize fixed above i.e. 500. Therefore, we can
aggregate the total value counts in “key” column by iterating the
example6 file.
file_chunker = pd.read_csv('chapter05/example6.csv', chunksize =
500)
total = Series( [] )
for data_piece in file_chunker:
total = total.add( data_piece[ 'key' ].value_counts(), fill_value = 0 )
total = total.order(ascending = False)
# Now we will get the below result:
In [ 179 ]: total[:10]
Out [ 179 ]:
F 460
Y 452
J 445
P 438
R 429
N 417
K 407
V 400
B 391
D 387
TextParser also has a get_chunk method. It helps us to read the
pieces of random size from the file.
How to Write Data Out to Text Format?
We can export data in a delimited format as well by reading it from
the csv file. Now let us see an example:
In [ 180 ]: final_data = pd.read_csv('chapter05/example5.csv')
In [ 181 ]: final_data
Out [ 181 ]:
anyValue w x y z
word
0 one 10 20 30 40
NaN
1 two 50 60 NaN 80 is
2 three NaN 100 110 120
fun
Now by using the to_csv method found in DataFrame, we can write
the resultant data in a comma separated file easily.
In [ 182 ]: final_data.to_csv('chapter05/csv_out.csv')
In [ 183 ]: !cat chapter05/csv_out.csv
, anyValue, w, x, y, z, word
0, one, 10, 20, 30, 40,
1, two, 50, 60, , 80, is
2, three, 90, 100, 110, 120, fun
You are free to use any other delimiter as well and is not limited to
comma only.
In [ 184 ]: final_data.to_csv(sys.stdout, sep = '|')
| anyValue| w| x| y| z| word
0| one| 10| 20| 30| 40|
1| two| 50| 60| | 80| is
2| three| 90| 100| 110| 120| fun
All the missing values will show up as empty in the output. You can
replace them with any sentinel of your choice.
In [ 185 ]: final_data.to_csv(sys.stdout, na_rep = 'NA')
, anyValue, w, x, y, z, word
0, one, 10, 20, 30, 40, NA
1, two, 50, 60, NA, 80, is
2, three, 90, 100, 110, 120, fun
Comma separated values i.e. cs files have various formats. You can
define your own new format having a specific delimiter, line
terminator. You simply need to define a subclass of csv.Dialect to
achieve this. Here is an example:
class own_dialect(csv.Dialect):
line_terminator = '\n'
value_delimiter = ';'
quote_char = '"'
obj_reader = csv.reader(foo, dialect = own_dialect)
HDF5 Format
Various tools are present in order to read and write a large amount of
data efficiently in binary format. One such widely used library is
HDF5, which has an interface for many programming languages,
such as Python, MATLAB, and Java. Full form of HDF is
hierarchical data format. Every HDF5 has a file system like node
structure. It helps you in storing more than one datasets as well as
supports metadata (data about data). As compared to simple formats,
it supports fast compression, which in turn helps in storing repeated
pattern data efficiently. HDF5 is an excellent choice for reading and
writing large datasets that do not fit in the memory, as it will cover
tiny sections of huge arrays.
HDF5 library has 2 interfaces in Python: h5py and PyTables. Both of
them follow a different approach to resolve a problem. PyTables
abstracts the details of HDF5 in order to come up with querying
ability, efficient data containers, and table indexing. Whereas h5py
offers a simple and direct high level interface for the HDF5 API.
For storing pandas object, HDFStore class is used. Let’s see it in the
below example.
In [ 196 ]: obj_store = pd.HDFStore(owndata.h5')
In [ 197 ]: obj_store['obj1'] = obj_frame
In [ 198 ]: obj_store['obj1_col'] = obj_frame['a']
In [ 199 ]: obj_store
Out [ 199 ]:
<class 'pandas.io.pytables.HDFStore'>
File path: owndata.h5
obj1 DataFrame
obj1_col Series
Objects present in the HDF5 file are retireived back in a dict style.
In [ 200 ]: obj_store['obj1']
Out [ 200 ]:
w, x, y, z, word
10, 20, 30, 40, this
50, 60, 70, 80, is
90, 100, 110, 120, fun
Therefore, if you usually work with huge data files, you can explore
more about h5py and PyTables as per your needs and requirements.
As most of the data analysis approaches are IO- bound and not CPU-
bound, using HDF5 will speed up your application.
When selecting the data from a table, SQL Drivers in Python (such
as MySQLdb, PyODBC, pymssql, psycopg2, etc.) will return the list
of tuples.
In [ 208 ]: obj_cursor = obj_con.execute('select * from example')
In [ 209 ]: obj_rows = obj_cursor.fetchall()
In [ 210 ]: obj_rows
Out [ 210 ]:
[(u'Mary', 'USA', 2.5, 26),
(u'John', 'Canada', 3.5, 30),
(u'David', 'UK', 3.8, 32),
(u'Alex', 'Germany', 4.6, 24) ]
Now we can pass this list directly to the DataFrame’s constructor
along with the column names present in the description attribute of
the cursor.
Now let us see an example:
In [ 211 ]: obj_cursor.description
Out [ 211 ]:
( ('w', None, None, None, None, None, None ),
( 'x', None, None, None, None, None, None ),
( 'y', None, None, None, None, None, None ),
( 'z', None, None, None, None, None, None ) )
In [ 212 ]: DataFrame(rows, columns=zip(*obj_cursor.description)
[0])
Out [ 212 ]:
w x y z
0 Mary USA 2.5 26
1 John Canada 3.5 30
2 David UK 3.8 32
3 Alex Germany 4.6 24
It is like munging so that you won’t need to repeat it every time
when you are querying the database. There is a read_frame function
in pandas, which simplifies the entire process. You only need to pass
the select query statement and the connection object created by you.
In [ 213 ]: import pandas.io.sql as obj_sql
In [ 213 ]: obj_sql.read_frame('select * from example', obj_con)
Out [ 213 ]:
w x y z
0 Mary USA 2.5 26
1 John Canada 3.5 30
2 David UK 3.8 32
3 Alex Germany 4.6 24
Thus, we can easily have a list of all the tweets posted on Twitter
using the collection. It can be done using the below syntax.
obj_cursor = obj_tweets.find( {'from_user': 'Mary'} )
An iterator will be returned back by the cursor, which yield all the
documents as a dict. Thus, we can easily convert it into a Python
DataFrame and extract the subset of all the fields of the tweet.
all_fields = [ 'from_user', 'created_at', 'text', 'id' ]
obj_value = DataFrame( list(obj_cursor), columns = all_fields)
Chapter 6: Plotting Data and
Visualization in Python
Data analysis is all about presenting the information in a creative and
interactive way. It is undoubtedly the most significant and essential
process in data analysis. It is mostly used to get details regarding pre
experimentation or post experimentation results using data plots. It
helps in identifying the outliers, coming up with exciting ideas to
transform business, and for data transformations. Python offers
numerous data visualization tools, but we will focus on matplotlib. It
can be found at http://matplotlib.sourceforge.net
Matplotlib is a 2-Dimensional plotting package, which come on the
desktop. It was created in 2002 by John Hunter so that we can use
MATLAB type of quality plots in Python. Today, matplotlib is fully
compatible with Python to provide a productive and functional
environment for computing services. It offers high end features such
as panning and zooming when we use it in Graphical User Interface
toolkit such as IPython. It offers support for numerous GUI backend
applications and is independent of the Operating system. In addition
to that, it has the capability to export the graphics into the raster
formats such as PDF, JPG, GIF, SVG, BMP, PNG, etc.
It also offers various add-on toolkits for creating 3 Dimensional
plots, projections, and mapping the relevant data. Some of the add-on
toolkits are basemap and mplot3d. We will see how basemap is used
to plot data into a map in this chapter. In order to understand the
examples used in this chapter, you need to start IPython in Pylab
mode. You can do this by using the pylab command. Otherwise, you
can also enable the GUI event loop integration. You can do this by
using the %gui command.
Matplotlib API Primer
We can interact with matplotlib in numerous ways. The very basic
methd is to use pylab mode. It will automatically change the
configuration of IPython to support the backend GUI of your
preferred choice. You can use from a variety of options such as
PyQt, Tk, GTK, wxPython, and Mac OS X native. You can also
continue with the default GUI backend if you don’t want to use our
own. While using the Pylab mode, it will automatically import all the
required functions and modules in IPython library so that you get a
similar look and feel of MATLAB. To check if everything has been
set up correctly, you can create a simple plot and check it. You can
do it by using the arange magic command.
Plot(np.arange(20))
A new pop up window will open up with a simple line plot on your
screen if the configuration is set up correctly. You can check it and
then close it by entering close() or using your mouse/touchpad.
Functions such as plot and close present in matplotlib can be easily
imported using the below syntax:
import matplotlib.pyplot as obj_plt
For example, let’s give a command to plot like this one: obj_plt.plot(
[4, 2.5, -3, 2.3] ).
Now the matplotlib will draw on the last figure.
In [ 5 ]: from numpy.random import obj_rand
In [ 6 ]: obj_plt.plot(obj_rand(30).cumsum(), 'k--')
Here the k—refers to a styling option that commands matplotlib for
plotting a black dash line. The Axes Subplot objects that are returned
by obj_fig.add_subplot function can be plotted directly on other
subplots that are empty. This is done by calling their instance
methods.
In [ 7 ]: _ = axis1.hist(obj_rand(50), bins = 10, color = 'k', alpha =
0.25)
In [ 8 ]: axis2.scatter(np.arange(15), np.arange(15) + 3 *
obj_rand(30))
It comes out to be very handy as the axes array works like 2-D array.
We can also point out the subplots that are having similar X or Y
axis by the help of sharex & sharey. This proves to be important
when we want to compare the data on a similar scale, or else it will
auto scale the limits on its own.
Here is the list of options available with the subplots function.
Argument name
Description of the argument
Nrows
It is used to return the number of rows present in the subplots
Ncols
It is used to return the number of columns present in the subplots
Sharex
It is used to make sure that all the subplots are using same X axis
ticks
Sharey
It is used to make sure that all the subplots are using same Y axis
ticks
subplot_kw
It is used to provide the dict of all the keywords to create the
subplots
Plotting Lines
Both DataFrame and Series offers a plotting method of their own in
order to create various type of plots. The default type of plot for both
of them is the line plot. We can create it using the below example:
In [ 14 ]: plot_series = Series(np.random.randn(15).cumsum(), index
= np.arange(0, 50, 5))
In [ 15 ]: plot_series.plot()
Here we are passing the index of series object to the matplotlib to
plot the values on the X axis. We can disable it as well by passing
the argument use_index = False. There are 2 options for adjusting the
X axis and Y axis ticks and limits. They are: xticks and xlim.
Here is a simple series plot indicating the X axis and Y axis plotting.
Almost all the plotting functions present in pandas will accept the
optional parameter ax as the subplot object. It provides much needed
flexibility in placing the subplots in gird layout.
Plot method present in DataFrame will plot every column present in
it at a different line in the same subplot. Thus, a legend gets created
automatically. Here is an example of the same.
In [ 16 ]: plot_df = DataFrame(np.random.randn(20, 5).cumsum(0),
....: columns = [ 'W', 'X', 'Y', 'Z' ],
....: index = np.arange( 0, 50, 50 ) )
In [ 17 ]: plot_df.plot()
Plotting Bars
We can easily create bar plots by passing the value kind = 'bar' in
case of vertical bars. If you want to use horizontal bars, then you
need to use kind = 'barh'. Here is an example of the same:
In [ 17 ]: obj_fig, obj_axes = obj_plt.subplots(2, 1)
In [ 18 ]: obj_data = Series(np.random.rand(12), index = list('a b c d
e f g h I j k l'))
In [ 19 ]: obj_data.plot(kind = 'bar', ax = axes[0], color = 'k', alpha =
0.4)
Out [ 19 ]: <matplotlib.axes.AxesSubplot at 0x3de6851>
In [ 20 ]: obj_data.plot(kind = 'barh', ax = axes[1], color = 'k', alpha =
0.4)
We can also create stacked bar plots using DataFrame. For that, we
need to pass the argument value stacked = True. This will stack
together all the values in every row.
In [ 32 ]: obj_df.plot(kind = 'barh', stacked = True, alpha = 0.4)
Scatter Plots
It is very important to use scatter plots in order to examine the
relationship between 2 data series of 1-Dimensional values. Here,
each point’s co-ordinates are specified using to different DataFrame
columns. These are filled circles that represent those points. The
points can be anything from a pair of metrics to longitude & latitude
values in a map.
Now let us see how we can draw a scatter plot by using the co-
ordinate values present in the columns of DataFrame.
>>> obj_df = pd.DataFrame( [ [2.5, 1.5, 1], [3.7, 2.03 0], [6.5, 3.6,
1],
... [4.4, 1.2, 1], [6.3, 2.0, 1] ],
... columns = [ 'length', 'width', 'species' ] )
>>> axis1 = obj_df.plot.scatter( x = 'length',
... y = 'width',
... c = 'DarkBlue')
After executing the above code, we will get the plot as shown in the
below image.
In the above example, we have passed parameter values of x, y, s
(scalar or array), and c (color).
Conclusion – Python Tools
Since it is an open source, there are numerous options to create data
visualizations in Python. We can choose from the free ones to the
commercial paid libraries with Python. We mainly discussed
matplotlib with Python, as it is easy to implement and understand.
But at the same time, it also has some disadvantages in creating
visually appealing graphics to the end users.
Here are a couple of other tools you can consider for data
visualization in Python.
Chaco
It is best suited for interactive visualization and static plotting. It has
excellent features to express complex data visualizations having
many inter connections. Its rendering is pretty faster as compared to
matplotlib.
You can find more details at this link:
http://code.enthought.com/chaco/
Mayavi
It was created by P. Ramachandran and is basically a 3-Dimensional
toolkit. It integrates easily with Python without any issues. We can
easily rotate, pan, or zoom the plots with the help of keyboard or
mouse.
You can find more details at this link:
https://docs.enthought.com/mayavi/mayavi/
You now have the beginner tools to implement the features of
Python. Be patient with yourself and enjoy your journey in the
amazing world of Python.
References
www.google.com
http://pandas.pydata.org
http://www.codebasicshub.com/
Source of screenshot: Jupyter Notebook (kernel: Python 3)