Working With Pandas Notes
Working With Pandas Notes
Mutability - All Pandas data structures are value mutable (can be changed).
Note − DataFrame is widely used and one of the most important data structures. Panel is used much less.
Series - Series is a one-dimensional array like structure with homogeneous data. For example, the following
series is a collection of integers 10, 23, 56 …
10 23 56 17 52 61 73 90 26 72
Key Points:
Homogeneous data
Size Immutable
Values of Data Mutable
A pandas Series can be created using the following constructor –
pandas.Series(data, index, dtype, copy)
data
data takes various forms like ndarray, list, constants
index
Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is
passed.
dtype
dtype is for data type. If None, data type will be inferred
copy
Copy data. Default False
Creating Pandas Series – A series can be created using various inputs like −
Array, Dict, Scalar value or constant, Mathematical Operations etc.
a) Create an Empty Series: An empty series can be created using Series() function of pandas library.
Example - #import the pandas library as pd
import pandas as pd
s = pd.Series()
print s
Output - Series([], dtype: float64)
b) Creating a series from Lists: In order to create a series from list, we have to first create a list after that
we can create a series from list.
Example 1 - # import the pandas library as pd
import pandas as pd
simple_list = ['g', 'e', 'e', 'k', 's']
s = pd.Series(simple_list)
print s
Output -
c) Create a Series from ndarray - If data is an ndarray, then index passed must be of the same length. If
no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3….
range(len(array))-1].
Example 1 - #import the NUMPY & PANDAS as np & pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s
Output –
We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-
1, i.e., 0 to 3.
Example 2 - #import the NUMPY & PANDAS as np & pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s
Output -
We passed the index values here. Now we can see the customized indexed values in the
output.
d) Create a Series from dictionaries - A dict can be passed as input and if no index is specified, then the
dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data
corresponding to the labels in the index will be pulled out.
Example 1 - #import the NUMPY & PANDAS as np & pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
Output -
Observe − Index order is persisted and the missing element is filled with NaN (Not a Number).
e) Create a Series from Scalar (Single Item) - If data is a scalar value, an index must be provided. The
value will be repeated to match the length of index
Example - #import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print s
Output -
f) Create a Series from Mathematical Operations – Different types of mathematical operators (+, -, *, /
etc.) can be applied on pandas series to generate another series.
Example - # import the pandas library as pd
import pandas as pd1
s = pd1.Series([1,2,3])
t = pd1.Series([1,2,4])
u = s + t #addition operation
print(u)
u = s * t # multiplication operation
print(u)
Output -
Series Basic Functionality:
axes
Returns a list of the row axis labels
dtype
Returns the dtype of the object.
empty
Returns True if series is empty.
ndim
Returns the number of dimensions of the underlying data, by definition 1.
size
Returns the number of elements in the underlying data.
values
Returns the Series as ndarray.
head()
Returns the first n rows. Maximum default value of n is 5.
tail()
Returns the last n rows. Maximum default value of n is 5.
Accessing Data from Series with Position - Data in the series can be accessed similar to that in an
ndarray.
Example 1 – Retrieve the first element. As we already know, the counting starts from zero for the array,
which means the first element is stored at zeroth position and so on.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first element
print s[0]
Output –
Example 2 – Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that
index onwards will be extracted. If two parameters (with : between them) is used, items between the two
indexes (not including the stop index).
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first three element
print s[:3]
utput –
Retrieve Data Using Label (Index) – A Series is like a fixed-size dict in that you can get and set values by
index label.
Example 1 – Retrieve a single element using index label value.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve a single element
print s['a']
Output –
DataFrame - A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns.
Structure –
data
data takes various forms like ndarray, series, map, lists, dict, constants and also another
DataFrame.
index
For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if
no index is passed.
columns
For column labels, the optional default syntax is - np.arange(n). This is only true if no index is
passed.
dtype
Data type of each column.
copy
This command (or whatever it is) is used for copying of data, if the default is False.
Creating DataFrames – A pandas DataFrame can be created using various inputs like –
Lists, dict, Series, Numpy ndarrays, Another DataFrame
a) Create an Empty DataFrame - A basic DataFrame, which can be created is an Empty Dataframe.
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print df
Output –
b) Create a DataFrame from Lists - The DataFrame can be created using a single list or a list of lists.
Example 1 –
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
Output –
Example 2 –
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
Output -
Example 3 –
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
Output -
Observe, the dtype parameter changes the type of Age column to floating point.
c) Create a DataFrame from Dict of ndarrays / Lists - All the ndarrays must be of same length. If index
is passed, then the length of the index should equal to the length of the arrays.
If no index is passed, then by default, index will be range(n), where n is the array length.
Example 1 –
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
Output –
Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n).
Example 2 – Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df
Output –
Example 2 – The following example shows how to create a DataFrame by passing a list of dictionaries
and the row indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df
Output –
Example 3 – The following example shows how to create a DataFrame with a list of dictionaries, row
indices, and column indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'],
columns=['a', 'b1'])
print df1
print df2
Output –
Observe, df2 DataFrame is created with a column index other than the dictionary key; thus, appended the
NaN’s in place. Whereas, df1 is created with column indices same as dictionary keys, so NaN’s appended.
e) Create a DataFrame from Dictionary of Series – Dictionary of Series can be passed to form a
DataFrame. The resultant index is the union of all the series indexes passed.
Example –
import pandas as pd
df = pd.DataFrame(d)
print df
Output –
Now, the above table will look as follows if we represent it in CSV format:
Observe, a comma separates all the values in columns within each row. However, you can use other
symbols such as a semicolon (;) as a separator as well.
The two workhorse functions for reading text files (or the flat files) are read_csv() and
read_table(). They both use the same parsing code to intelligently convert tabular data into a
DataFrame object –
read_csv – This function supports reading of CSV files and save into a DataFrame object. Below
are the examples demonstrating the different parameters of read_csv function
Here is temp.csv file data we will use in all examples further –
S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900
Example – Reading temp.csv file using Pandas
import pandas as pd
df=pd.read_csv("temp.csv")
print df
Output -
sep - If the separator between each field of your data is not a comma, use the sep argument.
Example – Consider the data is separated in the above given temp.csv file by dollar ($) symbol
instead of comma (,) as below:
S.No$Name$Age$City$Salary
1$Tom$28$Toronto$20000
2$Lee$32$HongKong$3000
3$Steven$43$Bay Area$8300
4$Ram$38$Hyderabad$3900
Program -
import pandas as pd
data=pd.read_csv("temp.csv", sep='$')
print(data)
Output –
Note - If your csv file does not have header, then you need to set header = None while
reading it.
names - Use the names attribute if you would want to specify column names to the
dataframe explicitly.
Program –
import pandas as pd
col_names = ['Serial No', 'Person Name', 'Age', 'Lives
in', 'Earns']
data = pd.read_csv("temp.csv", names = col_names,
header = None)
print(data)
Output –
index_col - Use this argument to specify the row labels to use. Column name or number
both can be mentioned as index column.
Program –
import pandas as pd
df=pd.read_csv("temp.csv",index_col=['S.No'])
print df
Output –
prefix - When a data set doesn’t have any header , and you try to convert it to dataframe by
(header = None), pandas read_csv generates dataframe column names automatically with
integer values 0,1,2,…
If we want to prefix each column’s name with a string, say, “COLUMN”, such that dataframe
column names will now become COLUMN0, COLUMN1, COLUMN2 etc. we use prefix
argument.
pandas read_csv dtype - Use dtype to set the datatype for the data or dataframe columns.
Program –
import pandas as pd
df = pd.read_csv("temp.csv", dtype={'Salary':
np.float64})
print df.dtypes
Output –
true_values , false_values – Suppose your dataset contains Yes and No string which you
want to interpret as True and False.
skiprows - skiprows skips the number of rows specified.
import pandas as pd
df=pd.read_csv("temp.csv", skiprows=2)
print df
Output -
skip_blank_lines - If skip_blank_lines option is set to False, then wherever blank lines are
present, NaN values will be inserted into the dataframe. If set to True, then the entire blank
line will be skipped.
parse_dates - We can use pandas parse_dates to parse columns as datetime. You can
either use parse_dates = True or parse_dates = [‘column name’]
iterator - Using the iterator parameter, you can define how much data you want to read , in
each iteration by setting iterator to True.
Program –
import pandas as pd
df=pd.read_csv("temp.csv", iterator = True)
print("first", data.get_chunk(2))
print("second", data.get_chunk(2))
Output –
g) Create DataFrame from Text files – Pandas built-in function pandas.read_table() method is
used to read Text files and store in DataFrame object.
Difference between CSV file and Text file:
TEXT FILE DATA – Separated by Tab (\t) CSV FILE DATA – Separated by Comma (,)
S.No Name Age City Salary S.No,Name,Age,City,Salary
1 Tom 28 Toronto 20000 1,Tom,28,Toronto,20000
2 Lee 32 HongKong 3000 2,Lee,32,HongKong,3000
3 Steven 43 Bay Area 8300 3,Steven,43,Bay Area,8300
4 Ram 38 Hyderabad 3900 4,Ram,38,Hyderabad,3900
read_table() to read the data read_csv() to read the data
Important Note – read_table() function supports all parameters which are supported
by read_csv() function.
Operations on DataFrame columns –
a) Column Selection – To select a column from DataFrame, we can simply pass column label into the
DataFrame object as given below:
Example –
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df ['one']
Output –
b) Column Addition – To add a column in existed DataFrame, we can simply pass column label and
elements into the DataFrame object as given below:
Example -
import pandas as pd
df = pd.DataFrame(d)
print df
Output –
c) Column Deletion –
Example –
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df
Output –
df = pd.DataFrame(d)
print df.loc['b']
Output –
Selection by integer location – Rows can be selected by passing integer location to an iloc function.
Example –
import pandas as pd
df = pd.DataFrame(d)
print df.iloc[2]
Output –
df = pd.DataFrame(d)
print df[2:4]
Output –
b) Row Addition – New rows in existed DataFrame can be added using append function. This function
will append the rows at the end.
Example –
import pandas as pd
df = df.append(df2)
print df
Output –
c) Row Deletion – We can use index label to delete or drop rows from a DataFrame. If label is duplicated,
then multiple rows will be dropped.
Observe, in the above example, the labels 0 and 1 are duplicate.
Example –
import pandas as pd
df = df.append(df2)
print df
Output –
Observe, two rows were dropped because those two contain the same label 0.
Renaming DataFrame Rows/Columns index labels – Pandas rename() method is used to rename any
index, column or row.
Syntax - DataFrame.rename(mapper=None, index=None, columns=None,
axis=None, inplace=False)
Parameters:
mapper, index and columns: Dictionary value, key refers to the old name and value refers to
new name. Only one of these parameters can be used at once.
axis: int or string value, 0/’row’ for Rows and 1/’columns’ for Columns. Default is 0.
inplace: Makes changes in original Data Frame if True. Default is false.
g) Boolean Indexing – In Boolean indexing, we will select subsets of data based on the actual values of
the data in the DataFrame and not on their row/column labels or integer locations. In Boolean indexing,
we use a Boolean vector to filter the data.
Boolean indexing is a type of indexing which uses actual values of the data in the DataFrame. In Boolean
indexing, we can filter a data in four ways –
Accessing a DataFrame with a Boolean index - In order to access a dataframe with a Boolean
index, we have to create a dataframe in which index of dataframe contains a Boolean value that is
“True” or “False”.
For example –
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = { 'name' : ["aparna", "pankaj", "sudhir", "Geeku"],
'degree' : ["MBA", "BCA", "M.Tech", "MBA"],
'score' : [90, 40, 80, 98]}
df = pd.DataFrame(dict, index = [True, False, True, False])
print(df)
Output –
Now we have created a dataframe with Boolean index after that user can access a dataframe with
the help of Boolean index.
Accessing a Dataframe with a Boolean index using .loc[ ]
In order to access a dataframe with a Boolean index using .loc[ ], we simply pass a Boolean value
(True or False) in a .loc[ ] function.
Example – print(df.loc[True])
Output -
Applying a Boolean mask to a dataframe - We can apply a Boolean mask by giving list of True and
False of the same length as contain in a dataframe. When we apply a Boolean mask it will print only
that dataframe in which we pass a Boolean value True.
Example –
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = { 'name' : ["aparna", "pankaj", "sudhir", "Geeku"],
'degree' : ["MBA", "BCA", "M.Tech", "MBA"],
'score' : [90, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df[[True, False, True, False]])
Output –
Masking data based on column value - In a dataframe we can filter a data based on a column value
in order to filter data, we can apply certain condition on dataframe using different operator like ==,
>, <, <=, >=. When we apply these operator on dataframe then it produce a Series of True and False.
Example 1 – print(df[df['score']>90])
Output –
Example 2 –
masking_condition =(df['score']>80) & (df['degree']=='MBA')
print(df[masking_condition])
Output –
Masking data based on index value - In a dataframe we can filter a data based on a column value
in order to filter data, we can create a mask based on the index values using different operator like
==, >, <, etc.
Example – print (df[df.index>1])
Output –