100% found this document useful (2 votes)

2K views19 pages

12 Useful Pandas Techniques in Python For Data Manipulation

The document discusses 12 techniques for data manipulation in Pandas including boolean indexing, apply function, imputing missing values, pivot tables, multi-indexing, crosstabs, merging dataframes, sorting, plotting, cut function for binning, and string operations. The techniques are demonstrated on a loan prediction dataset.

Uploaded by

xwpom2

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Download as docx, pdf, or txt

100% found this document useful (2 votes)

2K views19 pages

12 Useful Pandas Techniques in Python For Data Manipulation

Uploaded by

xwpom2

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Download as docx, pdf, or txt

You are on page 1/ 19

12 Useful Pandas Techniques in

Python for Data Manipulation

Introduction
Python is fast becoming the preferred language for data scientists and for
good reasons. It provides the larger ecosystem of a programming language and
the depth of good scientific computation libraries. If you are starting to learn
Python, have a look at learning path on Python.
Among its scientific computation libraries, I found Pandas to be the most useful
for data science operations. Pandas, along with Scikit-learn provides almost the
entire stack needed by a data scientist. This article focuses on providing 12
ways for data manipulation in Python. Ive also shared some tips &
tricks which will allow you to work faster.
I would recommend that you look at the codes for data exploration before going
ahead. To help you understand better, Ive taken a data set to perform these
operations and manipulations.
Data Set: Ive used the data set of Loan Prediction problem. Download the data
set and get started.

Lets get started

Ill start by importing modules and loading the data set into Python environment:

import pandas as pd

import numpy as np

data = pd.read_csv("train.csv", index_col="Loan_ID")

#1 Boolean Indexing
What do you do, if you want to filter values of a column based on conditions
from another set of columns? For instance, we want a list of all females who
are not graduate and got a loan. Boolean indexing can help here. You can use
the following code:

data.loc[(data["Gender"]=="Female") & (data["Education"]=="Not Graduate") &

(data["Loan_Status"]=="Y"), ["Gender","Education","Loan_Status"]]

#Create a new function:

def num_missing(x):

return sum(x.isnull())

#Applying per column:

print "Missing values per column:"

print data.apply(num_missing, axis=0) #axis=0 defines that function is to be

applied on each column

#Applying per row:

print "\nMissing values per row:"

print data.apply(num_missing, axis=1).head() #axis=1 defines that function is to be

applied on each row

Thus we get the desired result.

Note: head() function is used in second output because it contains many rows.
Read More: Pandas Reference (apply)

#3 Imputing missing files

fillna() does it in one go. It is used for updating missing values with the overall
mean/mode/median of the column. Lets impute the Gender, Married and
Self_Employed columns with their respective modes.

#First we import a function to determine the mode

from scipy.stats import mode

mode(data['Gender'])

Output: ModeResult(mode=array([Male], dtype=object), count=array([489]))

This returns both mode and count. Remember that mode can be an array as
there can be multiple values with high frequency. We will take the first one by
default always using:

mode(data['Gender']).mode[0]

Now we can fill the missing values and check using technique #2.

#Impute the values:

data['Gender'].fillna(mode(data['Gender']).mode[0], inplace=True)

data['Married'].fillna(mode(data['Married']).mode[0], inplace=True)

data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0], inplace=True)

#Now check the #missing values again to confirm:

print data.apply(num_missing, axis=0)

Hence, it is confirmed that missing values are imputed. Please note that this is
the most primitive form of imputation. Other sophisticated techniques include
modeling the missing values, using grouped averages (mean/mode/median). Ill
cover that part in my next articles.
Read More: Pandas Reference (fillna)

#4 Pivot Table
Pandas can be used to create MS Excel style pivot tables. For instance, in this
case, a key column is LoanAmount which has missing values. We can impute
it using mean amount of each Gender, Married and Self_Employed group.
The mean LoanAmount of each group can be determined as:

#Determine pivot table

impute_grps = data.pivot_table(values=["LoanAmount"],
index=["Gender","Married","Self_Employed"], aggfunc=np.mean)

print impute_grps

More: Pandas Reference (Pivot Table)

#5 Multi-Indexing

If you notice the output of step #3, it has a strange property. Each index is made
up of a combination of 3 values. This is called Multi-Indexing. It helps in
performing operations really fast.
Continuing the example from #3, we have the values for each group but they
have

not

been imputed.

This can be done using the various techniques learned till now.

#iterate only through rows with missing LoanAmount

for i,row in data.loc[data['LoanAmount'].isnull(),:].iterrows():

ind = tuple([row['Gender'],row['Married'],row['Self_Employed']])

data.loc[i,'LoanAmount'] = impute_grps.loc[ind].values[0]

#Now check the #missing values again to confirm:

print data.apply(num_missing, axis=0)

Note:
1. Multi-index requires tuple for defining groups of indices in loc statement.
This a tuple used in function.

2. The .values[0] suffix is required because, by default a series element is

returned which has an index not matching with that of the dataframe. In this
case, a direct assignment gives an error.

#6. Crosstab
This function is used to get an initial feel (view) of the data. Here, we can
validate some basic hypothesis. For instance, in this case, Credit_History is
expected to affect the loan status significantly. This can be tested using crosstabulation as shown below:

pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)

These are absolute numbers. But, percentages can be more intuitive in making
some quick insights. We can do this using the apply function:

def percConvert(ser):

return ser/float(ser[-1])

pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percCo
nvert, axis=1)

Now, it is evident that people with a credit history have much higher chances of
getting a loan as 80% people with credit history got a loan as compared to only
9% without credit history.
But thats not it. It tells an interesting story. Since I know that having a credit
history is super important, what if I predict loan status to be Y for ones with
credit history and N otherwise. Surprisingly, well be right 82+378=460 times out
of 614 which is a whopping 75%!
I wont blame you if youre wondering why the hell do we need statistical
models. But trust me, increasing the accuracy by even 0.001% beyond this
mark is a challenging task. Would you take thischallenge?
Note: 75% is on train set. The test set will be slightly different but close. Also, I
hope this gives some intuition into why even a 0.05% increase in accuracy can
result in jump of 500 ranks on the Kaggle leaderboard.
Read More: Pandas Reference (crosstab)

#7 Merge DataFrames
Merging dataframes become essential when we have information coming from
different sources to be collated. Consider a hypothetical case where the
average property rates (INR per sq meters) is available for different property
types. Lets define a dataframe as:

prop_rates = pd.DataFrame([1000, 5000, 12000],

index=['Rural','Semiurban','Urban'],columns=['rates'])

prop_rates

Now we can merge this information with the original dataframe as:

data_merged = data.merge(right=prop_rates,
how='inner',left_on='Property_Area',right_index=True, sort=False)

data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'],
aggfunc=len)

The pivot table validates successful merge operation. Note that the values
argument is irrelevant here because we are simply counting the values.
ReadMore: Pandas Reference (merge)

#8 Sorting DataFrames
Pandas allow easy sorting based on multiple columns. This can be done as:

data_sorted = data.sort_values(['ApplicantIncome','CoapplicantIncome'],
ascending=False)

data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)

Note: Pandas sort function is now deprecated. We should use sort_values

instead.
More: Pandas Reference (sort_values)

#9 Plotting (Boxplot & Histogram)

Many of you might be unaware that boxplots and histograms can be directly
plotted in Pandas and calling matplotlib separately is not necessary. Its just a 1line command. For instance, if we want to compare the distribution of
ApplicantIncome by Loan_Status:

import matplotlib.pyplot as plt

%matplotlib inline

data.boxplot(column="ApplicantIncome",by="Loan_Status")

data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)

This shows that income is not a big deciding factor on its own as there is no
appreciable difference between the people who received and were denied the
loan.
Read More: Pandas Reference (hist) | Pandas Reference (boxplot)

#10 Cut function for binning

Sometimes numerical values make more sense if clustered together. For
example, if were trying to model traffic (#cars on road) with time of the day

(minutes). The exact minute of an hour might not be that relevant for predicting
traffic as compared to actual period of the day like Morning, Afternoon,
Evening, Night, Late Night. Modeling traffic this way will be more intuitive
and will avoid overfitting.
Here we define a simple function which can be re-used for binning any variable
fairly easily.

#Binning:

def binning(col, cut_points, labels=None):

#Define min and max values:

minval = col.min()

maxval = col.max()

#create list by adding min and max to cut_points

break_points = [minval] + cut_points + [maxval]

#if no labels provided, use default labels 0 ... (n-1)

if not labels:

labels = range(len(cut_points)+1)

#Binning using cut function of pandas

colBin =
pd.cut(col,bins=break_points,right=False,labels=labels,include_lowest=True)

return colBin

#Binning age:

cut_points = [90,140,190]

labels = ["low","medium","high","very high"]

data["LoanAmount_Bin"] = binning(data["LoanAmount"], cut_points, labels)

print pd.value_counts(data["LoanAmount_Bin"], sort=False)

#11 Coding nominal data

Often, we find a case where weve to modify the categories of a nominal
variable. This can be due to various reasons:
1. Some algorithms (like Logistic Regression) require all inputs to be numeric.
So nominal variables are mostly coded as 0, 1.(n-1)

2. Sometimes a category might be represented in 2 ways. For e.g. temperature

might be recorded as High, Medium, Low, H, low. Here, both High
and H refer to same category. Similarly, in Low and low there is only a
difference of case. But, python would read them as different levels.
3. Some categories might have very low frequencies and its generally a good
idea to combine them.

Here Ive defined a generic function which takes in input as a dictionary and
codes the values using replace function in Pandas.

#Define a generic function using Pandas replace function

def coding(col, codeDict):

colCoded = pd.Series(col, copy=True)

for key, value in codeDict.items():

colCoded.replace(key, value, inplace=True)

return colCoded

#Coding LoanStatus as Y=1, N=0:

print 'Before Coding:'

print pd.value_counts(data["Loan_Status"])

data["Loan_Status_Coded"] = coding(data["Loan_Status"], {'N':0,'Y':1})

print '\nAfter Coding:'

print pd.value_counts(data["Loan_Status_Coded"])

Similar counts before and after proves the coding.

#12 Iterating over rows of a dataframe

This is not a frequently used operation. Still, you dont want to get stuck. Right?
At times you may need to iterate through all rows using a for loop. For instance,
one common problem we face is the incorrect treatment of variables in Python.
This generally happens when:
1. Nominal variables with numeric categories are treated as numerical.
2. Numeric variables with characters entered in one of the rows (due to a data
error) are considered categorical.

So its generally a good idea to manually define the column types. If we check
the data types of all columns:

#Check current type:

data.dtypes

Here we see that Credit_History is a nominal variable but appearing as float. A

good way to tackle such issues is to create a csv file with column names and
types. This way, we can make a generic function to read the file and assign
column data types. For instance, here I have created a csv filedatatypes.csv.

#Load the file:

colTypes = pd.read_csv('datatypes.csv')

print colTypes

After loading this file, we can iterate through each row and assign the datatype
using column type to the variable name defined in the feature column.

#Iterate through each row and assign variable type.

#Note: astype is used to assign types

for i, row in colTypes.iterrows(): #i: dataframe index; row: each row in series format

if row['feature']=="categorical":

data[row['feature']]=data[row['feature']].astype(np.object)

elif row['feature']=="continuous":

data[row['feature']]=data[row['feature']].astype(np.float)

print data.dtypes

Now the credit history column is modified to object type which is used for
representing

nominal

variables

Pandas.

Read More: Pandas Reference (iterrows)

End Notes
In this article, we covered various functions of Pandas which can make our life
easy while performing data exploration and feature engineering. Also, we

defined some generic functions which can be reused for achieving similar
objective on different datasets.
Also See: If you have any doubts pertaining to Pandas or Python in general,
feel free to discuss with us.
Did you find the article useful? Do you use some better (easier/faster)
techniques for performing the tasks discussed above? Do you think there are
better alternatives to Pandas in Python? Well be glad if you share your
thoughts as comments below.

Pandas Exercises For Data Analysis PDF
100% (1)
Pandas Exercises For Data Analysis PDF
83 pages
Python Seaborn Notes
No ratings yet
Python Seaborn Notes
28 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
MAchine Learning
No ratings yet
MAchine Learning
120 pages
65 Free Data Science Resources For Beginners PDF
No ratings yet
65 Free Data Science Resources For Beginners PDF
19 pages
TAS71-R001E Ver4 DIASYS-IDOL++ Function Block Reference Guide
100% (8)
TAS71-R001E Ver4 DIASYS-IDOL++ Function Block Reference Guide
360 pages
Types of Database Management Systems
No ratings yet
Types of Database Management Systems
4 pages
Gonzalez Woods Image Processing PDF
0% (1)
Gonzalez Woods Image Processing PDF
3 pages
DS5101 Digital Waveform Output Board dSPACE Catalog 2008
No ratings yet
DS5101 Digital Waveform Output Board dSPACE Catalog 2008
6 pages
Operating System Concepts, Terminology and History
No ratings yet
Operating System Concepts, Terminology and History
25 pages
Python For Finance - The Complete Beginner's Guide - by Behic Guven - Jul, 2020 - Towards Data Science PDF
100% (1)
Python For Finance - The Complete Beginner's Guide - by Behic Guven - Jul, 2020 - Towards Data Science PDF
12 pages
Python for Data Science: A Hands-On Introduction
From Everand
Python for Data Science: A Hands-On Introduction
Yuli Vasiliev
No ratings yet
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
List Comprehension in Python
No ratings yet
List Comprehension in Python
8 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
7 Time Series Datasets For Machine Learning
No ratings yet
7 Time Series Datasets For Machine Learning
8 pages
The Next Level of Data Visualization in Python
100% (1)
The Next Level of Data Visualization in Python
17 pages
CodeAcademyNotes Python
No ratings yet
CodeAcademyNotes Python
42 pages
Yousef Time Series Analysis in Python 2020
100% (1)
Yousef Time Series Analysis in Python 2020
835 pages
Scikit Learn
No ratings yet
Scikit Learn
25 pages
Python Pandas Tutorial
No ratings yet
Python Pandas Tutorial
45 pages
NumPy, SciPy and MatPlotLib
100% (1)
NumPy, SciPy and MatPlotLib
18 pages
Python Pandas
100% (1)
Python Pandas
35 pages
Top 9 Data Science Algorithms
No ratings yet
Top 9 Data Science Algorithms
152 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Python Interview Questions: Answer: in Duck Typing, One Is Concerned With Just Those Aspects of An Object That Are
No ratings yet
Python Interview Questions: Answer: in Duck Typing, One Is Concerned With Just Those Aspects of An Object That Are
12 pages
Pandas Plotting Capabilities
No ratings yet
Pandas Plotting Capabilities
27 pages
Jupyter Shortcuts PDF
No ratings yet
Jupyter Shortcuts PDF
7 pages
Advanced Python Tips
No ratings yet
Advanced Python Tips
50 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
K-Means and PCA
No ratings yet
K-Means and PCA
69 pages
ML Glossary
No ratings yet
ML Glossary
44 pages
Numpy Complete Material
No ratings yet
Numpy Complete Material
19 pages
Statistics For Data Science - 1
100% (2)
Statistics For Data Science - 1
38 pages
Pytthon For Data Analysis From Scratch
100% (5)
Pytthon For Data Analysis From Scratch
37 pages
Rapids Cheatsheet
100% (1)
Rapids Cheatsheet
2 pages
Python For Data Science
100% (1)
Python For Data Science
4 pages
Advanced Programming With Python
No ratings yet
Advanced Programming With Python
9 pages
Python Quick Guide - Tutorialspoint
No ratings yet
Python Quick Guide - Tutorialspoint
199 pages
Pandas Data Analysis Handbook
No ratings yet
Pandas Data Analysis Handbook
55 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
02 - Decision Tree Classification On Iris Dataset
No ratings yet
02 - Decision Tree Classification On Iris Dataset
6 pages
Guide Python Data Science
100% (2)
Guide Python Data Science
13 pages
Introduction To Data Visualization With Seaborn Chapter1
No ratings yet
Introduction To Data Visualization With Seaborn Chapter1
26 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
No ratings yet
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
3 pages
Python Workshop Exercises
No ratings yet
Python Workshop Exercises
7 pages
Sent-Machine Learning For Data Science
100% (1)
Sent-Machine Learning For Data Science
463 pages
Python Cheat Sheet For Data Scientists by Tomi Mester 2019 PDF
100% (3)
Python Cheat Sheet For Data Scientists by Tomi Mester 2019 PDF
23 pages
ML Notesv1
100% (1)
ML Notesv1
300 pages
SQL For Beginners - Nettuts
100% (1)
SQL For Beginners - Nettuts
11 pages
Pandas
No ratings yet
Pandas
11 pages
Statistics in Details
100% (2)
Statistics in Details
283 pages
Introduction To Languages
No ratings yet
Introduction To Languages
22 pages
Data Science Course Content
No ratings yet
Data Science Course Content
4 pages
Scikit Learn Cheat Sheet
No ratings yet
Scikit Learn Cheat Sheet
9 pages
R Programming For NGS Data Analysis
No ratings yet
R Programming For NGS Data Analysis
5 pages
Data Visualisation Using Pyplot
No ratings yet
Data Visualisation Using Pyplot
20 pages
Pandas Visualisation
No ratings yet
Pandas Visualisation
27 pages
100 Data Science in R Interview Questions and Answers For 2016
100% (2)
100 Data Science in R Interview Questions and Answers For 2016
56 pages
Scikit-Learn Cheat Sheet - Python Machine Learning (Article) - DataCamp
100% (1)
Scikit-Learn Cheat Sheet - Python Machine Learning (Article) - DataCamp
16 pages
Python Exercises PDF
100% (1)
Python Exercises PDF
138 pages
Python Basic and Advanced-Day 8
100% (1)
Python Basic and Advanced-Day 8
20 pages
New Learning of Python by Practical Innovation and Technology
From Everand
New Learning of Python by Practical Innovation and Technology
Sudhir Pathania
No ratings yet
Ch. 4 (B) Transportation Problems
No ratings yet
Ch. 4 (B) Transportation Problems
19 pages
4th Assignment
No ratings yet
4th Assignment
8 pages
Rom Huawei G610-U00
No ratings yet
Rom Huawei G610-U00
6 pages
System Analysis and Design
33% (3)
System Analysis and Design
47 pages
EC-Council CEH Printables Sample
No ratings yet
EC-Council CEH Printables Sample
9 pages
Python - Functions
100% (1)
Python - Functions
17 pages
Computer Science Short Questions Guess Paper 11th Class 2015
No ratings yet
Computer Science Short Questions Guess Paper 11th Class 2015
4 pages
Error 429 Error Creating Object Bullzip Pdfutil
No ratings yet
Error 429 Error Creating Object Bullzip Pdfutil
1 page
CCNA 2 Pretest Exam Answers v5.0 2015 100%: MAC Address Kernel Shell
No ratings yet
CCNA 2 Pretest Exam Answers v5.0 2015 100%: MAC Address Kernel Shell
12 pages
Presentation A Nu
No ratings yet
Presentation A Nu
45 pages
CSE 390a: Intro To Shell Scripting
No ratings yet
CSE 390a: Intro To Shell Scripting
22 pages
Efficient Methods and Hardware For Deep Learning-Augmented
No ratings yet
Efficient Methods and Hardware For Deep Learning-Augmented
125 pages
VMware Health Analyzer Install and User Guide v5.0.5 en
No ratings yet
VMware Health Analyzer Install and User Guide v5.0.5 en
48 pages
IBM Mainframe - Cobol Material
No ratings yet
IBM Mainframe - Cobol Material
172 pages
Project ON Cafeteria Management System: Phase-2nd
No ratings yet
Project ON Cafeteria Management System: Phase-2nd
4 pages
Huawei Esight Full Product Datasheet
No ratings yet
Huawei Esight Full Product Datasheet
50 pages
Principles of Computer Architecture: Miles Murdocca and Vincent Heuring
No ratings yet
Principles of Computer Architecture: Miles Murdocca and Vincent Heuring
33 pages
Partitioning in Oracle
No ratings yet
Partitioning in Oracle
5 pages
H8 Brief Note
No ratings yet
H8 Brief Note
20 pages
Mobile Shopee Management System
No ratings yet
Mobile Shopee Management System
33 pages
Major
No ratings yet
Major
24 pages
Avamar Admin Guide
No ratings yet
Avamar Admin Guide
424 pages
Q13 Q14 Q15
No ratings yet
Q13 Q14 Q15
2 pages
Introduction To SQL - Questions
No ratings yet
Introduction To SQL - Questions
8 pages
The Minimum You Need To Know About Java and xBaseJ
100% (1)
The Minimum You Need To Know About Java and xBaseJ
186 pages