12 Useful Pandas Techniques in Python For Data Manipulation
12 Useful Pandas Techniques in Python For Data Manipulation
import pandas as pd
import numpy as np
#1 Boolean Indexing
What do you do, if you want to filter values of a column based on conditions
from another set of columns? For instance, we want a list of all females who
are not graduate and got a loan. Boolean indexing can help here. You can use
the following code:
#2 Apply Function
It is one of the commonly used functions for playing with data and creating new
variables. Applyreturns some value after passing each row/column of a data
frame with some function. The function can be both default or user-defined. For
instance, here it can be used to find the #missing values in each row and
column.
def num_missing(x):
return sum(x.isnull())
mode(data['Gender'])
mode(data['Gender']).mode[0]
Now we can fill the missing values and check using technique #2.
data['Gender'].fillna(mode(data['Gender']).mode[0], inplace=True)
data['Married'].fillna(mode(data['Married']).mode[0], inplace=True)
data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0], inplace=True)
Hence, it is confirmed that missing values are imputed. Please note that this is
the most primitive form of imputation. Other sophisticated techniques include
modeling the missing values, using grouped averages (mean/mode/median). Ill
cover that part in my next articles.
Read More: Pandas Reference (fillna)
#4 Pivot Table
Pandas can be used to create MS Excel style pivot tables. For instance, in this
case, a key column is LoanAmount which has missing values. We can impute
it using mean amount of each Gender, Married and Self_Employed group.
The mean LoanAmount of each group can be determined as:
impute_grps = data.pivot_table(values=["LoanAmount"],
index=["Gender","Married","Self_Employed"], aggfunc=np.mean)
print impute_grps
#5 Multi-Indexing
If you notice the output of step #3, it has a strange property. Each index is made
up of a combination of 3 values. This is called Multi-Indexing. It helps in
performing operations really fast.
Continuing the example from #3, we have the values for each group but they
have
not
been imputed.
This can be done using the various techniques learned till now.
ind = tuple([row['Gender'],row['Married'],row['Self_Employed']])
data.loc[i,'LoanAmount'] = impute_grps.loc[ind].values[0]
Note:
1. Multi-index requires tuple for defining groups of indices in loc statement.
This a tuple used in function.
#6. Crosstab
This function is used to get an initial feel (view) of the data. Here, we can
validate some basic hypothesis. For instance, in this case, Credit_History is
expected to affect the loan status significantly. This can be tested using crosstabulation as shown below:
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)
These are absolute numbers. But, percentages can be more intuitive in making
some quick insights. We can do this using the apply function:
def percConvert(ser):
return ser/float(ser[-1])
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percCo
nvert, axis=1)
Now, it is evident that people with a credit history have much higher chances of
getting a loan as 80% people with credit history got a loan as compared to only
9% without credit history.
But thats not it. It tells an interesting story. Since I know that having a credit
history is super important, what if I predict loan status to be Y for ones with
credit history and N otherwise. Surprisingly, well be right 82+378=460 times out
of 614 which is a whopping 75%!
I wont blame you if youre wondering why the hell do we need statistical
models. But trust me, increasing the accuracy by even 0.001% beyond this
mark is a challenging task. Would you take thischallenge?
Note: 75% is on train set. The test set will be slightly different but close. Also, I
hope this gives some intuition into why even a 0.05% increase in accuracy can
result in jump of 500 ranks on the Kaggle leaderboard.
Read More: Pandas Reference (crosstab)
#7 Merge DataFrames
Merging dataframes become essential when we have information coming from
different sources to be collated. Consider a hypothetical case where the
average property rates (INR per sq meters) is available for different property
types. Lets define a dataframe as:
prop_rates
Now we can merge this information with the original dataframe as:
data_merged = data.merge(right=prop_rates,
how='inner',left_on='Property_Area',right_index=True, sort=False)
data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'],
aggfunc=len)
The pivot table validates successful merge operation. Note that the values
argument is irrelevant here because we are simply counting the values.
ReadMore: Pandas Reference (merge)
#8 Sorting DataFrames
Pandas allow easy sorting based on multiple columns. This can be done as:
data_sorted = data.sort_values(['ApplicantIncome','CoapplicantIncome'],
ascending=False)
data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)
%matplotlib inline
data.boxplot(column="ApplicantIncome",by="Loan_Status")
data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)
This shows that income is not a big deciding factor on its own as there is no
appreciable difference between the people who received and were denied the
loan.
Read More: Pandas Reference (hist) | Pandas Reference (boxplot)
(minutes). The exact minute of an hour might not be that relevant for predicting
traffic as compared to actual period of the day like Morning, Afternoon,
Evening, Night, Late Night. Modeling traffic this way will be more intuitive
and will avoid overfitting.
Here we define a simple function which can be re-used for binning any variable
fairly easily.
#Binning:
minval = col.min()
maxval = col.max()
if not labels:
labels = range(len(cut_points)+1)
colBin =
pd.cut(col,bins=break_points,right=False,labels=labels,include_lowest=True)
return colBin
#Binning age:
cut_points = [90,140,190]
Here Ive defined a generic function which takes in input as a dictionary and
codes the values using replace function in Pandas.
return colCoded
print pd.value_counts(data["Loan_Status"])
print pd.value_counts(data["Loan_Status_Coded"])
So its generally a good idea to manually define the column types. If we check
the data types of all columns:
data.dtypes
colTypes = pd.read_csv('datatypes.csv')
print colTypes
After loading this file, we can iterate through each row and assign the datatype
using column type to the variable name defined in the feature column.
for i, row in colTypes.iterrows(): #i: dataframe index; row: each row in series format
if row['feature']=="categorical":
data[row['feature']]=data[row['feature']].astype(np.object)
elif row['feature']=="continuous":
data[row['feature']]=data[row['feature']].astype(np.float)
print data.dtypes
Now the credit history column is modified to object type which is used for
representing
nominal
variables
in
Pandas.
End Notes
In this article, we covered various functions of Pandas which can make our life
easy while performing data exploration and feature engineering. Also, we
defined some generic functions which can be reused for achieving similar
objective on different datasets.
Also See: If you have any doubts pertaining to Pandas or Python in general,
feel free to discuss with us.
Did you find the article useful? Do you use some better (easier/faster)
techniques for performing the tasks discussed above? Do you think there are
better alternatives to Pandas in Python? Well be glad if you share your
thoughts as comments below.