Pandas
Introduction
● Pandas is a python library and is a powerful tool for data science.
● Widely used package for data manipulation and analysis.
● Used everywhere including commercial and and academic sectors
in the fields like economics,finance analytics.
● Developed by Wes McKinney in 2008.
Compare Pandas and Numpy
Pandas Numpy
Work on Tabular Work on Numerical
data data
Powerful Tool- Powerful Tool- Array
Series and
DataFrame
The pandas Data Structure
• Series
• DataFrame
Series
● Is a one dimensional object similar to array
● Series is essentially a column in a table.
● DataFrame is a multi-dimensional table made up of a
collection of series.
Creating a Series
One dimensional labeled array capable of
holding data of any type.(integer, string,
float,python object)
How to import pandas in python
import pandas as pd
Creating Series
Random Numbers
import pandas as pd
a1=pd.Series(np.random.rand(4))
print(a1)
Output:
0 0.6122
1 0.9809
2 0.3350
3 0.7221
Series
a2=pd.Series(np.random.rand(4),index=[‘a’,’b’
,’c’,’d’]
print(a2)
a 0.6122
b 0.98096
c 0.3350
d 0.7221
Access the value of Series-index
• >>> s1['c’]
0.3350
• >>>s1['c'] = 3.14
• >>> s1['c', 'a', 'b’]
• c 3.14
• a 0.6122
Creating a Series from array
Create a Series from Dictionary
DataFrame
Tabular data structure comprising a set of
ordered columns and rows.
Pandas DataFrame
Is a two dimensional data structure.
Data is aligned in rows and columns.
Retrieve data from dataframe (index)
Creating dataframe using
dictionary
Pandas- Cleaning Data
Cleaning means fixing bad data in the
dataset.
1. Empty Cells
2. Data in Wrong Format
3. Wrong data
4. Duplicates
1. Empty Cells
Empty cells gives wrong result when you
analyze the data
Solution :
Remove the row that contain empty row.
Dropping the Null Value
row/column
dropna()
Allows the user to analyze and drop
rows/columns with null values
Syntax
DataFrameName.dropna(self, axis=0,
how=any,thesh=None,Subset=None,inplace=
False)
Example
Drop Row
Drop the rows where at least one element is missing:
DataFrameName.dropna()
Out[3]:
Drop the column where one element is missing
dataframename.dropna(axis=’columns’)
Drop the row all the elements are
missing
DataFrameName.dropna(how=’all’)
Rows with at least two NA
DataFrameName.dropna(thresh=2)
Define the columns for missing
values
dataframename.dropna(subset=[‘name’,’born’]
DataFrame with valid entries in
the same variable
DataFrameName.dropna(inplace=True)
Replace Empty Values
fillna()
Allows to replace empty cells with a value
Replace only for a specified
column
Replace empty values for a single column
Replace Using Mean, Median or
Mode
Calculate mean, replace any empty values
Mean = the average value (the sum of all values divided by number of values).
Calculate mean, replace any empty values
Median = the value in the middle, after you have sorted all values ascending.
Calculate mode, replace any empty values with it
Mode = the value that appears most frequently.
2. Data with wrong format
Cells with data of wrong format, impossible to
analyze data
to_datetime()
3. Fixing Wrong Data
fix wrong values is to replace them with something else.
4. Removing Duplicates
duplicated()- returns a boolean value for each
row
Returns true for every row that is duplicate
drop_duplicates()- remove duplicates