Python Pandas ch-2
Python Pandas ch-2
Its output is as follows −
a b
First 1 2
Second 5 7
print(df2)
a b1
First 1 NaN
Second 5 NaN
Data Series
• Series is a one-dimensional labeled array capable of
holding data of any type (integer, string, float, python
objects, etc.). The axis labels are collectively called index.
• pandas.Series
• A pandas Series can be created using the following
constructor −
pandas.Series( data, index, dtype, copy)
A series can be created using various inputs like −
• Array
• Dict
• Scalar value or constant
The parameters of the constructor are as follows −
Sr.No Parameter & Description
1 data
data takes various forms like ndarray, list, constants
2 index
Index values must be unique and hashable, same length
as data. Default np.arange(n) if no index is passed.
3 dtype
dtype is for data type. If None, data type will be inferred
4 copy
Copy data. Default False
If data is an ndarray, then index passed must be of the same length. If no index is passed, then
by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].
Example 1
We did not pass any index, so by default, it assigned the indexes ranging
from 0 to len(data)-1, i.e., 0 to 3.
#import
Example 2the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s
Its output is as follows −
100 a
101 b
102 c
103 d
dtype: object
We passed the index values here. Now we can see the customized indexed
values in the output.
Create a Series from dict
A dict canthe
#import
be passed
pandasas input and if no index
library and isaliasing
specified, then
as
the dictionary keys are taken in a sorted order to construct index.
pdIf index is passed, the values in data corresponding to the labels
import pandas
in the index as pd
will be pulled out.
import
Example 1numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
Its output is as follows −
a 0.0
b 1.0
c 2.0
dtype: float64
Observe − Dictionary keys are used to construct index.
Create a DataFrame from Dict of Series
Dictionary of Series can be passed to form a
DataFrame. The resultant index is the union of
all the series indexes passed.
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Selecting/Accessing a Subset from a Dataframe using Row/Column
Names
df = pd.DataFrame(data,index=['Delhi','Mumbai','Kolkata','Chennai'])
print(df)
Population Average_income
Delhi 10927986 72167810876544
Mumbai 12691836 85007812691836
Kolkata 4631392 422678431392
Chennai 4328063 5261782328063
To access multiple rows make sure not to
miss the COLON after COMMA
Continue…
• To access selective columns, use :
<DF object>.loc[ : , <start column> :<end row>,:]
Make sure not to miss the COLON BEFORE
COMMA. Like rows, all columns falling between
start and end columns, will also be listed
• To access range of columns from a range of
rows, use:
<DF object>.loc[<startrow> : <endrow>,
<startcolumnn> : <endcolumn>]
To access selective columns make sure not to
miss the COLON before COMMA
To access range of columns from ranges of rows
Df.loc[<startrows>:<endrow>,<startcolumn>:<endcolumn>]
• import pandas as pd
• d = {'one' : pd.Series([1, 2, 3], index=['a', 'b',
'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b',
'c', 'd'])}
• df = pd.DataFrame(d)
• print df.loc['b']
• Its output is as follows −
• one 2.0
• two 2.0
• Name: b,
• dtype: float64
Obtaining subset from DataFrame using Rows/Columns
Numeric index position
Df.iloc[<startrow index>:<endrow index>, <startcolumn index>:<endcolumn index>]
• Selection by integer location
• Rows can be selected by passing integer location to
an iloc function.
• import pandas as pd
• d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' :
pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
• df = pd.DataFrame(d)
• print df.iloc[2]
• Its output is as follows −
• one 3.0
• two 3.0
• Name: c,
• dtype: float64
Selecting/Accessing individual Values
import pandas as pd
import numpy as np
df =
pd.DataFrame(np.random.randn(4,3),columns=['col1','col2',
'col3'])
for key,value in df.iteritems():
print key,value
iterrows()
iterrows() returns the iterator yielding each index value along with a
series containing the data in each row.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
print (row_index,row)
Descriptive Statistics with Pandas
dfSales={2016:{'Qtr1':34500,'Qtr2':56000, 'Qtr3':47000,'Qtr4':49000}, 2017:
{'Qtr1':44900,'Qtr2':46100, 'Qtr3':57000,'Qtr4':59000}, 2018:
{'Qtr1':54500,'Qtr2':51000, 'Qtr3':57000,'Qtr4':58900}, 2019:{'Qtr1':34500}}
sal_df=pd.DataFrame(dfSales)
Functions min() and max()
The min() and max() functions find out minimum or maximum value respectively from a give set of data.
Syntax
<dataFrame>.min(axis=None,skipna=None,numeric_only=None)
<dataFrame>.max(axis=None,skipna=None,numeric_only=None)
Axis (0 or 1) by default min and max is calculated along axis 0
Skipna (True or False ) Exclude NA/null values when computing the result.
Numeric_only (True or False) include only float, int, boolean columns.
Functions count() and sum()
The function count() counts the non-NA entries for each row or column. The value None, NaN , NaT etc,
are considered as NA in pandas. The syntax for using count() is:
Syntax
<dataFrame>.count(axis=None,skipna=None,numeric_only=None)
<dataFrame>.max(axis=None,skipna=None,numeric_only=None)
Axis (0 or 1) by default min and max is calculated along axis 0
Skipna (True or False ) Exclude NA/null values when computing the result.
Numeric_only (True or False) include only float, int, boolean columns.
What are Quantile?
• Quantiles are points in a distribution that relates to the ranks order
of values in that distribution. The quantile of the value is the
fraction of observations less than or equal to the value. The
quantile of the median is 0.5 by definition.
• The quantile() function
<dataframe>.quantile(q=0.5,axis=0,numeric_only=True)
q= float or array like, default 0.5(50%quantile), 0<=q<=1, the quantile(s) to
compute.
Axis=[[0,1,’index’, ‘column’] (default 0)] 0 or index is for row wise , 1 or column
for column wise.
Numeric_only If false, the quantile of datetime and timedelta data will be
computed as well.
• The median splits the data set in half. i.e 50% on
the left / below the median and 50% on the right
/ above the median. That means if we divide a
distribution in 2 exact halves, we call the divided
marker as median.
• The divided parts are called quantiles, which
means median divides a distribution in 2
quantiles(50% percentile each).
• If we divide a distribution in 4 exact quarter (25
each), we call the dividing marker a QUARTILE.
• That is, a quartile divides a distribution in 4
quantiles.(25% percentile each)
• Sal_df.quantile(q=[0.25,0.5,0.75,1.0])
• Sal_df.quantile(q=[0.25,0.5,0.75,1.0], axis =1)
The Var() function
• The var() computes variance and returns unbiased
variance over requested axis.
• The syntax of var() function is :
<dataFrame>.var(axis=None,skipna=None,numeric_only=None)
Axis (index (0), column (1) ) default 0
Skipna (True or False ) Exclude NA/null values when
computing the result.
Numeric_only (True or False) include only float, int,
boolean columns.
Sal.std()
sal.var( axis=1)
The std() Function
• The std() function computes the standard deviation
over requested axis.
• Syntax
<dataFrame>.std(axis=None,skipna=None,numeric_only=None,)
Axis (index (0), column (1) ) default 0
Skipna (True or False ) Exclude NA/null values when
computing the result.
Numeric_only (True or False) include only float, int,
boolean columns.
Describe() Function
• The describe() function computes a summary of statistics
pertaining to the DataFrame columns.
import pandas as pd
import numpy as np
• #Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve',
'Smith','Jack', 'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80
,4.10,3.65]) }
• #Create a DataFrame
df = pd.DataFrame(d)
print df.describe()
Aggregation/Descriptive statistics - Dataframe
Data aggregation –
Aggregation is the process of turning the values of a dataset (or a
subset of it) into one single value or data aggregation is a
multivalued function ,which require multiple values and return a
single value as a result.There are number of aggregations possible
like count,sum,min,max,median,quartile etc. These(count,sum etc.)
are descriptive statistics and other related operations on
DataFrame Let us make this clear! If we have a DataFrame like…