Unit IV:
Pandas
Pranav Gupta
INTRODUCTION
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and manipulating
data.
• Pandas allows us to analyze big data and make conclusions based on
statistical theories.
• Pandas can clean messy data sets, and make them readable and
relevant.
• Pandas are also able to delete rows that are not relevant, or contains
wrong values, like empty or NULL values. This is called cleaning the
data.
SERIES
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.
• EXAMPLE:
import pandas as pd
a = [2,3,4]
var = [Link](a)
print(var)
LABELS
• If nothing else is specified, the values are labeled with their index
number.
• First value has index 0, second value has index 1 etc.
• This label can be used to access a specified value.
• EXAMPLE: print(var[0])
• Using index argument, the user can name their own labels.
EXAMPLE
#CODE:
import pandas as pd
a = [2,3,4]
var = [Link](a, index = ["x", "y", "z"])
print(var)
#Output
KEY/VALUE OBJECTS AS SERIES
#CODE:
import pandas as pd
num = {"day1": 420, "day2": 380, "day3": 390}
var = [Link](num)
print(var)
DATAFRAMES
• A Pandas DataFrame is a 2 dimensional data
structure, like a 2 dimensional array, or a table with
rows and columns.
EXAMPLE:
import pandas as pd
data = { "S_No": [420, 380, 390],
"Name": ['Dean', 'Jane', 'Shaw']}
df = [Link](data, index = ["Section E",
"Section F","Section G"])
print(df)
LOCATE ROW
• Data frames store data in the form of rows and columns.
• To access one or more rows, loc parameter is used in
pandas.
• EXAMPLE: print([Link][0]) OR print([Link][[0, 1]])
READ CSV IN PANDAS
• A simple way to store big data sets is to use CSV files
(comma separated files).
• CSV files contains plain text and is a well know format
that can be read by everyone including Pandas.
• The to_string() is used to print the entire DataFrame.
EXAMPLE:
import pandas as pd
df = pd.read_csv('[Link]')
print(df.to_string())
DATAFRAME OPERATIONS
• Pandas contain a numerous amount of methods and functions that
can be used to manipulate data.
• These operations are used by data analysts to find correlation,
maximum, minimum, mean, null values in the dataset.
• Some of the dataframe operations are head, tail , info, shape,
reshape, columns, isnull, dropna, mean, sum, describe, value_counts,
corr, loc, iloc, and apply.
1. head():
• This function returns the first 5 rows of the DataFrame.
• This function returns a specified number of rows from
the top.
• It returns the first 5 rows if a number is not specified.
• The column names will also be returned, in addition to
the specified rows.
• SYNTAX: [Link](n)
• Here,
Optional. The number of rows to return.
n Default value is 5.
WITHOUT PARAMETER n
WITH PARAMETER n
2. tail()
• It returns the last 5 rows of the DataFrame.
• This method returns a specified number of last rows.
• This method returns the last 5 rows if a number is not
specified.
• The column names will also be returned, in addition to the
specified rows.
• SYNTAX: [Link](n)
• Here,
Optional. The number of
n rows to return. Default
value is 5.
WITHOUT PARAMETER n
WITH PARAMETER n
3. info()
• This method prints information about the DataFrame.
• The information contains the number of columns, column
labels, column data types, memory usage, range index, and
the number of cells in each column (non-null values).
• This method actually prints the info. There is no need to
use the print() method to print the info.
• SYNTAX:
[Link](verbose,buf,max_cols,memory_usage,sho
w_counts,null_counts)
WITHOUT ANY PARAMETERS
4. shape
• This property returns a tuple containing the shape of the
DataFrame.
• The shape is the number of rows and columns of the
DataFrame.
• SYNTAX: [Link]
5. columns
• This property returns the label of each column in the
DataFrame.
• SYNTAX: [Link]
6. isnull()
• This method returns a DataFrame object where all the
values are replaced with a Boolean value True for NULL
values, and otherwise False.
• This method takes no parameters.
• SYNTAX: [Link]()
7. dropna()
• This method removes the rows that contains NULL values.
• It returns a new DataFrame object unless
the inplace parameter is set to True, in that case
the dropna() method does the removing in the original
DataFrame instead.
• SYNTAX: [Link](axis, how, thresh, subset,
inplace)
Parameter Value Description
axis 0 Optional, default 0.
1 0 and 'index‘ removes ROWS that contains NULL
'index' values
'columns' 1 and 'columns' removes COLUMNS that
contains NULL values
how 'all' Optional, default 'any'. Specifies whether to
'any' remove the row or column when ALL values are
NULL, or if ANY value is NULL.
thresh Number Optional, Specifies the number of NOT NULL
values required to keep the row.
subset List Optional, specifies where to look for NULL
values
inplace True Optional, default False. If True: the removing is
False done on the current DataFrame. If False: returns
a copy where the removing is done.
8. mean()
• The mean() method returns a Series with the mean value
of each column.
• By specifying the column axis (axis='columns'),
the mean() method searches column-wise and returns the
mean value for each row.
• SYNTAX: [Link](axis, skipna, level,
numeric_only, kwargs)
Parameter Value Description
axis 0 Optional, Which axis to check, default 0.
1
'index'
'columns'
skip_na True Optional, default True. Set to False if the
False result should NOT skip NULL values
level Number Optional, default None. Specifies which
level name level ( in a hierarchical multi index) to
check along
numeric_on None Optional. Specify whether to only check
ly True numeric values. Default None
False
kwargs Optional, keyword arguments. These
arguments has no effect, but could be
accepted by a NumPy function
9. sum()
• This method adds all values in each column and returns the
sum for each column.
• By specifying the column axis (axis='columns'),
the sum() method searches column-wise and returns the
sum of each row.
• SYNTAX: [Link](axis, skipna, level, numeric_only,
min_count, kwargs)
Parameter Value Description
axis 0 Optional, Which axis to check, default 0.
1
'index'
'columns'
skip_na True Optional, default True. Set to False if the result
False should NOT skip NULL values
level Number Optional, default None. Specifies which level ( in
level name a hierarchical multi index) to check along
numeric_only None Optional. Specifies whether to only check
True numeric values. Default None
False
min_count None Optional. Specifies the minimum number of
True values that needs to be present to perform the
False action. Default 0
kwargs Optional, keyword arguments. These arguments
has no effect, but could be accepted by a
NumPy function
10. describe()
• This method returns description of the data in the
DataFrame.
• SYNTAX: [Link](percentiles, include, exclude,
datetime_is_numeric)
Parameter Value Description
percentile numbers Optional, a list of percentiles to
between: include in the result, default is :
0 and 1 [.25, .50, .75].
include None Optional, a list of the data types to
'all' allow in the result
datatypes
exclude None Optional, a list of the data types to
'all' disallow in the result
datatypes
datetime_is True Optional, default False. Set to True to
_numeric False treat datetime data as numeric
11. corr
• This method finds the correlation(relationship) between
each column in a DataFrame.
• SYNTAX: [Link](method, min_periods)
Parameter Value Description
method 'kendall' Optional, Default pearson. Specifies
'pearson' which method to use, or a callable
'spearman' function.
func
min_perio Number Optional. Specifies the minimum
ds number of observations required to
return a good enough result
12. Iloc
• The iloc property gets, or sets, the value(s) of the specified
indexes.
• Specify both row and column with an index.
• To access more than one row, use double brackets and specify
the indexes, separated by commas.
• To slice of the DataFrame with from and to indexes, separated
by a colon
• SYNTAX: [Link][[0, 2]] OR [Link][0:2]
13. apply
• The apply() method allows you
to apply a function along one of
the axis of the DataFrame,
default 0, which is the index
(row) axis.
• SYNTAX: [Link](func,
axis, raw, result_type,
args, kwds)
Parameter Value Description
func Required. A function to apply to the
DataFrame.
axis 0 Optional, Which axis to apply the function to.
1 default 0.
'index'
'columns'
raw True Optional, default False. Set to true if the
False row/column should be passed as an ndarray
object
result_type 'expand' Optional, default None. Specifies how the
'reduce' result will be returned
'broadcast'
None
args a tuple Optional, arguments to send into the function
kwds keyword Optional, keyword arguments to send into the
arguments function
14. value_counts()
• This method returns a Series containing the frequency of
each distinct row in the Dataframe.
• SYNTAX: dataframe.value_counts(subset=None,
normalize=False, sort=True, ascending=False,
dropna=True)