0% found this document useful (0 votes)
32 views

EDA All Functions

The document discusses various exploratory data analysis functions in pandas such as df.head(), df.tail(), df.info(), df.describe(), df.shape, df.columns, and df.dtypes. It provides examples of using each function on a sample DataFrame and describes what each function shows.

Uploaded by

classfunction9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

EDA All Functions

The document discusses various exploratory data analysis functions in pandas such as df.head(), df.tail(), df.info(), df.describe(), df.shape, df.columns, and df.dtypes. It provides examples of using each function on a sample DataFrame and describes what each function shows.

Uploaded by

classfunction9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

3/25/24, 10:42 PM EDA All Functions

Note Book By Tariq Ahmed (WP:


+923070996076)
1. df.head(n): Returns the first n rows of the DataFrame.
2. df.tail(n): Returns the last n rows of the DataFrame.
3. df.info(): Provides information about the DataFrame, including column names, data types,
and non-null value counts.
4. df.describe(): Computes various descriptive statistics for numerical columns in the
DataFrame, such as count, mean, standard deviation, and percentiles.
5. df.shape: Returns the dimensions (number of rows and columns) of the DataFrame.
6. df.columns: Returns the column names of the DataFrame.
7. df.dtypes: Returns the data types of each column in the DataFrame.
8. df.isnull(): Checks for missing values and returns a DataFrame of the same shape with
True/False values indicating the presence of missing values.
9. df.dropna(): Removes rows with missing values from the DataFrame.
10. df.fillna(value): Fills missing values in the DataFrame with a specified value.
11. df.groupby(by): Groups the DataFrame by one or more columns and returns a GroupBy
object for further aggregation and analysis.
12. df.sort_values(by): Sorts the DataFrame by one or more columns.
13. df.merge(df2): Merges two DataFrames based on common columns or indices.
14. df.pivot_table(values, index, columns): Creates a pivot table from the DataFrame,
aggregating values based on specified columns.
15. df.apply(func): Applies a function to each element or column of the DataFrame.

import libraries
In [2]: import pandas as pd

To find csv file encoding


In [5]: with open('Diwali Sales Data.csv') as f:
print(f)

<_io.TextIOWrapper name='Diwali Sales Data.csv' mode='r' encoding='cp1252'>

import Csv file


In [6]: df=pd.read_csv('Diwali Sales Data.csv',encoding='cp1252')

localhost:8888/nbconvert/html/Class EDA/EDA All Functions.ipynb?download=false 1/9


3/25/24, 10:42 PM EDA All Functions

df.head() function is used to display the first


few rows of a DataFrame object in pandas,
which is a popular data manipulation and
analysis library.
In [7]: df.head()

Out[7]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone O
Group

0 1002903 Sanskriti P00125942 F 26-35 28 0 Maharashtra Western

1 1000732 Kartik P00110942 F 26-35 35 1 Andhra Pradesh Southern

2 1001990 Bindu P00118542 F 26-35 35 1 Uttar Pradesh Central A

3 1001425 Sudevi P00237842 M 0-17 16 0 Karnataka Southern C

4 1000588 Joni P00057942 M 26-35 28 1 Gujarat Western

df.tail(), it returns the last five rows of the


DataFrame by default.
In [10]: df.tail()

Out[10]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone
Group

11246 1000695 Manning P00296942 M 18-25 19 1 Maharashtra Western

11247 1004089 Reichenbach P00171342 M 26-35 33 0 Haryana Northern

Madhya
11248 1001209 Oshin P00201342 F 36-45 40 0 Central
Pradesh

11249 1004023 Noonan P00059442 M 36-45 37 0 Karnataka Southern

11250 1002744 Brumley P00281742 F 18-25 19 0 Maharashtra Western

df.info() it provides a summary of the


DataFrame, including the following
information:
The total number of rows and columns in the DataFrame. The column names and their
corresponding data types. The count of non-null values in each column. The memory usage of
localhost:8888/nbconvert/html/Class EDA/EDA All Functions.ipynb?download=false 2/9
3/25/24, 10:42 PM EDA All Functions

the DataFrame.

In [8]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11251 entries, 0 to 11250
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 11251 non-null int64
1 Cust_name 11251 non-null object
2 Product_ID 11251 non-null object
3 Gender 11251 non-null object
4 Age Group 11251 non-null object
5 Age 11251 non-null int64
6 Marital_Status 11251 non-null int64
7 State 11251 non-null object
8 Zone 11251 non-null object
9 Occupation 11251 non-null object
10 Product_Category 11251 non-null object
11 Orders 11251 non-null int64
12 Amount 11239 non-null float64
13 Status 0 non-null float64
14 unnamed1 0 non-null float64
dtypes: float64(3), int64(4), object(8)
memory usage: 1.3+ MB

df.describe() function in pandas is used to


generate descriptive statistics of a
DataFrame.such as count, mean, standard
deviation, minimum value,
In [9]: df.describe()

Out[9]: User_ID Age Marital_Status Orders Amount Status unnamed1

count 1.125100e+04 11251.000000 11251.000000 11251.000000 11239.000000 0.0 0.0

mean 1.003004e+06 35.421207 0.420318 2.489290 9453.610858 NaN NaN

std 1.716125e+03 12.754122 0.493632 1.115047 5222.355869 NaN NaN

min 1.000001e+06 12.000000 0.000000 1.000000 188.000000 NaN NaN

25% 1.001492e+06 27.000000 0.000000 1.500000 5443.000000 NaN NaN

50% 1.003065e+06 33.000000 0.000000 2.000000 8109.000000 NaN NaN

75% 1.004430e+06 43.000000 1.000000 3.000000 12675.000000 NaN NaN

max 1.006040e+06 92.000000 1.000000 4.000000 23952.000000 NaN NaN

localhost:8888/nbconvert/html/Class EDA/EDA All Functions.ipynb?download=false 3/9


3/25/24, 10:42 PM EDA All Functions

df.shape (number of rows and columns) of


the DataFrame.
In [11]: df.shape

(11251, 15)
Out[11]:

df.columns Show the column names of the


DataFrame.
In [13]: df.columns

Index(['User_ID', 'Cust_name', 'Product_ID', 'Gender', 'Age Group', 'Age',


Out[13]:
'Marital_Status', 'State', 'Zone', 'Occupation', 'Product_Category',
'Orders', 'Amount', 'Status', 'unnamed1'],
dtype='object')

df.dtypes shows the data types of each


column in the DataFrame.
In [14]: df.dtypes

User_ID int64
Out[14]:
Cust_name object
Product_ID object
Gender object
Age Group object
Age int64
Marital_Status int64
State object
Zone object
Occupation object
Product_Category object
Orders int64
Amount float64
Status float64
unnamed1 float64
dtype: object

df.isnull(): Checks for missing values and


returns a DataFrame of the same shape with
True/False values indicating the presence of
missing values.
In [15]: df.isnull()

localhost:8888/nbconvert/html/Class EDA/EDA All Functions.ipynb?download=false 4/9


3/25/24, 10:42 PM EDA All Functions

Out[15]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone Occupatio
Group

0 False False False False False False False False False Fal

1 False False False False False False False False False Fal

2 False False False False False False False False False Fal

3 False False False False False False False False False Fal

4 False False False False False False False False False Fal

... ... ... ... ... ... ... ... ... ...

11246 False False False False False False False False False Fal

11247 False False False False False False False False False Fal

11248 False False False False False False False False False Fal

11249 False False False False False False False False False Fal

11250 False False False False False False False False False Fal

11251 rows × 15 columns

df.isnull().sum() Checks for missing values


and count how many nulls are.
In [16]: df.isnull().sum()

User_ID 0
Out[16]:
Cust_name 0
Product_ID 0
Gender 0
Age Group 0
Age 0
Marital_Status 0
State 0
Zone 0
Occupation 0
Product_Category 0
Orders 0
Amount 12
Status 11251
unnamed1 11251
dtype: int64

df.dropna(): Removes rows with missing


values from the DataFrame.
In [17]: df.dropna()

localhost:8888/nbconvert/html/Class EDA/EDA All Functions.ipynb?download=false 5/9


3/25/24, 10:42 PM EDA All Functions

Out[17]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone Occupation Pro
Group

df.drop('column name',axis=1,inplace=True)
Removes Missing values from column.
In [23]: df.drop('unnamed1',axis=1,inplace=True)

df.fillna(value): Fills missing values in the


DataFrame with a specified value.
In [27]: # Fill missing values with a constant value
df_filled = df.fillna(0)
df_filled

Out[27]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zo
Group

0 1002903 Sanskriti P00125942 F 26-35 28 0 Maharashtra Weste

1 1000732 Kartik P00110942 F 26-35 35 1 Andhra Pradesh Southe

2 1001990 Bindu P00118542 F 26-35 35 1 Uttar Pradesh Cent

3 1001425 Sudevi P00237842 M 0-17 16 0 Karnataka Southe

4 1000588 Joni P00057942 M 26-35 28 1 Gujarat Weste

... ... ... ... ... ... ... ... ...

11246 1000695 Manning P00296942 M 18-25 19 1 Maharashtra Weste

11247 1004089 Reichenbach P00171342 M 26-35 33 0 Haryana Northe

Madhya
11248 1001209 Oshin P00201342 F 36-45 40 0 Cent
Pradesh

11249 1004023 Noonan P00059442 M 36-45 37 0 Karnataka Southe

11250 1002744 Brumley P00281742 F 18-25 19 0 Maharashtra Weste

11251 rows × 14 columns

Fill missing values with the mean of the


column

localhost:8888/nbconvert/html/Class EDA/EDA All Functions.ipynb?download=false 6/9


3/25/24, 10:42 PM EDA All Functions

df['column name'].fillna(df['column
name'].mean(),inplace=True)
In [33]: df['Amount'].fillna(df['Amount'].mean(),inplace=True)

In [34]: #check it's fill


df.isnull().sum()

User_ID 0
Out[34]:
Cust_name 0
Product_ID 0
Gender 0
Age Group 0
Age 0
Marital_Status 0
State 0
Zone 0
Occupation 0
Product_Category 0
Orders 0
Amount 0
Status 11251
dtype: int64

df.groupby(by) function in pandas is used to


group a DataFrame by one or more columns.
It allows you to split the DataFrame into
groups based on unique values in the
specified column(s) and perform operations
on each group independently.
In [53]: grouped = df.groupby(['Product_ID', 'Cust_name'])
mean_age = grouped['Age'].mean()
print(mean_age)

Product_ID Cust_name
P00000142 Adrian 19.0
Akshat 27.0
Armstrong
34.0
Arun 33.0
Atkinson46.0
...
P0099442 Amol 26.0
Astrea 35.0
Grant 32.0
Siddharth 36.0
P0099742 Shatayu 13.0
Name: Age, Length: 10948, dtype: float64

In [54]: # in one line


mean_values = df.groupby(['Product_ID', 'Cust_name'])['Age'].mean()
mean_values
localhost:8888/nbconvert/html/Class EDA/EDA All Functions.ipynb?download=false 7/9
3/25/24, 10:42 PM EDA All Functions

Product_ID Cust_name
Out[54]:
P00000142 Adrian 19.0
Akshat 27.0
Armstrong
34.0
Arun 33.0
Atkinson46.0
...
P0099442 Amol 26.0
Astrea 35.0
Grant 32.0
Siddharth 36.0
P0099742 Shatayu 13.0
Name: Age, Length: 10948, dtype: float64

df.sort_values(by): Sorts the DataFrame by


one or more columns.
In [59]: #df.sort_values(by='Column1') # Sort by a single column

df.sort_values(by='Amount')

Out[59]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zo
Group

11250 1002744 Brumley P00281742 F 18-25 19 0 Maharashtra Weste

11249 1004023 Noonan P00059442 M 36-45 37 0 Karnataka Southe

Madhya
11248 1001209 Oshin P00201342 F 36-45 40 0 Cent
Pradesh

11247 1004089 Reichenbach P00171342 M 26-35 33 0 Haryana Northe

11246 1000695 Manning P00296942 M 18-25 19 1 Maharashtra Weste

... ... ... ... ... ... ... ... ...

4 1000588 Joni P00057942 M 26-35 28 1 Gujarat Weste

3 1001425 Sudevi P00237842 M 0-17 16 0 Karnataka Southe

2 1001990 Bindu P00118542 F 26-35 35 1 Uttar Pradesh Cent

1 1000732 Kartik P00110942 F 26-35 35 1 Andhra Pradesh Southe

0 1002903 Sanskriti P00125942 F 26-35 28 0 Maharashtra Weste

11251 rows × 14 columns

In [56]: # Sort by multiple columns


#df.sort_values(by=['Column1', 'Column2'])
df.sort_values(by=['Age', 'Amount'])

localhost:8888/nbconvert/html/Class EDA/EDA All Functions.ipynb?download=false 8/9


3/25/24, 10:42 PM EDA All Functions

Out[56]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone
Group

11240 1001425 Sudevi P00044742 F 0-17 12 0 Delhi Central

11109 1004135 Jayanti P00229742 F 0-17 12 0 Delhi Central

10804 1001673 Lampkin P00277442 F 0-17 12 0 Gujarat Western

Madhya
10774 1001926 Barton P00157542 M 0-17 12 1 Central
Pradesh

9505 1005403 Caroline P00195742 M 0-17 12 1 Haryana Northern

... ... ... ... ... ... ... ... ... ...

Madhya
2951 1002204 Dilbeck P00246642 M 55+ 92 0 Central
Pradesh

2698 1005658 Poirier P00227942 M 55+ 92 0 Karnataka Southern

Uttar
1106 1001176 Alice P00128942 M 55+ 92 0 Central
Pradesh

Uttar
612 1002526 Shreya P00271142 M 55+ 92 1 Central
Pradesh

359 1003036 Prescott P00255842 F 55+ 92 0 Uttarakhand Central

11251 rows × 14 columns

localhost:8888/nbconvert/html/Class EDA/EDA All Functions.ipynb?download=false 9/9

You might also like