Python Pandas

Introduction to Python Pandas
Pandas is a powerful open-source Python library for data analysis and manipulation. It provides
data structures like DataFrame and Series that make handling structured data (like tables and
time-series) easy and efficient. Pandas is widely used in data science, machine learning, and
analytics due to its versatility and high-level abstractions for managing datasets.
Key Features of Pandas
1. Data Structures:
o Series: One-dimensional, similar to a column in Excel or a 1D NumPy array.
o DataFrame: Two-dimensional, like a table with rows and columns.
2. Data Manipulation:
o Filtering, sorting, grouping, and aggregation.
3. Integration:
o Works seamlessly with other libraries like NumPy, Matplotlib, and Scikit-learn.
4. Data I/O:
o Read and write data from various formats like CSV, Excel, SQL, JSON, etc.
5. Time-Series Support:
o Provides functionality for analyzing and processing time-series data.
Applications of Pandas
1. Data Cleaning
Real-time Example:
 Task: Cleaning customer data by handling missing values.
import pandas as pd
# Sample data
data = {'Name': ['Alice', 'Bob', None, 'Eve'], 'Age': [25, 30, None, 40]}
df = pd.DataFrame(data)
# Handling missing values

df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
2. Financial Analysis
Real-time Example:
 Task: Analyzing stock market data.
import pandas as pd
# Loading sample stock data

df = pd.read_csv('https://example.com/stock_prices.csv', parse_dates=['Date'])
# Filtering for a specific company

apple_stock = df[df['Company'] == 'Apple']
# Calculating moving average

apple_stock['Moving_Avg'] = apple_stock['Close'].rolling(window=20).mean()
print(apple_stock.head())
3. Exploratory Data Analysis (EDA)
Real-time Example:
 Task: Analyzing a dataset of sales.
# Loading sales data

sales_data = pd.read_csv('https://example.com/sales_data.csv')
# Grouping sales by region

region_sales = sales_data.groupby('Region')['Sales'].sum()
# Plotting the data

region_sales.plot(kind='bar', title='Sales by Region')
4. Time Series Analysis
Real-time Example:
 Task: Forecasting electricity demand based on past data.
# Loading time series data

df = pd.read_csv('https://example.com/electricity_demand.csv',
parse_dates=['Timestamp'])
# Resampling data to hourly averages

hourly_demand = df.resample('H', on='Timestamp')['Demand'].mean()
print(hourly_demand.head())
5. Machine Learning Preprocessing
Real-time Example:
 Task: Preparing data for a machine learning model.
# Loading data
data = pd.read_csv('https://example.com/housing_data.csv')
# Dropping irrelevant columns

data.drop(['ID'], axis=1, inplace=True)
# Encoding categorical features

data = pd.get_dummies(data, columns=['City'], drop_first=True)
# Normalizing numerical features

data['Price'] = (data['Price'] - data['Price'].mean()) / data['Price'].std()
print(data.head())
6. Web Scraping and Analysis
Real-time Example:
 Task: Scraping live product prices and analyzing them.
import pandas as pd
import requests
from bs4 import BeautifulSoup
# Scraping data from a website

response = requests.get('https://example.com/products')
soup = BeautifulSoup(response.text, 'html.parser')
# Extracting product names and prices

products = {'Name': [], 'Price': []}
for product in soup.select('.product-item'):
products['Name'].append(product.select_one('.name').text)
products['Price'].append(float(product.select_one('.price').text.strip('$')))
df = pd.DataFrame(products)
# Analyzing product prices

print(df.describe())
Why Use Pandas?

 Handles large datasets efficiently.
 Provides intuitive data manipulation tools.
 Simplifies working with different data formats.
 Integrates well with visualization and machine learning tools.
What Can Pandas Do?
 Pandas gives you answers about the data. Like:

 Is there a correlation between two or more columns?
 What is average value?
 Max value?
 Min value?
 Pandas are also able to delete rows that are not relevant, or contains wrong values, like
empty or NULL values. This is called cleaning the data.
Pandas Series:
The Pandas Series can be defined as a one-dimensional array that is capable of storing various
data types. We can easily convert the list, tuple, and dictionary into series using "series' method.
The row labels of series are called the index. A Series cannot contain multiple columns. It has
the following parameter:
 data: It can be any list, dictionary, or scalar value.

 index: The value of the index should be unique and hashable. It must be of the same
length as data. If we do not pass any index, default np.arrange(n) will be used.
 dtype: It refers to the data type of series.
 copy: It is used for copying the data.
Creating a Series:
We can create a Series in two ways:
1. Create an empty Series

2. Create a Series using inputs.
Create an Empty Series:

We can easily create an empty series in Pandas which means it will not have any value.
The syntax that is used for creating an Empty Series:

1. <series object> = pandas.Series()
The below example creates an Empty Series type object that has no values and having
default datatype, i.e., float64.
Example
1. import pandas as pd
2. x = pd.Series()
3. print (x)
Output
Series([], dtype: float64)
Creating a Series using inputs:

We can create Series by using various inputs:
o Array
o Dict
o Scalar value
Creating Series from Array:
Before creating a Series, firstly, we have to import the numpy module and then use array()
function in the program. If the data is ndarray, then the passed index must be of the same
length.
If we do not pass an index, then by default index of range(n) is being passed where n
defines the length of an array, i.e., [0,1,2,....range(len(array))-1].
Example
2. import numpy as np
3. info = np.array(['P','a','n','d','a','s'])
4. a = pd.Series(info)
5. print(a)
Output
0P
1a
2n
3d
4a
5s
dtype: object
Create a Series from dict
We can also create a Series from dict. If the dictionary object is being passed as an
input and the index is not specified, then the dictionary keys are taken in a sorted
order to construct the index.
If index is passed, then values correspond to a particular label in the index will be extracted
from the dictionary.
1. #import the pandas library

4. info = {'x' : 0., 'y' : 1., 'z' : 2.}
5. a = pd.Series(info)
6. print (a)
Output
x 0.0
y 1.0
z 2.0
dtype: float64
Create a Series using Scalar:
If we take the scalar values, then the index must be provided. The scalar value will be
repeated for matching the length of the index.
1. #import pandas library

4. x = pd.Series(4, index=[0, 1, 2, 3])
5. print (x)
Output
04
14
24
34
dtype: int64
Accessing data from series with Position:
Once you create the Series type object, you can access its indexes, data, and even
individual elements.
The data in the Series can be accessed similar to that in the ndarray.
2. x = pd.Series([1,2,3],index = ['a','b','c'])
3. #retrieve the first element
4. print (x[0])
Output
1
Series object attributes

The Series attribute is defined as any information related to the Series object such as size,
datatype. etc. Below are some of the attributes that you can use to get the information about
the Series object:
Attributes Description
Series.index Defines the index of the Series.
Series.shape It returns a tuple of shape of the data.
Series.dtype It returns the data type of the data.
Series.size It returns the size of the data.
It returns True if Series object is empty, otherwise

Series.empty
returns false.
It returns True if there are any NaN values,

Series.hasnans
otherwise returns false.
Series.nbytes It returns the number of bytes in the data.
Series.ndim It returns the number of dimensions in the data.
Series.itemsize It returns the size of the datatype of item.
Retrieving Index array and data array of a series object

We can retrieve the index array and data array of an existing Series object by using the
attributes index and values.
3. x=pd.Series(data=[2,4,6,8])
4. y=pd.Series(data=[11.2,18.6,22.5], index=['a','b','c'])
5. print(x.index)
6. print(x.values)
7. print(y.index)
8. print(y.values)
Output
RangeIndex(start=0, stop=4, step=1)

[2 4 6 8]
Index(['a', 'b', 'c'], dtype='object')
[11.2 18.6 22.5]
Retrieving Types (dtype) and Size of Type (itemsize)

You can use attribute dtype with Series object as <objectname> dtype for retrieving the data
type of an individual element of a series object, you can use the itemsize attribute to show
the number of bytes allocated to each data item.
3. a=pd.Series(data=[1,2,3,4])
4. b=pd.Series(data=[4.9,8.2,5.6],
5. index=['x','y','z'])
6. print(a.dtype)
7. print(a.itemsize)
8. print(b.dtype)
9. print(b.itemsize)
Output
int64
8
float64
8
Retrieving Shape
The shape of the Series object defines total number of elements including missing or empty
values(NaN).
4. b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
5. print(a.shape)
6. print(b.shape)
Output
(4,)
(3,)
Retrieving Dimension, Size and Number of bytes:

4. b=pd.Series(data=[4.9,8.2,5.6],
5. index=['x','y','z'])
6. print(a.ndim, b.ndim)
7. print(a.size, b.size)
8. print(a.nbytes, b.nbytes)
Output
11
43
32 24
Checking Emptiness and Presence of NaNs

To check the Series object is empty, you can use the empty attribute. Similarly, to check if
a series object contains some NaN values or not, you can use the hasans attribute.
Example
3. a=pd.Series(data=[1,2,3,np.NaN])
4. b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
5. c=pd.Series()
6. print(a.empty,b.empty,c.empty)
7. print(a.hasnans,b.hasnans,c.hasnans)
8. print(len(a),len(b))
9. print(a.count( ),b.count( ))
Output
False False True

True False False
43
33
Series Functions
There are some functions used in Series which are as follows:
Map the values from two series that have a

Pandas Series.map()
common column.
Calculate the standard deviation of the given set

Pandas Series.std()
of numbers, DataFrame, column, and rows.
Pandas Series.to_frame() Convert the series object to the dataframe.
Returns a Series that contain counts of unique

Pandas Series.value_counts()
values.
Python DataFrame: Reading CSV and JSON, and Performing Analysis
Functions
Python's pandas library provides powerful tools for handling, manipulating, and analyzing
structured data.
1. Python DataFrame: Reading CSV

Definition
pd.read_csv(): Reads a comma-separated values (CSV) file into a DataFrame.
CSV files are widely used for storing tabular data in various fields such as finance, healthcare,
and e-commerce.
Real-Time Scenario
Finance: Reading a CSV containing stock market data to analyze trends.
E-commerce: Reading product sales data for generating reports.
Example: Reading CSV and Basic Operations
import pandas as pd
# Reading a CSV file

df = pd.read_csv("sales_data.csv")
# Displaying first 5 rows

print(df.head())
# Scenario: Calculate total sales

total_sales = df["sales_amount"].sum()
print(f"Total Sales: {total_sales}")
2. Python DataFrame: Reading JSON
Definition
pd.read_json(): Reads a JSON file into a DataFrame.
JSON is a popular format for transmitting data in web applications and APIs.
Real-Time Scenario
Web Development: Reading user details from a JSON API response.
Social Media Analysis: Reading JSON containing user activity for engagement reports.
Example: Reading JSON and Basic Operations
import pandas as pd
# Reading a JSON file

df = pd.read_json("user_data.json")
# Displaying first 5 rows

print(df.head())
# Scenario: Filter users above age 30

filtered_users = df[df["age"] > 30]
print(filtered_users)
3. Python DataFrame: Analysis Functions

Definition
Pandas provides a wide range of functions to analyze and manipulate data, such as
summarization, filtering, grouping, and visualization.
Real-Time Scenario
Healthcare: Summarizing patient data for trend analysis.
Marketing: Grouping customer purchases by region for targeted campaigns.
Analysis Functions
Summarization Functions
df.describe(): Provides statistical summary.

df.mean(), df.sum(), df.count(): Calculate mean, sum, or count of values.
Example: Statistical Summary
df = pd.read_csv("employee_data.csv")
print(df.describe()) # Summary of numerical columns
# Scenario: Calculate average salary

avg_salary = df["salary"].mean()
print(f"Average Salary: {avg_salary}")
Filtering and Querying
df.loc[]: Filter rows by label.

df[df["column_name"] > value]: Conditional filtering.
Example: Filter Data
# Scenario: Employees with salary > 50000

high_salary = df[df["salary"] > 50000]
print(high_salary)
Grouping and Aggregation
df.groupby(): Groups data by specified columns and applies aggregation functions.

Example: Group Sales by Region
# Scenario: Total sales by region
grouped_sales = df.groupby("region")["sales_amount"].sum()
print(grouped_sales)
Sorting
df.sort_values(): Sorts the DataFrame by specified columns.

Example: Sort Employees by Salary
sorted_employees = df.sort_values(by="salary", ascending=False)

print(sorted_employees)
Handling Missing Data
df.isnull(): Checks for missing values.

df.fillna(): Fills missing values with a specified value.
df.dropna(): Drops rows/columns with missing values.
Example: Handle Missing Values
# Scenario: Replace missing salaries with 30000

df["salary"] = df["salary"].fillna(30000)
print(df)
Merging and Joining
pd.merge(): Merges two DataFrames.

df.join(): Joins DataFrames on indices.
Example: Merge Employee and Department Data
departments = pd.DataFrame({"dept_id": [1, 2], "dept_name": ["HR", "Finance"]})

merged_df = pd.merge(df, departments, left_on="dept_id", right_on="dept_id")
print(merged_df)
Visualization
df.plot(): Generates basic plots.

df.hist(): Creates histograms.
Example: Plot Sales Data

import matplotlib.pyplot as plt
# Scenario: Sales Trend
df["sales_amount"].plot(kind="line")
plt.title("Sales Trend")
plt.show()
Functions Summary
Function Purpose Example Use Case
Load data from a CSV file

pd.read_csv() Load sales or employee data.
into a DataFrame
Load data from a JSON file Load API response for user
pd.read_json()
into a DataFrame activity.
Statistical summary of
df.describe() Summarize patient statistics.
numerical columns
Group data and apply Calculate total sales per

df.groupby()
aggregation functions region.
Sort data by specified

df.sort_values() Rank employees by salary.
columns
Replace missing product

df.fillna() Fill missing values
prices.
Visualize data using basic Analyze sales trends over

df.plot()
plots months.
Data Cleaning Functions in Python DataFrames
Data cleaning is a crucial step in preparing datasets for analysis. Pandas provides several
functions to clean and preprocess data. Below is a detailed explanation of key data-cleaning
techniques, real-time scenarios, and example codes.
Common Data Issues and Pandas Cleaning Functions
Issue Function/Technique Description

isnull(), notnull(), fillna(), Identify, fill, or remove missing
Missing Values dropna() data.
Detect and remove duplicate
Duplicate Rows duplicated(), drop_duplicates()
rows.
Convert data to appropriate
Incorrect Data Types astype(), to_datetime()
types.
Outliers clip(), replace(), filtering Handle extreme values.
Replace or correct invalid
Invalid Values Filtering, apply(), replace()
entries.
Inconsistent str.lower(), str.strip(), Standardize text data for
Formatting str.replace() consistency.
Removing Unwanted Drop irrelevant rows or
Filtering, drop()
Data columns.
1. Handling Missing Data
Scenario: A sales dataset has missing values for revenue.
import pandas as pd
# Sample DataFrame with missing values

data = {
"Product": ["A", "B", "C", None],
"Sales": [100, 200, None, 150],
"Revenue": [500, None, 300, 400],
}
# Identify missing values

print(df.isnull())
# Fill missing values

df["Revenue"] = df["Revenue"].fillna(df["Revenue"].mean())
# Drop rows with missing Product

df = df.dropna(subset=["Product"])
print(df)
Functions
 isnull(): Checks for missing values.

 fillna(value): Replaces missing values with a specified value.
 dropna(): Removes rows or columns with missing values.
2. Removing Duplicates
Scenario: A customer dataset has duplicate entries.
# Sample DataFrame with duplicates

data = {"Customer": ["Alice", "Bob", "Alice"], "Purchase": [200, 300, 200]}
# Detect duplicates
print(df.duplicated())
# Remove duplicates
df = df.drop_duplicates()
print(df)
Functions
 duplicated(): Identifies duplicate rows.

 drop_duplicates(): Removes duplicate rows.
3. Converting Data Types
Scenario: Date data is in string format and needs conversion.
data = {"Date": ["2024-12-01", "2024-12-02", "2024-12-03"], "Sales": ["100",

"200", "300"]}
# Convert Sales to numeric

df["Sales"] = pd.to_numeric(df["Sales"])
# Convert Date to datetime

df["Date"] = pd.to_datetime(df["Date"])
print(df.dtypes)
Functions
 astype(type): Converts a column to the specified type.
 pd.to_datetime(): Converts a column to datetime format.
4. Handling Outliers
Scenario: Sales data contains extreme outliers.
data = {"Sales": [100, 200, 300, 10000]}

# Cap sales at 500

df["Sales"] = df["Sales"].clip(upper=500)
print(df)
Functions
 clip(lower, upper): Limits values within a specified range.

 replace(): Replaces specified values.
5. Replacing Invalid or Incorrect Values
Scenario: Age column has invalid negative values.
data = {"Name": ["Alice", "Bob"], "Age": [25, -5]}

# Replace negative ages with mean age

df["Age"] = df["Age"].apply(lambda x: max(x, 0))
print(df)
Functions
 replace(to_replace, value): Replaces values based on conditions.

 apply(func): Applies a custom function to transform data.
6. Standardizing Text
Scenario: Product names have inconsistent capitalization and whitespace.
data = {"Product": [" apple", "Orange ", "BANANA"]}

# Standardize text
df["Product"] = df["Product"].str.strip().str.lower()
print(df)
Functions
 str.strip(): Removes leading/trailing whitespace.

 str.lower(): Converts text to lowercase.
 str.replace(pattern, replacement): Replaces text based on a pattern.
7. Dropping Irrelevant Data
Scenario: Drop unnecessary columns like "ID".
data = {"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"], "Score": [85,
90, 95]}
# Drop ID column
df = df.drop(columns=["ID"])
print(df)
Functions
 drop(columns): Removes specified columns.

 drop(indexes): Removes specified rows.
8. Applying Filters
Scenario: Retain rows where revenue > 300.
data = {"Product": ["A", "B", "C"], "Revenue": [200, 400, 300]}

# Filter rows
filtered_df = df[df["Revenue"] > 300]
print(filtered_df)
9. Handling Categorical Data
Scenario: Replace categorical values with labels.
data = {"Gender": ["Male", "Female", "Male"]}

# Replace categories with numeric labels

df["Gender"] = df["Gender"].replace({"Male": 0, "Female": 1})
print(df)
Function Use Case Real-Time Scenario
isnull() Identify missing values Detect missing survey responses.

Replace missing prices with the
fillna() Fill missing data
average.
Remove rows/columns with Drop incomplete customer
dropna()
missing data records.
Find duplicate orders in e-
duplicated() Detect duplicate rows
commerce data.
drop_duplicates() Remove duplicate rows Clean duplicate customer entries.
Convert numeric strings to
astype() Convert column data types
integers.
Replace "NA" with a default
replace() Replace specific values
value in a column.
clip() Cap outliers Limit revenue to a specific range.
str.strip() Remove extra spaces Clean up messy product names.
drop() Drop irrelevant data Remove ID columns for analysis.

Summary of functions

Python Pandas

Uploaded by

Python Pandas

Uploaded by

Introduction to Python Pandas

Key Features of Pandas

 Task: Cleaning customer data by handling missing values.

# Handling missing values

 Task: Analyzing stock market data.

# Loading sample stock data

# Filtering for a specific company

# Calculating moving average

3. Exploratory Data Analysis (EDA)

 Task: Analyzing a dataset of sales.

# Loading sales data

# Grouping sales by region

# Plotting the data

4. Time Series Analysis

 Task: Forecasting electricity demand based on past data.

# Loading time series data

# Resampling data to hourly averages

 Task: Preparing data for a machine learning model.

# Dropping irrelevant columns

# Encoding categorical features

# Normalizing numerical features

6. Web Scraping and Analysis

 Task: Scraping live product prices and analyzing them.

# Scraping data from a website

# Extracting product names and prices

# Analyzing product prices

Why Use Pandas?

What Can Pandas Do?

 Pandas gives you answers about the data. Like:

 data: It can be any list, dictionary, or scalar value.

1. Create an empty Series

Create an Empty Series:

The syntax that is used for creating an Empty Series:

Series([], dtype: float64)

Creating a Series using inputs:

1. #import the pandas library

1. #import pandas library

Series object attributes

Series.index Defines the index of the Series.

Series.shape It returns a tuple of shape of the data.

Series.dtype It returns the data type of the data.

Series.size It returns the size of the data.

It returns True if Series object is empty, otherwise

It returns True if there are any NaN values,

Series.nbytes It returns the number of bytes in the data.

Series.ndim It returns the number of dimensions in the data.

Series.itemsize It returns the size of the datatype of item.

Retrieving Index array and data array of a series object

RangeIndex(start=0, stop=4, step=1)

Retrieving Types (dtype) and Size of Type (itemsize)

Retrieving Dimension, Size and Number of bytes:

Checking Emptiness and Presence of NaNs

False False True

Map the values from two series that have a

Calculate the standard deviation of the given set

Pandas Series.to_frame() Convert the series object to the dataframe.

Returns a Series that contain counts of unique

1. Python DataFrame: Reading CSV

# Reading a CSV file

# Displaying first 5 rows

# Scenario: Calculate total sales

# Reading a JSON file

# Displaying first 5 rows

# Scenario: Filter users above age 30

3. Python DataFrame: Analysis Functions

df.describe(): Provides statistical summary.

# Scenario: Calculate average salary

df.loc[]: Filter rows by label.

# Scenario: Employees with salary > 50000

df.groupby(): Groups data by specified columns and applies aggregation functions.

df.sort_values(): Sorts the DataFrame by specified columns.

sorted_employees = df.sort_values(by="salary", ascending=False)

df.isnull(): Checks for missing values.

# Scenario: Replace missing salaries with 30000

pd.merge(): Merges two DataFrames.

departments = pd.DataFrame({"dept_id": [1, 2], "dept_name": ["HR", "Finance"]})