0% found this document useful (0 votes)
7 views21 pages

Python Pandas

Pandas is an open-source Python library designed for data analysis and manipulation, featuring data structures like Series and DataFrame for efficient handling of structured data. It offers key functionalities such as data manipulation, integration with other libraries, and support for various data formats, making it widely used in data science and analytics. The document also provides practical examples of applications in data cleaning, financial analysis, exploratory data analysis, and machine learning preprocessing.

Uploaded by

Abhishek Dutta
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
7 views21 pages

Python Pandas

Pandas is an open-source Python library designed for data analysis and manipulation, featuring data structures like Series and DataFrame for efficient handling of structured data. It offers key functionalities such as data manipulation, integration with other libraries, and support for various data formats, making it widely used in data science and analytics. The document also provides practical examples of applications in data cleaning, financial analysis, exploratory data analysis, and machine learning preprocessing.

Uploaded by

Abhishek Dutta
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 21

Introduction to Python Pandas

Pandas is a powerful open-source Python library for data analysis and manipulation. It provides
data structures like DataFrame and Series that make handling structured data (like tables and
time-series) easy and efficient. Pandas is widely used in data science, machine learning, and
analytics due to its versatility and high-level abstractions for managing datasets.

Key Features of Pandas

1. Data Structures:
o Series: One-dimensional, similar to a column in Excel or a 1D NumPy array.
o DataFrame: Two-dimensional, like a table with rows and columns.
2. Data Manipulation:
o Filtering, sorting, grouping, and aggregation.
3. Integration:
o Works seamlessly with other libraries like NumPy, Matplotlib, and Scikit-learn.
4. Data I/O:
o Read and write data from various formats like CSV, Excel, SQL, JSON, etc.
5. Time-Series Support:
o Provides functionality for analyzing and processing time-series data.

Applications of Pandas

1. Data Cleaning

Real-time Example:

 Task: Cleaning customer data by handling missing values.

import pandas as pd

# Sample data
data = {'Name': ['Alice', 'Bob', None, 'Eve'], 'Age': [25, 30, None, 40]}
df = pd.DataFrame(data)

# Handling missing values


df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)

print(df)
2. Financial Analysis

Real-time Example:

 Task: Analyzing stock market data.

import pandas as pd

# Loading sample stock data


df = pd.read_csv('https://example.com/stock_prices.csv', parse_dates=['Date'])

# Filtering for a specific company


apple_stock = df[df['Company'] == 'Apple']

# Calculating moving average


apple_stock['Moving_Avg'] = apple_stock['Close'].rolling(window=20).mean()

print(apple_stock.head())

3. Exploratory Data Analysis (EDA)

Real-time Example:

 Task: Analyzing a dataset of sales.

# Loading sales data


sales_data = pd.read_csv('https://example.com/sales_data.csv')

# Grouping sales by region


region_sales = sales_data.groupby('Region')['Sales'].sum()

# Plotting the data


region_sales.plot(kind='bar', title='Sales by Region')

4. Time Series Analysis

Real-time Example:

 Task: Forecasting electricity demand based on past data.

# Loading time series data


df = pd.read_csv('https://example.com/electricity_demand.csv',
parse_dates=['Timestamp'])

# Resampling data to hourly averages


hourly_demand = df.resample('H', on='Timestamp')['Demand'].mean()

print(hourly_demand.head())
5. Machine Learning Preprocessing

Real-time Example:

 Task: Preparing data for a machine learning model.

# Loading data
data = pd.read_csv('https://example.com/housing_data.csv')

# Dropping irrelevant columns


data.drop(['ID'], axis=1, inplace=True)

# Encoding categorical features


data = pd.get_dummies(data, columns=['City'], drop_first=True)

# Normalizing numerical features


data['Price'] = (data['Price'] - data['Price'].mean()) / data['Price'].std()

print(data.head())

6. Web Scraping and Analysis

Real-time Example:

 Task: Scraping live product prices and analyzing them.

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Scraping data from a website


response = requests.get('https://example.com/products')
soup = BeautifulSoup(response.text, 'html.parser')

# Extracting product names and prices


products = {'Name': [], 'Price': []}
for product in soup.select('.product-item'):
products['Name'].append(product.select_one('.name').text)

products['Price'].append(float(product.select_one('.price').text.strip('$')))

df = pd.DataFrame(products)

# Analyzing product prices


print(df.describe())

Why Use Pandas?


 Handles large datasets efficiently.
 Provides intuitive data manipulation tools.
 Simplifies working with different data formats.
 Integrates well with visualization and machine learning tools.

What Can Pandas Do?

 Pandas gives you answers about the data. Like:


 Is there a correlation between two or more columns?
 What is average value?
 Max value?
 Min value?
 Pandas are also able to delete rows that are not relevant, or contains wrong values, like
empty or NULL values. This is called cleaning the data.

Pandas Series:
The Pandas Series can be defined as a one-dimensional array that is capable of storing various
data types. We can easily convert the list, tuple, and dictionary into series using "series' method.
The row labels of series are called the index. A Series cannot contain multiple columns. It has
the following parameter:

 data: It can be any list, dictionary, or scalar value.


 index: The value of the index should be unique and hashable. It must be of the same
length as data. If we do not pass any index, default np.arrange(n) will be used.
 dtype: It refers to the data type of series.
 copy: It is used for copying the data.

Creating a Series:
We can create a Series in two ways:

1. Create an empty Series


2. Create a Series using inputs.

Create an Empty Series:


We can easily create an empty series in Pandas which means it will not have any value.

The syntax that is used for creating an Empty Series:


1. <series object> = pandas.Series()
The below example creates an Empty Series type object that has no values and having
default datatype, i.e., float64.

Example

1. import pandas as pd
2. x = pd.Series()
3. print (x)
Output

Series([], dtype: float64)

Creating a Series using inputs:


We can create Series by using various inputs:

o Array
o Dict
o Scalar value
Creating Series from Array:

Before creating a Series, firstly, we have to import the numpy module and then use array()
function in the program. If the data is ndarray, then the passed index must be of the same
length.

If we do not pass an index, then by default index of range(n) is being passed where n
defines the length of an array, i.e., [0,1,2,....range(len(array))-1].

Example

1. import pandas as pd
2. import numpy as np
3. info = np.array(['P','a','n','d','a','s'])
4. a = pd.Series(info)
5. print(a)
Output

0P
1a
2n
3d
4a
5s
dtype: object
Create a Series from dict
We can also create a Series from dict. If the dictionary object is being passed as an
input and the index is not specified, then the dictionary keys are taken in a sorted
order to construct the index.

If index is passed, then values correspond to a particular label in the index will be extracted
from the dictionary.

1. #import the pandas library


2. import pandas as pd
3. import numpy as np
4. info = {'x' : 0., 'y' : 1., 'z' : 2.}
5. a = pd.Series(info)
6. print (a)
Output

x 0.0
y 1.0
z 2.0
dtype: float64
Create a Series using Scalar:

If we take the scalar values, then the index must be provided. The scalar value will be
repeated for matching the length of the index.

1. #import pandas library


2. import pandas as pd
3. import numpy as np
4. x = pd.Series(4, index=[0, 1, 2, 3])
5. print (x)
Output

04
14
24
34
dtype: int64
Accessing data from series with Position:
Once you create the Series type object, you can access its indexes, data, and even
individual elements.

The data in the Series can be accessed similar to that in the ndarray.

1. import pandas as pd
2. x = pd.Series([1,2,3],index = ['a','b','c'])
3. #retrieve the first element
4. print (x[0])
Output
1

Series object attributes


The Series attribute is defined as any information related to the Series object such as size,
datatype. etc. Below are some of the attributes that you can use to get the information about
the Series object:

Attributes Description

Series.index Defines the index of the Series.

Series.shape It returns a tuple of shape of the data.

Series.dtype It returns the data type of the data.

Series.size It returns the size of the data.

It returns True if Series object is empty, otherwise


Series.empty
returns false.

It returns True if there are any NaN values,


Series.hasnans
otherwise returns false.

Series.nbytes It returns the number of bytes in the data.

Series.ndim It returns the number of dimensions in the data.

Series.itemsize It returns the size of the datatype of item.

Retrieving Index array and data array of a series object


We can retrieve the index array and data array of an existing Series object by using the
attributes index and values.
1. import numpy as np
2. import pandas as pd
3. x=pd.Series(data=[2,4,6,8])
4. y=pd.Series(data=[11.2,18.6,22.5], index=['a','b','c'])
5. print(x.index)
6. print(x.values)
7. print(y.index)
8. print(y.values)
Output

RangeIndex(start=0, stop=4, step=1)


[2 4 6 8]
Index(['a', 'b', 'c'], dtype='object')
[11.2 18.6 22.5]

Retrieving Types (dtype) and Size of Type (itemsize)


You can use attribute dtype with Series object as <objectname> dtype for retrieving the data
type of an individual element of a series object, you can use the itemsize attribute to show
the number of bytes allocated to each data item.

1. import numpy as np
2. import pandas as pd
3. a=pd.Series(data=[1,2,3,4])
4. b=pd.Series(data=[4.9,8.2,5.6],
5. index=['x','y','z'])
6. print(a.dtype)
7. print(a.itemsize)
8. print(b.dtype)
9. print(b.itemsize)
Output

int64
8
float64
8

Retrieving Shape
The shape of the Series object defines total number of elements including missing or empty
values(NaN).

1. import numpy as np
2. import pandas as pd
3. a=pd.Series(data=[1,2,3,4])
4. b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
5. print(a.shape)
6. print(b.shape)
Output
(4,)
(3,)

Retrieving Dimension, Size and Number of bytes:


1. import numpy as np
2. import pandas as pd
3. a=pd.Series(data=[1,2,3,4])
4. b=pd.Series(data=[4.9,8.2,5.6],
5. index=['x','y','z'])
6. print(a.ndim, b.ndim)
7. print(a.size, b.size)
8. print(a.nbytes, b.nbytes)
Output

11
43
32 24

Checking Emptiness and Presence of NaNs


To check the Series object is empty, you can use the empty attribute. Similarly, to check if
a series object contains some NaN values or not, you can use the hasans attribute.

Example

1. import numpy as np
2. import pandas as pd
3. a=pd.Series(data=[1,2,3,np.NaN])
4. b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
5. c=pd.Series()
6. print(a.empty,b.empty,c.empty)
7. print(a.hasnans,b.hasnans,c.hasnans)
8. print(len(a),len(b))
9. print(a.count( ),b.count( ))
Output

False False True


True False False
43
33
Series Functions
There are some functions used in Series which are as follows:

Map the values from two series that have a


Pandas Series.map()
common column.

Calculate the standard deviation of the given set


Pandas Series.std()
of numbers, DataFrame, column, and rows.

Pandas Series.to_frame() Convert the series object to the dataframe.

Returns a Series that contain counts of unique


Pandas Series.value_counts()
values.
Python DataFrame: Reading CSV and JSON, and Performing Analysis
Functions
Python's pandas library provides powerful tools for handling, manipulating, and analyzing
structured data.

1. Python DataFrame: Reading CSV


Definition
pd.read_csv(): Reads a comma-separated values (CSV) file into a DataFrame.
CSV files are widely used for storing tabular data in various fields such as finance, healthcare,
and e-commerce.
Real-Time Scenario
Finance: Reading a CSV containing stock market data to analyze trends.
E-commerce: Reading product sales data for generating reports.
Example: Reading CSV and Basic Operations
import pandas as pd

# Reading a CSV file


df = pd.read_csv("sales_data.csv")

# Displaying first 5 rows


print(df.head())

# Scenario: Calculate total sales


total_sales = df["sales_amount"].sum()
print(f"Total Sales: {total_sales}")
2. Python DataFrame: Reading JSON
Definition
pd.read_json(): Reads a JSON file into a DataFrame.
JSON is a popular format for transmitting data in web applications and APIs.
Real-Time Scenario
Web Development: Reading user details from a JSON API response.
Social Media Analysis: Reading JSON containing user activity for engagement reports.
Example: Reading JSON and Basic Operations
import pandas as pd

# Reading a JSON file


df = pd.read_json("user_data.json")

# Displaying first 5 rows


print(df.head())

# Scenario: Filter users above age 30


filtered_users = df[df["age"] > 30]
print(filtered_users)

3. Python DataFrame: Analysis Functions


Definition
Pandas provides a wide range of functions to analyze and manipulate data, such as
summarization, filtering, grouping, and visualization.
Real-Time Scenario
Healthcare: Summarizing patient data for trend analysis.
Marketing: Grouping customer purchases by region for targeted campaigns.
Analysis Functions
Summarization Functions

df.describe(): Provides statistical summary.


df.mean(), df.sum(), df.count(): Calculate mean, sum, or count of values.
Example: Statistical Summary

df = pd.read_csv("employee_data.csv")
print(df.describe()) # Summary of numerical columns

# Scenario: Calculate average salary


avg_salary = df["salary"].mean()
print(f"Average Salary: {avg_salary}")
Filtering and Querying

df.loc[]: Filter rows by label.


df[df["column_name"] > value]: Conditional filtering.
Example: Filter Data

# Scenario: Employees with salary > 50000


high_salary = df[df["salary"] > 50000]
print(high_salary)
Grouping and Aggregation

df.groupby(): Groups data by specified columns and applies aggregation functions.


Example: Group Sales by Region
# Scenario: Total sales by region
grouped_sales = df.groupby("region")["sales_amount"].sum()
print(grouped_sales)
Sorting

df.sort_values(): Sorts the DataFrame by specified columns.


Example: Sort Employees by Salary

sorted_employees = df.sort_values(by="salary", ascending=False)


print(sorted_employees)
Handling Missing Data

df.isnull(): Checks for missing values.


df.fillna(): Fills missing values with a specified value.
df.dropna(): Drops rows/columns with missing values.
Example: Handle Missing Values

# Scenario: Replace missing salaries with 30000


df["salary"] = df["salary"].fillna(30000)
print(df)
Merging and Joining

pd.merge(): Merges two DataFrames.


df.join(): Joins DataFrames on indices.
Example: Merge Employee and Department Data

departments = pd.DataFrame({"dept_id": [1, 2], "dept_name": ["HR", "Finance"]})


merged_df = pd.merge(df, departments, left_on="dept_id", right_on="dept_id")
print(merged_df)
Visualization

df.plot(): Generates basic plots.


df.hist(): Creates histograms.

Example: Plot Sales Data


import matplotlib.pyplot as plt
# Scenario: Sales Trend
df["sales_amount"].plot(kind="line")
plt.title("Sales Trend")
plt.show()
Functions Summary

Function Purpose Example Use Case

Load data from a CSV file


pd.read_csv() Load sales or employee data.
into a DataFrame

Load data from a JSON file Load API response for user
pd.read_json()
into a DataFrame activity.

Statistical summary of
df.describe() Summarize patient statistics.
numerical columns

Group data and apply Calculate total sales per


df.groupby()
aggregation functions region.

Sort data by specified


df.sort_values() Rank employees by salary.
columns

Replace missing product


df.fillna() Fill missing values
prices.

Visualize data using basic Analyze sales trends over


df.plot()
plots months.
Data Cleaning Functions in Python DataFrames

Data cleaning is a crucial step in preparing datasets for analysis. Pandas provides several
functions to clean and preprocess data. Below is a detailed explanation of key data-cleaning
techniques, real-time scenarios, and example codes.

Common Data Issues and Pandas Cleaning Functions

Issue Function/Technique Description


isnull(), notnull(), fillna(), Identify, fill, or remove missing
Missing Values dropna() data.
Detect and remove duplicate
Duplicate Rows duplicated(), drop_duplicates()
rows.
Convert data to appropriate
Incorrect Data Types astype(), to_datetime()
types.
Outliers clip(), replace(), filtering Handle extreme values.
Replace or correct invalid
Invalid Values Filtering, apply(), replace()
entries.
Inconsistent str.lower(), str.strip(), Standardize text data for
Formatting str.replace() consistency.
Removing Unwanted Drop irrelevant rows or
Filtering, drop()
Data columns.

1. Handling Missing Data

Scenario: A sales dataset has missing values for revenue.

import pandas as pd

# Sample DataFrame with missing values


data = {
"Product": ["A", "B", "C", None],
"Sales": [100, 200, None, 150],
"Revenue": [500, None, 300, 400],
}
df = pd.DataFrame(data)

# Identify missing values


print(df.isnull())

# Fill missing values


df["Revenue"] = df["Revenue"].fillna(df["Revenue"].mean())

# Drop rows with missing Product


df = df.dropna(subset=["Product"])
print(df)

Functions

 isnull(): Checks for missing values.


 fillna(value): Replaces missing values with a specified value.
 dropna(): Removes rows or columns with missing values.

2. Removing Duplicates

Scenario: A customer dataset has duplicate entries.

# Sample DataFrame with duplicates


data = {"Customer": ["Alice", "Bob", "Alice"], "Purchase": [200, 300, 200]}
df = pd.DataFrame(data)

# Detect duplicates
print(df.duplicated())

# Remove duplicates
df = df.drop_duplicates()
print(df)

Functions

 duplicated(): Identifies duplicate rows.


 drop_duplicates(): Removes duplicate rows.

3. Converting Data Types

Scenario: Date data is in string format and needs conversion.

data = {"Date": ["2024-12-01", "2024-12-02", "2024-12-03"], "Sales": ["100",


"200", "300"]}

df = pd.DataFrame(data)

# Convert Sales to numeric


df["Sales"] = pd.to_numeric(df["Sales"])

# Convert Date to datetime


df["Date"] = pd.to_datetime(df["Date"])

print(df.dtypes)

Functions
 astype(type): Converts a column to the specified type.
 pd.to_datetime(): Converts a column to datetime format.

4. Handling Outliers

Scenario: Sales data contains extreme outliers.

data = {"Sales": [100, 200, 300, 10000]}


df = pd.DataFrame(data)

# Cap sales at 500


df["Sales"] = df["Sales"].clip(upper=500)

print(df)

Functions

 clip(lower, upper): Limits values within a specified range.


 replace(): Replaces specified values.

5. Replacing Invalid or Incorrect Values

Scenario: Age column has invalid negative values.

data = {"Name": ["Alice", "Bob"], "Age": [25, -5]}


df = pd.DataFrame(data)

# Replace negative ages with mean age


df["Age"] = df["Age"].apply(lambda x: max(x, 0))

print(df)

Functions

 replace(to_replace, value): Replaces values based on conditions.


 apply(func): Applies a custom function to transform data.

6. Standardizing Text
Scenario: Product names have inconsistent capitalization and whitespace.

data = {"Product": [" apple", "Orange ", "BANANA"]}


df = pd.DataFrame(data)

# Standardize text
df["Product"] = df["Product"].str.strip().str.lower()

print(df)

Functions

 str.strip(): Removes leading/trailing whitespace.


 str.lower(): Converts text to lowercase.
 str.replace(pattern, replacement): Replaces text based on a pattern.

7. Dropping Irrelevant Data

Scenario: Drop unnecessary columns like "ID".

data = {"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"], "Score": [85,
90, 95]}
df = pd.DataFrame(data)

# Drop ID column
df = df.drop(columns=["ID"])

print(df)

Functions

 drop(columns): Removes specified columns.


 drop(indexes): Removes specified rows.

8. Applying Filters

Scenario: Retain rows where revenue > 300.

data = {"Product": ["A", "B", "C"], "Revenue": [200, 400, 300]}


df = pd.DataFrame(data)

# Filter rows
filtered_df = df[df["Revenue"] > 300]

print(filtered_df)
9. Handling Categorical Data

Scenario: Replace categorical values with labels.

data = {"Gender": ["Male", "Female", "Male"]}


df = pd.DataFrame(data)

# Replace categories with numeric labels


df["Gender"] = df["Gender"].replace({"Male": 0, "Female": 1})

print(df)

Function Use Case Real-Time Scenario

isnull() Identify missing values Detect missing survey responses.


Replace missing prices with the
fillna() Fill missing data
average.
Remove rows/columns with Drop incomplete customer
dropna()
missing data records.
Find duplicate orders in e-
duplicated() Detect duplicate rows
commerce data.
drop_duplicates() Remove duplicate rows Clean duplicate customer entries.
Convert numeric strings to
astype() Convert column data types
integers.
Replace "NA" with a default
replace() Replace specific values
value in a column.
clip() Cap outliers Limit revenue to a specific range.

str.strip() Remove extra spaces Clean up messy product names.

drop() Drop irrelevant data Remove ID columns for analysis.


Summary of functions

You might also like