Python Pandas
Python Pandas
Pandas is a powerful open-source Python library for data analysis and manipulation. It provides
data structures like DataFrame and Series that make handling structured data (like tables and
time-series) easy and efficient. Pandas is widely used in data science, machine learning, and
analytics due to its versatility and high-level abstractions for managing datasets.
1. Data Structures:
o Series: One-dimensional, similar to a column in Excel or a 1D NumPy array.
o DataFrame: Two-dimensional, like a table with rows and columns.
2. Data Manipulation:
o Filtering, sorting, grouping, and aggregation.
3. Integration:
o Works seamlessly with other libraries like NumPy, Matplotlib, and Scikit-learn.
4. Data I/O:
o Read and write data from various formats like CSV, Excel, SQL, JSON, etc.
5. Time-Series Support:
o Provides functionality for analyzing and processing time-series data.
Applications of Pandas
1. Data Cleaning
Real-time Example:
import pandas as pd
# Sample data
data = {'Name': ['Alice', 'Bob', None, 'Eve'], 'Age': [25, 30, None, 40]}
df = pd.DataFrame(data)
print(df)
2. Financial Analysis
Real-time Example:
import pandas as pd
print(apple_stock.head())
Real-time Example:
Real-time Example:
print(hourly_demand.head())
5. Machine Learning Preprocessing
Real-time Example:
# Loading data
data = pd.read_csv('https://example.com/housing_data.csv')
print(data.head())
Real-time Example:
import pandas as pd
import requests
from bs4 import BeautifulSoup
products['Price'].append(float(product.select_one('.price').text.strip('$')))
df = pd.DataFrame(products)
Pandas Series:
The Pandas Series can be defined as a one-dimensional array that is capable of storing various
data types. We can easily convert the list, tuple, and dictionary into series using "series' method.
The row labels of series are called the index. A Series cannot contain multiple columns. It has
the following parameter:
Creating a Series:
We can create a Series in two ways:
Example
1. import pandas as pd
2. x = pd.Series()
3. print (x)
Output
o Array
o Dict
o Scalar value
Creating Series from Array:
Before creating a Series, firstly, we have to import the numpy module and then use array()
function in the program. If the data is ndarray, then the passed index must be of the same
length.
If we do not pass an index, then by default index of range(n) is being passed where n
defines the length of an array, i.e., [0,1,2,....range(len(array))-1].
Example
1. import pandas as pd
2. import numpy as np
3. info = np.array(['P','a','n','d','a','s'])
4. a = pd.Series(info)
5. print(a)
Output
0P
1a
2n
3d
4a
5s
dtype: object
Create a Series from dict
We can also create a Series from dict. If the dictionary object is being passed as an
input and the index is not specified, then the dictionary keys are taken in a sorted
order to construct the index.
If index is passed, then values correspond to a particular label in the index will be extracted
from the dictionary.
x 0.0
y 1.0
z 2.0
dtype: float64
Create a Series using Scalar:
If we take the scalar values, then the index must be provided. The scalar value will be
repeated for matching the length of the index.
04
14
24
34
dtype: int64
Accessing data from series with Position:
Once you create the Series type object, you can access its indexes, data, and even
individual elements.
The data in the Series can be accessed similar to that in the ndarray.
1. import pandas as pd
2. x = pd.Series([1,2,3],index = ['a','b','c'])
3. #retrieve the first element
4. print (x[0])
Output
1
Attributes Description
1. import numpy as np
2. import pandas as pd
3. a=pd.Series(data=[1,2,3,4])
4. b=pd.Series(data=[4.9,8.2,5.6],
5. index=['x','y','z'])
6. print(a.dtype)
7. print(a.itemsize)
8. print(b.dtype)
9. print(b.itemsize)
Output
int64
8
float64
8
Retrieving Shape
The shape of the Series object defines total number of elements including missing or empty
values(NaN).
1. import numpy as np
2. import pandas as pd
3. a=pd.Series(data=[1,2,3,4])
4. b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
5. print(a.shape)
6. print(b.shape)
Output
(4,)
(3,)
11
43
32 24
Example
1. import numpy as np
2. import pandas as pd
3. a=pd.Series(data=[1,2,3,np.NaN])
4. b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
5. c=pd.Series()
6. print(a.empty,b.empty,c.empty)
7. print(a.hasnans,b.hasnans,c.hasnans)
8. print(len(a),len(b))
9. print(a.count( ),b.count( ))
Output
df = pd.read_csv("employee_data.csv")
print(df.describe()) # Summary of numerical columns
Load data from a JSON file Load API response for user
pd.read_json()
into a DataFrame activity.
Statistical summary of
df.describe() Summarize patient statistics.
numerical columns
Data cleaning is a crucial step in preparing datasets for analysis. Pandas provides several
functions to clean and preprocess data. Below is a detailed explanation of key data-cleaning
techniques, real-time scenarios, and example codes.
import pandas as pd
Functions
2. Removing Duplicates
# Detect duplicates
print(df.duplicated())
# Remove duplicates
df = df.drop_duplicates()
print(df)
Functions
df = pd.DataFrame(data)
print(df.dtypes)
Functions
astype(type): Converts a column to the specified type.
pd.to_datetime(): Converts a column to datetime format.
4. Handling Outliers
print(df)
Functions
print(df)
Functions
6. Standardizing Text
Scenario: Product names have inconsistent capitalization and whitespace.
# Standardize text
df["Product"] = df["Product"].str.strip().str.lower()
print(df)
Functions
data = {"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"], "Score": [85,
90, 95]}
df = pd.DataFrame(data)
# Drop ID column
df = df.drop(columns=["ID"])
print(df)
Functions
8. Applying Filters
# Filter rows
filtered_df = df[df["Revenue"] > 300]
print(filtered_df)
9. Handling Categorical Data
print(df)