0% found this document useful (0 votes)
250 views29 pages

Data Wrangling With Python Lab Manual

Data Wrangling with Python Lab Manual
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
250 views29 pages

Data Wrangling With Python Lab Manual

Data Wrangling with Python Lab Manual
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 29

Module-I Data Wrangling with Python

What Is Data Wrangling?


Data Wrangling is the process of cleaning, organizing, and transforming raw data into a
more usable format to prepare it for analysis. This is crucial because raw data often contains
errors, inconsistencies, or irrelevant information that can skew the results of any analysis.
Data wrangling ensures that the data is in the right structure and format to derive
meaningful insights.

Importance of Data Wrangling:


1. Improves Data Quality: It ensures that data is accurate, complete, and relevant,
minimizing errors in analysis.
2. Increases Efficiency: Properly wrangled data makes analysis faster, leading to
quicker decision-making.
3. Ensures Consistency: Cleaning the data eliminates duplicate or redundant
information.
4. Enhances Data Usability: Raw data is often unstructured and requires wrangling to
make it suitable for processing by algorithms and machine learning models.

How Is Data Wrangling Performed?


1. Data Collection: Gather raw data from various sources such as databases, APIs, or
files.
2. Data Exploration: Understand the structure, format, and potential issues in the data.
3. Data Cleaning: Remove duplicates, handle missing values, correct inconsistencies, and
deal with outliers.
4. Data Transformation: Convert the data into a format suitable for analysis, such as
normalizing, aggregating, or converting data types.
5. Data Integration: Combine multiple data sources into a cohesive dataset.
6. Data Validation: Ensure the final dataset meets the requirements and is error-free.

Tasks of Data Wrangling:


- Handling missing data (filling, interpolation, or deletion).
- Correcting data inconsistencies (format errors, incorrect data types).
- Filtering and selecting relevant data.
- Removing duplicates and irrelevant data.
- Aggregating data (grouping or summarizing).
- Transforming and normalizing data (scaling, converting categorical to numerical).

Data Wrangling Tools:


1. Python Libraries: Pandas, NumPy, Matplotlib, Seaborn.
2. R Programming: dplyr, tidyr.
3. SQL: For database querying and manipulation.
4. Excel/Google Sheets: For smaller data wrangling tasks.
5. OpenRefine: A powerful tool for data cleaning and transformation.

Introduction to Python
Python is a versatile, high-level programming language widely used for data science,
machine learning, web development, and automation. Its simplicity, readability, and rich
library ecosystem make it a popular choice for beginners and professionals alike.

Python Basics:
- Variables: Used to store data values.
```python
x=5
name = 'John'
```
- Data Types: int, float, str, list, dict, etc.
- Control Structures: if, for, while, etc.
- Functions: Used to modularize code.
```python
def greet():
print('Hello, World!')
```
- Libraries: Python has extensive libraries for data wrangling, including pandas, numpy,
and csv.

Data Meant to Be Read by Machines


Machine-readable data refers to structured data that computers can process directly. Some
common formats include:
1. CSV (Comma-Separated Values): A simple file format for tabular data.
2. JSON (JavaScript Object Notation): A lightweight format for storing and exchanging
data, commonly used in APIs.
3. XML (eXtensible Markup Language): A markup language that defines rules for
encoding documents in a format that is both human-readable and machine-readable.

CSV Data
CSV (Comma-Separated Values) files are used to store tabular data, with each line in the
file representing a row, and each field separated by a comma.
Example:
```csv
name,age,city
John,25,New York
Jane,30,Los Angeles
```

JSON Data
JSON is used to represent structured data in a readable text format, often used in web
applications for transmitting data.
Example:
```json
{
'name': 'John',
'age': 25,
'city': 'New York'
}
```

XML Data
XML is used to describe data in a hierarchical structure, making it useful for representing
complex data models.
Example:
```xml
<person>
<name>John</name>
<age>25</age>
<city>New York</city>
</person>
```
Experiment – 1: Develop a Python Program for Reading and Writing CSV Files
Here’s a basic Python program to read and write CSV files using the csv module.
Program:

import csv

# Writing to a CSV file


data = [['Name', 'Age', 'City'],
['John', '25', 'New York'],
['Jane', '30', 'Los Angeles']]

with open('people.csv', 'w', newline='') as file:


writer = csv.writer(file)
writer.writerows(data)

# Reading from a CSV file


with open('people.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)

Csv File:

Name Age City

John 25 New York

Jane 30 Los Angeles

Output:
Experiment – 2: Develop a Python Program for Reading XML Files
This Python program reads XML data using the xml.etree.ElementTree module.

Program:
import xml.etree.ElementTree as ET

# Parsing an XML file


tree = ET.parse('data.xml')
root = tree.getroot()

# Iterating through the XML


for person in root.findall('person'):
name = person.find('name').text
age = person.find('age').text
city = person.find('city').text
print(f'Name: {name}, Age: {age}, City: {city}')

Xml File:

<?xml version="1.0"?>

<people>

<person> Xml Output:


<name>John</name>

<age>25</age>

<city>New York</city>

</person>

<person>

<name>Jane</name>

<age>30</age>

<city>Los Angeles</city>

</person>

</people>
Output:

Experiment – 3: Develop a Python Program for Reading and Writing JSON to a


File
Here’s how to read and write JSON using Python’s json module.
Program:
import json

# Writing multiple rows of JSON data to a file

data = [

{"name": "John", "age": 25, "city": "New York"},

{"name": "Jane", "age": 30, "city": "Los Angeles"},

{"name": "Doe", "age": 40, "city": "Chicago"}

with open('data.json', 'w') as json_file:

json.dump(data, json_file, indent=4)

# Reading multiple rows of JSON data from a file

with open('data.json', 'r') as json_file:

data = json.load(json_file)

for person in data:

print(person)

Output:
Module-II: Working with Excel Files
and PDFs
This module focuses on working with Excel files and PDFs using Python, key tasks for
automating and processing data efficiently. We’ll cover how to parse and manipulate these
files, how to install the required Python packages, and introduce some basic database
concepts that provide alternative data storage options. The hands-on experiments will give
you experience with real-world scenarios, such as converting files between formats and
parsing data.

1. Installing Python Packages


To work with Excel and PDF files in Python, you’ll need to install specific libraries.
Some common libraries include:

 pandas: For handling data in various formats (CSV, Excel, TSV, etc.).
 openpyxl: For reading and writing Excel files.
 pdfminer.six: For extracting text from PDFs.

Example: Installing necessary packages

pip install pandas openpyxl pdfminer.six

This command installs all the required libraries. Once installed, you can start using them
in your Python scripts.

2. Parsing Excel Files


Excel files (.xlsx) are commonly used for storing tabular data, and Python offers several
packages for reading and writing Excel files.

2.1 Reading Excel Files

 Library: pandas or openpyxl


- Task: You can read an Excel file and manipulate it as a DataFrame using the
pandas library.

2.2 Writing to an Excel File

- You can also write data to an Excel file using pandas.


3. Parsing PDFs
PDF parsing can be more complex than Excel, as PDFs do not have a structured tabular
format. However, Python libraries like pdfminer.six allow for extracting text from PDFs,
which can then be further processed.

3.1 Extracting Text from PDFs

 Library: pdfminer.six
 Task: Extract raw text from a PDF document.

3.2 Converting PDF to Text and Processing

4. Converting Between File Formats


 4.1 Converting a TSV File to Excel

A Tab-Separated Values (TSV) file is similar to a CSV file, but columns are
separated by tabs instead of commas. Converting TSV to Excel is straightforward
using pandas.

5. Databases: A Brief Introduction


 Relational Databases:
 Relational databases, like MySQL and PostgreSQL, store data in structured tables
with rows and columns. They are suitable when you need complex queries,
relationships between datasets, and strong consistency.

 Non-Relational Databases (NoSQL)


 Non-relational databases, such as MongoDB, store data in a flexible, document-
oriented format (e.g., JSON). They are preferred when scalability and flexibility
are needed, such as in large-scale web apps.
 When to Use: Relational Databases for structured data and complex queries.
NoSQL Databases for flexibility and scalability.
Experiments:

Experiment 4: Develop a Python Program for Reading an Excel File

Program:

import pandas as pd

# Load the Excel file


df = pd.read_excel('sample_data.xlsx')

# Display the first 5 rows


print(df.head())

Excel file:

Name Age City


John Doe 28 New York
Jane Smith 34 Los Angeles
Emily Davis 22 Chicago

Output:
Experiment 5: Develop a Python Program for Converting a TSV File into
Excel

Program:
import pandas as pd

# Read the TSV file


df = pd.read_csv('data.tsv', sep='\t')

# Convert to Excel and save


df.to_excel('output_data.xlsx', index=False)

print('TSV file successfully converted to Excel!')

TSV File:

Name Age City


Alice 24 Seattle
Bob 30 Portland
Charlie 29 San Francisco
David 35 New York

Output:

Experiment 6: Develop a Python Program for Converting a PDF File into


Excel

Program:

from pdfminer.high_level import extract_text

import pandas as pd

# Step 1: Extract text from the PDF

text = extract_text('sample.pdf')
# Step 2: Replace (cid:9) with actual tab characters

text = text.replace('(cid:9)', '\t')

# Step 3: Split the text by lines

lines = text.strip().split('\n')

# Step 4: Inspect and parse the lines into structured data

data = []

for line in lines[1:]:

columns = line.split('\t')

print(f"Parsed Line: {columns}")

data.append(columns)

# Step 5: Ensure all rows have 3 columns before proceeding

clean_data = [row for row in data if len(row) == 3]

# Step 6: Create a DataFrame with appropriate column names

df = pd.DataFrame(clean_data, columns=['Name', 'Age', 'City'])

# Step 7: Save the DataFrame as an Excel file (use a new file name to avoid permission
issues)

output_excel_path = 'output_data.xlsx'

df.to_excel(output_excel_path, index=False)

print('PDF data has been successfully converted to Excel!')


Pdf File:

Excel File:

Name Age City


John 28 New York
Jane 34 Los Angeles
Emily 22 Chicago

Output:
Module-III Data Cleanup
Why Clean Data?
Data cleanup ensures that the dataset is accurate, consistent, and usable for analysis. Dirty
data can cause incorrect models, misleading results, or failed applications. Data cleaning
involves the removal or rectification of missing values, duplicates, formatting errors, and
inconsistencies.

Data Cleanup Basics


Data cleanup involves tasks such as:

 • Handling missing values: Removing or imputing empty cells.


 • Correcting wrong formats: Ensuring consistency in date formats, string cases,
numerical types, etc.
 • Removing outliers: Identifying and addressing extreme data points that can skew
analysis results.

Identifying Values for Data Cleanup


Before cleaning the data, it is essential to identify issues like:

 • Missing or null values.


 • Incorrect data types or formats.
 • Outliers or erroneous data points.
 • Duplicated data entries.

Formatting Data
Formatting data involves ensuring consistency in date formats, numerical types (floats or
integers), string casing (lowercase/uppercase), and handling categorical variables.

Finding Outliers and Bad Data


Outliers are extreme values that differ significantly from the majority of the dataset and
can negatively affect analysis. Common methods to detect outliers include:

 Using Z-scores or Interquartile Range (IQR).


 Visualizations like box plots.
Finding Duplicates
Duplicates can bias your analysis, and removing them ensures the integrity of the dataset.
Python's pandas library provides methods to identify and remove duplicates.

Fuzzy Matching and RegEx Matching


• Fuzzy Matching: Useful for finding strings that are similar but not exact matches, often
applied when merging datasets.

• Regular Expressions (RegEx): Useful for pattern matching, such as identifying specific
string formats like email addresses or phone numbers.

Normalizing and Standardizing Data


• Normalization: Scales data to a specific range, typically [0, 1].

• Standardization: Centers data around the mean and scales it by standard deviation. This
is useful in machine learning algorithms.

Saving the Data


Once data is cleaned, saving the cleaned dataset is essential for further analysis and
modeling without repeating the process.

Scripting the Cleanup


Automating the cleanup process with Python scripts ensures the procedure is repeatable,
especially when new data is added to the dataset.

Experiment 7: Develop a Python Program for cleaning empty cells and


cleaning wrong format

Program:

import pandas as pd

# Sample DataFrame with missing values and wrong format

data = {

'Name': ['Alice', 'Bob', None, 'David'],

'Age': [25, None, 30, 'Twenty'],


'Salary': [50000, 60000, None, 80000]

df = pd.DataFrame(data)

# Cleaning empty cells by filling with default values or dropping rows

df['Name'] = df['Name'].fillna('Unknown') # No inplace=True, just assignment

df['Age'] = pd.to_numeric(df['Age'], errors='coerce') # Convert 'Age' to numeric, NaN


for errors

df['Age'] = df['Age'].fillna(df['Age'].mean()) # Assign back after filling missing values

df = df.dropna(subset=['Salary']) # No inplace, assign back to df

print("Cleaned DataFrame:")

print(df)

Output:
Experiment 8: Develop a Python Program for finding duplicates in a
data frame

Program:

import pandas as pd

# Sample DataFrame with duplicates

data = {

'Name': ['Alice', 'Bob', 'Alice', 'David', 'Alice'],

'Age': [25, 30, 25, 40, 25]

df = pd.DataFrame(data)

# Finding duplicates

duplicates = df.duplicated()

print("Duplicated Rows:")

print(df[duplicates])

# Removing duplicates

df_no_duplicates = df.drop_duplicates()

print("\nDataFrame after removing duplicates:")

print(df_no_duplicates)
Output:

Experiment 9: Develop a Python Program for normalizing data

Program:

import pandas as pd

from sklearn.preprocessing import MinMaxScaler

# Sample DataFrame for normalization

data = {

'Feature1': [10, 20, 30, 40, 50],

'Feature2': [1, 2, 3, 4, 5]

df = pd.DataFrame(data)

# Normalizing data using MinMaxScaler

scaler = MinMaxScaler()

df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

print("Normalized DataFrame:")

print(df_normalized)
Output:
Module IV: Data Exploration and
Analysis
1. Exploring Data
Exploring data involves a preliminary examination of the data to understand its characteristics.
This is where you look at basic statistics, identify data types, check for missing values, and get an
overall sense of the dataset. Key Steps include: overview of the dataset, summary statistics,
distribution checks, and identifying missing values.

2. Importing Data
Data import is the process of bringing external data into your Python environment for analysis.
Data can be imported from various file types like CSV, Excel, SQL databases, and web APIs. In
Python, methods to import data include:
• CSV Files: pandas.read_csv("filename.csv")
• Excel Files: pandas.read_excel("filename.xlsx")
• SQL Databases: Using sqlalchemy or sqlite3 for connecting and querying databases.

3. Exploring Table Functions


Table functions allow you to interact with and manipulate datasets for better understanding.
Functions include head() and tail() to inspect rows, info() for data types, and describe() for
summary statistics.

4. Joining Numerous Datasets


Joining datasets is combining multiple datasets to create a unified dataset for analysis. This
includes Inner Join, Outer Join, Left Join, and Right Join. In Python, tools like merge() and
concat() from pandas are used.

5. Identifying Correlations
Correlation measures the statistical relationship between two variables. It can show whether
changes in one variable predict changes in another. A correlation matrix shows the relationship
between all variables. Visualizing correlations using a heatmap is common.

6. Identifying Outliers
Outliers are data points that significantly differ from the rest of the dataset. Methods for
identifying outliers include the Standard Deviation method and Interquartile Range (IQR).
Visualization tools like box plots and histograms help in identifying outliers.
7. Creating Groupings
Grouping data involves categorizing data into different segments for analysis. The groupby()
function in pandas allows grouping data based on specific categories, and applying aggregation
methods like mean or sum.

8. Analyzing Data - Separating and Focusing the Data


Analyzing data involves separating relevant features for focused exploration. Methods include
filtering data using conditions or selecting specific columns for subsetting.

9. Presenting Data
After analysis, presenting data involves summarizing insights and using visuals like charts and
graphs to communicate findings clearly.

10. Visualizing the Data


Visualizations make data easier to interpret by presenting it in graphical formats. Common
visualizations include bar charts, histograms, pie charts, line charts, and scatter plots.

11. Time-Related Data


Time-related data involves handling datasets with temporal components, such as stock prices or
weather trends. Techniques like time series visualization and rolling averages are used for
analysis.

12. Maps, Interactives, Words, Images, Video, and Illustrations


Advanced visualizations include geographic maps, word clouds, and interactive charts. Python
libraries like folium, geopandas, plotly, and bokeh are used to create these visualizations.

13. Presentation Tools


Presentation tools like Tableau, Power BI, and Google Data Studio allow creating interactive
reports and dashboards for sharing results. Python libraries like matplotlib and seaborn are also
used for building visuals.

14. Publishing the Data - Open-Source Platforms


Publishing data involves sharing your analysis on platforms like GitHub or Kaggle. Tools like
Tableau Public can be used to share interactive dashboards.
Experiments
Experiment 10: Python Program for Detecting and Removing Outliers

Program:

import pandas as pd

import numpy as np

# Sample data

data = {

'Value': [10, 12, 14, 18, 90, 13, 15, 14, 300, 17, 13, 12, 16, 10]

df = pd.DataFrame(data)

# Detecting outliers using IQR

Q1 = df['Value'].quantile(0.25)

Q3 = df['Value'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

# Detecting outliers

outliers = df[(df['Value'] < lower_bound) | (df['Value'] > upper_bound)]

print("Outliers detected:\n", outliers)

# Removing outliers

df_no_outliers = df[(df['Value'] >= lower_bound) & (df['Value'] <= upper_bound)]

print("Data after removing outliers:\n", df_no_outliers)


Output:

Experiment 11: Python Program for Drawing Bar Chart, Histogram, and Pie Chart

Program:

import matplotlib.pyplot as plt

# Sample data for visualizations

categories = ['Category A', 'Category B', 'Category C', 'Category D']

values = [23, 45, 56, 78]

# Bar chart

plt.figure(figsize=(6,4))

plt.bar(categories, values, color='blue')

plt.title('Bar Chart')

plt.show()

# Histogram

data = [10, 12, 13, 15, 18, 18, 19, 20, 23, 25, 29, 30, 31, 32, 35]

plt.figure(figsize=(6,4))

plt.hist(data, bins=5, color='green')


plt.title('Histogram')

plt.show()

# Pie chart

plt.figure(figsize=(6,4))

plt.pie(values, labels=categories, autopct='%1.1f%%', startangle=140, colors=['blue', 'orange',


'green', 'red'])

plt.title('Pie Chart')

plt.show()

Output:
Experiment 12: Python Program for Time Series Visualization
Program:

import pandas as pd

import matplotlib.pyplot as plt

# Creating a sample time series DataFrame with 'MS' for month start

date_rng = pd.date_range(start='2024-01-01', end='2024-12-31', freq='MS')

data = {'Sales': [200, 210, 215, 220, 230, 250, 245, 260, 270, 275, 290, 300]}

df = pd.DataFrame(data, index=date_rng)

# Time series plot

plt.figure(figsize=(10, 6))

plt.plot(df.index, df['Sales'], marker='o', linestyle='-', color='blue')

plt.title('Monthly Sales Time Series')

plt.xlabel('Month')

plt.ylabel('Sales')

plt.grid(True)

plt.show()

Output:
Module V: Web Scraping
Web scraping is the process of extracting data from websites. It is a critical skill in data analysis
and machine learning, especially when the required data isn't available in structured formats like
CSV or databases. This module covers various aspects of web scraping, including techniques and
tools to interact with web pages and extract meaningful information.

1. What to Scrape and How


What to Scrape: Identify the information you need from a website, such as product details, news
articles, or user reviews. Not all content on a webpage is relevant, so knowing what to scrape helps
to target the exact data needed.

How to Scrape: There are different methods like using requests for simple pages, or more advanced
tools like Selenium or Scrapy for pages that require interaction or load dynamically.

2. Analyzing a Web Page


This involves understanding the structure of a webpage by inspecting its HTML elements.
Tools like Chrome DevTools help in finding the tags (e.g., <div>, <p>, <span>) that contain the
required data. It is important to locate elements correctly before writing a scraper.

3. Network/Timeline
Network Analysis: Using browser dev tools, you can inspect network requests to understand how
data is loaded. This is especially useful for scraping dynamically loaded content, such as AJAX
calls.

Timeline: Helps to track when different elements load on a page, useful when dealing with
JavaScript-heavy websites.

4. Interacting with JavaScript


Some websites load content dynamically using JavaScript. Scrapers like Selenium can automate
interactions with JavaScript-based content, allowing you to scrape data that would otherwise be
invisible with static HTML scraping methods.

5. In-Depth Analysis of a Page


Understanding how different elements are nested and structured is crucial for writing effective
scraping scripts. This involves deep inspection of HTML tags, attributes, and JavaScript functions
that might load data dynamically.

6. Getting Pages
This involves sending HTTP requests to a URL to fetch the HTML content. Libraries like requests
in Python are commonly used for this. The response can then be parsed to extract the needed data.
7. Reading a Web Page with LXML and XPath
LXML: A powerful library for parsing XML and HTML documents. It is faster than BeautifulSoup
and allows more complex parsing.

XPath: A language used for navigating through elements and attributes in an XML/HTML
document. It allows precise selection of elements, making it effective for scraping.

8. Advanced Web Scraping - Browser-Based Parsing


Selenium: A tool that automates browsers, useful for scraping pages that require user interactions
like clicking buttons or filling forms. Selenium can simulate real user actions.

Ghost.py: A headless browser scraping tool, suitable for web scraping without opening a browser
window.

9. Screen Reading with Selenium


Screen Reading: Automating the extraction of visible elements using Selenium.
This is especially useful when you need to scrape data that requires scrolling or clicking to appear.

10. Spidering the Web - Building a Spider with Scrapy


Scrapy: A powerful and fast web crawling framework in Python that automates the extraction and
storage of data from websites. It is ideal for large-scale scraping projects.

Building a Spider: A spider is a class in Scrapy that defines how to follow links and extract content
from the target website.

11. Crawling Whole Websites with Scrapy


Scrapy spiders can be set to follow links and scrape multiple pages on a website. This process is
called crawling. Scrapy manages the crawling efficiently and allows data storage in formats like
JSON or CSV.

Experiments:
Experiment 13: Develop a Python Program for Reading an HTML Page
Program:

import requests

from bs4 import BeautifulSoup

# URL of the webpage to be scraped

url = 'https://example.com' # Replace with the URL you want to scrape

# Send a GET request to the URL

response = requests.get(url)
# Check if the request was successful

if response.status_code == 200:

# Parse the HTML content using BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

# Extract and print the title of the page

title = soup.title.string

print("Title of the page:", title)

# Extract all paragraphs and print their text content

paragraphs = soup.find_all('p')

print("\nParagraphs:")

for i, paragraph in enumerate(paragraphs, start=1):

print(f"Paragraph {i}: {paragraph.get_text()}")

else:

print("Failed to retrieve the web page. Status code:", response.status_code)

Web Page:
Output:

Experiment 14: Develop a Python Program for Building a Spider Using


Scrapy

Program:

import scrapy

class ExampleSpider(scrapy.Spider):

name = 'example'

start_urls = ['https://example.com'] # Replace with the target URL

def parse(self, response):

# Extracting the page title

title = response.xpath('//title/text()').get()

yield {'Title': title}

# Extracting all paragraphs' text

paragraphs = response.xpath('//p/text()').getall()

for i, paragraph in enumerate(paragraphs, start=1):

yield {f'Paragraph {i}': paragraph}


Web Page:

Output:

You might also like