Data Wrangling With Python Lab Manual
Data Wrangling With Python Lab Manual
Introduction to Python
Python is a versatile, high-level programming language widely used for data science,
machine learning, web development, and automation. Its simplicity, readability, and rich
library ecosystem make it a popular choice for beginners and professionals alike.
Python Basics:
- Variables: Used to store data values.
```python
x=5
name = 'John'
```
- Data Types: int, float, str, list, dict, etc.
- Control Structures: if, for, while, etc.
- Functions: Used to modularize code.
```python
def greet():
print('Hello, World!')
```
- Libraries: Python has extensive libraries for data wrangling, including pandas, numpy,
and csv.
CSV Data
CSV (Comma-Separated Values) files are used to store tabular data, with each line in the
file representing a row, and each field separated by a comma.
Example:
```csv
name,age,city
John,25,New York
Jane,30,Los Angeles
```
JSON Data
JSON is used to represent structured data in a readable text format, often used in web
applications for transmitting data.
Example:
```json
{
'name': 'John',
'age': 25,
'city': 'New York'
}
```
XML Data
XML is used to describe data in a hierarchical structure, making it useful for representing
complex data models.
Example:
```xml
<person>
<name>John</name>
<age>25</age>
<city>New York</city>
</person>
```
Experiment – 1: Develop a Python Program for Reading and Writing CSV Files
Here’s a basic Python program to read and write CSV files using the csv module.
Program:
import csv
Csv File:
Output:
Experiment – 2: Develop a Python Program for Reading XML Files
This Python program reads XML data using the xml.etree.ElementTree module.
Program:
import xml.etree.ElementTree as ET
Xml File:
<?xml version="1.0"?>
<people>
<age>25</age>
<city>New York</city>
</person>
<person>
<name>Jane</name>
<age>30</age>
<city>Los Angeles</city>
</person>
</people>
Output:
data = [
data = json.load(json_file)
print(person)
Output:
Module-II: Working with Excel Files
and PDFs
This module focuses on working with Excel files and PDFs using Python, key tasks for
automating and processing data efficiently. We’ll cover how to parse and manipulate these
files, how to install the required Python packages, and introduce some basic database
concepts that provide alternative data storage options. The hands-on experiments will give
you experience with real-world scenarios, such as converting files between formats and
parsing data.
pandas: For handling data in various formats (CSV, Excel, TSV, etc.).
openpyxl: For reading and writing Excel files.
pdfminer.six: For extracting text from PDFs.
This command installs all the required libraries. Once installed, you can start using them
in your Python scripts.
Library: pdfminer.six
Task: Extract raw text from a PDF document.
A Tab-Separated Values (TSV) file is similar to a CSV file, but columns are
separated by tabs instead of commas. Converting TSV to Excel is straightforward
using pandas.
Program:
import pandas as pd
Excel file:
Output:
Experiment 5: Develop a Python Program for Converting a TSV File into
Excel
Program:
import pandas as pd
TSV File:
Output:
Program:
import pandas as pd
text = extract_text('sample.pdf')
# Step 2: Replace (cid:9) with actual tab characters
lines = text.strip().split('\n')
data = []
columns = line.split('\t')
data.append(columns)
# Step 7: Save the DataFrame as an Excel file (use a new file name to avoid permission
issues)
output_excel_path = 'output_data.xlsx'
df.to_excel(output_excel_path, index=False)
Excel File:
Output:
Module-III Data Cleanup
Why Clean Data?
Data cleanup ensures that the dataset is accurate, consistent, and usable for analysis. Dirty
data can cause incorrect models, misleading results, or failed applications. Data cleaning
involves the removal or rectification of missing values, duplicates, formatting errors, and
inconsistencies.
Formatting Data
Formatting data involves ensuring consistency in date formats, numerical types (floats or
integers), string casing (lowercase/uppercase), and handling categorical variables.
• Regular Expressions (RegEx): Useful for pattern matching, such as identifying specific
string formats like email addresses or phone numbers.
• Standardization: Centers data around the mean and scales it by standard deviation. This
is useful in machine learning algorithms.
Program:
import pandas as pd
data = {
df = pd.DataFrame(data)
print("Cleaned DataFrame:")
print(df)
Output:
Experiment 8: Develop a Python Program for finding duplicates in a
data frame
Program:
import pandas as pd
data = {
df = pd.DataFrame(data)
# Finding duplicates
duplicates = df.duplicated()
print("Duplicated Rows:")
print(df[duplicates])
# Removing duplicates
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
Output:
Program:
import pandas as pd
data = {
'Feature2': [1, 2, 3, 4, 5]
df = pd.DataFrame(data)
scaler = MinMaxScaler()
print("Normalized DataFrame:")
print(df_normalized)
Output:
Module IV: Data Exploration and
Analysis
1. Exploring Data
Exploring data involves a preliminary examination of the data to understand its characteristics.
This is where you look at basic statistics, identify data types, check for missing values, and get an
overall sense of the dataset. Key Steps include: overview of the dataset, summary statistics,
distribution checks, and identifying missing values.
2. Importing Data
Data import is the process of bringing external data into your Python environment for analysis.
Data can be imported from various file types like CSV, Excel, SQL databases, and web APIs. In
Python, methods to import data include:
• CSV Files: pandas.read_csv("filename.csv")
• Excel Files: pandas.read_excel("filename.xlsx")
• SQL Databases: Using sqlalchemy or sqlite3 for connecting and querying databases.
5. Identifying Correlations
Correlation measures the statistical relationship between two variables. It can show whether
changes in one variable predict changes in another. A correlation matrix shows the relationship
between all variables. Visualizing correlations using a heatmap is common.
6. Identifying Outliers
Outliers are data points that significantly differ from the rest of the dataset. Methods for
identifying outliers include the Standard Deviation method and Interquartile Range (IQR).
Visualization tools like box plots and histograms help in identifying outliers.
7. Creating Groupings
Grouping data involves categorizing data into different segments for analysis. The groupby()
function in pandas allows grouping data based on specific categories, and applying aggregation
methods like mean or sum.
9. Presenting Data
After analysis, presenting data involves summarizing insights and using visuals like charts and
graphs to communicate findings clearly.
Program:
import pandas as pd
import numpy as np
# Sample data
data = {
'Value': [10, 12, 14, 18, 90, 13, 15, 14, 300, 17, 13, 12, 16, 10]
df = pd.DataFrame(data)
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
# Detecting outliers
# Removing outliers
Experiment 11: Python Program for Drawing Bar Chart, Histogram, and Pie Chart
Program:
# Bar chart
plt.figure(figsize=(6,4))
plt.title('Bar Chart')
plt.show()
# Histogram
data = [10, 12, 13, 15, 18, 18, 19, 20, 23, 25, 29, 30, 31, 32, 35]
plt.figure(figsize=(6,4))
plt.show()
# Pie chart
plt.figure(figsize=(6,4))
plt.title('Pie Chart')
plt.show()
Output:
Experiment 12: Python Program for Time Series Visualization
Program:
import pandas as pd
# Creating a sample time series DataFrame with 'MS' for month start
data = {'Sales': [200, 210, 215, 220, 230, 250, 245, 260, 270, 275, 290, 300]}
df = pd.DataFrame(data, index=date_rng)
plt.figure(figsize=(10, 6))
plt.xlabel('Month')
plt.ylabel('Sales')
plt.grid(True)
plt.show()
Output:
Module V: Web Scraping
Web scraping is the process of extracting data from websites. It is a critical skill in data analysis
and machine learning, especially when the required data isn't available in structured formats like
CSV or databases. This module covers various aspects of web scraping, including techniques and
tools to interact with web pages and extract meaningful information.
How to Scrape: There are different methods like using requests for simple pages, or more advanced
tools like Selenium or Scrapy for pages that require interaction or load dynamically.
3. Network/Timeline
Network Analysis: Using browser dev tools, you can inspect network requests to understand how
data is loaded. This is especially useful for scraping dynamically loaded content, such as AJAX
calls.
Timeline: Helps to track when different elements load on a page, useful when dealing with
JavaScript-heavy websites.
6. Getting Pages
This involves sending HTTP requests to a URL to fetch the HTML content. Libraries like requests
in Python are commonly used for this. The response can then be parsed to extract the needed data.
7. Reading a Web Page with LXML and XPath
LXML: A powerful library for parsing XML and HTML documents. It is faster than BeautifulSoup
and allows more complex parsing.
XPath: A language used for navigating through elements and attributes in an XML/HTML
document. It allows precise selection of elements, making it effective for scraping.
Ghost.py: A headless browser scraping tool, suitable for web scraping without opening a browser
window.
Building a Spider: A spider is a class in Scrapy that defines how to follow links and extract content
from the target website.
Experiments:
Experiment 13: Develop a Python Program for Reading an HTML Page
Program:
import requests
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
title = soup.title.string
paragraphs = soup.find_all('p')
print("\nParagraphs:")
else:
Web Page:
Output:
Program:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
title = response.xpath('//title/text()').get()
paragraphs = response.xpath('//p/text()').getall()
Output: