Week 2: Data Collection & Cleaning
In the world of data analytics, 80% of a data scientist's time is
spent cleaning and preparing data.
Transition from simple spreadsheets to professional data handling.
Master the art of clean/transform messy data into a format ready for
analysis using both Excel and the industry-standard Python library,
Pandas.
1. Data Types and Structures
Before importing data, you must understand what you are looking
at.
Data in Excel and Python generally falls into these categories:
Qualitative (Categorical): Descriptive data.
o Nominal: No inherent order (e.g., Eye color, Country).
o Ordinal: Has a specific order (e.g., Satisfaction ratings: Low,
Medium, High).
Quantitative (Numerical): Measured data.
o Discrete: Whole numbers/counts (e.g., Number of children).
o Continuous: Any value in a range (e.g., Weight, Temperature).
Data Structure: Good data follows the Tidy Data principle:
1. Each variable forms a column. Les variables sont en colonnes.
2. Each observation forms a row. Les observations sont en ligne.
3. Each type of observational unit forms a table.
2. Importing Data into Excel
Data rarely starts inside an Excel file. You will often need to pull it from
external sources.
Importing CSV and Text Files
1. Go to the Data tab.
2. Click Get Data > From File > From Text/CSV.
3. Key Concept: Pay attention to the Delimiter (the character
separating the data, usually a comma or tab).
Importing from the Web
Excel can scrape tables directly from URLs.
1. Data > From Web.
2. Paste the URL. Excel will scan the page for HTML tables and let you
preview them before importing.
Example:
Example: Importing U.S. Population Data from Wikipedia
Step-by-Step Walkthrough:
1. Go to Data Tab
Click on Data → Get Data → From Other Sources → From Web
*In newer Excel versions (Microsoft 365, 2021+), it's: Data → Get &
Transform Data → From Web*
2. Paste the URL
[Link]
List_of_states_and_territories_of_the_United_States
1. Click OK
2. Navigator Window Opens
Excel scans the webpage and shows:
o Left pane: List of tables found (usually named "Table 0",
"Table 1", etc.)
o Right pane: Preview of selected table
For Wikipedia's U.S. states page:
o Table 1 = Basic state info (Name, Capital, Population, etc.)
o You can click through tables to find the one you want
3. Select the Population Table
Click on the table with population data (usually Table 1 or 2)
Click Load to import directly
OR click Transform Data to clean it first
***Real Example URL for Practice:
Try this URL with economic data:
[Link]
What happens:
1. Excel finds 3-4 tables on that page
2. You'll see tables like:
o "Table 1" = IMF estimates
o "Table 2" = United Nations estimates
o "Table 3" = World Bank estimates
3. Click "Table 1" and see IMF GDP data preview
4. Click Load → Data appears in Excel!
Cleaning Tips (Transform Data):
If the table needs cleaning before importing:
1. Click Transform Data instead of Load
2. Power Query Editor opens where you can:
o Remove unnecessary columns
o Filter rows
o Change data types
o Clean text (remove footnotes like "[a]", "[b]")
3. Click Close & Load when done
Important Notes:
✅ Works with: Public web pages with standard HTML tables
❌ May not work with:
Password-protected sites
JavaScript-heavy pages (React/Angular apps)
Data behind login walls
Interactive "infinite scroll" tables
Pro Tip: For complex sites, sometimes better to:
1. Save page as .html file
2. Use Data → From File → From HTML
3. Data Cleaning (transforming)in Excel
Messy data leads to "Garbage In, Garbage Out" (GIGO).
Use these essential tools to fix it:
Remove Duplicates: Highlight your range → Data tab → Remove
Duplicates.
This ensures each record is unique.
Text-to-Columns: Used to split one column into many (e.g.,
splitting "Full Name" into "First Name" and "Last Name" using a
space delimiter).
Data Validation: Prevents bad data entry. Highlight a cell → Data
tab → Data Validation.
You can restrict entry to a "List," "Date," or "Whole Number."
Find and Replace (Ctrl + H): Quickly swap out errors (e.g.,
replacing "USA" with "United States" for consistency).
Practcal Example: Cleaning Sales Data from a Messy CSV File
Walk through cleaning a real-world messy dataset step-by-step:
Scenario: E-commerce Sales Data Cleanup
We have a CSV file with these common data problems:
Inconsistent date formats
Mixed text in number columns
Missing values
Extra spaces and symbols
Inconsistent categories
Download messy_sales.csv file and import it to excel
This is the original raw data file:
Order_ID,Customer_Name,Order_Date,Product,Quantity,Price,Region
1001, John Smith ,01/15/2023,Laptop,1,"$1,200.00",North
1002,Jane Doe,2023-02-01, "Mouse" ,3,25.99,SOUTH
1003,Bob Johnson,15-03-2023,Keyboard,2,45.50,West
1004,Alice Brown,,Monitor,1,300,North East
1005, "Charlie Davis" ,04/30/2023,Headphones,five,79.99,SOUTH
1006,Eva Green,05/15/2023,Laptop,2,"1200",West
1007,,06/01/2023,Mouse,1,$25,North
Step-by-Step Cleaning Process
Step 1: Import Data
1. Data Tab → Get Data → From File → From Text/CSV
2. Select messy_sales.csv
3. Click Transform Data (NOT Load) - This opens Power Query Editor
Step 2: Power Query Editor Interface
You'll see:
Left: Queries pane
Center: Data preview
Right: Query Settings (applied steps)
Top: Transform ribbon
Step 3: Data Cleaning Steps
1. Remove Extra Spaces from Text Columns
1. Select "Customer_Name" column
2. Transform Tab → Format → Trim
3. Select "Product" column → Transform → Format → Trim
4. Select "Region" column → Transform → Format → Trim
2. Fix Inconsistent Date Format
1. Select "Order_Date" column
2. Data Type button → Change to Date
3. If errors appear, click error indicator → "Using Locale"
4. Choose: Data Type → Using Locale → Date → English (United States)
3. Clean Price Column (Remove $ and commas)
1. Select "Price" column
2. Transform Tab → Replace Values
- Value to Find: `$`
- Replace With: (leave empty)
3. Replace Values again:
- Value to Find: `,`
- Replace With: (leave empty)
4. Change Data Type to Decimal Number
4. Fix Quantity Column (Convert "five" to 5)
1. Select "Quantity" column
2. Transform Tab → Replace Values
- Value to Find: `five`
- Replace With: `5`
3. Change Data Type to Whole Number
5. Standardize Region Names
1. Select "Region" column
2. Transform Tab → Format → UPPERCASE
3. Replace Values:
- "SOUTH" → "South"
- "NORTH EAST" → "Northeast"
- "WEST" → "West"
6. Handle Missing Values
1. Find Order_ID 1004 with missing date
Option A: Fill Down from previous row
- Right-click Order_Date header → Fill → Down
Option B: Add placeholder
- Replace null values with "Date Missing"
2. Find Customer_Name missing for Order_ID 1007
- Replace null values with "Unknown Customer"
7. Remove Duplicate Rows
1. Home Tab → Remove Rows → Remove Duplicates
2. Power Query removes exact duplicate rows
8. Add Calculated Column (Total Sales)
1. Add Column Tab → Custom Column
2. New Column Name: `Total_Sales`
3. Custom Column Formula: `[Quantity] * [Price]`
4. Click OK
9. Filter Out Problem Rows (Optional)
1. Click drop-down on "Quantity" column
2. Uncheck "Error" (if any conversion errors)
3. Do same for "Price" and "Order_Date" columns
10. Reorder Columns
1. Click and drag column headers to reorder:
- Order_ID
- Order_Date
- Customer_Name
- Region
- Product
- Quantity
- Price
- Total_Sales
Step 4: Review Applied Steps
Look at Query Settings → Applied Steps:
1. Source
2. Changed Type
3. Trimmed Text
4. Replaced Value (multiple)
5. Filled Down
6. Added Custom
7. Filtered Rows
8. Reordered Columns
You can:
Click any step to see intermediate result
Click to delete a step
Drag steps to reorder
Step 5: Load Clean Data
1. Home Tab → Close & Load
2. Choose where to load:
o Close & Load: Creates new worksheet with cleaned data
o Close & Load To: Choose specific location
Step 6: Final Cleaned Data Output
Ord Qua
Order Custome Regi Produ Pri Total_
er_I ntit
_Date r_Name on ct ce Sales
D y
100 1/15/ John Nort Lapto 12
1 1200
1 2023 Smith h p 00
25
100 2/1/2 Sout
Jane Doe Mouse 3 .9 77.97
2 023 h
9
45
100 3/15/ Bob Keybo
West 2 .5 91.00
3 2023 Johnson ard
0
100 Date Alice Nort Monit 1 30 300
4 Missi Brown heas or 0
Ord Qua
Order Custome Regi Produ Pri Total_
er_I ntit
_Date r_Name on ct ce Sales
D y
ng t
79
100 4/30/ Charlie Sout Headp
5 .9 399.95
5 2023 Davis h hones
9
100 5/15/ Eva Lapto 12
West 2 2400
6 2023 Green p 00
Unknow
100 6/1/2 n Nort
Mouse 1 25 25
7 023 Custome h
r
Pro Tips for Data Cleaning:
1. Create Reusable Templates:
- After cleaning, right-click query → Copy
- Paste new data → right-click → Advanced Editor
- Change source file path → all transformations apply automatically
2. Parameterize Queries:
1. Home → Manage Parameters → New Parameter
2. Name: FilePath
3. Set to your CSV path
4. In Source step, replace hardcoded path with parameter
3. Error Handling:
- Right-click column header → Replace Errors
- Choose value to insert (0, blank, "Error")
4. Data Profiling:
- View Tab → check "Column Distribution" and "Column Profile"
- See data quality at a glance
5. Refresh Automation:
- Data Tab → Refresh All (updates when source changes)
- Data Tab → Properties → set refresh frequency
Common Cleanup Formulas in Excel (Alternative Approach):
If not using Power Query, use these formulas:
1. TRIM(A2) # Remove extra spaces
2. DATEVALUE(A2) # Convert text to date
3. VALUE(SUBSTITUTE(A2,"$","")) # Remove $ and convert
4. PROPER(A2) # Capitalize first letters
5. IFERROR(formula, "N/A") # Handle errors
6. XLOOKUP() # Standardize categories
Key Benefits of Power Query Cleaning:
✅ Non-destructive: Original data unchanged
✅ Repeatable: Click Refresh to clean new data
✅ Audit trail: Every step recorded
✅ Scalable: Handles millions of rows efficiently
Try this with your own messy data! Just remember: Transform first,
Load last to avoid manual cleanup.
4. Introduction to Python Pandas
Excel is great for small tasks, but Pandas is the powerhouse for large-
scale data manipulation.
Loading Data
To use Pandas, we first import the library and then load a dataset (usually
a CSV).
import pandas as pd
# Load a CSV file
df = pd.read_csv('[Link]')
# Inspect the first 5 rows
print([Link]())
# Get summary info about columns and data types
print([Link]())
Basic Inspection Commands:
[Link](): Provides statistical summaries (mean, min, max).
[Link]: Shows the number of rows and columns.
[Link]().sum(): Counts missing values in each column.
Lab 2: Clean a Messy Dataset
Objective: Take a raw dataset containing duplicate entries, inconsistent
formatting, and missing values, and clean it using both tools.
Part A: The Excel Challenge
1. Download: (Assume a file messy_sales.csv with columns: Date,
Customer_Name, Product, Amount).
2. Task 1: Use Remove Duplicates based on the Customer_Name
and Date.
3. Task 2: Use Text-to-Columns to split Customer_Name into
First_Name and Last_Name.
4. Task 3: Set a Data Validation rule on the Amount column to
ensure no values are less than 0.
Part B: The Python Challenge
Open a Jupyter Notebook or Google Colab and run the following:
import pandas as pd
# 1. Load the data
df = pd.read_csv('messy_sales.csv')
# 2. Drop duplicate rows
df = df.drop_duplicates()
# 3. Handle missing values (fill with 0 or drop)
df['Amount'] = df['Amount'].fillna(0)
# 4. Save the cleaned file
df.to_csv('cleaned_sales.csv', index=False)
Exercises & Answers
Exercise 1: Knowledge Check
1. Which Excel tool would you use to separate "City, State" into two
different columns?
2. What is the Python command to see the last 10 rows of a dataset?
3. In data types, is "Temperature" discrete or continuous?
Exercise 2: Logic Challenge
You import a dataset and notice the "Date" column is being treated as
"General" text, preventing you from sorting chronologically. How do you
fix this in Excel?
Answer Key
Ex 1, Q1: Text-to-Columns.
Ex 1, Q2: [Link](10).
Ex 1, Q3: Continuous (it can have decimals).
Ex 2: Select the column, go to the Home tab, and change the
Number Format dropdown from "General" to "Short Date." If that
fails, use the DATEVALUE function.