Topics for Lab:
1. Data Cleaning Tools and Techniques
2. Data Cleaning Features in Spreadsheet
3. Sorting and Filtering
4. Data Cleaning Verifying and Reporting Results
5. Capturing Cleaning Changes
1)Data Cleaning Tools and Techniques
https://www.bing.com/search?q=data%20cleaning%20tools%20and
%20techniques&qs=SYC&showconv=1&sendquery=1&FORM=ASCHT2&sp=6&lq=0&fbclid=
IwAR34znhYKs4VRC3Hd08InlMAhc70qN2h7LwkEVom9Iuefw_H9pwKfX3Uvec
All Data collected must be subjected to Data cleaning
through Data processing.
Data preprocessing involves identifying and correcting or
removing inaccurate, incomplete, or irrelevant data from
a dataset.
Techniques in Cleaning Data:
1) Removing duplicates: This technique involves identifying
and removing identical records from a dataset.
2) Removing irrelevant data: This involves identifying and
removing data that is not relevant to the analysis.
3) Standardizing capitalization: This is converting all text to
a consistent case format.
4) Converting data types: converting data from one type to
another, such as converting text to numbers.
5) Clearing formatting: removing any formatting from the
data, such as bold or italicized text.
6) Fixing errors: identifying and correcting errors in the data.
7) Language translation: translating data from one language
to another.
8) Handling missing values: identifying and handling missing
data in the dataset.
Most Popular Tools Available in Cleaning Data:
(through open source and SaaS Tools)
1) OpenRefine: A free, open-source tool for working with
messy data.
2) RapidMiner: A data science platform that includes data
cleaning and preparation tools.
3) Talend Data Preparation: A cloud-based data preparation
tool that allows users to clean and prepare data for analysis.
4) Data Ladder Cleansing Tool: A data cleaning tool that
uses machine learning algorithms to identify and correct
errors in data.
5) Rattle: A free, open-source data mining tool that includes
data cleaning and preparation tools.
2)Data Cleaning Features in
Spreadsheet
Microsoft Excel provides several features to help clean data, such
as:
1) Fill data automatically in worksheet cells: This feature
allows filling of data automatically in worksheet cells based
on patterns in the data.
2) Create and format tables: Creating and formatting tables
in your spreadsheet, which can make it easier to work with
large datasets.
3) Create a macro: Automating repetitive tasks in your
spreadsheet.
4) Check spelling and grammar: Checking the spelling and
grammar of your data.
5) Filter for unique values or remove duplicate values:
Filtering for unique values or remove duplicate values in
your data.
6) Find and replace text: Finding and replacing text in your
data.
7) Change the case of text: Changing the case of text in your
data.
8) Remove spaces and nonprinting characters from text:
Involves removal of spaces and nonprinting characters from
text in your data.
9) Fix numbers and number signs: Fixing numbers and
number signs in your data.
10) Fix dates and times: Fixing dates and times in your
data.
3) Sorting and Filtering
Sorting and filtering are powerful techniques to manage and
analyze data in spreadsheets.
Sorting allows arranging data in a specific order, revealing
patterns and trends.
Filtering helps in focusing on specific subsets of data.
o Advanced filtering techniques provide even greater
control over the data analysis.
o Includes
custom number and text filters,
wildcards,
date filters, and
filtering by color or icon
Combining sorting and filtering can help draw insights and
allows data-driven decisions quickly, making these skills
essential for anyone working with spreadsheets.
Other Special Features of Sorting and Filtering:
Sorting displays data in a specific order, often to reveal
patterns, trends, or relationships.
Filtering displays only the rows that meet (your) specific
criteria, effectively hiding the rows that do not match
your conditions.
Both Sorting and Filtering allow you to focus on specific
subsets of your data, making it easier to analyze and
draw insights
4) Data Cleaning, Verifying and Reporting Results
Data cleaning is the process of identifying and resolving
potential data inconsistencies or errors to improve the
quality of your data.
It involves reviewing, analyzing, detecting,
modifying, or removing ‘dirty’ data to make your
dataset ‘clean’ 1.
Data validation at the time of data entry or collection
helps you minimize the amount of data cleaning you’ll
need to do.
After data collection, you can use data
standardization and data transformation to clean
your data 1.
Data verification is the process of ensuring that the data
is accurate, complete, and consistent. It involves checking
the data for errors, inconsistencies, and missing values.
Data verification is an essential step in ensuring that
the data is reliable and can be used for analysis and
decision-making 2.
Reporting results is the process of presenting the
findings of your data analysis. It involves summarizing the
data, identifying patterns and trends, and drawing
conclusions.
The goal of reporting results is to communicate the
insights gained from the data analysis to
stakeholders in a clear and concise manner
5) Capturing Cleaning Changes
The Benefits of Effective Data Cleansing
https://www.techtarget.com/searchdatamanagement/definition/data-scrubbing
When data cleaning is done well, data cleansing provides
benefits to data management, business or organization in
general.
Benefits of Effective Data Cleansing
1) Improved decision-making.
With more accurate data, analytics applications can
produce better results.
That enables organizations to make more informed
decisions on business strategies and operations, as
well as things like patient care and government
programs.
2) More effective marketing and sales.
Customer data is often wrong, inconsistent or out of
date (many customers, by nature, don’t mind about
data integrity or quality, and just provide whatever
comes to mind. Many customers hate being asked and
disturbed).
Cleaning up the data in customer relationship
management and sales systems is very important
because it helps improve the effectiveness of
marketing campaigns and sales efforts.
3) Better operational performance.
Clean, high-quality data helps organizations avoid
inventory shortages, delivery snafus and other
business problems that can result in higher costs,
lower revenues and even damaging relationships with
customers.
4) Increased use of data.
Data has become a key corporate asset, but it can't
generate business value if it isn't used.
By making data more trustworthy, data cleansing helps
convince business managers and workers to rely on it
as part of their jobs.
5) Reduced data costs.
Data cleansing stops data errors and issues from
further propagating in systems and analytics
applications.
In the long term, that saves time and money, because
IT (Information Technology) and data management
teams do not have to continue fixing the same errors
in data sets.