DATA CLEANING
Data cleaning or say data cleansing is the process of
detecting and correcting (or removing) corrupt or inaccurate
records from a record set, table, or database and refers to
identifying incomplete, incorrect, inaccurate or irrelevant parts
of the data and then replacing, modifying, or deleting the
dirty data.
STEPS FOR DATA CLEANING
1. IMPORTING OF DATA.
2. EXPLORING THE RAW DATA
3. REMOVAL OF UNWANTED OBSERVATIONS
4. FIXING STRUCTURAL ERRORS
5. MANAGING UNWANTED DATA
6. HANDLING MISSING DATA
7. EXPORTING THE DATASET
DATA CLEANING WITH R
• FOR UNDERSTANDING OF DATA- WE LOAD DPLYR LIBRARY FOR FOLLOWING FUNCTION
Launch<-[Link] (dataset) library(dplyr)
• View its class:- class(abc) • Glimpse(abc) #same as structure
• View its dimension:- dim(abc) • Summary(abc)
• Head(abc)
• For rows and column:- name(abc)
• Tail(abc)
• For the structure of data:- str(abc)
• FOR VISUALIZING FOR MISSING VALUES
Checking for NAS
We use
• [Link](abc)
hist(abc$xy) single variable • which([Link](x)) particular row/col
• any([Link](abc))
plot(abc$xy ty) b/w two variable • sum([Link](abc))
• Summery(abc)
For tidy data Another method to remove rows with nas
Observation as row and column • [Link](abc)
One type of obs unit per table
We use To deal with date and times
gather(data, key, value) We use lubridates library
spread(data, key, value) Ex- library(lubridate)
seprate(data, col, into) Weather$day<-ymd(weather2date)
unite(data, col, ….)
Dealing with missing values
Row with no missing value
• [Link](abc)