0% found this document useful (0 votes)
62 views4 pages

Project

Data cleaning is the process of detecting and correcting inaccurate or incomplete records in a dataset. It involves importing data, exploring for errors, removing unwanted observations, fixing structural errors, managing unwanted data, handling missing values, and exporting the cleaned dataset. When cleaning data in R, common steps include loading packages like dplyr, examining the structure and dimensions of the data, checking for and dealing with missing values, separating and uniting columns, and visualizing relationships between variables.

Uploaded by

satyam upadhayay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views4 pages

Project

Data cleaning is the process of detecting and correcting inaccurate or incomplete records in a dataset. It involves importing data, exploring for errors, removing unwanted observations, fixing structural errors, managing unwanted data, handling missing values, and exporting the cleaned dataset. When cleaning data in R, common steps include loading packages like dplyr, examining the structure and dimensions of the data, checking for and dealing with missing values, separating and uniting columns, and visualizing relationships between variables.

Uploaded by

satyam upadhayay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

DATA CLEANING

Data cleaning or say data cleansing is the process of


detecting and correcting (or removing) corrupt or inaccurate
records from a record set, table, or database and refers to
identifying incomplete, incorrect, inaccurate or irrelevant parts
of the data and then replacing, modifying, or deleting the
dirty data.
STEPS FOR DATA CLEANING

1. IMPORTING OF DATA.
2. EXPLORING THE RAW DATA
3. REMOVAL OF UNWANTED OBSERVATIONS
4. FIXING STRUCTURAL ERRORS
5. MANAGING UNWANTED DATA
6. HANDLING MISSING DATA
7. EXPORTING THE DATASET
DATA CLEANING WITH R
• FOR UNDERSTANDING OF DATA- WE LOAD DPLYR LIBRARY FOR FOLLOWING FUNCTION

Launch<-[Link] (dataset) library(dplyr)


• View its class:- class(abc) • Glimpse(abc) #same as structure

• View its dimension:- dim(abc) • Summary(abc)


• Head(abc)
• For rows and column:- name(abc)
• Tail(abc)
• For the structure of data:- str(abc)
• FOR VISUALIZING FOR MISSING VALUES
Checking for NAS
We use
• [Link](abc)
hist(abc$xy) single variable • which([Link](x)) particular row/col
• any([Link](abc))
plot(abc$xy ty) b/w two variable • sum([Link](abc))
• Summery(abc)

For tidy data Another method to remove rows with nas


Observation as row and column • [Link](abc)
One type of obs unit per table
We use To deal with date and times
gather(data, key, value) We use lubridates library
spread(data, key, value) Ex- library(lubridate)
seprate(data, col, into) Weather$day<-ymd(weather2date)
unite(data, col, ….)

Dealing with missing values


Row with no missing value
• [Link](abc)

You might also like