0% found this document useful (0 votes)
22 views2 pages

Data Cleaning R

Data cleaning is the process of transforming raw data into a suitable format for analysis, primarily by addressing missing values and duplicates. The document provides examples using R's dplyr library to remove rows with missing values, replace them with median values, and eliminate duplicate rows. Each example includes code snippets demonstrating the cleaning techniques.

Uploaded by

getu zerga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views2 pages

Data Cleaning R

Data cleaning is the process of transforming raw data into a suitable format for analysis, primarily by addressing missing values and duplicates. The document provides examples using R's dplyr library to remove rows with missing values, replace them with median values, and eliminate duplicate rows. Each example includes code snippets demonstrating the cleaning techniques.

Uploaded by

getu zerga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

Data cleaning refers to the process of transforming raw data into data that is suitable for analysis or

model-building.

In most cases, “cleaning” a dataset involves dealing with missing values and duplicated data.

1.Example 1: Remove Rows with Missing Values

library(dplyr)

#remove rows with missing values

new_df <- df %>% na.omit()

#view new data frame

new_df

2.Example 2: Replace Missing Values with Another Value

library(dplyr)

library(tidyr)

#replace missing values in each numeric column with median value of column

new_df <-df %>% mutate(across(where(is.numeric),~replace_na(.,median(.,na.rm=TRUE))))

#view new data frame

new_df

3.Example 3: Remove Duplicate Rows

We can use the following syntax to replace any missing values with the median value of each column:

library(dplyr)
#remove duplicate rows

new_df <- df %>% distinct(.keep_all=TRUE)

#view new data frame

new_df

You might also like