Data cleaning refers to the process of transforming raw data into data that is suitable for analysis or
model-building.
In most cases, “cleaning” a dataset involves dealing with missing values and duplicated data.
1.Example 1: Remove Rows with Missing Values
library(dplyr)
#remove rows with missing values
new_df <- df %>% na.omit()
#view new data frame
new_df
2.Example 2: Replace Missing Values with Another Value
library(dplyr)
library(tidyr)
#replace missing values in each numeric column with median value of column
new_df <-df %>% mutate(across(where(is.numeric),~replace_na(.,median(.,na.rm=TRUE))))
#view new data frame
new_df
3.Example 3: Remove Duplicate Rows
We can use the following syntax to replace any missing values with the median value of each column:
library(dplyr)
#remove duplicate rows
new_df <- df %>% distinct(.keep_all=TRUE)
#view new data frame
new_df