0% found this document useful (0 votes)
8 views

Statistical Modeling Using R - Lab Manual

Uploaded by

Gagana Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Statistical Modeling Using R - Lab Manual

Uploaded by

Gagana Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Statistical Modelling using R

Lab Manual

4th Semester MBA


Code: 22MBADS452
Experiment Particulars Module
Number
1 Importing data from a text file 1
2 Importing Data from an Excel File 1
3 Variable Assignment and Basic Arithmetic 1
4 Creating and Manipulating Vectors 1
5 Creating and Accessing Elements of a Data Frame 1
6 Working with Factors and Discretizing Variables 1
7 Scatterplot Between mpg and wt Using ggplot2 2
8 Prepare a Dataset with Missing/Special Values/Outliers 2
9 Select Specific Columns from starwars Dataset 2
10 Filter Rows Based on eye_color in starwars Dataset 2
11 Arrange mtcars Dataset by mpg in Ascending Order 2
12 Filter starwars Dataset by Height and Hair Color 2
13 Factor and Discretize Values in starwars Dataset 2
14 Basic Scatter Plot with ggplot2 3
15 Customizing Histograms with ggplot2 3
16 Creating a Density Plot with ggplot2 3
17 Faceting with ggplot2 3
18 Creating a Boxplot with ggplot2 3
19 Advanced Customization in ggplot2 3
20 Creating Multiple Plots in a Grid with ggplot2 3
21 Customizing Axes and Legends in ggplot2 3
22 Overlaying Multiple Geometries in ggplot2 3
23 Using Themes and Color Schemes in ggplot2 3
24 Exploring Categorical Data and Descriptive Statistics 4
25 Exploring Numerical Data with Boxplots 4
26 Comparing Numerical Variables with Scatter Plots 4
27 Histograms and Density Plots for Numerical Variables 4
28 Exploratory Data Analysis with Boxplots and Grouped Barplots 4
29 Hypothesis Testing: T-Test 5
30 Chi-Square Test of Independence 5
31 Simple Linear Regression 5
32 Multiple Regression: Assumption Checking and Model Validation 5
33 ANOVA (Analysis of Variance) 5
1. How do you import a CSV file named data.csv into R and view the first six rows of the
data?
# Import the CSV file
data <- read.csv("data.csv")

# View the first six rows of the data


head(data)

2. How do you import the first sheet of an Excel file named data.xlsx into R?
# Load the readxl package
library(readxl)

# Import the first sheet of the Excel file


data <- read_excel("data.xlsx")

# View the first six rows of the data


head(data)

3. Assign the value 10 to a variable a and the value 20 to a variable b. Calculate their sum,
difference, product, and quotient.
# Variable assignment
a <- 10
b <- 20

# Arithmetic operations
sum <- a + b
difference <- a - b
product <- a * b
quotient <- a / b

# Print the results


print(sum)
print(difference)
print(product)
print(quotient)

4. Create a numeric vector v with values from 1 to 5. Calculate the mean and standard
deviation of v.
# Create a numeric vector
v <- c(1, 2, 3, 4, 5)

# Calculate the mean


mean_v <- mean(v)
print(mean_v)

# Calculate the standard deviation


sd_v <- sd(v)
print(sd_v)
5. Create a data frame with two columns: name (character) and age (numeric), with at
least three rows of data. Access the second row of the data frame.
# Create a data frame
data <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35)
)

# Access the second row


second_row <- data[2, ]

# Print the second row


second_row

6. Convert the numeric vector v (from Question 5) into a factor with three levels: "Low"
(1, 2), "Medium" (3), and "High" (4, 5).
# Create a numeric vector
v <- c(1, 2, 3, 4, 5)

# Define the breaks for discretizing the variable


breaks <- c(0, 2, 3, 5)

# Define the labels for the factors


labels <- c("Low", "Medium", "High")

# Convert the numeric vector into a factor


v_factor <- cut(v, breaks = breaks, labels = labels)

# Print the factor


v_factor
7. Using the mtcars dataset, provide a scatterplot between mpg (miles per gallon) and wt
(weight) using ggplot2.
# Load the ggplot2 package
library(ggplot2)

# Create a scatterplot
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
ggtitle("Scatterplot of MPG vs Weight") +
xlab("Weight (1000 lbs)") +
ylab("Miles Per Gallon (mpg)")

8. Prepare a dataset of your own with missing values, special values, and outliers.
# Create a dataset
my_data <- data.frame(
id = 1:10,
value = c(10, 20, 30, NA, 50, Inf, 70, 80, 1000, 90) # NA, Inf as special values
and 1000 as an outlier
)

# View the dataset


my_data

9. From the starwars dataset, select the columns name, height, gender, and species.
# Load the dplyr package and starwars dataset
library(dplyr)
data(starwars)

# Select specific columns


selected_columns <- starwars %>%
select(name, height, gender, species)

# View the selected columns


head(selected_columns)

10. From the starwars dataset, print the details of all those who have "yellow" and “blue”
as their eye_color
# Filter rows based on eye_color
eyes <- starwars %>%
filter(eye_color %in% c(‘yellow’,’blue’)
# View the details
eyes

11. For all the cars in the mtcars dataset, arrange the dataset in ascending order based on
the mpg value.
# Arrange the dataset by mpg in ascending order
mtcars_arranged <- mtcars %>%
arrange(mpg)
# View the arranged dataset
head(mtcars_arranged)
12. From the starwars dataset, print the name, height, and hair_color of all those who have
height > 175 and their hair_color is "black".
# Filter by height and hair_color
filtered_starwars <- starwars %>%
filter(height > 175, hair_color == "black") %>%
select(name, height, hair_color)

# View the filtered data


filtered_starwars

13. From the starwars dataset, create a factor variable for the species column and then
discretize the height column into three categories: "Short", "Average", and "Tall".
# Load the dplyr package and starwars dataset
library(dplyr)
data(starwars)

# Convert the species column to a factor


starwars <- starwars %>%
mutate(species_factor = as.factor(species))

# Define the breaks for discretizing the height column


height_breaks <- c(-Inf, 160, 180, Inf)
height_labels <- c("Short", "Average", "Tall")

# Discretize the height column into categories


starwars <- starwars %>%
mutate(height_category = cut(height, breaks = height_breaks, labels =
height_labels))

# View the updated dataset with the new factor and discretized columns
head(starwars %>% select(name, species, species_factor, height,
height_category))

Explanation:

1. Convert species Column to Factor:


o Use mutate and as.factor to create a new column species_factor that
contains the factor version of the species column.
2. Discretize height Column into Categories:
o Define the height_breaks to categorize the heights into "Short",
"Average", and "Tall".
o Use cut to create a new column height_category that categorizes the
height based on the defined breaks and labels.
14. Use ggplot2 to create a scatter plot of mpg (miles per gallon) vs wt (weight) from the
mtcars dataset. Customize the plot by adding a title, labels for the x and y axes, and
changing the point shapes.
# Load the ggplot2 package
library(ggplot2)

# Create a scatter plot


ggplot(mtcars, aes(x = wt, y = mpg, shape = factor(cyl))) +
geom_point(size = 3) +
labs(title = "Scatter Plot of MPG vs Weight",
x = "Weight",
y = "Miles Per Gallon") +
scale_shape_manual(values = c(16, 17, 18)) +
theme_minimal()
15. Plot a histogram of mpg from mtcars. Customize the histogram by changing the fill
color, adjusting bin width, and adding labels to axes.
# Create a histogram
ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) +
geom_histogram(binwidth = 2, color = "black") +
labs(title = "Histogram of MPG",
x = "Miles Per Gallon",
y = "Frequency") +
scale_fill_manual(values = c("#FF9999", "#66CCFF", "#99FF99")) +
theme_minimal()
16. Generate a density plot of mpg from mtcars. Customize the plot by changing line types,
colors, and adding a legend.
# Create a density plot
ggplot(mtcars, aes(x = mpg, color = factor(cyl), linetype = factor(cyl))) +
geom_density(size = 1.2) +
labs(title = "Density Plot of MPG",
x = "Miles Per Gallon",
y = "Density") +
scale_color_manual(values = c("#FF9999", "#66CCFF", "#99FF99")) +
scale_linetype_manual(values = c("solid", "dashed", "dotted")) +
theme_minimal()
17. Create a scatter plot of mpg vs wt from mtcars dataset. Facet the plot by cyl (number
of cylinders) and customize facet labels and overall plot appearance.
# Create a faceted scatter plot
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
facet_wrap(~ cyl, labeller = labeller(cyl = c("4" = "Four Cylinders", "6" = "Six
Cylinders", "8" = "Eight Cylinders"))) +
labs(title = "Scatter Plot of MPG vs Weight (Faceted by Cylinder)",
x = "Weight",
y = "Miles Per Gallon") +
theme_minimal()
18. Generate a boxplot of mpg grouped by cyl from mtcars. Customize the appearance by
adjusting fill colors, adding a title, and changing the order of boxes.
# Create a boxplot
ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
geom_boxplot() +
labs(title = "Boxplot of MPG by Cylinder",
x = "Number of Cylinders",
y = "Miles Per Gallon") +
scale_fill_manual(values = c("#FF9999", "#66CCFF", "#99FF99")) +
theme_minimal()
19. Create a scatter plot of mpg vs hp from mtcars. Customize the plot by adjusting point
size based on wt, adding a smooth trend line, and annotating points with labels for car
names.
# Create an advanced scatter plot
ggplot(mtcars, aes(x = hp, y = mpg, size = wt, label = rownames(mtcars))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
geom_text(check_overlap = TRUE, vjust = 1.5) +
labs(title = "Scatter Plot of MPG vs Horsepower",
x = "Horsepower",
y = "Miles Per Gallon") +
scale_size(range = c(3, 10)) +
theme_minimal()
20. Generate a 2x2 grid of plots: scatter plot, histogram, density plot, and boxplot for mpg
from mtcars. Customize each plot and arrange them in a grid layout.
# Load necessary packages
library(ggplot2)
library(gridExtra) # Load gridExtra for grid.arrange

# Load the mtcars dataset


data(mtcars)

# Create individual plots


plot1 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
labs(title = "Scatter Plot of MPG vs Weight",
x = "Weight",
y = "Miles Per Gallon") +
theme_minimal()

plot2 <- ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) +


geom_histogram(binwidth = 2, color = "black") +
labs(title = "Histogram of MPG",
x = "Miles Per Gallon",
y = "Frequency") +
scale_fill_manual(values = c("#FF9999", "#66CCFF", "#99FF99")) +
theme_minimal()

plot3 <- ggplot(mtcars, aes(x = mpg, color = factor(cyl), linetype = factor(cyl)))


+
geom_density(size = 1.2) +
labs(title = "Density Plot of MPG",
x = "Miles Per Gallon",
y = "Density") +
scale_color_manual(values = c("#FF9999", "#66CCFF", "#99FF99")) +
scale_linetype_manual(values = c("solid", "dashed", "dotted")) +
theme_minimal()

plot4 <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +


geom_boxplot() +
labs(title = "Boxplot of MPG by Cylinder",
x = "Number of Cylinders",
y = "Miles Per Gallon") +
scale_fill_manual(values = c("#FF9999", "#66CCFF", "#99FF99")) +
theme_minimal()

# Arrange plots in a grid layout


grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)
21. Create a scatter plot of mpg vs disp from mtcars. Customize the plot by changing axis
labels, adjusting axis limits, and modifying the legend.
# Create a scatter plot with customized axes and legend
ggplot(mtcars, aes(x = disp, y = mpg, color = factor(cyl))) +
geom_point() +
labs(title = "Scatter Plot of MPG vs Displacement",
x = "Displacement (cu.in.)",
y = "Miles Per Gallon") +
scale_color_manual(values = c("#FF9999", "#66CCFF", "#99FF99"),
labels = c("Four Cylinders", "Six Cylinders", "Eight Cylinders"),
name = "Number of Cylinders") +
theme_minimal()
22. Overlay a scatter plot and a line plot of mpg vs wt from mtcars. Customize the plot by
changing point shapes, line color, and adding annotations.
# Create a scatter plot with overlaid line plot
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(shape = 16, size = 4) +
geom_line(stat = "smooth", method = "lm", color = "blue") +
labs(title = "Scatter Plot with Overlayed Line of MPG vs Weight",
x = "Weight",
y = "Miles Per Gallon") +
theme_minimal()
23. Generate a scatter plot of mpg vs hp from mtcars. Customize the plot by applying
different themes (e.g., theme_bw, theme_classic) and adjusting color schemes.
# Create a scatter plot with different themes and color schemes
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
labs(title = "Scatter Plot of MPG vs Horsepower",
x = "Horsepower",
y = "Miles Per Gallon") +
scale_color_manual(values = c("#FF9999", "#66CCFF", "#99FF99")) +
theme_bw()
24. Using the iris dataset, explore the categorical variable Species. Compute the frequency
table and bar plot to visualize the distribution of species. Calculate the mean, median,
and standard deviation of Sepal.Length for each species.
# Load the dataset
data(iris)

# Frequency table and bar plot of Species


species_freq <- table(iris$Species)
barplot(species_freq, main = "Frequency of Species in Iris Dataset", xlab =
"Species", ylab = "Frequency")

# Descriptive statistics of Sepal.Length by Species


summary_stats <- tapply(iris$Sepal.Length, iris$Species, function(x) c(mean =
mean(x), median = median(x), sd = sd(x)))
summary_stats

25. Use the mtcars dataset to explore numerical variables. Create a boxplot to visualize the
distribution of mpg (miles per gallon), hp (horsepower), and qsec (quarter mile time).
Interpret the variability and central tendency of each variable based on the boxplot.
# Load the dataset
data(mtcars)

# Create boxplots for mpg, hp, and qsec


par(mfrow = c(1, 3)) # Set up a 1x3 grid for plotting

# Boxplot for mpg


boxplot(mtcars$mpg, main = "Miles Per Gallon (mpg)")

# Boxplot for hp
boxplot(mtcars$hp, main = "Horsepower (hp)")

# Boxplot for qsec


boxplot(mtcars$qsec, main = "Quarter Mile Time (qsec)")
Interpretation:

1. Miles Per Gallon (mpg):


o Variability: The spread of mpg data is relatively wide, with a
noticeable range of values.
o Central Tendency: The median (horizontal line inside the box) is
around 20 mpg, indicating that half of the cars have mpg values
above and below this point.
2. Horsepower (hp):
o Variability: The hp data shows a wider range compared to mpg,
with some outliers above the upper whisker.
o Central Tendency: The median is approximately 125 hp, suggesting
that half of the cars have horsepower values above and below this
median.
3. Quarter Mile Time (qsec):
o Variability: qsec data is relatively clustered, with less spread
compared to mpg and hp.
o Central Tendency: The median is around 17 seconds, indicating the
middle value of the quarter mile time for the cars in the dataset.

26. Use the mtcars dataset to compare mpg vs wt (weight) using a scatter plot. Add a trend
line and interpret the relationship between mpg and wt.
# Scatter plot of mpg vs wt with trend line
plot(mtcars$wt, mtcars$mpg, main = "Scatter Plot of MPG vs Weight", xlab =
"Weight", ylab = "Miles Per Gallon")
abline(lm(mpg ~ wt, data = mtcars), col = "blue")
27. Generate histograms and density plots for Sepal.Length and Petal.Length from the iris
dataset. Compare the distributions and comment on any differences or similarities
between these two variables.
# Load the dataset
data(iris)

# Create histograms and density plots


par(mfrow = c(2, 2))
hist(iris$Sepal.Length, main = "Histogram of Sepal Length")
hist(iris$Petal.Length, main = "Histogram of Petal Length")
plot(density(iris$Sepal.Length), main = "Density Plot of Sepal Length")
plot(density(iris$Petal.Length), main = "Density Plot of Petal Length")
28. Using the mtcars dataset, create a grouped barplot to visualize the average mpg by cyl
(number of cylinders). Additionally, generate side-by-side boxplots of mpg grouped by
am (transmission type: 0 = automatic, 1 = manual) to compare the distributions.
# Load the dataset
data(mtcars)

# Grouped barplot of average mpg by cyl


avg_mpg <- tapply(mtcars$mpg, mtcars$cyl, mean)
barplot(avg_mpg, main = "Average MPG by Number of Cylinders", xlab =
"Number of Cylinders", ylab = "Average MPG")

# Side-by-side boxplots of mpg by am


boxplot(mpg ~ am, data = mtcars, main = "MPG by Transmission Type",
xlab = "Transmission Type", ylab = "Miles Per Gallon", col = c("red",
"blue"))
legend("topright", legend = c("Automatic", "Manual"), fill = c("red", "blue"))
29. Use the mtcars dataset to test if there is a significant difference in mpg (miles per gallon)
between cars with 4 cylinders (cyl == 4) and cars with 6 cylinders (cyl == 6). Perform
a two-sample t-test and interpret the results.
# Load the dataset
data(mtcars)

# Perform t-test
t_test <- t.test(mpg ~ cyl, data = mtcars, subset = cyl %in% c(4, 6))
t_test

# Interpret the results


cat("P-value:", t_test$p.value, "\n")
if (t_test$p.value < 0.05) {
cat("Conclusion: There is a significant difference in mpg between cars with 4
cylinders and 6 cylinders.")
} else {
cat("Conclusion: There is no significant difference in mpg between cars with 4
cylinders and 6 cylinders.")
}

30. Use the ChickWeight dataset in R to perform a chi-square test of independence to


determine if there is a relationship between Chick weight (weight) and Diet type (Diet).
Interpret the chi-square test results.
# Load the dataset
data(ChickWeight)

# Perform chi-square test


chi_square_test <- chisq.test(ChickWeight$weight, ChickWeight$Diet)
chi_square_test

# Interpret the results


cat("P-value:", chi_square_test$p.value, "\n")
if (chi_square_test$p.value < 0.05) {
cat("Conclusion: There is a significant relationship between Chick weight and
Diet type.")
} else {
cat("Conclusion: There is no significant relationship between Chick weight and
Diet type.")
}
31. Use the mtcars dataset to perform a simple linear regression to predict mpg (miles per
gallon) based on wt (weight in 1000 lbs). Check the assumptions of linear regression
(linearity, normality of residuals, homoscedasticity) and interpret the regression results.
# Load the dataset
data(mtcars)

# Fit linear regression model


model <- lm(mpg ~ wt, data = mtcars)

# Check assumptions
plot(model)

# Summary of regression model


summary(model)

32. Use the iris dataset to perform a multiple linear regression to predict Petal.Length based
on Sepal.Length, Sepal.Width, and Petal.Width. Check the assumptions of multiple
regression (linearity, normality of residuals, multicollinearity) and validate the model.

# Load the dataset


data(iris)

# Fit multiple linear regression model


model <- lm(Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width, data =
iris)

# Check assumptions
par(mfrow = c(2, 2))
plot(model)

# Validate the model


anova(model)
33. Use the PlantGrowth dataset to perform an ANOVA to determine if there is a significant
difference in weight of plants among different group types (ctrl, trt1, trt2). Interpret the
ANOVA results.
# Load the dataset
data(PlantGrowth)

# Perform ANOVA
anova_result <- aov(weight ~ group, data = PlantGrowth)
summary(anova_result)

# Interpret the results


cat("P-value:", summary(anova_result)[[1]]$"Pr(>F)"[1], "\n")
if (summary(anova_result)[[1]]$"Pr(>F)"[1] < 0.05) {
cat("Conclusion: There is a significant difference in weight of plants among
different group types.")
} else {
cat("Conclusion: There is no significant difference in weight of plants among
different group types.")
}

You might also like