2023 Tutorial 12
2023 Tutorial 12
Page 1 of 6
PAS 2023 – Introductory Lab
1. Objectives
- Reading data file into R, introducing data environment in R
- Providing descriptive statistics methods for categorical and numerical variables
- Understanding basic inferential statistical methods
2. Import data set into R
Before reading a data file into R, you should create your own folder to save the
necessary materials and output of your working sessions. In order to check the current
working directory, we use:
o getwd()
The output of this command is the current folder which is chosen automatically by R. if
you wish to change into your own, the following code will help:
o setwd(“mydirectory”)
Exercise 1: Create a folder in a particular directory and label it Learning R. Copy and
paste all the files to be used for this lab into that directory.
There are three main types of data files: Comma Seperated Value file (.csv), text file
(.txt), and excel file (.xls or .xlsx). we will focus on the first one, using the available
function read.csv():
Now, let’s read in the mtcars.csv dataset from your working directory.
Page 2 of 6
PAS 2023 – Introductory Lab
3. Descriptive statistics
3.1. Categorical variables
o data1$am<-factor(data1$am,ordered=FALSE,levels=c(0,1),
labels=c("automatic","manual"))
Exercise 3: Convert the gear variable into factor, using the labels “3 gears”, “4
gears”, “5 gears”.
The descriptive methods for categorical variables include (relative) frequency table and
bar charts. The procedure can be conducted in R as following:
We can use clustered or stacked bar graph to describe two categorical variables
simultaneously. Firstly, we need to form the cross-tabulation table:
a. Add a title for the graph and labels for the two axes.
b. Use different colors for the bars. You can use the colors() function to list the
names of the available colors in R.
Page 3 of 6
PAS 2023 – Introductory Lab
Measures of Central tendency and Relative standing: Mean, Median, Q1, Q3, Minimum
and Maximum are computed in summary() function.
o summary(data1$mpg)
o var(data1$mpg)
o sd(data1$mpg)
o IQR(data1$mpg)
o stem(data1$mpg)
Histogram:
o hist(data1$mpg)
Box plot: The following command is to work with boxplot (for numerical data):
o boxplot(data1$mpg)
o boxplot.stats(data1$mpg)#to obtain the statistics used to
construct the boxplot
o boxplot(data1$mpg ~ data1$gear)#to draw boxplots of mpg for
each number of gears
When dependent and independent variables are both numerical, we can use scatter
plot. With this plot, we can read (at the first glance) the overall pattern, possible groups
or outliers in our data set.
Let’s consider variable mpg and wt (i.e., weight). We produce scatter plot by plot()
function:
o plot(data1$mpg~data1$wt)
Exercise 6: Add a title to the plot and comments on the trend of this data set. Are there
any outliers or groups of data points?
Page 4 of 6
PAS 2023 – Introductory Lab
4. Inferential statistics
4.1. Hypothesis testing of mean(s)
a. One mean:
o t.test(x, alternative = "less", mu = 100)
The above function performs a one-sample t-test on the data contained in x, using a left-
tailed test with H o : μ=100 .
Note: The conf.level argument allows us to specify the confidence level of the
reported Confidence Interval (CI) for the relevant parameter in each t-test.
Exercise 7: An investor with a stock portfolio worth several hundred thousand dollars
sued his broker and brokerage firm because lack of diversification in his portfolio led to
poor performance. The conflict was settled by an arbitration panel that gave
“substantial damages” to the investor. File diversify.csv gives the rates of return for the
39 months that the account was managed by the broker. The arbitration panel compared
these returns with the average of the Standard & Poor’s 500-stock index for the same
period. Consider the 39 monthly returns as a random sample from the population of
monthly returns that the brokerage would generate if it managed the account forever.
Are these returns compatible with a population mean of μ = 0.95%, the S&P 500
average? Perform that test at 0.01 level of significance.
b. Two means:
o t.test(y~x,data=dataframe,alternative=”two.sided”)#if y is a
numeric variable and x is a dichotomous variable,
o t.test(y1,y2,paired=FALSE,alternative=”two.sided”)#if y1 and
y2 are numeric, and the samples are independent
Exercise 8: Load the HomePrices.csv dataset. Home values tend to increase over time
under normal conditions, but the recession of 2008 and 2009 has reportedly caused the
sales price of existing homes to fall nationwide (BusinessWeek, March 9, 2009). You
would like to see if the data support this conclusion. The file HomePrices contains data
on 30 existing home sales in 2006 and 40 existing home sales in 2009. Would you feel
justified in concluding that resale prices of existing homes have declined from 2006 to
2009? Why or why not? Use 0.01 level of significance.
Exercise 9: Data is from W.S. Gosset's 1908 paper and a built-in dataset in R named
sleep. Two different sleeping drugs were taken by two groups of patients. The variable
"extra" is the increase in hours of sleep on the groups (consisting of 10 patients each).
The variable "group" gives the labels for which drug each patient took. Does the
information indicate that the first drug is less effective than the second type? Use 0.05
level of significance.
Page 5 of 6
PAS 2023 – Introductory Lab
4.2. Hypothesis testing of proportion(s)
R does not supply any function that performs Z tests of proportion(s). Use the functions
that have been written by FMT teachers to perform one-proportion Z test and two-
proportion Z test. You are permitted to use these functions in your project report.
a. One proportion
Exercise 10: In a cover story, BusinessWeek published information about sleep habits of
Americans (BusinessWeek, January 26, 2004). The article noted that sleep deprivation
causes a number of problems, including highway deaths. Fifty-one percent of adult
drivers admit to driving while drowsy. A researcher hypothesized that this issue was
an even bigger problem for night shift workers.
i. Formulate the hypotheses that can be used to help determine whether more than
51% of the population of night shift workers admit to driving while drowsy.
ii. A sample of 500 night shift workers identified those who admitted to driving while
drowsy. See the Drowsy.csv file. What is the sample proportion? What is the p-
value? At α = .01, what is your conclusion?
b. Two porportions
Exercise 11: JupiterResearch estimated that theU.S. online dating market reached $932
million in 2011 and that the European online dating sites doubled revenues from 243
million euros in 2006 to 549 million euros in 2011. When trying to start a new
relationship, people want to make a favorable impression. Sometimes they will even
stretch the truth a bit when disclosing information about themselves. A study of
deception in online dating studied the accuracy of the information given in their online
dating profiles by 80 online daters. The study found that 22 of 40 men lied about their
height, while 17 of 40 women were deceptive in this way. A difference between the
person’s actual height and that reported in the online dating profile was classified as a
lie if it was greater than 0.5 inches.
i. Find the sample proportion of men who lied about their height. Do the same for
the women.
ii. Do men lie more often about their height than women? State the hypotheses that
can be used to test the assumption. What is the p-value? Use α = 0.05. What is
your conclusion?
In order to check the normality assumption in t-test, we can perform the following
methods: histogram, QQ-plot, Shapiro-Wilk’s test (with the null hypothesis that
“sample distribution is normal”).
Page 6 of 6