2023 Tutorial 12

PAS 2023 – Introductory Lab
Descriptive and Inferential statistics

With R
Contents
1. Objectives............................................................................................................................2
2. Import data set into R........................................................................................................2
3. Descriptive statistics..........................................................................................................3
3.1. Categorical variables..................................................................................................3
a. One categorical variable............................................................................................3
b. Two categorical variables..........................................................................................3
3.2. Numerical variables...................................................................................................4
a. One numerical variable.............................................................................................4
b. Two numerical variables...........................................................................................4
4. Inferential statistics............................................................................................................5
4.1. Hypothesis testing of mean(s)..................................................................................5
a. One mean:....................................................................................................................5
b. Two means:..................................................................................................................5
4.2. Hypothesis testing of proportion(s).........................................................................6
a. One proportion...........................................................................................................6
b. Two porportions.........................................................................................................6
4.3. Checking assumptions (optional)............................................................................6
Page 1 of 6
1. Objectives
- Reading data file into R, introducing data environment in R
- Providing descriptive statistics methods for categorical and numerical variables
- Understanding basic inferential statistical methods
2. Import data set into R
Before reading a data file into R, you should create your own folder to save the
necessary materials and output of your working sessions. In order to check the current
working directory, we use:
o getwd()
The output of this command is the current folder which is chosen automatically by R. if
you wish to change into your own, the following code will help:
o setwd(“mydirectory”)
Exercise 1: Create a folder in a particular directory and label it Learning R. Copy and
paste all the files to be used for this lab into that directory.
There are three main types of data files: Comma Seperated Value file (.csv), text file
(.txt), and excel file (.xls or .xlsx). we will focus on the first one, using the available
function read.csv():
o ourDataframe <- read.csv("filename",tringsAsFactors= FALSE)
filename: name of the delimited text file

stringAsFactors: a logical value that tells R whether to convert character data into
factor. If you set stringsAsFactors = FALSE, R will not automatically convert
character variables into factors (you need to convert yourself if needed). Factors can be
understood simply as categorical variables.
read.csv() returns a data frame and we assign it to ourDataframe so we can refer
to it again and again.
Now, let’s read in the mtcars.csv dataset from your working directory.
o data1 <- read.csv("mtcars.csv",stringsAsFactors = FALSE)

o head(data1)#to see the 5 first rows in your dataframe
o str(data1)#to check the structure of variables
Exercise 2:
a. What does the following code do?

o extracted1 <- data1[1:10,]
o extracted2 <- data1[,1:3]
b. Select all the rows of the data1 data frame that has mpg at least 20.0.
c. (Optional) Change the name of variables.
Page 2 of 6
3. Descriptive statistics
3.1. Categorical variables
Automatically, R cannot understand a categorical variable unless we change it into a

factor by the function factor(). For example, the variable am has only two values so
we will convert it into a factor to use it as a categorical variable (when we draw a
frequency table or a bar chart, for example).
o data1$am<-factor(data1$am,ordered=FALSE,levels=c(0,1),
labels=c("automatic","manual"))
Exercise 3: Convert the gear variable into factor, using the labels “3 gears”, “4
gears”, “5 gears”.
a. One categorical variable
The descriptive methods for categorical variables include (relative) frequency table and
bar charts. The procedure can be conducted in R as following:
o am.table<-table(data1$am) #create a frequency table

o prop.table(am.table) #relative frequency table
o prop.table(am.table)*100 #percent frequency table
o barplot(am.table,col = "skyblue",main = "Barplot of Types

of Transmission", xlab = "Types of Transmission", ylab =
"Frequency") #create a bar chart
Exercise 4: Adjust the y-axis limit if it cannot cover the length of bars.
b. Two categorical variables
We can use clustered or stacked bar graph to describe two categorical variables
simultaneously. Firstly, we need to form the cross-tabulation table:
o trans.vs.gear <- table(data1$am,data1$gear)

o barplot(trans.vs.gear, col=c("red", "yellow"), beside=TRUE,
ylim=c(0,20)) #clustered bar graph
o barplot(trans.vs.gear,col=c("red","yellow")) #stacked bar
graph
Exercise 5: For the above bar graphs:
a. Add a title for the graph and labels for the two axes.
b. Use different colors for the bars. You can use the colors() function to list the
names of the available colors in R.
Page 3 of 6
3.2. Numerical variables

a. One numerical variable
Measures of Central tendency and Relative standing: Mean, Median, Q1, Q3, Minimum
and Maximum are computed in summary() function.
o summary(data1$mpg)
Measures of variability: Variance and standard deviation of a variable are available in

R, with functions named var() and sd(). The interquartile range is determined by
the function IQR().
o var(data1$mpg)
o sd(data1$mpg)
o IQR(data1$mpg)
Stem and leaf display:
o stem(data1$mpg)
Histogram:
o hist(data1$mpg)
Box plot: The following command is to work with boxplot (for numerical data):
o boxplot(data1$mpg)
o boxplot.stats(data1$mpg)#to obtain the statistics used to
construct the boxplot
o boxplot(data1$mpg ~ data1$gear)#to draw boxplots of mpg for
each number of gears
b. Two numerical variables
When dependent and independent variables are both numerical, we can use scatter
plot. With this plot, we can read (at the first glance) the overall pattern, possible groups
or outliers in our data set.
Let’s consider variable mpg and wt (i.e., weight). We produce scatter plot by plot()
function:
o plot(data1$mpg~data1$wt)
Exercise 6: Add a title to the plot and comments on the trend of this data set. Are there
any outliers or groups of data points?
Page 4 of 6
4. Inferential statistics
4.1. Hypothesis testing of mean(s)
a. One mean:
o t.test(x, alternative = "less", mu = 100)
The above function performs a one-sample t-test on the data contained in x, using a left-
tailed test with H o : μ=100 .
Other options for alternative hypothesis are: greater, two.sided.
Note: The conf.level argument allows us to specify the confidence level of the
reported Confidence Interval (CI) for the relevant parameter in each t-test.
Exercise 7: An investor with a stock portfolio worth several hundred thousand dollars
sued his broker and brokerage firm because lack of diversification in his portfolio led to
poor performance. The conflict was settled by an arbitration panel that gave
“substantial damages” to the investor. File diversify.csv gives the rates of return for the
39 months that the account was managed by the broker. The arbitration panel compared
these returns with the average of the Standard & Poor’s 500-stock index for the same
period. Consider the 39 monthly returns as a random sample from the population of
monthly returns that the brokerage would generate if it managed the account forever.
Are these returns compatible with a population mean of μ = 0.95%, the S&P 500
average? Perform that test at 0.01 level of significance.
b. Two means:
o t.test(y~x,data=dataframe,alternative=”two.sided”)#if y is a
numeric variable and x is a dichotomous variable,
o t.test(y1,y2,paired=FALSE,alternative=”two.sided”)#if y1 and
y2 are numeric, and the samples are independent
Exercise 8: Load the HomePrices.csv dataset. Home values tend to increase over time
under normal conditions, but the recession of 2008 and 2009 has reportedly caused the
sales price of existing homes to fall nationwide (BusinessWeek, March 9, 2009). You
would like to see if the data support this conclusion. The file HomePrices contains data
on 30 existing home sales in 2006 and 40 existing home sales in 2009. Would you feel
justified in concluding that resale prices of existing homes have declined from 2006 to
2009? Why or why not? Use 0.01 level of significance.
Exercise 9: Data is from W.S. Gosset's 1908 paper and a built-in dataset in R named
sleep. Two different sleeping drugs were taken by two groups of patients. The variable
"extra" is the increase in hours of sleep on the groups (consisting of 10 patients each).
The variable "group" gives the labels for which drug each patient took. Does the
information indicate that the first drug is less effective than the second type? Use 0.05
level of significance.
Page 5 of 6
4.2. Hypothesis testing of proportion(s)
R does not supply any function that performs Z tests of proportion(s). Use the functions
that have been written by FMT teachers to perform one-proportion Z test and two-
proportion Z test. You are permitted to use these functions in your project report.
a. One proportion
Exercise 10: In a cover story, BusinessWeek published information about sleep habits of
Americans (BusinessWeek, January 26, 2004). The article noted that sleep deprivation
causes a number of problems, including highway deaths. Fifty-one percent of adult
drivers admit to driving while drowsy. A researcher hypothesized that this issue was
an even bigger problem for night shift workers.
i. Formulate the hypotheses that can be used to help determine whether more than
51% of the population of night shift workers admit to driving while drowsy.
ii. A sample of 500 night shift workers identified those who admitted to driving while
drowsy. See the Drowsy.csv file. What is the sample proportion? What is the p-
value? At α = .01, what is your conclusion?
b. Two porportions
Exercise 11: JupiterResearch estimated that theU.S. online dating market reached $932
million in 2011 and that the European online dating sites doubled revenues from 243
million euros in 2006 to 549 million euros in 2011. When trying to start a new
relationship, people want to make a favorable impression. Sometimes they will even
stretch the truth a bit when disclosing information about themselves. A study of
deception in online dating studied the accuracy of the information given in their online
dating profiles by 80 online daters. The study found that 22 of 40 men lied about their
height, while 17 of 40 women were deceptive in this way. A difference between the
person’s actual height and that reported in the online dating profile was classified as a
lie if it was greater than 0.5 inches.
i. Find the sample proportion of men who lied about their height. Do the same for
the women.
ii. Do men lie more often about their height than women? State the hypotheses that
can be used to test the assumption. What is the p-value? Use α = 0.05. What is
your conclusion?
4.3. Checking assumptions (optional)
In order to check the normality assumption in t-test, we can perform the following
methods: histogram, QQ-plot, Shapiro-Wilk’s test (with the null hypothesis that
“sample distribution is normal”).
Page 6 of 6

2023 Tutorial 12

Uploaded by

2023 Tutorial 12

Uploaded by

PAS 2023 – Introductory Lab

Descriptive and Inferential statistics

o ourDataframe <- read.csv("filename",tringsAsFactors= FALSE)

filename: name of the delimited text file

o data1 <- read.csv("mtcars.csv",stringsAsFactors = FALSE)

a. What does the following code do?

Automatically, R cannot understand a categorical variable unless we change it into a

a. One categorical variable

o am.table<-table(data1$am) #create a frequency table

o barplot(am.table,col = "skyblue",main = "Barplot of Types

b. Two categorical variables

o trans.vs.gear <- table(data1$am,data1$gear)

Exercise 5: For the above bar graphs:

3.2. Numerical variables

Measures of variability: Variance and standard deviation of a variable are available in

Stem and leaf display:

b. Two numerical variables

Other options for alternative hypothesis are: greater, two.sided.

4.3. Checking assumptions (optional)

You might also like