Lab 2021-Probability and Statistics
Descriptive and Inferential statistics
With R
Contents
1. Objectives ............................................................................................................................ 1
2. Import data set into R ....................................................................................................... 1
3. Descriptive statistics ......................................................................................................... 2
3.1. Categorical variables ................................................................................................. 2
a. One categorical variable ........................................................................................... 3
b. Two categorical variables ......................................................................................... 3
3.2. Numerical variables .................................................................................................. 3
a. One numerical variable ............................................................................................. 3
b. Two numerical variables .......................................................................................... 4
4. Inferential statistics ........................................................................................................... 4
4.1. Hypothesis testing of mean(s) ................................................................................. 4
a. One mean: ................................................................................................................... 4
b. Two means: ................................................................................................................. 5
4.2. Hypothesis testing of proportion(s) ........................................................................ 5
a. One proportion .......................................................................................................... 5
b. Two porportions ........................................................................................................ 6
4.3. Checking assumptions (optional)............................................................................ 6
1. Objectives
- Read data file into R, introduction to data environment in R
- Descriptive statistics methods for categorical and numerical variables
- Basic Inferential statistics methods
2. Import data set into R
Before reading data file into R, you should better create your own folder to save the
necessary material and the output of your working sessions. In order to check the
current working directory, we use:
o getwd()
Page 1 of 6
Lab 2021-Probability and Statistics
The output of this command is the current folder which is chosen automatically by R. if
you wish to change into your own, the following code will help:
o setwd(“mydirectory”)
Exercise 1: Create a folder in your E directory and label it Learning R. Copy and paste
all the files to be used for this lab into that directory.
There are three main types of data file: Comma Seperated Value file (.csv), text file (.txt),
and excel file (.xls or .xlsx). we will focus on the first one, using the available function
[Link]():
o ourDataframe <- [Link]("filename",tringsAsFactors= FALSE)
filename: name of the delimited text file
stringAsFactors: a logical value that tells R whether to convert character data into
factor. If you set stringsAsFactors= FALSE, the R system will not automatically
convert character variables into factors (if later on you need factors, you need to
convert yourself). Factors can be understood simply as categorical variables.
[Link]() returns a data frame and we assign it to ourDataframe so we can refer to it
again and again.
Now, let’s read in the [Link] dataset from your working directory.
o data1<-[Link]("[Link]",stringsAsFactors=FALSE)
o head(data1)#to see the 5 first rows in your dataframe
o str(data1)#to check the structure of variables
Exercise 2.
a. What does the following code do?
o extracted1 <- data1[1:10,]
o extracted2 <- data1[,1:3]
b. Select all the rows of the data2 data frame that has mpg at least 20.0.
c. (Optional) Change the name of variables.
3. Descriptive statistics
3.1. Categorical variables
Automatically, R cannot understand a categorical variable unless we change it into a
factor by the function factor(). For example, the variable am has only two values so
we will convert it into a factor to use it as a categorical variable (when we draw a
frequency table or a bar chart for example).
o data1$am<-factor(data1$am,ordered=FALSE,levels=c(0,1),
labels=c("automatic","manual"))
Exercise 3.
Page 2 of 6
Lab 2021-Probability and Statistics
Convert the gear variable into factor, using the labels “3 gear”, “4 gears”,…
a. One categorical variable
The descriptive methods for categorical variables include (relative) frequency table and
bar charts. The procedure can be conducted in R as following:
o #create frequency table
o [Link]<-table(data1$am)
o [Link]([Link])#relative frequency table
o [Link]([Link])*100#percent frequency table
o #create bar charts
o barplot([Link],col="skyblue",main="Barplot of Types of
Transmission ", xlab=" Types of Transmission ",
ylab="Frequency")
Exercise 4.
Adjust the y-axis limit if it cannot cover the length of bars.
b. Two categorical variables
We can use clustered or stacked bar graph to describe two categorical variables
simultaneously. Firstly, we need to form the cross-tabulation table:
o [Link]<-table(data1$am,data1$gear)
o barplot([Link], col=c("red", "yellow"), beside=TRUE,
ylim=c(0,20))#clustered bar graph
o barplot([Link],col=c("red","yellow"))
Exercise 5. For the above bar graphs:
a. Add a title for the graph and labels for the two axes.
b. Use different colors for the bars. You can use the colors() function to list the
names of the available colors in R.
3.2. Numerical variables
a. One numerical variable
Measures of Central tendency and Relative standing: Mean, Median, Q1, Q3, Minimum
and Maximum are computed in summary() function.
o summary(data1$mpg)
Measures of variability: Variance and standard deviation of a variable are available in
R, with function var() and sd(). The interquartile range is determined by function
IQR().
o var(data1$mpg)
o sd(data1$mpg)
o IQR(data1$mpg)
Page 3 of 6
Lab 2021-Probability and Statistics
Stem and leaf display:
o stem(data1$mpg)
Histogram:
o hist(data1$mpg)
Box plot:
The following command is to work with boxplot (for numerical data):
o boxplot(data1$mpg)
o [Link](data1$mpg)#to obtain the statistics used to
construct the boxplot
o boxplot(data1$mpg~data1$gear)#to give the boxplots vs
Number of Gears variable
b. Two numerical variables
When dependent and independent variables are both numerical, we must use scatter
plot. Via this plot, we can read (at the first glance) the overall pattern, possible groups
or outliers in our data set.
Let’s consider variable mpg and wt (i.e., weight). We produce scatter plot by plot()
function:
o plot(data1$mpg~data1$wt)
Exercise 6. Add a title to the plot and comments on the trend of this data set. Are there
any outliers or groups of data points?
4. Inferential statistics
4.1. Hypothesis testing of mean(s)
a. One mean:
o [Link](x, alternative = "less", mu = 100)
The above function performs a one-sample t-test on the data contained in x, using a left-
tailed test with 𝐻𝑜 : 𝜇 = 100.
Other options for alternative are: greater, [Link].
Note: The [Link] argument allows us to specify the confidence level of the
reported Confidence Interval (CI) for the relevant parameter in each t-test.
Exercise 7: An investor with a stock portfolio worth several hundred thousand dollars
sued his broker and brokerage firm because lack of diversification in his portfolio led to
poor performance. The conflict was settled by an arbitration panel that gave
“substantial damages” to the investor. File [Link] gives the rates of return for the
39 months that the account was managed by the broker. The arbitration panel compared
Page 4 of 6
Lab 2021-Probability and Statistics
these returns with the average of the Standard & Poor’s 500-stock index for the same
period. Consider the 39 monthly returns as a random sample from the population of
monthly returns that the brokerage would generate if it managed the account forever.
Are these returns compatible with a population mean of μ = 0.95%, the S&P 500
average? Perform that test at 0.01 level of significance.
b. Two means:
o [Link](y~x,data=dataframe,alternative=”[Link]”)#if y is a
numeric variable and x is a dichotomous variable,
o [Link](y1,y2,paired=FALSE,alternative=”[Link]”)#if y1 and
y2 are numeric, and the samples are independent
Exercise 8: Load the [Link] dataset. Home values tend to increase over time
under normal conditions, but the recession of 2008 and 2009 has reportedly caused the
sales price of existing homes to fall nationwide (BusinessWeek, March 9, 2009). You
would like to see if the data support this conclusion. The file HomePrices contains data
on 30 existing home sales in 2006 and 40 existing home sales in 2009. Would you feel
justified in concluding that resale prices of existing homes have declined from 2006 to
2009? Why or why not? Use 0.01 level of significance.
Exercise 9: Data is from W.S. Gosset's 1908 paper and a built-in dataset in R named
sleep. Two different sleeping drugs were taken by two groups of patients. The variable
"extra" is the increase in hours of sleep on the groups (consisting of 10 patients each).
The variable "group" gives the labels for which drug each patient took. Does the
information indicate that the first drug is less effective than the second type? Use 0.05
level of significance.
4.2. Hypothesis testing of proportion(s)
R does not supply any function that performs Z tests of proportion(s). Use the functions
that have been written by FMT teachers to perform one-proportion Z test and two-
proportion Z test. You are permitted to use these functions in your project report.
a. One proportion
Exercise 10: In a cover story, BusinessWeek published information about sleep habits of
Americans (BusinessWeek, January 26, 2004). The article noted that sleep deprivation
causes a number of problems, including highway deaths. Fifty-one percent of adult
drivers admit to driving while drowsy. A researcher hypothesized that this issue was
an even bigger problem for night shift workers.
i. Formulate the hypotheses that can be used to help determine whether more than
51% of the population of night shift workers admit to driving while drowsy.
ii. A sample of 500 night shift workers identified those who admitted to driving while
drowsy. See the [Link] file. What is the sample proportion? What is the p-
value? At α = .01, what is your conclusion?
Page 5 of 6
Lab 2021-Probability and Statistics
b. Two porportions
Exercise 11: JupiterResearch estimated that theU.S. online dating market reached $932
million in 2011 and that the European online dating sites doubled revenues from 243
million euros in 2006 to 549 million euros in 2011. When trying to start a new
relationship, people want to make a favorable impression. Sometimes they will even
stretch the truth a bit when disclosing information about themselves. A study of
deception in online dating studied the accuracy of the information given in their online
dating profiles by 80 online daters. The study found that 22 of 40 men lied about their
height, while 17 of 40 women were deceptive in this way. A difference between the
person’s actual height and that reported in the online dating profile was classified as a
lie if it was greater than 0.5 inches.
i. Find the sample proportion of men who lied about their height. Do the same for
the women.
ii. Do men lie more often about their height than women? State the hypotheses that
can be used to test the assumption. What is the p-value? Use α = 0.05. What is
your conclusion?
4.3. Checking assumptions (optional)
In order to check the normality assumption in t-test, we can perform the following
methods: histogram, QQ-plot, Shapiro-Wilk’s test (with the null hypothesis that
“sample distribution is normal”).
Page 6 of 6