Using R For Basic Statistical Analysis
Using R For Basic Statistical Analysis
arthritis.csv – Data from Koch & Edwards (1988) from a double-blind clinical trial investigating a
new treatment for rheumatoid arthritis. The data contains 84 observations for the following 5
variables:
Variable Description
ID Patient ID
treatment Treatment (Placebo or treated)
sex Sex (Female, Male)
age Age of patient
improved Treatment outcome (None, Some,
Marked)
2. Start R Studio and then change the working directory to ‘c:\B350F’ and then import the dataset
“mtcars” and “arthritis” from csv files to R and assign them as “dat” and “dat2”, respectively.
(Hint: Use “read.csv” to import the files as explained in Task 1-8. Remember to set header = TRUE)
2. The default output of summary() only includes the minimum, 1st quantile, median, mean, 3rd
quantile and maximum. To obtain additional statistical measures, you need to run other R code.
However, these R code does NOT apply on non-numerical variables (i.e. factors and characters).
1
BIA B350F – Using R for basic statistical analysis
Therefore, you need to make sure your dataset has only numerical variables. Here are the first few
line of the “mtcars” dataset.
Note that the variable “model” is character. We need to remove this variable before computing
additional statistical measures. You can apply the technique from Task 1-11 to exclude “model” in
the dataset. Here is an example
Or
The R package “e1071” has been pre-installed in your computer. Now, load this package in R, and
then run the following R program. What are the statistical measures given in the output? Which
variable has the largest variation?
apply(dat_new, 2, mean)
apply(dat_new, 2, sd)
apply(dat_new, 2, mode)
apply(dat_new, 2, range)
apply(dat_new, 2, skewness)
apply(dat_new, 2, kurtosis)
The apply() tells R to compute a statistical measure by each variable in dataset “dat_new”. The apply
function is a R function which enables to make quick operations on matrix, vector or array. It’s called
as: apply(variable, margin, function) where:
- variable is the variable you want to apply the function to.
- margin specifies if you want to apply by row (margin = 1), by column (margin = 2), or for
each element (margin = 1:2). Margin can be even greater than 2, if we work with variables of
dimension greater than two.
- function is the function you want to apply to the elements of your variable. It can be any R
function, including a User Defined Function (UDF).
3. Another method is using the describe() function in the R package “psych”. This function has a huge
advantage over apply(): it displays descriptive statistical measures all at once. Now, load the R
package “psych”, and then run the following R code. After that, type “?describe” to open the help
file and read the “value” section to see what see what statistical measure it produces
describe(dat_new)
2
BIA B350F – Using R for basic statistical analysis
Note:
(a) Skewness is a measure of the asymmetry of the probability distribution of a real-valued random
variable about its mean. Skewness can be positive or negative. The data is left-skewed when the
skewness statistics is negative; and the data is right-skewed when the skewness is positive. The
closer the skewness statistics to 0, the more symmetry the data.
Left-skewed Right-skewed
(Skewness = -0.5370) (Skewness = +0.5370)
(b) Kurtosis is a measure measures the tendency of the data to be concentrated toward the center
or toward the tails of the distribution. Normal distribution has a kurtosis value of positive 3. R
has standardized Kurtosis statistics to zero by minus it by 3 (i.e., excess).
3
BIA B350F – Using R for basic statistical analysis
4. By default, R retains the same variable names as what we have in our csv / txt files. If you print out
“dat_new” from part 2, you can see the value of the variable “am” is coded, which is meaningless.
To give a meaning label, you may use the “ifelse” to decode the variable as follows:
Run the program and then study the output. Now, you just created a new variable in “dat_new” called
“Type”. What are the changes made in the variable “Type”?lot, regression plots, etc. It can create one or
more plots and overlays them on a single set of axes.
Try the following R programs to learn more about different types of statistical graphics and interpret the
output.
Let’s break down the R program above to understand how this function works.
mpg ~ type. You need to provide a formula, in y~x format. You can only input one y variable
and one x variable.
main = "mile per gallon by transmission type". It is recommended to provide a title for
your plot. This is crucial if you make multiple boxplots at once.
By default, R will indicate the outliers and generate the boxplot vertically. You can generate a
horizontal boxplot by setting horizontal = TRUE.
Note:
(a) Boxplot summarizes the distribution based on the following five statistics:
Median – Q2/50th Percentile
First quartile – Q1/25th Percentile
Third quartile – Q3/75th Percentile
interquartile range (IQR) – 25th to the
75th percentile.
“maximum” – Q3 + 1.5*IQR
“minimum” – Q1 -1.5*IQR
(Source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)
4
BIA B350F – Using R for basic statistical analysis
(Source:
https://towardsdatascience.com/understanding-
boxplots-5e2df7bcbd51)
2. Histogram plot
The “break” argument here tells R how many bars are fitted in the histogram. If you want to
increase the number of bars, increase the size of breaks. Now, plot a histogram for the variable
“qsec” in dataset “dat_new”, and set breaks = 20. What do you see in this histogram?
Now, plot a histogram to display the distribution of time required to travel 1/4 mile.
3. Scatter plot
The R code above shows the relations between gross horsepower and the time required to
complete a 1/4 mile trip. What pattern do you see between these 2 variables? Does this pattern
make sense to you?
probplot(dat_new$mpg)
If your data is normally distributed, your data points should be aligned on the reference line, which
is printed in red. What do you observe in the following plot?
5
BIA B350F – Using R for basic statistical analysis
Light-tailed Heavy-tailed
(Source: SAS)
6
BIA B350F – Using R for basic statistical analysis
qqnorm() creates a normality plot on your data, and compares the data against the theoretical
normality line, i.e. qqline. If your data are normally distributed, most of your data points should be
aligned on the qqline.
(Note: Q-Q plot compares the quantiles of a data distribution with the quantiles of a standardized
theoretical distribution from a specified family of distributions but P-P plot compares the empirical
cumulative distribution function of a data set with a specified theoretical cumulative distribution
function. So, the y-axis of Q-Q plot is quantile while the y-axis of P-P plot is probability.)
2. Shapiro-Wilk’s test
The Shapiro-Wilk’s test does exactly the same thing as the K-S test does. This test is preferred as it
does not have that many restrictions as K-S test does. Run the following code and examine the
result.
shapiro.test(dat_new$mpg)
7
BIA B350F – Using R for basic statistical analysis
If the p-value from either one of the tests above falls below 0.05, we reject H0. We believe our data is
not normally distributed. Data transformation may be required at this point.
(Note: Normality tests are very sensitive with small sample size. A data with sample size = 15 could give
a test p-value above 0.05. Therefore, it’s crucial to use both normality test and plot to determine if your
data is truly normally distributed. Check out the following links for more information about normality
test:
https://www.researchgate.net/post/Kolmogorov-Smirnov_test_or_Shapiro-
Wilk_test_which_is_more_preferred_for_normality_of_data_according_to_sample_size
https://www.tandfonline.com/doi/pdf/10.1080/00949655.2010.520163)
1. Correlation matrix
A correlation matrix is a table showing correlation coefficients between sets of variables. The
following R program can produce a correlation analysis with descriptive statistics and correlation
matrix for four measures of association for the variables ‘mpg’, ‘drat’, ‘hp’, and ‘wt’ of ‘mtcars’ data
set.
(Note: the round() function is used to format the output. Please read the help of it to see what it
can be done.)
Run the R program above and then comment on the relationships between pairs of variables.
By default, the Pearson correlation coefficients are computed. To change the method of
correlation computation, you need to specify the method using the ‘method=’ argument cor().
Now, run the following R code to compute the Spearman correlation coefficients, and then
compare the results between corr.matrix and corr.matrix2.
8
BIA B350F – Using R for basic statistical analysis
Run the R program and comment on the relationship between pairs of variables based on the
scatter plot matrix.
Run the program and find out which variables’ correlation is statistically insignificant at 5% level of
significance.
9
BIA B350F – Using R for basic statistical analysis
In R, the CrossTable() function produces one-way to n-way frequency and contingency (cross-tabulation)
tables. The CrossTable() procedure can work with both string (character) or numeric categorical
variables. To run CrossTable(), you need to install and load the R package “gmodels”.
library(gmodels)
dat2 <- read.csv("Arthritis.csv", header = TRUE)
CrossTable(dat2$Treatment)
CrossTable(dat2$Improved)
Run the program and study the output. What summary statistics are included in the one-way
table?
2. Contingency table
(a) In addition to one-way frequency, CrossTable() procedure can create cross-tabulation tables in
which the frequencies are determined for multiple variables at a time. Try the following R
programs to produce a contingency table.
(b) In the contingency table, each cell of the table by default contains a number of summary
statistics. The following arguments can be used in the CrossTable() statement to limit the
statistics in each cell.
Modify the program in (a) to display only the frequency in each cell.
(c) Instead of suppressing output, the following options can provide additional analysis in each cell
of the contingency table.
expected = TRUE – print the expected cell frequencies under the null hypothesis of
independence
prop.chisq = TRUE – print each cell's contribution to the total chi-squared statistic
Modify the program in (b) to add the expected cell frequencies and each cell’s contribution to
the total chi-squared statistic.
10
BIA B350F – Using R for basic statistical analysis
Run the R program and complete the hypothesis test based on the output. Can the hypothesis be
rejected? What is the conclusion of the test?
11