0% found this document useful (0 votes)
28 views11 pages

Using R For Basic Statistical Analysis

1. The document describes how to import data from csv files into R and perform descriptive statistical analysis on the imported data. 2. Key steps include importing the mtcars and arthritis datasets, removing non-numeric variables, using functions like summary(), apply(), and describe() to calculate statistical measures like mean, standard deviation, range, skewness and kurtosis. 3. Graphical analysis tools like boxplots and histograms are demonstrated to visualize the distribution of variables in the data. Boxplots summarize the median, quartiles and outliers while histograms use bar heights to show frequency distribution.

Uploaded by

Nile Seth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
28 views11 pages

Using R For Basic Statistical Analysis

1. The document describes how to import data from csv files into R and perform descriptive statistical analysis on the imported data. 2. Key steps include importing the mtcars and arthritis datasets, removing non-numeric variables, using functions like summary(), apply(), and describe() to calculate statistical measures like mean, standard deviation, range, skewness and kurtosis. 3. Graphical analysis tools like boxplots and histograms are demonstrated to visualize the distribution of variables in the data. Boxplots summarize the median, quartiles and outliers while histograms use bar heights to show frequency distribution.

Uploaded by

Nile Seth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 11

BIA B350F – Using R for basic statistical analysis

Task 2-1 Import the data


1. Create a folder ‘B350F’ at the hard disk and then copy the following to the folder.
 mtcars.csv – The data was extracted from the 1974 Motor Trend US magazine, and comprises
fuel consumption and 10 aspects of automobile design and performance for 32 automobiles
(1973–74 models). The variables include:
Variable Description
model Automobile model no.
mpg Miles (US) gallon
Cyl Number of cylinders
disp Displacement (cu.in.)
Hp Gross horsepower
drat Rear axle ratio
Wt Weight (1000 lbs)
qsec ¼ mile time (in seconds)
vs V-engine or Straight-engine
am Transmission (0=automatic, 1=manual)
gear Number of forward gears
carb Number of carburetors

 arthritis.csv – Data from Koch & Edwards (1988) from a double-blind clinical trial investigating a
new treatment for rheumatoid arthritis. The data contains 84 observations for the following 5
variables:
Variable Description
ID Patient ID
treatment Treatment (Placebo or treated)
sex Sex (Female, Male)
age Age of patient
improved Treatment outcome (None, Some,
Marked)

2. Start R Studio and then change the working directory to ‘c:\B350F’ and then import the dataset
“mtcars” and “arthritis” from csv files to R and assign them as “dat” and “dat2”, respectively.

(Hint: Use “read.csv” to import the files as explained in Task 1-8. Remember to set header = TRUE)

Task 2-2 Descriptive statistics


The summary() function provides data summarization tools to compute descriptive statistics for
variables across all observations and within groups of observations.
1. Run the following R program to find the default output of summary(). Assume that you store
“mtcars” and “arthritis” as dat and dat2, respectively.
summary(dat)
summary(dat2)

2. The default output of summary() only includes the minimum, 1st quantile, median, mean, 3rd
quantile and maximum. To obtain additional statistical measures, you need to run other R code.
However, these R code does NOT apply on non-numerical variables (i.e. factors and characters).

1
BIA B350F – Using R for basic statistical analysis

Therefore, you need to make sure your dataset has only numerical variables. Here are the first few
line of the “mtcars” dataset.

model mpg cyl disp hp drat wt qsec vs am gear carb


1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Note that the variable “model” is character. We need to remove this variable before computing
additional statistical measures. You can apply the technique from Task 1-11 to exclude “model” in
the dataset. Here is an example

## remove "model" in our dataset


dat_new <- dat[, -1]

Or

dat_new <- dat[,names(dat)!="model"]

The R package “e1071” has been pre-installed in your computer. Now, load this package in R, and
then run the following R program. What are the statistical measures given in the output? Which
variable has the largest variation?

apply(dat_new, 2, mean)
apply(dat_new, 2, sd)
apply(dat_new, 2, mode)
apply(dat_new, 2, range)
apply(dat_new, 2, skewness)
apply(dat_new, 2, kurtosis)

The apply() tells R to compute a statistical measure by each variable in dataset “dat_new”. The apply
function is a R function which enables to make quick operations on matrix, vector or array. It’s called
as: apply(variable, margin, function) where:
- variable is the variable you want to apply the function to.
- margin specifies if you want to apply by row (margin = 1), by column (margin = 2), or for
each element (margin = 1:2). Margin can be even greater than 2, if we work with variables of
dimension greater than two.
- function is the function you want to apply to the elements of your variable. It can be any R
function, including a User Defined Function (UDF).

3. Another method is using the describe() function in the R package “psych”. This function has a huge
advantage over apply(): it displays descriptive statistical measures all at once. Now, load the R
package “psych”, and then run the following R code. After that, type “?describe” to open the help
file and read the “value” section to see what see what statistical measure it produces

describe(dat_new)

2
BIA B350F – Using R for basic statistical analysis

Note:
(a) Skewness is a measure of the asymmetry of the probability distribution of a real-valued random
variable about its mean. Skewness can be positive or negative. The data is left-skewed when the
skewness statistics is negative; and the data is right-skewed when the skewness is positive. The
closer the skewness statistics to 0, the more symmetry the data.

Left-skewed Right-skewed
(Skewness = -0.5370) (Skewness = +0.5370)
(b) Kurtosis is a measure measures the tendency of the data to be concentrated toward the center
or toward the tails of the distribution. Normal distribution has a kurtosis value of positive 3. R
has standardized Kurtosis statistics to zero by minus it by 3 (i.e., excess).

Uniform Normal Logistic


kurtosis = 1.8, excess = −1.2 kurtosis = 3, excess = 0 kurtosis = 4.2, excess = 1.2
(Light-tailed) (Heavy-tailed)
Maximum and minimum kurtosis:

Discrete: equally likely values Student’s t (df=4)


kurtosis = 1, excess = −2 kurtosis = ∞, excess = ∞
(Source: https://brownmath.com/stat/shape.htm)

3
BIA B350F – Using R for basic statistical analysis

4. By default, R retains the same variable names as what we have in our csv / txt files. If you print out
“dat_new” from part 2, you can see the value of the variable “am” is coded, which is meaningless.
To give a meaning label, you may use the “ifelse” to decode the variable as follows:

dat_new$type <- ifelse(dat_new$am == 0, "Automatic", "manual")


dat_new

Run the program and then study the output. Now, you just created a new variable in “dat_new” called
“Type”. What are the changes made in the variable “Type”?lot, regression plots, etc. It can create one or
more plots and overlays them on a single set of axes.
Try the following R programs to learn more about different types of statistical graphics and interpret the
output.

Task 2-3 Graphical analysis in R


1. Boxplots
boxplot(mpg ~ type, data = dat_new, main = "mile per gallon by transmission
type", outline = TRUE, horizontal = FALSE)

Let’s break down the R program above to understand how this function works.
 mpg ~ type. You need to provide a formula, in y~x format. You can only input one y variable
and one x variable.

 data = dat_new . Indicate which dataset you are using

 main = "mile per gallon by transmission type". It is recommended to provide a title for
your plot. This is crucial if you make multiple boxplots at once.

 By default, R will indicate the outliers and generate the boxplot vertically. You can generate a
horizontal boxplot by setting horizontal = TRUE.
Note:
(a) Boxplot summarizes the distribution based on the following five statistics:
 Median – Q2/50th Percentile
 First quartile – Q1/25th Percentile
 Third quartile – Q3/75th Percentile
 interquartile range (IQR) – 25th to the
75th percentile.
 “maximum” – Q3 + 1.5*IQR
 “minimum” – Q1 -1.5*IQR

The horizontal/vertical line extends from


the box to the minimum and the maximum
is called whisker. Any data not included between the whiskers should be plotted as an outlier
with a dot.

(Source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)

4
BIA B350F – Using R for basic statistical analysis

(b) Boxplot of a nearly normal distribution and the


probability density function (pdf) for a normal
distribution is as follows:

(Source:
https://towardsdatascience.com/understanding-
boxplots-5e2df7bcbd51)

2. Histogram plot

hist(dat_new$mpg, breaks = 10, main = "histogram plot of mile per gallon")

The “break” argument here tells R how many bars are fitted in the histogram. If you want to
increase the number of bars, increase the size of breaks. Now, plot a histogram for the variable
“qsec” in dataset “dat_new”, and set breaks = 20. What do you see in this histogram?

Now, plot a histogram to display the distribution of time required to travel 1/4 mile.

3. Scatter plot

plot(dat_new$hp, dat_new$qsec, type = "p", pch = 16,


main = "scatterplot between horsepower and 1/4 mile time")

The R code above shows the relations between gross horsepower and the time required to
complete a 1/4 mile trip. What pattern do you see between these 2 variables? Does this pattern
make sense to you?

4. Probplot (p-p plot)


To conduct regression analysis, one of the assumptions is that our dependent variable is normally
distributed. You can use probplot() from the R package “e1071” to help us visualize the normality of
our data. Type the following R code and view the plot

probplot(dat_new$mpg)

If your data is normally distributed, your data points should be aligned on the reference line, which
is printed in red. What do you observe in the following plot?

5
BIA B350F – Using R for basic statistical analysis

(Note: Normal probability plot can be interpreted as below.)


Vertical axis – ordered response values
Horizontal axis – normal order statistic medians

Approximately normal Left-skewed Right-skewed

Light-tailed Heavy-tailed
(Source: SAS)

6
BIA B350F – Using R for basic statistical analysis

5. qqnorm and qqline


Another way to examine the normality is plotting a Q-Q plot. Run the following R codes, and compare
the output against the plot from probplot() above. What are the similarities and differences between
them?
qqnorm(dat_new$mpg)
qqline(dat_new$mpg, col = 4, lwd = 2)

qqnorm() creates a normality plot on your data, and compares the data against the theoretical
normality line, i.e. qqline. If your data are normally distributed, most of your data points should be
aligned on the qqline.

(Note: Q-Q plot compares the quantiles of a data distribution with the quantiles of a standardized
theoretical distribution from a specified family of distributions but P-P plot compares the empirical
cumulative distribution function of a data set with a specified theoretical cumulative distribution
function. So, the y-axis of Q-Q plot is quantile while the y-axis of P-P plot is probability.)

Task 2-4 Tests for normality


Sometimes, Q-Q plots are too vague to determine if our data is normally distributed. As a result, we
introduce two different test for normality: Kolmogorov-Smirnov (K-S) normality test and Shapiro-Wilk’s
test.

1. Kolmogorov-Smirnov normality test (K-S test).


The K-S test computes a test statistic, which determines if data follows normal distribution. This test
have several limitation. Run the following code and examine the result. Note that you need to
provide the mean and standard deviation of your data, in order to complete the test.

ks.test(dat_new$mpg, "pnorm", mean(dat_new$mpg), sd(dat_new$mpg))

2. Shapiro-Wilk’s test
The Shapiro-Wilk’s test does exactly the same thing as the K-S test does. This test is preferred as it
does not have that many restrictions as K-S test does. Run the following code and examine the
result.

shapiro.test(dat_new$mpg)

7
BIA B350F – Using R for basic statistical analysis

Both tests have the following hypothesis:


H0: the data is normally distributed
H1: the data is NOT normally distributed

If the p-value from either one of the tests above falls below 0.05, we reject H0. We believe our data is
not normally distributed. Data transformation may be required at this point.

(Note: Normality tests are very sensitive with small sample size. A data with sample size = 15 could give
a test p-value above 0.05. Therefore, it’s crucial to use both normality test and plot to determine if your
data is truly normally distributed. Check out the following links for more information about normality
test:
 https://www.researchgate.net/post/Kolmogorov-Smirnov_test_or_Shapiro-
Wilk_test_which_is_more_preferred_for_normality_of_data_according_to_sample_size
 https://www.tandfonline.com/doi/pdf/10.1080/00949655.2010.520163)

Task 2-5 Correlation analysis using cor()


The cor() function can be used to perform correlation analysis to evaluate the strength of relationship
between two continuous variables. It gives a number of correlation measures including Pearson
correlation coefficients and two nonparametric measures of association, namely Spearman rank-order
correlation and Kendall’s tau-b coefficient.

1. Correlation matrix
A correlation matrix is a table showing correlation coefficients between sets of variables. The
following R program can produce a correlation analysis with descriptive statistics and correlation
matrix for four measures of association for the variables ‘mpg’, ‘drat’, ‘hp’, and ‘wt’ of ‘mtcars’ data
set.

dat <- read.csv("mtcars.csv", header = TRUE)

## extract numerical variables in our dataset


dat_new <- dat[, c(2,4,5,6,7)]
corr.matrix <- round(cor(dat_new), 3)

(Note: the round() function is used to format the output. Please read the help of it to see what it
can be done.)

Run the R program above and then comment on the relationships between pairs of variables.

By default, the Pearson correlation coefficients are computed. To change the method of
correlation computation, you need to specify the method using the ‘method=’ argument cor().
Now, run the following R code to compute the Spearman correlation coefficients, and then
compare the results between corr.matrix and corr.matrix2.

corr.matrix2 <- round(cor(dat_new, method=”spearman” ), 3)

8
BIA B350F – Using R for basic statistical analysis

2. Graphical correlation analysis


(a) The corrplot() in R helps us visualize the correlation between numerical variables. Using the
“mtcars” data above, we can reproduce the correlation matrix above in graphical form. To use
corrplot(), you need to install and load R package “corrplot”.

corrplot(corr.matrix, method = "color", addCoef.col="white", type = "upper")

Run the R program and comment on the relationship between pairs of variables based on the
scatter plot matrix.

(b) Using pairs() for scatter plots for pairs of variables


The pairs() displays a scatter plot for each pair of variables. You can add a main title that helps
readers understand your plot.

pairs(dat_new, pch = 16, main = "Scatter plots between numerical variables in


'mtcars' ")

3. Variance and covariance matrices


In R, var() and cov() create the variance and covariance matrices for variables, respectively. When
running cov(), R will display the Pearson correlations by default. Run the following R program and
compare the covariance matrix and the correlation matrix.

### compute the variance


var(dat_new)

### compute the covariance matrix


cov(dat_new)

4. Pearson correlation test


Use cor.test() to perform Pearson correlation test in R.

cor.test(x = dat_new$mpg, y = dat_new$drat)

This test is to test the following hypothesis:


H0: 𝜌 = 0
H1: 𝜌 ≠ 0
5. Pairwise Pearson’s correlation test
Instead of performing the correlation test separately for each pair of variables, the R package ltm
provides a function rcor.test that perform pairwise Pearson’s correlation test for every single
variable in the given dataframe as follows:

### perform pairwise correlation test


library(ltm)
rcor.test(dat_new)

Run the program and find out which variables’ correlation is statistically insignificant at 5% level of
significance.

9
BIA B350F – Using R for basic statistical analysis

Self-study Task 2-6: Analyzing categorical data using CrossTable()

In R, the CrossTable() function produces one-way to n-way frequency and contingency (cross-tabulation)
tables. The CrossTable() procedure can work with both string (character) or numeric categorical
variables. To run CrossTable(), you need to install and load the R package “gmodels”.

1. One-way frequency table


(a) By default, the CrossTable() procedure creates a one-way frequency table for one single variable
in the input data set. The CrossTable() statement can be used to tell R the specific frequency
table to create as follows:

library(gmodels)
dat2 <- read.csv("Arthritis.csv", header = TRUE)
CrossTable(dat2$Treatment)
CrossTable(dat2$Improved)

Run the program and study the output. What summary statistics are included in the one-way
table?

2. Contingency table
(a) In addition to one-way frequency, CrossTable() procedure can create cross-tabulation tables in
which the frequencies are determined for multiple variables at a time. Try the following R
programs to produce a contingency table.

CrossTable(dat2$Treatment, dat2$Improved, prop.r = TRUE, prop.c = TRUE,


prop.t = TRUE)

What summary statistics are included in this contingency table?

(b) In the contingency table, each cell of the table by default contains a number of summary
statistics. The following arguments can be used in the CrossTable() statement to limit the
statistics in each cell.

 prop.r = FALSE – suppress the printing of the row percentages


 prop.c = FALSE – suppress the printing of the column percentages, and
 prop.t = FALSE – suppress the printing of the (joint) cell percentages

Modify the program in (a) to display only the frequency in each cell.

(c) Instead of suppressing output, the following options can provide additional analysis in each cell
of the contingency table.
 expected = TRUE – print the expected cell frequencies under the null hypothesis of
independence
 prop.chisq = TRUE – print each cell's contribution to the total chi-squared statistic

Modify the program in (b) to add the expected cell frequencies and each cell’s contribution to
the total chi-squared statistic.

10
BIA B350F – Using R for basic statistical analysis

3. Test statistic for variable independence


You can obtain test statistic for variable independence using CrossTable(). For two-way table,
CrossTable() performs three different tests: Pearson chi-square, Fisher Exact test, and McNemar
test. The following program requests R to conduct chi-square test.

CrossTable(dat2$Treatment, dat2$Improved, expected = TRUE, chisq = TRUE)

The null and alternative hypothesis conducted in the program are:


H0: The relationship between treatment and improved is not significant.
H1: There is a relationship between treatment and improved.

Run the R program and complete the hypothesis test based on the output. Can the hypothesis be
rejected? What is the conclusion of the test?

11

You might also like