Statistical Computing by Using R
Statistical Computing by Using R
Examples from textbook of Biomatrics Lab (2011), Department of Life Science, Tunghai University.
Chen-Pan Liao ()
PhD student, Department of Life Science, Tunghai University, Taiwan; E-mail address: andrew.43@gmail.com March 10, 2013
License
Statistical computing by using R by Chen-Pan Liao () is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. You are free to share or remix this article under several conditions. Read the license to learn more.
Contents
1 Introduction 1.1 How to apply the example in R? . . . . . . . . . . . . . . . . 2 Descriptive statistics 2.1 Normal distribution and z-value . . . . . . . . . . . . . . . . . 2.2 Normality, skewness and kurtosis . . . . . . . . . . . . . . . . 3 One-/two-sample test 3.1 One-sample t-test . . . . . . . . . . . . . . 3.2 Two-sample t-test, f -test or Bartletts test 3.3 Paired t-test . . . . . . . . . . . . . . . . . 3.4 Two-sample Mannn-whitney U test . . . . 2 2 3 3 3 4 4 4 5 5 5 5 6
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
4 One-factor experimental design 4.1 Balanced/unbalanced one-way ANOVA . . . . . . . . . . . . 4.2 Kruskal-Wallis test . . . . . . . . . . . . . . . . . . . . . . . .
https://plus.google.com/117776983818354527306/about http://creativecommons.org/licenses/by-sa/3.0/
5 Two-way ANOVA & experimental designs 5.1 Completely randomized design (CRD) . . . . 5.2 Randomized Complete Block Design (RCBD) 5.3 RCBD (unbalanced design) . . . . . . . . . . 5.4 Latin square design . . . . . . . . . . . . . . . 5.5 Nested design . . . . . . . . . . . . . . . . . . 6 Regression & correlation 6.1 Simple linear regression . . . . . . 6.2 Replicated simple linear regression 6.3 Simple linear correlation . . . . . . 6.4 Multiple regression . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
6 6 7 7 7 8 8 8 9 9 9
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
7 Count data 10 7.1 Goodness of t test . . . . . . . . . . . . . . . . . . . . . . . . 10 7.2 Test of independence . . . . . . . . . . . . . . . . . . . . . . . 10 7.3 Test of homogeneity . . . . . . . . . . . . . . . . . . . . . . . 10 A Free resources for learning R 11
Introduction
R is a free computer language for statistical computing and graphics. Everyone can download and install R framework. In this article, several usual statistical tests computed by using R are demonstrated. Most of questions given in these demos are from textbook of Biomatrics Lab (2011), Department of Life Science, Tunghai University.
1.1
All of the R codes in this article were tested under R version 2.13. In general, a reader can just copy & paste the R code to reveal the results of statistical computations with the following two kinds of exceptions. The rst case always happens after the R code try to load a package by after inputing library(package_name) where package_name indicates the name of a package which R tried to load; If you did not installed that package in R system, then R always returns a error indicating the package is unavailable. To resolve such problem is to install the package before R load it as the following code. install.packages("package_name") library(package_name) Its quite usual to input data from a plain text le (usually a CSV le) and then ask R to load that le by inputting read.csv("filename.csv"). The second case is that R cannot nd or load that data le, indicating that le may not exist or you may specify a incorrect path of the le. By the way, you should generate the CSV le somewhere then load it by indicating its full path and lename (including its lename extension), e.g., D:\somewhere\mydata.csv. Another usual technic is change your working directory where that data le exists by inputting setwd("dir")
See
where dir is the path of a directory which contains the CSV le. For example, if there is a data le called data.csv located in D:/your/own/path/, the r codes to load this le could be read.csv("D:/your/own/path/data.csv") to indicate the full path or setwd("D:/your/own/path") read.csv("data.csv") to change your working path and then read the le.
2
2.1
Descriptive statistics
Normal distribution and z-value
The mean and standard deviation of intelligence quotient (a.k.a. IQ) are 100 and 15, respectively. Find 1) a z-value if someones IQ = 147, 2) a p-value Pr(IQ > 147), and 3) a IQy value if Pr(IQ > IQy ) = 0.01. (147-100)/15 # return 3.133333 (1) 1 - pnorm(147, 100, 15) # return 0.0008641652 (2) qnorm((1-0.01),100 ,15 ) # return 134.8952 (3)
2.2
The following table shows the range of IQ and the number of persons. Test the normality of IQ and nd the skewness and kurtosis of IQ distribution. IQ range 160169 150159 140149 130139 Frequency 3 16 55 120 IQ range 120129 110119 100109 9099 Frequency 330 610 719 592 IQ range 8089 7079 6069 Frequency 338 130 48
## input frequency & range iq <- seq(165, 65, -10) freq <- c(3, 16, 55, 120, 330, 610, 719, 592, 338, 130, 48) ## bulid raw data for (i in 1:length(iq)) { yy <- rep(iq[i], freq[i]) if (i==1) {y <- yy} else {y <- c(y, yy)} } ## several estimations length(y) # number of value mean(y) # mean median(y) # median library(modeest); mlv(y, method = "mfv") # mode names(sort(-table(y)))[1] # mode; not always correct min(y) # minial value max(y) # maximal value range(y) # range summary(y) # combination of estimations var(y) # variance of sample 3
sd(y) # standard deviation of sample sd(y)/length(y)^0.5 # standard error of sample ## skewness & kurtosis library(psych); skew(y); kurtosi(y) ## normality library(nortest); lillie.test(y) # Lilliefors test shapiro.test(y) # Shapiro-Wilk test
3
3.1
One-/two-sample test
One-sample t-test
Test the H0 : The mean of Y is equal to 24.3. Y = {25.8, 24.6, 26.1, 22.9, 25.1, 27.3, 24.0, 24.5, 23.9, 26.2, 24.3, 24.6, 23.3, 25.5, 28.1, 24.8, 23.5, 26.3, 25.4, 25.5, 23.9, 27.0, 24.8, 22.9, 25.4}. ## input data y <- c ( 25.8, 24.6, 26.1, 22.9, 25.1, 27.3, 24.0, 24.5, 23.9, 26.2, 24.3, 24.6, 23.3, 25.5, 28.1, 24.8, 23.5, 26.3, 25.4, 25.5, 23.9, 27.0, 24.8, 22.9, 25.4 ) ## normality test shapiro.test(y) ## t-test t.test(y, mu=24.3) # for HA: mu != 24.3 t.test(y, mu=24.3, alternative="g") # for HA: mu > 24.3 t.test(y, mu=24.3, alternative="l") # for HA: mu < 24.3
3.2
Test the H0 : The means of Y1 and Y2 are equal. Y1 = {8.8, 8.4, 7.9, 8.7, 9.1, 9.6}; Y2 = {9.9, 9.0, 11.1, 9.6, 8.7, 10.4, 9.5}. ## input data y.1 <- c (8.8, 8.4, 7.9, 8.7, 9.1, 9.6) y.2 <- c (9.9, 9.0, 11.1, 9.6, 8.7, 10.4, 9.5) ## normality test shapiro.test(y.1) shapiro.test(y.2) ## f-test var.test(y.1, y.2) ## Bartlett s test y <- c(y.1, y.2) group <- c ( rep("a",length(y.1)), rep("b",length(y.2)) ) bartlett.test(y~group) ## t-test 4
t.test(y.1, y.2, var.equal=T) # if equal variances t.test(y.1, y.2, var.equal=F) # if unequal variances
3.3
Paired t-test
Test the H0 which the means of Y1i and Y2i are equal. Note that each Yi is paired. Y1i ={142, 140, 144, 144, 142, 146, 149, 150, 142, 148}; Y2i ={138, 136, 147, 139, 143, 141, 143, 145, 136, 146}. ## input data y.1 <- c(142, 140, 144, 144, 142, 146, 149, 150, 142, 148) y.2 <- c(138, 136, 147, 139, 143, 141, 143, 145, 136, 146) ## normality test shapiro.test(y.1 - y.2) ## t-test t.test(y.1, y.2, paired=T) # HA: mu1 != mu2 t.test(y.1, y.2, paired=T, alternative="g") # HA: mu.1 > mu.2 t.test(y.1, y.2, paired=T, alternative="l") # HA: mu.1 < mu.2
3.4
Test the H0 which the distributions of Y1 and Y2 are equal by using Mannnwhitney U test. Y1 = {3, 3, 3, 6, 10, 10, 13.5, 13.5, 16.5, 16.5, 19.5}; Y2 = {3, 3, 7.5, 7.5, 10, 12, 16.5, 16.5, 19.5, 22.5, 22.5, 22.5, 22.5, 25}. ## input data y.1 <- c(3, 3, 3, 6, 10, 10, 13.5, 13.5, 16.5, 16.5, 19.5) y.2 <- c( 3, 3, 7.5, 7.5, 10, 12, 16.5, 16.5, 19.5, 22.5, 22.5, 22.5, 22.5, 25 ) ## Mann-Whitney U test wilcox.test(y.1, y.2) # HA: y.1 != y.2 wilcox.test(y.1, y.2, alternative="g") # HA: y.1 > y.2 wilcox.test(y.1, y.2, alternative="l") # HA: y.1 < y.2
4
4.1
Test the H0 which the means of Yj={1,2,3,4} are all equal, where Y1 = {60.8, 57, 65, 58.6, 61.7}, Y2 = {68.7, 67.7, 74, 66.3, 69.8}, Y3 = {102.6, 102.1, 100.2, 96.5}, and Y4 = {87.9, 84.2, 83.1, 85.7, 90.3}. ## input data weight <- c( 60.8, 57.0, 65.0, 58.6, 61.7, 68.7, 67.7, 74.0, 66.3, 69.8, 102.6, 102.1, 100.2, 96.5, 87.9, 84.2, 83.1, 85.7, 90.3 ) groups <- as.factor( c(rep(1,5), rep(2,5), rep(3,4), rep(4,5)) ) ## description
tapply(weight, groups, mean) # mean of each group tapply(weight, groups, sd) # sd of each group boxplot(weight ~ groups) # box plot ## normality tests for each group tapply(weight, groups, shapiro.test) ## test equal variances bartlett.test(weight ~ groups) ## ANOVA m <- aov(weight ~ groups); summary(m) ## post-hoc TukeyHSD(m) library(laercio); LDuncan(m, "groups") pairwise.t.test(weight, groups, p.adj="none") library(agricolae); LSD.test(m, "groups", group=F) # # # # Tukey s Duncan s LSD LSD
4.2
Kruskal-Wallis test
Test the H0 which the medians of Yi={1,2,3,4} are all equal, where Y1 = {2, 2, 3.5, 3.5, 8, 10, 10, 17}, Y2 = {6, 10, 13.5, 13.5, 20, 23.5, 23.5, 26}, Y3 = {13.5, 16, 18, 20, 23.5, 26, 28}, and Y4 = {6, 6, 13.5, 22, 26, 29, 30, 31}. ## input data y <- c( 2, 2, 3.5, 3.5, 8, 10, 10, 17, 6, 10, 13.5, 13.5, 20, 23.5, 23.5, 26, 13.5, 16, 18, 20, 23.5, 26, 28, 6, 6, 13.5, 22, 26, 29, 30, 31 ) groups <- as.factor( c(rep(1,8), rep(2,8), rep(3,7), rep(4,8) )) ## Kruskal-Wallis test kruskal.test(y ~ groups) ## post-hoc library(pgirmess); kruskalmc(y, groups)
5
5.1
Determine the xed eects of hormone (horm), sex, and their interaction aecting on calcareous concentration (plconc). ## load data rawdata <- read.csv("data.csv") ## model I ANOVA mod <- aov( plconc ~ factor(horm) * ) summary(mod) # type library(car) Anova(mod, type=2) # type Anova(mod, type=3) # type
5.2
Determine the xed eects of 4 dierent diets (diet) aecting on pigs weight. Five dierent places which pigs were fed in are set as block (block). ## load data rawdata <- read.csv("data.csv") ## model I ANOVA mod <- aov(wt ~ factor(diet) + factor(block), data=rawdata) summary(mod) ## post-hoc TukeyHSD(mod) library(laercio); LDuncan(mod,"diet") pairwise.t.test(rawdata$wt, rawdata$diet, p.adj="none") # Tukey s # Duncan s # LSD
5.3
Determine the eect of treat aecting on variable obs with a block block. ## load data rawdata <- read.csv("data.csv") ## model I ANOVA mod <- aov(obs ~ treat + factor(block), data=rawdata) summary(mod) # type I library(car); Anova(mod, type=3) # type III ## post-hoc TukeyHSD(mod) library(laercio); LDuncan(mod,"treat") pairwise.t.test(rawdata$obs, rawdata$treat, p.adj="none") library(agricolae); LSD.test(mod, "treat", group=F) # # # #
5.4
Determine the eects of factor row, factor col, and factor treat aecting on variable obs, where all of factors are designed as Latin square. ## load data rawdata <- read.csv("data.csv") ## model I ANOVA mod <- aov( obs ~ treat + factor(row) + factor(col), data=rawdata ) summary(mod) # type I (fixed-model) library(car); Anova(mod, type=3) # type III ## post-hoc TukeyHSD(mod) # Tukey s library(laercio); LDuncan(mod,"treat") # Duncan s ## post-hoc (LSD)
5.5
Nested design
A 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , B 1 1 1 2 2 2 3 3 3 4 4 4 1 1 1 2 2 2 3 3 3 4 4 4 1 1 1 2 2 2 3 3 3 4 4 4 , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , data.csv obs 11 9 10 8 7 6 8 10 11 11 14 10 11 8 7 10 14 12 9 10 8 10 13 12 12 14 10 8 10 12 11 9 12 13 12 11
Determine the eects of factor A and factor B aecting on variable obs, where factor B is nested within factor A. ## load data rawdata <- read.csv("data.csv") ## fixed B mod.1 <- aov( obs ~ factor(A) + factor(B) %in% factor(A), data=rawdata ) mod.1 <- aov( obs ~ factor(A) / factor(B), data=rawdata ) # alternative summary(mod.1) ## random B (usual case) mod.2 <- aov( obs ~ factor(A) + Error(factor(A)/factor(B)), data=rawdata ) summary(mod.2) ms.A <- summary(mod.2)[[1]][[1]][[3]] # get 7.527778 df.A <- summary(mod.2)[[1]][[1]][[1]] # get 2 ms.res <- summary(mod.2)[[2]][[1]][[3]] # get 7.768519 df.res <- summary(mod.2)[[2]][[1]][[1]] # get 9 f.A <- ms.A / ms.res p.A <- pf (f.A, df.A, df.res, lower.tail=F); p.A
6
6.1
Test the H0 which means = 0 in the model Yi = + Xi + following sampling data. ## load data rawdata <- read.csv("data.csv") ## linear regression mod <- lm(Y ~ X, data=rawdata) summary(mod) ## prediction for X=c(14, 22.6) newdata <- data.frame(X=c(14, 22.6)) ## prediction for single X predict(mod, newdata, interval="prediction")
from the
data.csv X , Y 22 , 16 26 , 17 45 , 26 37 , 24 28 , 22 50 , 21 56 , 32 34 , 18 60 , 30 40 , 20
# # # #
6.2
Test the H0 which means = 0 in the model Yi = + Xi + lack of t from the following sampling data. ## load data rawdata <- read.csv("data.csv") ## linear regression mod <- lm(Y ~ X, data=rawdata); summary(mod) ## lack of fit in a regression with replicated data mod.aov <- lm(Y ~ factor(X), data=rawdata) anova(mod, mod.aov)
data.csv Y , 108 110 106 125 120 118 119 132 137 134 148 151 146 147 144 162 156 164 158 159 X , , , , , , , , , , , , , , , , , , , , 30 30 30 40 40 40 40 50 50 50 60 60 60 60 60 70 70 70 70 70
6.3
Determine the correlation coecient and test the H0 which means the coecient equaling to 0 from the following data by using Pearsons, Kendalls, and Spearmans correlation. ## load data rawdata <- read.csv("data.csv") ## Pearson s attach(rawdata) cor.test(wingL, tailL) # HA: rho != 0 cor.test(wingL, tailL, alternative="g") # HA: rho > 0 cor.test(wingL, tailL, alternative="l") # HA: rho < 0 ## Kendall s attach(rawdata); cor.test(wingL, tailL, method="kendall") ## Spearman s attach(rawdata); cor.test(wingL, tailL, method="spearman")
6.4
Multiple regression
data.csv J,A,B,C,D,E 1,6,9.9,5.7,1.6,2.12 2,1,9.3,6.4,3.0,3.39 3,-2,9.4,5.7,3.4,3.61 4,11,9.1,6.1,3.4,1.72 5,-1,6.9,6.0,3.0,1.80 6,2,9.3,5.7,4.4,3.21 7,5,7.9,5.9,2.2,2.59 8,1,7.4,6.2,2.2,3.25 9,1,7.3,5.5,1.9,2.86 10,3,8.8,5.2,0.2,2.32 11,11,9.8,5.7,4.2,1.57 12,9,10.5,6.1,2.4,1.50 13,5,9.1,6.4,3.4,2.69 14,-3,10.1,5.5,3.0,4.06 15,1,7.2,5.5,0.2,1.98 16,8,11.7,6.0,3.9,2.29 17,-2,8.7,5.5,2.2,3.55 18,3,7.6,6.2,4.4,3.31 19,6,8.6,5.9,0.2,1.83 20,10,10.9,5.6,2.4,1.69
Apply a multiple linear regression and consequent model selection for following data. ## load data rawdata <- read.csv("data.csv") ## full model mod.full <- lm(E ~ A + B + C + D, data=rawdata) summary(mod.full) ## stepwise mod.step <- step(mod.full, direction="both") 9
summary(mod.step) ## backward mod.backward <- step(mod.full, direction="backward") summary(mod.backward) ## forward mod.forward <- step(mod.full, direction="forward") summary(mod.forward)
7
7.1
Count data
Goodness of t test
Test whether the frequent ratio of 4 : 4 is equivalent to 3 : 1. ## input observed frequencies and expected proportions o <- c(4, 4) e <- c(3, 1) / (3+1) chisq.test(o, p=e) chisq.test(o, p=c(3,1), rescale.p=T) chisq.test(o, p=e, simulate.p.value=T, B=4000) # Pearson s # Pearson s alternative # permutation based on 4000 of replicates
7.2
Test of independence
sex M , M , M , M , F , F , F , F , data.csv , hair , freq Black , 32 Blond , 16 Brown , 43 Red , 9 Black , 55 Blond , 64 Brown , 65 Red , 16
Test the independence between factor sex and factor hair. ## load data rawdata <- read.csv("data.csv") ## Pearson s crosstable <- xtabs(freq ~ sex + hair, data=rawdata) summary(crosstable) chisq.test(crosstable) # alternative library(MASS); loglm(freq~sex+hair, rawdata) # alternative ## G-test library(MASS); loglm(freq ~ sex + hair, rawdata) source("http://www.psych.ualberta.ca/~phurd/cruft/g.test.r") g.test(crosstable) # alternative ## Fisher s exact test fisher.test(crosstable)
7.3
Test of homogeneity
data.csv grade,Black,Blond,Brown,Red 1 , 23 , 48 , 54 , 13 2 , 24 , 56 , 64 , 21 3 , 19 , 58 , 48 , 20 4 , 21 , 56 , 57 , 22 5 , 23 , 39 , 57 , 21 6 , 16 , 48 , 55 , 13
A researcher randomly sampled students hair in dierent grades from a elementary school and recored students hair color, then summarized the following table. Test the frequency ratio of hair color of students from dierent grades are all 0.125 : 0.375 : 0.375 : 0.125. ## load data rawdata <- read.csv("data.csv") rawdata$sum <- apply(rawdata[2:5], 1, sum)
10
rawdata["sum",] <- apply(rawdata, 2, sum) ## chi-square of each grade e <- c(0.125, 0.375, 0.375, 0.125) chi.each <- rep(NA, 6); df.each <- rep(3, 6) for (i in 1:6) { chi.each[i] <- as.numeric(chisq.test(rawdata[i,2:5], p=e)$statistic) } ## chi-square of all grades chi.all <- as.numeric(chisq.test(rawdata["sum",2:5], p=e)$statistic) df.all <- 3 ## testing homogeneity chi.homo <- sum(chi.each) - chi.all df.homo <- sum(df.each) - df.all pchisq(chi.homo, df.homo, lower.tail=F) ## since p = 0.8362961, not reject H0, ## then merged data from different grades chisq.test(rawdata["sum",2:5], p=e) For instance, following the previous example, if you test the homogeneity between grades without specifying the expected ratio, the statistical calculation will be mathematically equivalent to that of test of independence. ## load data rawdata <- read.csv("data.csv") freq <- rawdata[1:6, 2:5] ## test chisq.test(freq)
Using R for psychological research by William Revelle. http://www.personality-project.org/r/r. guide.html R Graph Gallery by Romain Franois. http://addictedtor.free.fr/graphiques/ R Wiki by R Wiki Community. http://rwiki.sciviews.org/doku.php?id=start An Introduction to R by Bill Venables and David M. Smith. http://cran.r-project.org/doc/manuals/ R-intro.html R Data Import/Export by Douglas Bates, Saikat DebRoy and Brian Ripley. http://cran.r-project.org/doc/manuals/R-data.html Resources to help you learn and use R by UCLA Academic Technology Services. http://www.ats. ucla.edu/stat/R/ Statistics with R by Vincent Zoonekynd. http://zoonek2.free.fr/UNIX/48_R/all.html R: Statistical Computing and Programming Language by Chien-Fu Je Lin; In traditional Chinese. http://web.ntpu.edu.tw/~cflin/Teach/R/Rproj.htm R by Taiwans National Applied Research Laboratories; In traditional Chinese. https://sites.google.com/site/rprojectnotes/ or http://statlab.nchc.org.tw/rnotes/ R (videos) by Chen-Pan Liao; In traditional Chinese. http://apansharing.blogspot.tw/p/ r-demo.html or http://www.youtube.com/playlist?list=PL5AC0ADBF65924EAD
11