-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathJSMtalk.Rmd
166 lines (139 loc) · 6.29 KB
/
JSMtalk.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
title: "Number of Samples Needed For Model Selection with Confidence"
author: "Sean Carver, Ph.D., American University"
date: "August 1, 2017"
output: ioslides_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Model Similarity
**Question:** How similar is the model of the NY Yankees baseball team to the model of our home team, the Baltimore Orioles?
- Common answer: The Kullback-Leibler Divergence.
- Different answer: The number of Baltimore-at-bat-half-innings we would need to simulate to correctly reject the model of the Yankees, 95 percent of the time.
![](orioles_cropped.jpg)
## Baseball Model
- Markov Chain fitted from the 2011 Baltimore Orioles, batting at home (Marchi & Albert, 2013):
- Joint work: Rebeca Berger, American University, Class of 2017.
```{r bbsetup, cache=TRUE}
source('simulation.R')
source('msbaseball.R')
```
```{r bb, echo=TRUE}
sim.baseball(5, BAL, seed=1)
likes.baseball(sim.baseball(5, BAL, seed=1), BAL);
```
## New Models: t(5) and N(0,1)
```{r densitycurves}
library(ggplot2)
densitycurve5 <- function(x) {return(dt(x,df=5))}
densitycurveinf <- function(x) {return(dt(x,df=Inf))}
ggplot(NULL,aes(c(-3,3))) +
geom_area(stat="function",fun=densitycurve5,fill="red",alpha=I(.4),xlim = c(-3,3), show.legend=TRUE) +
geom_area(stat="function",fun=densitycurveinf,fill="blue",alpha=I(.4),xlim=c(-3,3), show.legend=TRUE) +
xlab("Data Point") + ylab("Density") + scale_fill_manual(name="df",values=c("red","blue"),labels=factor(c(5,Inf))) +
theme(legend.position="right")
```
## Running: t(5) and N(0,1)
```{r klisetup, cache=FALSE, message=FALSE}
source('kli.R')
```
```{r simt, echo=TRUE}
samples.t5 <- sim.t(200000, df=5, seed=7); head(samples.t5,5)
likes.t5.t5 <- likes.t(samples.t5, df=5)
likes.t5.normal <- likes.t(samples.t5, df=Inf)
head(like.ratios(likes.t5.t5, likes.t5.normal), 5)
```
- A positive log-likelihood-ratio selects correct model.
- Adding ratios gives the log-likelihood-ratio of an ensemble.
## The Bootstrap Matrix
- Each element of the $i^{th}$ column contains the sum of $i$ independent log-likelihood-ratios.
- The rows are independent.
```{r bootstrapmatrix, echo=TRUE, cache=TRUE}
bootstrap.matrix(likes.t5.t5, likes.t5.normal,
max.samples=5, bootstrap.rows=7, seed=5)
```
## Proportion Correct in Each Column
```{r plotprop}
prophyp <- prop.choose.hyp(likes.t5.t5, likes.t5.normal,
max.samples=250, bootstrap.rows=1000, seed=5)
plot(1:250,prophyp,ylab="Proportion Correct",xlab="Sample Size", main="Based on 1000 Rows")
abline(0.95,0)
```
Each proportion is a multiple of 1/#rows (bad for regression).
## $5^{th}$ Percentile of Log-Likelihood-Ratio
```{r qcplot, cache=TRUE}
quantile <- quantile.columns(likes.t5.t5, likes.t5.normal,
max.samples=250, bootstrap.rows=1000, seed=5,
confidence.level=0.95)
plot(1:250,quantile,ylab="5th %-ile of Log-Likelihood-Ratio",xlab="Sample Size")
abline(0,0)
```
- The "proportion correct" crosses 0.95 at the same location where the $5^{th}$ percentile (not discrete) crosses zero.
## Region of interest
```{r roiplot, cache=TRUE}
roi <- region.of.interest(likes.t5.t5, likes.t5.normal,
max.samples=250, bootstrap.rows=1000, seed=5,
confidence.level=0.95)
eq <- estimated.quantiles(likes.t5.t5, likes.t5.normal,
max.samples=250, bootstrap.rows=1000, seed=5,
confidence.level=0.95)
plot(roi[[1]],roi[[2]],xlab="Sample Size",ylab="5th %-ile of Log-Likelihood-Ratio")
lines(eq[[1]],eq[[2]])
abline(0,0)
```
- We restrict the $5^{th}$ quantiles to a region of interest, then use regression to compute the intersection.
## Computing number of samples
```{r numberofsamples, echo=TRUE, cache=TRUE}
samples.needed(likes.t5.t5, likes.t5.normal,
max.samples=250, bootstrap.rows=1000, seed=5,
confidence.level=0.95)
```
```{r repssetup}
source('reps.R')
```
```{r repeated, echo=TRUE, cache=TRUE}
repeated.estimates(reps=6, seed=1)
```
- Occasionally, the method fails and leads to nonsensical results, but it is easy to spot those estimates.
## Density of Estimates
```{r repeatqr, warning=FALSE, cache=TRUE}
qr.reps <- repeated.estimates(reps=100, seed=2,
quantile.regression=TRUE)
```
```{r repeatslr, echo=FALSE, warning=FALSE, cache=TRUE}
slr.reps <- repeated.estimates(reps=100, seed=2,
quantile.regression=FALSE)
```
```{r bplot}
require(ggplot2)
qr.lab <- rep('quantreg::rq()',length(qr.reps))
sl.lab <- rep('quantile(); lm()',length(slr.reps))
lab <- rbind(cbind(qr.lab),cbind(sl.lab))
colnames(lab) <- 'lab'
est <- rbind(cbind(qr.reps),cbind(slr.reps))
colnames(est) <- 'est'
df <- data.frame(est, lab)
ggplot(df, aes(x=lab,y=est)) + geom_violin() + geom_boxplot(width=0.1) +
xlab('Type of Regression') + ylab('Estimate') +
theme(axis.text.x=element_text(size=15))
```
## Conclusions
- We found little, if any, improvement for the quantreg::rq() function over the two step: quantile() $\rightarrow$ lm(). Having the result of the intermediate step allowed greater efficiency.
- Based on a distribution derived from 100 estimates, it takes $128 \pm 1$ (approximately, MEAN +/- 2 STANDARD ERROR) to correctly reject the standard Normal distribution using samples from the t-distribution with 5 degrees of freedom.
- The above result required minutes of computation, not hours, and was robust.
## Why (or Why Not) Samples Needed?
![](baseballimage.jpg)
- Advantages: provides a measure of similarity between models which may be easier to interpret than the Kullback-Leibler Divergence.
- Disadvantages: harder and less natural to compute than the Kullback-Leibler divergence.
## Future Work
I am interested in models in Neuroscience where the likelihood must be approximated with Sequential Monte Carlo techniques.
![](neuron.jpg)
## Questions?
![](BaseballPitch.jpg)
## Links & Acknowledgments
- These slides can be found at http://jsm17.seancarver.org/talk.html
- My code can be found at https://github.com/seancarverphd/klir.git
- Baseball collaborator: Rebeca Berger, American University, Class of 2017.
- Baseball data (needed to run our baseball code) available from Max Marchi, see: https://github.com/maxtoki/baseball_R
- Citation: Max Marchi and Jim Albert (2013), _Analyzing Baseball With R_, CRC Press.