adding pointer to williams 1992, thanks @goodfeli for pointing out

jnirschl · Aug 14, 2016 · 50f2fad · 50f2fad
1 parent 25a55bb
commit 50f2fad
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/_posts/2016-05-31-rl.markdown b/_posts/2016-05-31-rl.markdown
@@ -163,15 +163,15 @@ In conclusion, once you understand the "trick" by which these algorithms work yo
 
 ### Non-differentiable computation in Neural Networks
 
-I'd like to mention one more interesting application of Policy Gradients unrelated to games: It allows us to design and train neural networks with components that perform (or interact with) non-differentiable computation. The idea was first introduced in [Recurrent Models of Visual Attention](http://arxiv.org/abs/1406.6247) in the context of a model that processed an image with a sequence of low-resolution foveal glances (inspired by our own human eyes). In particular, at every iteration an RNN would receive a small piece of the image and sample a location to look at next. For example the RNN might look at position (5,30), receive a small piece of the image, then decide to look at (24, 50), etc. The problem with this idea is that there a piece of network that produces a distribution of where to look next and then samples from it. Unfortunately, this operation is non-differentiable because, intuitively, we don't know what would have happened if we sampled a different location. More generally, consider a neural network from some inputs to outputs:
+I'd like to mention one more interesting application of Policy Gradients unrelated to games: It allows us to design and train neural networks with components that perform (or interact with) non-differentiable computation. The idea was first introduced in [Williams 1992](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) and more recently popularized by [Recurrent Models of Visual Attention](http://arxiv.org/abs/1406.6247) under the name "hard attention", in the context of a model that processed an image with a sequence of low-resolution foveal glances (inspired by our own human eyes). In particular, at every iteration an RNN would receive a small piece of the image and sample a location to look at next. For example the RNN might look at position (5,30), receive a small piece of the image, then decide to look at (24, 50), etc. The problem with this idea is that there a piece of network that produces a distribution of where to look next and then samples from it. Unfortunately, this operation is non-differentiable because, intuitively, we don't know what would have happened if we sampled a different location. More generally, consider a neural network from some inputs to outputs:
 
 <div class="imgcap">
 <img src="/assets/rl/nondiff1.png" width="600">
 </div>
 
 Notice that most arrows (in blue) are differentiable as normal, but some of the representation transformations could optionally also include a non-differentiable sampling operation (in red). We can backprop through the blue arrows just fine, but the red arrow represents a dependency that we cannot backprop through.
 
-Policy gradients to the rescue! We'll think about the part of the network that does the sampling as a small stochastic policy embedded in the wider network. Therefore, during training we will produce several samples (indicated by the branches below), and then we'll encourage samples that eventually led to good outcomes (in this case for example measured by the loss at the end). In other words we will train the parameters involved in the blue arrows with backprop as usual, but the parameters involved with the red arrow will now be updated independently of the backward pass using policy gradients, encouraging samples that led to low loss. This idea was formalized nicely in [Gradient Estimation Using Stochastic Computation Graphs](http://arxiv.org/abs/1506.05254).
+Policy gradients to the rescue! We'll think about the part of the network that does the sampling as a small stochastic policy embedded in the wider network. Therefore, during training we will produce several samples (indicated by the branches below), and then we'll encourage samples that eventually led to good outcomes (in this case for example measured by the loss at the end). In other words we will train the parameters involved in the blue arrows with backprop as usual, but the parameters involved with the red arrow will now be updated independently of the backward pass using policy gradients, encouraging samples that led to low loss. This idea was also recently formalized nicely in [Gradient Estimation Using Stochastic Computation Graphs](http://arxiv.org/abs/1506.05254).
 
 <div class="imgcap">
 <img src="/assets/rl/nondiff2.png" width="600">