adding one more note

jnirschl · Dec 9, 2016 · ac3f3a6 · ac3f3a6
1 parent d829052
commit ac3f3a6
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/_posts/2016-05-31-rl.markdown b/_posts/2016-05-31-rl.markdown
@@ -97,7 +97,7 @@ And that's it: we have a stochastic policy that samples actions and then actions
 
 If you think through this process you'll start to find a few funny properties. For example what if we made a good action in frame 50 (bouncing the ball back correctly), but then missed the ball in frame 150? If every single action is now labeled as bad (because we lost), wouldn't that discourage the correct bounce on frame 50? You're right - it would. However, when you consider the process over thousands/millions of games, then doing the first bounce correctly makes you slightly more likely to win down the road, so on average you'll see more positive than negative updates for the correct bounce and your policy will end up doing the right thing.
 
-**Update: December 9, 2016 - alternative view**. In my explanation above I use the terms such as "fill in the gradient and backprop", which I realize is a special kind of thinking if you're used to writing your own backprop code, or using Torch where the gradients are explicit and open for tinkering. However, if you're used to Theano or TensorFlow you might be a little perplexed because the code is oranized around specifying a loss function and the backprop is fully automatic and hard to tinker with. In this case, the following alternative view might be more intuitive. In vanilla supervised learning the objective is to maximize \\( \sum\_i \log p(y\_i \mid x\_i) \\) where \\(x\_i, y\_i \\) are training examples (such as images and their labels). Policy gradients is exactly the same as supervised learning with two minor differences: 1) We don't have the correct labels \\(y\_i\\) so as a "fake label" we substitute the action we happened to sample from the policy when it saw \\(x\_i\\), and 2) We modulate the loss for each example multiplicatively based on the eventual outcome, since we want to increase the log probability for actions that worked and decrease it for those that didn't. So in summary our loss now looks like \\( \sum\_i A\_i \log p(y\_i \mid x\_i) \\), where \\(y\_i\\) is the action we happened to sample and \\(A_i\\) is a number that we call an **advantage**. In the case of Pong, for example, \\(A\_i\\) could be 1.0 if we eventually won in the episode that contained \\(x\_i\\) and -1.0 if we lost. This will ensure that we maximize the log probability of actions that led to good outcome and minimize the log probability of those that didn't.
+**Update: December 9, 2016 - alternative view**. In my explanation above I use the terms such as "fill in the gradient and backprop", which I realize is a special kind of thinking if you're used to writing your own backprop code, or using Torch where the gradients are explicit and open for tinkering. However, if you're used to Theano or TensorFlow you might be a little perplexed because the code is oranized around specifying a loss function and the backprop is fully automatic and hard to tinker with. In this case, the following alternative view might be more intuitive. In vanilla supervised learning the objective is to maximize \\( \sum\_i \log p(y\_i \mid x\_i) \\) where \\(x\_i, y\_i \\) are training examples (such as images and their labels). Policy gradients is exactly the same as supervised learning with two minor differences: 1) We don't have the correct labels \\(y\_i\\) so as a "fake label" we substitute the action we happened to sample from the policy when it saw \\(x\_i\\), and 2) We modulate the loss for each example multiplicatively based on the eventual outcome, since we want to increase the log probability for actions that worked and decrease it for those that didn't. So in summary our loss now looks like \\( \sum\_i A\_i \log p(y\_i \mid x\_i) \\), where \\(y\_i\\) is the action we happened to sample and \\(A_i\\) is a number that we call an **advantage**. In the case of Pong, for example, \\(A\_i\\) could be 1.0 if we eventually won in the episode that contained \\(x\_i\\) and -1.0 if we lost. This will ensure that we maximize the log probability of actions that led to good outcome and minimize the log probability of those that didn't. So reinforcement learning is exactly like supervised learning, but on a continuously changing dataset (the episodes), scaled by the advantage, and we only want to do one (or very few) updates based on each sampled dataset.
 
 **More general advantage functions**. I also promised a bit more discussion of the returns. So far we have judged the *goodness* of every individual action based on whether or not we win the game. In a more general RL setting we would receive some reward \\(r_t\\) at every time step. One common choice is to use a discounted reward, so the "eventual reward" in the diagram above would become \\( R\_t = \sum\_{k=0}^{\infty} \gamma^k r\_{t+k} \\), where \\(\gamma\\) is a number between 0 and 1 called a discount factor (e.g. 0.99). The expression states that the strength with which we encourage a sampled action is the weighted sum of all rewards afterwards, but later rewards are exponentially less important. In practice it can can also be important to normalize these. For example, suppose we compute \\(R_t\\) for all of the 20,000 actions in the batch of 100 Pong game rollouts above. One good idea is to "standardize" these returns (e.g. subtract mean, divide by standard deviation) before we plug them into backprop. This way we're always encouraging and discouraging roughly half of the performed actions. Mathematically you can also interpret these tricks as a way of controlling the variance of the policy gradient estimator. A more in-depth exploration can be found [here](http://arxiv.org/abs/1506.02438).