adding alternative view explanation

jnirschl · Dec 9, 2016 · d829052 · d829052
1 parent 11c0ff2
commit d829052
Showing 1 changed file with 2 additions and 0 deletions.
diff --git a/_posts/2016-05-31-rl.markdown b/_posts/2016-05-31-rl.markdown
@@ -97,6 +97,8 @@ And that's it: we have a stochastic policy that samples actions and then actions
 
 If you think through this process you'll start to find a few funny properties. For example what if we made a good action in frame 50 (bouncing the ball back correctly), but then missed the ball in frame 150? If every single action is now labeled as bad (because we lost), wouldn't that discourage the correct bounce on frame 50? You're right - it would. However, when you consider the process over thousands/millions of games, then doing the first bounce correctly makes you slightly more likely to win down the road, so on average you'll see more positive than negative updates for the correct bounce and your policy will end up doing the right thing.
 
+**Update: December 9, 2016 - alternative view**. In my explanation above I use the terms such as "fill in the gradient and backprop", which I realize is a special kind of thinking if you're used to writing your own backprop code, or using Torch where the gradients are explicit and open for tinkering. However, if you're used to Theano or TensorFlow you might be a little perplexed because the code is oranized around specifying a loss function and the backprop is fully automatic and hard to tinker with. In this case, the following alternative view might be more intuitive. In vanilla supervised learning the objective is to maximize \\( \sum\_i \log p(y\_i \mid x\_i) \\) where \\(x\_i, y\_i \\) are training examples (such as images and their labels). Policy gradients is exactly the same as supervised learning with two minor differences: 1) We don't have the correct labels \\(y\_i\\) so as a "fake label" we substitute the action we happened to sample from the policy when it saw \\(x\_i\\), and 2) We modulate the loss for each example multiplicatively based on the eventual outcome, since we want to increase the log probability for actions that worked and decrease it for those that didn't. So in summary our loss now looks like \\( \sum\_i A\_i \log p(y\_i \mid x\_i) \\), where \\(y\_i\\) is the action we happened to sample and \\(A_i\\) is a number that we call an **advantage**. In the case of Pong, for example, \\(A\_i\\) could be 1.0 if we eventually won in the episode that contained \\(x\_i\\) and -1.0 if we lost. This will ensure that we maximize the log probability of actions that led to good outcome and minimize the log probability of those that didn't.
+
 **More general advantage functions**. I also promised a bit more discussion of the returns. So far we have judged the *goodness* of every individual action based on whether or not we win the game. In a more general RL setting we would receive some reward \\(r_t\\) at every time step. One common choice is to use a discounted reward, so the "eventual reward" in the diagram above would become \\( R\_t = \sum\_{k=0}^{\infty} \gamma^k r\_{t+k} \\), where \\(\gamma\\) is a number between 0 and 1 called a discount factor (e.g. 0.99). The expression states that the strength with which we encourage a sampled action is the weighted sum of all rewards afterwards, but later rewards are exponentially less important. In practice it can can also be important to normalize these. For example, suppose we compute \\(R_t\\) for all of the 20,000 actions in the batch of 100 Pong game rollouts above. One good idea is to "standardize" these returns (e.g. subtract mean, divide by standard deviation) before we plug them into backprop. This way we're always encouraging and discouraging roughly half of the performed actions. Mathematically you can also interpret these tricks as a way of controlling the variance of the policy gradient estimator. A more in-depth exploration can be found [here](http://arxiv.org/abs/1506.02438).
 
 **Deriving Policy Gradients**. I'd like to also give a sketch of where Policy Gradients come from mathematically. Policy Gradients are a special case of a more general *score function gradient estimator*. The general case is that when we have an expression of the form \\(E_{x \sim p(x \mid \theta)} [f(x)] \\) - i.e. the expectation of some scalar valued score function \\(f(x)\\) under some probability distribution \\(p(x;\theta)\\) parameterized by some \\(\theta\\). Hint hint, \\(f(x)\\) will become our reward function (or advantage function more generally) and \\(p(x)\\) will be our policy network, which is really a model for \\(p(a \mid I)\\), giving a distribution over actions for any image \\(I\\). Then we are interested in finding how we should shift the distribution (through its parameters \\(\theta\\)) to increase the scores of its samples, as judged by \\(f\\) (i.e. how do we change the network's parameters so that action samples get higher rewards). We have that: