clarifying a bit

jnirschl · May 8, 2019 · 44d6faa · 44d6faa
1 parent 92bae51
commit 44d6faa
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/_posts/2019-04-25-recipe.markdown b/_posts/2019-04-25-recipe.markdown
@@ -68,7 +68,7 @@ Tips & tricks for this stage:
 - **verify decreasing training loss**. At this stage you will hopefully be underfitting on your dataset because you're working with a toy model. Try to increase its capacity just a bit. Did your training loss go down as it should?
 - **visualize just before the net**. The unambiguously correct place to visualize your data is immediately before your `y_hat = model(x)` (or `sess.run` in tf). That is - you want to visualize *exactly* what goes into your network, decoding that raw tensor of data and labels into visualizations. This is the only "source of truth". I can't count the number of times this has saved me and revealed problems in data preprocessing and augmentation.
 - **visualize prediction dynamics**. I like to visualize model predictions on a fixed test batch during the course of training. The "dynamics" of how these predictions move will give you incredibly good intuition for how the training progresses. Many times it is possible to feel the network "struggle" to fit your data if it wiggles too much in some way, revealing instabilities. Very low or very high learning rates are also easily noticeable in the amount of jitter.
-- **use backprop to chart dependencies**. Your deep learning code will often contain complicated, vectorized, and broadcasted operations. A relatively common bug I've come across a few times is that people get this wrong (e.g. they use `view` instead of `transpose/permute` somewhere) and inadvertently mix information across the batch dimension. It is a depressing fact that your network will typically still train okay because it will learn to ignore data from the other examples. One way to debug this (and other related problems) is to set the loss for some example **i** to be 1.0, run the backward pass all the way to the input, and ensure that you get a non-zero gradient only on the **i-th** example. More generally, gradients give you information about what depends on what in your network, which can be useful for debugging.
+- **use backprop to chart dependencies**. Your deep learning code will often contain complicated, vectorized, and broadcasted operations. A relatively common bug I've come across a few times is that people get this wrong (e.g. they use `view` instead of `transpose/permute` somewhere) and inadvertently mix information across the batch dimension. It is a depressing fact that your network will typically still train okay because it will learn to ignore data from the other examples. One way to debug this (and other related problems) is to set the loss to be something trivial like the sum of all outputs of example **i**, run the backward pass all the way to the input, and ensure that you get a non-zero gradient only on the **i-th** input. The same strategy can be used to e.g. ensure that your autoregressive model at time t only depends on 1..t-1.  More generally, gradients give you information about what depends on what in your network, which can be useful for debugging.
 - **generalize a special case**. This is a bit more of a general coding tip but I've often seen people create bugs when they bite off more than they can chew, writing a relatively general functionality from scratch. I like to write a very specific function to what I'm doing right now, get that to work, and then generalize it later making sure that I get the same result. Often this applies to vectorizing code, where I almost always write out the fully loopy version first and only then transform it to vectorized code one loop at a time.