My implement of Rethinking Feature Distribution for Loss Functions in Image Classification
using MXNet/Gluon
-
In original paper, the L-GM-loss was formulated as
$L_{GM} = L_{cls} + \lambda L_{lkd}$ , where the regularization term$L_{lkd} = d_{z_{i}} + \frac{1}{2}log|\Lambda_{z_{i}}|$ . But when i implement it i find that it's pretty hard to optimize this term beacuse the loss also lead to a small variance(much smaller than a identity matrix), so$\frac{1}{2}log|\Lambda_{z_{i}}|$ will decrease to -inf after several iterations and also make the loss Nan. I tried 2 ways to cover this problem- Remove the regularization term and only optimize the classification loss
- Remove the
$\frac{1}{2}log|\Lambda_{z_{i}}|$ and keep the regularization term
this 2 solutions seem to fix the problem but since the regularization term is inferred from it's likelihood, simply removed is not a good way
-
The L-GM-Loss layer has two paramters:
mean
,var
. You can't use traditional init way likeXavier
etc. to initialize thevar
because the variance of a distribution is non-negative, the negative variance will also lead to the Nan loss. In my implement, i use a constant value 1 to initialize thevar
I plot the features distribution in my experiment, but as you can see below, there are quit different from the original paper, i will talk about the difference latter.
i set the customop
that requires to implement the backward by myself, if you have any idea about that please tell me :)
still suffering from the variance problem 😢
the author released code is written in caffe
and cuda
, you can find it in here
By adding a lr_mult term to the variance(set a low learning rate) i fixed the problem, here is the result.