This is a reimplementation of Gao Huang et al. (2016), Densely Connected Convolutional Networks for the Lasagne library, done by Jan Schlüter (@f0k).
The DenseNet is an extension of the ResNet: Instead of adding a convolutional layer's output to its input, it is appended. This can be seen as creating shortcut connections to all previous layers, instead of bypassing a single layer only:
This is not done across the full network, but within so-called dense blocks: A set of layers that share the same spatial dimensionality. Between these blocks, a projection (1x1 convolution) and downsampling (2x2 avg-pooling) are inserted:
By varying the number of dense blocks, the number of layers per dense block, or the number of feature maps added by each layer in a dense block (the growth rate), the network architecture can be scaled to different sizes.
To reproduce the results of Table 1, last three rows, for C10 (cifar-10 without data augmentation, with 20% dropout) and C10+ (cifar-10 with random translation and flipping, without dropout), run:
./train_test.py -L=40 -k=12 --no-augment --dropout=.2 --save-weights=C10_L40k12.npz
./train_test.py -L=40 -k=12 --augment --dropout=0 --save-weights=C10p_L40k12.npz
./train_test.py -L=100 -k=12 --no-augment --dropout=.2 --save-weights=C10_L100k12.npz
./train_test.py -L=100 -k=12 --augment --dropout=0 --save-weights=C10p_L100k12.npz
./train_test.py -L=100 -k=24 --no-augment --dropout=.2 --save-weights=C10_L100k24.npz
./train_test.py -L=100 -k=24 --augment --dropout=0 --save-weights=C10p_L100k24.npz
To keep 5000 training examples for validation and compute the validation loss
and error after each epoch, add --validate
to the commands. To instead
constantly monitor loss and error on the test set, add --validate-test
. To
save the loss and error history, add --save-errors
and a .npz
file name
(the order is: training loss, L2 loss, validation loss, validation error).
For the 40-layer network, I got 6.5% test error on C10 and 5.4% on C10+, compared to 7% and 5.24% in the paper. Training history:
The original implementation by the authors of the paper is done in Torch. This Lasagne implementation is based on the Torch implementation, with minor differences:
- By default, it uses a faster formulation of the model that splits up batch
normalization and learned scales / shifts to avoid redundant computation.
The more traditional formulation is included as well (see
densenet.py
anddensenet_fast.py
). - Initial weights are sampled from a Gaussian distribution with a standard
deviation of
sqrt(2/fan_in)
for hidden layers, andsqrt(1/fan_in)
for the output layer, following He et al. The authors' implementation usessqrt(2/fan_out)
for the hidden, and a Uniform distribution in range+- sqrt(1/fan_in)
for the output layer. For reference, the authors' scheme is included at the end ofdensenet.py
. It does not seem to make a difference at least for the 40-layer networks. - While the authors meant to use an L2 decay of 1e-4, the implementation in Torch amounts to a decay of 5e-5 in most other frameworks. I tried both, and 1e-4 seemed to work better at least for the 40-layer networks.
Everything else (architecture, training scheme, data preprocessing and augmentation) should be exactly the same.