Name		Name	Last commit message	Last commit date
parent directory ..
DenseConnectLayer.lua		DenseConnectLayer.lua
README.md		README.md
densenet.lua		densenet.lua
init.lua		init.lua

README.md

Memory Efficient Implementation of DenseNets

The standard (orginal) implementation of DenseNet with recursive concatenation is very memory inefficient. This can be an obstacle when we need to train DenseNets on high resolution images (such as for object detection and localization tasks) or on devices with limited memory.

In theory, DenseNet should use memory more efficiently than other networks, because one of its key features is that it encourages feature reusing in the network. The fact that DenseNet is "memory hungry" in practice is simply an artifact of implementation. In particular, the culprit is the recursive concatenation which re-allocates memory for all previous outputs at each layer. Consider a dense block with N layers, the first layer's output has N copies in the memory, the second layer's output has (N-1) copies, ..., leading to a quadratic increase (1+2+...+N) in memory consumption as the network depth grows.

Using optnet (-optMemory 1) or shareGradInput (-optMemory 2), we can significantly reduce the run-time memory footprint of the standard implementaion (with recursive concatenation). However, the memory consumption is still a quadratic function in depth.

We implement a customized densely connected layer (largely motivated by the Caffe implementation of memory efficient DenseNet by Tongcheng), which uses shared buffers to store the concatenated outputs and gradients, thus dramatically reducing the memory footprint of DenseNet during training. The mode -optMemory 3 activates shareGradInput and shared output buffers, while the mode -optMemory 4 further shares the memory to store the output of the Batch-Normalization layer before each 1x1 convolution layer. The latter makes the memory consumption linear in network depth, but introduces a training time overhead due to the need to re-forward these Batch-Normalization layers in the backward pass.

In practice, we suggest using the default -optMemory 2, as it does not require customized layers, while the memory consumption is moderate. When GPU memory is really the bottleneck, we can adopt the customized implementation by setting -optMemory to 3 or 4, e.g.,

th main.lua -netType densenet -dataset cifar10 -batchSize 64 -nEpochs 300 -depth 100 -growthRate 12 -optMemory 4

The following time and memory footprint are benchmarked on a DenseNet-BC (l=100, k=12) on CIFAR-10, and on an NVIDIA TitanX GPU:

optMemory	Memory	Time (s/mini-batch)	Description
0	5453M	0.153	Original implementation
1	3746M	0.153	Original implementation with optnet
2	2969M	0.152	Original implementation with shareGradInput
3	2188M	0.155	Customized implementation with shareGradInput and sharePrevOutput
4	1655M	0.175	Customized implementation with shareGradInput, sharePrevOutput and shareBNOutput

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models

models

README.md

Memory Efficient Implementation of DenseNets

Files

models

Directory actions

More options

Directory actions

More options

Latest commit

History

models

Folders and files

parent directory

README.md

Memory Efficient Implementation of DenseNets