Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
2, FEBRUARY 2023
Abstract—Since Convolutional neural networks (CNNs) need from power and computing constraints. For edge devices, the
massive computing resources, lots of computing architectures major bottleneck of processing is the memory accesses, which
are proposed to improve the throughput and energy efficiency will severely impact both throughput and energy efficiency [7].
of the computing. However, those computing architectures need
high data movement between the chip and off-chip memories, Therefore, some researchers have proposed different methods
which causes high energy consumption on the off-chip memory; to reduce computation and memory accesses which can be
thus, the feature map (fmap) compression has been discussed for roughly divided into network optimization and hardware accel-
reducing the data movement. Therefore, the design of fmap com- erator. In the network optimization, pruning techniques can
pression becomes one of the main researches on CNN accelerator find initial weights that make training particularly effective
for energy efficiency on the off-chip memory. In this brief, we
proposed floating-point (FP) fmap compression for a hardware and reduce 10% to 20% size of fully-connected and convolu-
accelerator which includes hardware design and a compression tional architectures of MNIST and CIFAR10 without accuracy
algorithm. This can apply quantization methods such as ternary drop [8]. As the training step usually uses a 32-bit floating-
neural quantization (TTQ), which only quantized weights with point (FP) as the default precision, it can be converted to a
little or no degradation in accuracy and reduces the computa- fixed-point and reduced precision by quantization to reduce
tion cost. In addition to the zero compression, we also compress
nonzero values in the fmap based on the FP format. The com- calculation and memory access. The quantization to fixed-
pression algorithm achieves low area overhead and a similar point or reduced precision usually results in accuracy loss,
compression ratio compared with the state-of-the-art on ILSVRC so there is an alternative way of quantization where only the
2012 dataset. weights are used, and the feature map (fmap) remains an FP32,
Index Terms—Compression, convolutional neural networks such as trained ternary quantization (TTQ) [9] and incre-
(CNNs), CNN accelerators, floating-point (FP), area efficient. mental network quantization (INQ) [10]. CUTIE [11] applies
TTQ to the hardware design to achieve greater or equal accu-
I. I NTRODUCTION racy and reduce energy consumption. As proposed in [12],
energy efficiency and accuracy degradation of less than 1.3%
ONVOLUTIONAL Neural Netwrk (CNN) is a branch
C of deep learning that can learn over a large number of
images to obtain the ability to extract high-level features. Since
can be achieved by INQ replacing multiplication operations
with shift operations in the hardware design. To achieve high
performance in hardware accelerators, highly-parallel compute
AlexNet [1] led the Imagenet Large Scale Visual Recognition
paradigms and reuse dataflow are commonly used. For exam-
Challenge (ILSVRC) [2] in 2012, CNN has become famous
ple, Eyeriss [13] optimizes energy efficiency by using row
in many research domains such as object detection [3] and
stationary dataflow on the spatial architecture to reduce data
image classification [4]. With CNN research has exploded,
movements. The above methods focus on reducing energy con-
accuracy is getting tremendous improvements, even surpassing
sumption by lower computation. The energy efficiency can be
human capability. As a result, CNN has been used in various
further improved by reducing fmap movement between the
applications, including autonomous driving [5] and weather
chip and off-chip memory.
recognition [6].
A further optimization opportunity is offered since CNNs
However, CNN-based solutions often require massive com-
employ the Rectified Linear Unit (ReLU). The ReLU clamps
puting resources and data movement in a system. As a result,
all negative values to zero, which leads to high sparsity in
those applications deployed on edge devices have suffered
the fmap. However, the zero value also occupies memory
Manuscript received 19 May 2022; revised 22 July 2022 and 12 September bandwidth the same as the nonzero value. Therefore previous
2022; accepted 8 October 2022. Date of publication 12 October 2022; work used compressed encoding to exploit sparsity or correla-
date of current version 9 February 2023. This work was supported by the
Ministry of Science and Technology, Republic of China through Project tion of spatially neighboring [14], [15], [16], [17], [18], [19].
“Low Power Deep Learning FPGA System for Object Recognition” under In [14], [15], [16], [17], [18], all nonzero values are retained as
Grant MOST 110-2622-E-011-025. This brief was recommended by Associate raw data, while values with only zeros are stored using com-
Editor A. Calimera. (Corresponding author: Shanq-Jang Ruan.)
The authors are with the Department of Electronic and Computer pression methods. Reference [14] proposed zero run-length
Engineering, National Taiwan University of Science and Technology, encoding (Zero-RLE) to record each nonzero value preceded
Taipei 10607, Taiwan (e-mail: sjruan@mail.ntust.edu.tw). by several consecutive zero values. Reference [15] presented
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSII.2022.3213847. Zero-free neuron array format (ZFNAF) that encoded fmaps as
Digital Object Identifier 10.1109/TCSII.2022.3213847 values and offsets in groups called bricks. Each offset indicates
1549-7747
c 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:33:48 UTC from IEEE Xplore. Restrictions apply.
YAN AND RUAN: AREA EFFICIENT COMPRESSION FOR FP FMAPS IN CNN ACCELERATORS 747
TABLE I
C OMPREHENSIVE C OMPARISON OF THE P ROPOSED AND
S TATE - OF - THE -A RT M ETHODS
TABLE II
S YMBOL E NCODING TABLE
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:33:48 UTC from IEEE Xplore. Restrictions apply.
YAN AND RUAN: AREA EFFICIENT COMPRESSION FOR FP FMAPS IN CNN ACCELERATORS 749
TABLE IV
C OMPARE THE E FFECT OF D IFFERENT S TORAGE L ENGTHS OF
D IFFERENCE ON THE C OMPRESSION R ATIO
TABLE III
C OMPARE THE E FFECT OF D IFFERENT S TORAGE L ENGTHS OF
Z EROS ON THE C OMPRESSION R ATIO
IV. E VALUATION
In the experiment, we use the validation set of ILSVRC2012
as the input image. The CNN models used are the pre-
trained models from PyTorch/torchvision: AlexNet, VGG16,
ResNet34, and MobileNetV2. All fmaps are extracted from the
output of ReLU in the forward propagation of these models. In
the FP format, we use xEyF to denote a format with x expo-
nential bits and y fraction bits. The FP32 and FP16 formats use
8E23F and 5E10F respectively. FP8 uses the 2E5F format in Fig. 6. Comparison of compression ratio of fmaps after ReLU in popular
MobileNetV2 because the architecture contains ReLU6 which CNN models.
causes the maximum output to be 6. Other models in FP8 use
the 3E4F format. values that can be compressed by the Delta Encoder, which
in turn affects the compression performance. Therefore we
A. Selection of Parameters applied the different storage lengths to different data formats.
In Table IV, we analyzed the compression performance of
There are two variable parameters in our compression algo-
different storage lengths based on FP16 and FP32. According
rithm that affect performance. One is the consecutive zeros
to analytical results, we chose 22 and 33 for FP16 and FP32,
storage length of Zero-RLE, and the other is the difference
respectively, which showed the best performance in most mod-
storage length of the Delta Encoder. Therefore, we took 1000
els. In addition, FP8 uses lengths of 22 and 21 in the 3E4F
random images to calculate fmaps in CNN models and ana-
and 2E5F formats, respectively, depending on the analysis of
lyzed the compression performance with different parameters
the fmap distribution.
on these fmaps.
1) Consecutive Zeros Storage Length of Zero-RLE: Zero-
RLE needs to set the maximum number of consecutive zeros B. Comparison With Previous Works
that can be stored in single coding. This parameter affects In Fig. 6, we illustrate the compression rate of proposed
the compression performance of Zero-RLE, and there are dif- and previous works in different CNN model fmaps of FP8,
ferent parameters depending on the characteristic of the data. FP16, and FP32. Compared with Zero-RLE and ZVC, our
In Table III, we compute the compression performance for compression ratio improves by 4% to 21% and 15% to 25%,
different CNN models with different storage lengths of zero. respectively. Our method outperforms Zero-RLE and ZVC
The results show that the best zero storage length of AlexNet because they only compress the zero values while we com-
and VGG16 are 24 , while ResNet34 is 23 and MobileNetV2 press the zero and nonzero values. In addition, we achieve the
is 22 , respectively. Based on the results, we choose 23 as the same performance as EBPC in most cases because both EBPC
zero storage length because its compression performance is the and ours compress the nonzero and zero values. Except in FP8
least different from the optimal value of each model among of MobileNetV2, the compression ratio of our method is 11%
all storage lengths. lower than that of EBPC because ReLU6 results in a more
2) The Difference Storage Length of Delta Encoder: The concentrated fmap distribution favoring the method of EBPC.
selection of the difference storage length affects the range of Moreover, we only compress the sign bit and exponent bits
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:33:48 UTC from IEEE Xplore. Restrictions apply.
750 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 2, FEBRUARY 2023
TABLE V
A REA C OST C OMPARISON OF THE P ROPOSED AND S TATE - OF - THE -A RT [2] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
U SED IN TSMC 130 NM AND UMC 65 NM , R ESPECTIVELY lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
[3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2016, pp. 779–788.
[4] S. S. Yadav and S. M. Jadhav, “Deep convolutional neural network based
medical image classification for disease diagnosis,” J. Big Data, vol. 6,
no. 1, pp. 1–18, 2019.
[5] M. Al-Qizwini, I. Barjasteh, H. Al-Qassab, and H. Radha, “Deep learn-
ing algorithm for autonomous driving using GoogLeNet,” in Proc. IEEE
Intell. Veh. Symp., 2017, pp. 89–96.
[6] B. Zhao, X. Li, X. Lu, and Z. Wang, “A CNN–RNN architecture for
multi-label weather recognition,” Neurocomputing, vol. 322, pp. 47–57,
Dec. 2018.
[7] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing
of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105,
because the fraction bits in FP format have almost the same no. 12, pp. 2295–2329, Dec. 2017.
[8] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse,
probability of ‘1’ and ‘0’. trainable neural networks,” 2018, arXiv:1803.03635.
We have synthesized the proposed design using the [9] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
Synopsys Design Compiler and TSMC 130 nm technology. in Proc. Int. Conf. Learn. Represent., 2016, pp. 1–10.
[10] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network
In Table V, we summarize the area and gate count of the quantization: Towards lossless CNNs with low-precision weights,” in
8-bit, 16-bit, and 32-bit fmaps for the proposed method and Proc. Int. Conf. Learn. Represent., 2017, pp. 1–14.
EBPC, respectively. It can be seen that the area of the com- [11] M. Scherer, G. Rutishauser, L. Cavigelli, and L. Benini, “CUTIE:
Beyond petaop/s/W ternary DNN inference acceleration with better-
pressor and the decompressor is almost the same under the than-binary energy efficiency,” IEEE Trans. Comput.-Aided Design
same bits because their architectures are similar, the only dif- Integr. Circuits Syst., vol. 41, no. 4, pp. 1020–1033, Apr. 2021.
ference is reverse operation. Upgrading the compressor and [12] C. F. B. Fong, J. Mu, and W. Zhang, “A cost-effective CNN accelerator
design with configurable PU on FPGA,” in Proc. IEEE Comput. Soc.
decompressor from 16-bit to 32-bit increases the area by 72% Annu. Symp. VLSI, 2019, pp. 31–36.
and 65%, respectively. Note that only the bits for data stor- [13] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
age and computation increase by increasing the number of energy-efficient reconfigurable accelerator for deep convolutional neu-
ral networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
bits, however, the cycle required for computation remains the Jan. 2017.
same. Compared with EBPC, the area cost of compressor and [14] A. Parashar et al., “SCNN: An accelerator for compressed-sparse con-
decompressor have reduced from 62.6% to 65.8% and 65.9% volutional neural networks,” ACM SIGARCH Comput. Archit. News,
vol. 45, pp. 27–40, Jun. 2017.
to 72.9%, respectively. This result comes from the fact that [15] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
EBPC needs to accumulate multiple data before it starts to A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network
compress, while ours does not. computing,” ACM SIGARCH Comput. Archit. News, vol. 44, pp. 1–13,
Jun. 2016.
[16] A. Aimar et al., “NullHop: A flexible convolutional neural network
accelerator based on sparse representations of feature maps,” IEEE
V. C ONCLUSION Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644–656, Mar. 2019.
[17] J.-S. Park et al., “9.5 A 6K-MAC feature-map-sparsity-aware neural pro-
This brief proposed an FP fmap compression method with cessing unit in 5nm flagship mobile SoC,” in Proc. IEEE Int. Solid-State
area-efficient for the CNN accelerator. We exploit the proper- Circuits Conf., vol. 64, 2021, pp. 152–154.
ties of CNN fmap sparsity and value distribution concentration [18] J. Kwon, J. Kong, and A. Munir, “Sparse convolutional neural
network acceleration with lossless input feature map compression for
to combine two compression methods that can target both char- resource-constrained systems,” IET Comput. Digit. Techn., vol. 16,
acteristics for compression. The proposed method not only no. 1, pp. 29–43, 2022.
reduces the transfer between the chip, but also the off-chip [19] L. Cavigelli, G. Rutishauser, and L. Benini, “EBPC: Extended bit-
plane compression for deep neural network inference and training
memory and the power consumption of off-chip memory. accelerators,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 4,
Compared to state-of-the-art [19], this brief achieves similar pp. 723–734, Dec. 2019.
compression performance but reduces the gate count of the [20] G. Pekhimenko, V. Seshadri, O. Mutlu, M. A. Kozuch, P. B. Gibbons,
and T. C. Mowry, “Base-delta-immediate compression: Practical data
16-bit compressor and decompressor by 63.4% and 68.5%, compression for on-chip caches,” in Proc. 21st Int. Conf. Parallel Archit.
respectively. Compilation Technol. (PACT), 2012, pp. 377–388.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
R EFERENCES pp. 770–778.
[22] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification L.-C. Chen,“MobileNetV2: Inverted residuals and linear bottle-
with deep convolutional neural networks,” in Proc. Int. Conf. Adv. Neural necks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
Inf. Process. Syst., vol. 25, 2012, pp. 1–19. pp. 4510–4520.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:33:48 UTC from IEEE Xplore. Restrictions apply.