0% found this document useful (0 votes)
36 views5 pages

Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators

Uploaded by

aishwarya.0225
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
36 views5 pages

Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators

Uploaded by

aishwarya.0225
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 5

746 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO.

2, FEBRUARY 2023

Area Efficient Compression for Floating-Point


Feature Maps in Convolutional Neural
Network Accelerators
Bai-Kui Yan and Shanq-Jang Ruan , Senior Member, IEEE

Abstract—Since Convolutional neural networks (CNNs) need from power and computing constraints. For edge devices, the
massive computing resources, lots of computing architectures major bottleneck of processing is the memory accesses, which
are proposed to improve the throughput and energy efficiency will severely impact both throughput and energy efficiency [7].
of the computing. However, those computing architectures need
high data movement between the chip and off-chip memories, Therefore, some researchers have proposed different methods
which causes high energy consumption on the off-chip memory; to reduce computation and memory accesses which can be
thus, the feature map (fmap) compression has been discussed for roughly divided into network optimization and hardware accel-
reducing the data movement. Therefore, the design of fmap com- erator. In the network optimization, pruning techniques can
pression becomes one of the main researches on CNN accelerator find initial weights that make training particularly effective
for energy efficiency on the off-chip memory. In this brief, we
proposed floating-point (FP) fmap compression for a hardware and reduce 10% to 20% size of fully-connected and convolu-
accelerator which includes hardware design and a compression tional architectures of MNIST and CIFAR10 without accuracy
algorithm. This can apply quantization methods such as ternary drop [8]. As the training step usually uses a 32-bit floating-
neural quantization (TTQ), which only quantized weights with point (FP) as the default precision, it can be converted to a
little or no degradation in accuracy and reduces the computa- fixed-point and reduced precision by quantization to reduce
tion cost. In addition to the zero compression, we also compress
nonzero values in the fmap based on the FP format. The com- calculation and memory access. The quantization to fixed-
pression algorithm achieves low area overhead and a similar point or reduced precision usually results in accuracy loss,
compression ratio compared with the state-of-the-art on ILSVRC so there is an alternative way of quantization where only the
2012 dataset. weights are used, and the feature map (fmap) remains an FP32,
Index Terms—Compression, convolutional neural networks such as trained ternary quantization (TTQ) [9] and incre-
(CNNs), CNN accelerators, floating-point (FP), area efficient. mental network quantization (INQ) [10]. CUTIE [11] applies
TTQ to the hardware design to achieve greater or equal accu-
I. I NTRODUCTION racy and reduce energy consumption. As proposed in [12],
energy efficiency and accuracy degradation of less than 1.3%
ONVOLUTIONAL Neural Netwrk (CNN) is a branch
C of deep learning that can learn over a large number of
images to obtain the ability to extract high-level features. Since
can be achieved by INQ replacing multiplication operations
with shift operations in the hardware design. To achieve high
performance in hardware accelerators, highly-parallel compute
AlexNet [1] led the Imagenet Large Scale Visual Recognition
paradigms and reuse dataflow are commonly used. For exam-
Challenge (ILSVRC) [2] in 2012, CNN has become famous
ple, Eyeriss [13] optimizes energy efficiency by using row
in many research domains such as object detection [3] and
stationary dataflow on the spatial architecture to reduce data
image classification [4]. With CNN research has exploded,
movements. The above methods focus on reducing energy con-
accuracy is getting tremendous improvements, even surpassing
sumption by lower computation. The energy efficiency can be
human capability. As a result, CNN has been used in various
further improved by reducing fmap movement between the
applications, including autonomous driving [5] and weather
chip and off-chip memory.
recognition [6].
A further optimization opportunity is offered since CNNs
However, CNN-based solutions often require massive com-
employ the Rectified Linear Unit (ReLU). The ReLU clamps
puting resources and data movement in a system. As a result,
all negative values to zero, which leads to high sparsity in
those applications deployed on edge devices have suffered
the fmap. However, the zero value also occupies memory
Manuscript received 19 May 2022; revised 22 July 2022 and 12 September bandwidth the same as the nonzero value. Therefore previous
2022; accepted 8 October 2022. Date of publication 12 October 2022; work used compressed encoding to exploit sparsity or correla-
date of current version 9 February 2023. This work was supported by the
Ministry of Science and Technology, Republic of China through Project tion of spatially neighboring [14], [15], [16], [17], [18], [19].
“Low Power Deep Learning FPGA System for Object Recognition” under In [14], [15], [16], [17], [18], all nonzero values are retained as
Grant MOST 110-2622-E-011-025. This brief was recommended by Associate raw data, while values with only zeros are stored using com-
Editor A. Calimera. (Corresponding author: Shanq-Jang Ruan.)
The authors are with the Department of Electronic and Computer pression methods. Reference [14] proposed zero run-length
Engineering, National Taiwan University of Science and Technology, encoding (Zero-RLE) to record each nonzero value preceded
Taipei 10607, Taiwan (e-mail: sjruan@mail.ntust.edu.tw). by several consecutive zero values. Reference [15] presented
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSII.2022.3213847. Zero-free neuron array format (ZFNAF) that encoded fmaps as
Digital Object Identifier 10.1109/TCSII.2022.3213847 values and offsets in groups called bricks. Each offset indicates
1549-7747 
c 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:33:48 UTC from IEEE Xplore. Restrictions apply.
YAN AND RUAN: AREA EFFICIENT COMPRESSION FOR FP FMAPS IN CNN ACCELERATORS 747

TABLE I
C OMPREHENSIVE C OMPARISON OF THE P ROPOSED AND
S TATE - OF - THE -A RT M ETHODS

Fig. 1. Top view of the proposed compression algorithm scheme.

consecutive times. It encodes the sequence of data stored as


a single data value and count. A single data value is fol-
lowed by a count, representing the corresponding count of the
same value. Zero-RLE used RLE for exploiting the sparsity
of fmaps. Since Zero-RLE only compresses the zero values,
the position of the nonzero value. References [16] and [17] it encodes the fmaps stored as the nonzero value and the zero
proposed that zero-value compression (ZVC) and feature-map count.
lossless compressors (FLC) use the sparsity map and quadtree
to record whether the fmap value at the corresponding location B. Delta Encoding
is zero value or nonzero value, respectively. Reference [18] Pekhimenko et al. [20] proposed a delta encoding method
proposed two-step compression which uses indices to record which records the first actual value and difference between
the location of nonzero values and the amount in the count values that follow each other in sequence. The decoder will use
table. However, EBPC [19] compresses nonzero values in the first actual value and differences to calculate each actual
addition to zero values and can achieve higher compression value. The compression used in sequence data with similar
ratios than other methods of compressing zeros. values will generate small difference values. Due to the small
In this brief, we propose a new hardware-friendly FP fmap difference value, the storage size of the difference value is able
compression algorithm including hardware design, which uses to be smaller than the size of a single actual value. In delta
Zero-RLE to compress zero values and modified delta encod- encoding, the storage size of the difference value needs to be
ing [20] to compress nonzero values. Our approach helps to defined, which limits the maximum value of the difference
compress fmaps for methods like TTQ and INQ at the inference that can be compressed. When the difference value exceeds
stage, reducing the memory access bottleneck while maintain- the storage size of the difference value, the value will not be
ing accuracy. Table I shows the comprehensive comparison compressed and the actual value will be saved as an coding.
of the proposed and state-of-the-art compression algorithms. Therefore, the coding needs a single prefix code followed by
Although we support fewer formats than the other algorithms, a difference or actual value, which is used to recognize the
we compress both zero and nonzero values. Therefore, we have following data type and read the corresponding data size.
a higher compression ratio than those compression methods
for only zero value. Compared with EBPC, we use the same III. P ROPOSED M ETHOD
method for compressing zero value, but we have modified
A. Compression Algorithm
delta encoding for nonzero value instead of BPC in EBPC.
As BPC uses three compression methods and relies on the Our compression method is based on lossless compression,
correlation between data, more resources are required to com- which can reduce the amount of data without sacrificing the
plete the compression. However, we use only one compression correctness of the data. The compression algorithm consists of
method for nonzero value and do not rely on the correlation Zero-RLE, Delta Encoder, and Packer, as shown in Fig. 1. We
between data, thus using fewer resources and calculations. We divide fmaps into zero and nonzero values and then compress
evaluate the resulting compression algorithm on FP16 fmaps them by different algorithms, respectively. Two different cod-
across all the layers of AlexNet, VGG16, ResNet34 [21], and ings will be packaged into a single stream through the packer.
MobileNetV2 [22], achieving an average of 3.5x, 2.53x, 1.93x, For zero values, Zero-RLE is used for compression and for
and 1.65x compression ratios, respectively. In addition reduc- nonzero values, delta encoding is used for compression on the
ing gate count over state-of-the-art by 66.3% and 70.7% on FP format. FP consists of a sign bit, fraction bits, and exponent
16-bit and 32-bit architectures, respectively. bits. As the activation function employs the ReLU, there are
The rest of this brief is organized as follows: relevant com- only positive values in fmaps. Therefore the sign bit is always
pression algorithms for the proposed method is introduced zero and does not need to be stored. Fraction bits are stored as
in Section II while the proposed compression algorithm and actual values because the probability of ‘1’ and ‘0’ is the same.
hardware architecture design are presented in Section III. The Exponent bits are compressed using delta encoding since the
evaluation is explored and described in Section IV. Finally, difference in the exponent bits is smaller when actual values
Section V concludes this brief. are similar. The relationship between the similarity of the val-
ues in fmaps is not only in the backward and forward order
but almost all values are distributed in a small range. In this
II. BACKGROUND
case, we can change the comparison with the previous value
A. Zero-RLE in the delta encoding to a comparison with a fixed value. The
RLE is a form of lossless data compression, which is benefits of using fixed value as an object of comparison are as
efficient on sequence with the same value occurring many follows: 1) The dataflow affects the input order of the fmaps,
Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:33:48 UTC from IEEE Xplore. Restrictions apply.
748 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 2, FEBRUARY 2023

Fig. 3. Compressor architecture based on FP16 without control signals.


Fig. 2. The statistic of the symbol count.

TABLE II
S YMBOL E NCODING TABLE

resulting in different compression ratios when comparing the


previous value with the next one. However, the compression Fig. 4. The distribution of nonzero values inference on ResNet34, which
rate is not affected by different dataflows when using a fixed takes 50 randomly images in the ILSVRC 2012 validation dataset as input.
value for comparison; 2) This prevents the registers from being
updated with each input value, reducing the registers dynamic is not stored in the coding because fmaps are always posi-
consumption due to clearing or writing operations. tive. The coding of the exponent bits is chosen by calculating
The codings in our compression algorithm are variable- the difference and outputting the difference or the original
length code and stored as a bitstream. These codings can be exponent bits. The final output of the Delta Encoder will be
roughly divided into three types, zero consecutive quantities, the encoded exponent bits followed by the input fraction bits.
nonzero values with compressed, and nonzero values with- After the fmap is compressed, a multiplexer will select coding
out compressed. Therefore, we added the prefix code to each of Zero-RLE or Delta Encoder to the Packer. The Packer will
coding to indicate the corresponding coding symbol. For the accumulate the input variable-length code and output the fixed
prefix code, we analyzed the number of symbols in the FP16 length when the buffer reaches a sufficient length.
fmap of ResNet34, as shown in Fig. 2. We can obtain similar For the fixed value of comparison in the Delta Encoder,
statistics from other neural networks. Based on our analysis, we count the fmap values for ResNet34, as shown in Fig. 4.
we used a 1-bit prefix code for compressed nonzero and a This finding reveals that the distribution of fmaps is mostly
2-bit prefix code for the others. The corresponding code for- concentrated within 10−3 to 10. As a result, We can calculate
mat and prefix code of each symbol are shown in Table II. the optimal fixed value for the Delta Encoder based on this
Consecutive zeros coding can represent up to m consecutive analytical result. Since Nd must be smaller than Ne in order
zeros at a time. The coding of compressed nonzero is com- to compress fmaps, only the case where the most significant
posed of the difference of Nd bits and fraction of Nf bits. The bit (MSB) is fixed zero or one can be compressed. According
coding of uncompressed nonzero consists of exponent of Ne to our analysis of fmap, the compression rate can be maxi-
bits and fraction of Nf bits. The decompressor can decode the mized when the MSB is zero. As the exponent is stored as
bitstream back to its original value utilizing the prefix code. an unsigned value, it can be converted to an exponent with a
signed range by subtracting the exponent bias. The exponent
bias is a fixed value and the MSB is zero, which is just right
B. Hardware Architecture for our needs. Therefore we use the FP exponent bias as a
The proposed architecture is divided into a compressor and fixed value in the delta encoder, which can be compressed to
a decompressor. Fmaps can be compressed by the compres- cover values smaller than two. We can adjust the Nd to decide
sor and stored in off-chip memory, and then decompressed the lower bound of compression. For instance, the exponent
by the decompressor for accelerator computation. This sub- bias of FP16 is 011112 . Assuming that the Nd is set to 2, the
section introduces the architecture of the compressor and the compressible range will be 0.125 to 1.99 of the fmap value.
decompressor, respectively. 2) Decompressor: The components of the decompressor
1) Compressor: The compressor is mainly composed of are similar to those of the compressor, consisting of three
Zero-RLE, Delta Encoder, and Packer, as shown in Fig. 3. main components: Unpacker, Zero-RLE Decoder, and Delta
The fmap will first be encoded by the Zero-RLE and Delta Decoder, as shown in Fig. 5. The input bitstream is accu-
Encoder. Zero-RLE has a counter that counts the number mulated by Unpacker, and the symbol decoder will decode
of consecutive zeroes. That will output a prefix code fol- the corresponding coding symbol based on the prefix code.
lowed by zero counts when the counter is not zero and the According to the coding symbol, the compressed data will
input is nonzero, or the counter is full. The Delta Encoder is be decoded by the Zero-RLE Decoder or Delta Decoder.
responsible for compressing the nonzero values. The sign bit Zero-RLE has a counter that counts down the number of

Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:33:48 UTC from IEEE Xplore. Restrictions apply.
YAN AND RUAN: AREA EFFICIENT COMPRESSION FOR FP FMAPS IN CNN ACCELERATORS 749

TABLE IV
C OMPARE THE E FFECT OF D IFFERENT S TORAGE L ENGTHS OF
D IFFERENCE ON THE C OMPRESSION R ATIO

Fig. 5. Decompressor architecture based on FP16 without control signals.

TABLE III
C OMPARE THE E FFECT OF D IFFERENT S TORAGE L ENGTHS OF
Z EROS ON THE C OMPRESSION R ATIO

consecutive zeros and controls the output of the decoder to


zero or Delta Decoder.

IV. E VALUATION
In the experiment, we use the validation set of ILSVRC2012
as the input image. The CNN models used are the pre-
trained models from PyTorch/torchvision: AlexNet, VGG16,
ResNet34, and MobileNetV2. All fmaps are extracted from the
output of ReLU in the forward propagation of these models. In
the FP format, we use xEyF to denote a format with x expo-
nential bits and y fraction bits. The FP32 and FP16 formats use
8E23F and 5E10F respectively. FP8 uses the 2E5F format in Fig. 6. Comparison of compression ratio of fmaps after ReLU in popular
MobileNetV2 because the architecture contains ReLU6 which CNN models.
causes the maximum output to be 6. Other models in FP8 use
the 3E4F format. values that can be compressed by the Delta Encoder, which
in turn affects the compression performance. Therefore we
A. Selection of Parameters applied the different storage lengths to different data formats.
In Table IV, we analyzed the compression performance of
There are two variable parameters in our compression algo-
different storage lengths based on FP16 and FP32. According
rithm that affect performance. One is the consecutive zeros
to analytical results, we chose 22 and 33 for FP16 and FP32,
storage length of Zero-RLE, and the other is the difference
respectively, which showed the best performance in most mod-
storage length of the Delta Encoder. Therefore, we took 1000
els. In addition, FP8 uses lengths of 22 and 21 in the 3E4F
random images to calculate fmaps in CNN models and ana-
and 2E5F formats, respectively, depending on the analysis of
lyzed the compression performance with different parameters
the fmap distribution.
on these fmaps.
1) Consecutive Zeros Storage Length of Zero-RLE: Zero-
RLE needs to set the maximum number of consecutive zeros B. Comparison With Previous Works
that can be stored in single coding. This parameter affects In Fig. 6, we illustrate the compression rate of proposed
the compression performance of Zero-RLE, and there are dif- and previous works in different CNN model fmaps of FP8,
ferent parameters depending on the characteristic of the data. FP16, and FP32. Compared with Zero-RLE and ZVC, our
In Table III, we compute the compression performance for compression ratio improves by 4% to 21% and 15% to 25%,
different CNN models with different storage lengths of zero. respectively. Our method outperforms Zero-RLE and ZVC
The results show that the best zero storage length of AlexNet because they only compress the zero values while we com-
and VGG16 are 24 , while ResNet34 is 23 and MobileNetV2 press the zero and nonzero values. In addition, we achieve the
is 22 , respectively. Based on the results, we choose 23 as the same performance as EBPC in most cases because both EBPC
zero storage length because its compression performance is the and ours compress the nonzero and zero values. Except in FP8
least different from the optimal value of each model among of MobileNetV2, the compression ratio of our method is 11%
all storage lengths. lower than that of EBPC because ReLU6 results in a more
2) The Difference Storage Length of Delta Encoder: The concentrated fmap distribution favoring the method of EBPC.
selection of the difference storage length affects the range of Moreover, we only compress the sign bit and exponent bits

Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:33:48 UTC from IEEE Xplore. Restrictions apply.
750 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 2, FEBRUARY 2023

TABLE V
A REA C OST C OMPARISON OF THE P ROPOSED AND S TATE - OF - THE -A RT [2] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
U SED IN TSMC 130 NM AND UMC 65 NM , R ESPECTIVELY lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
[3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2016, pp. 779–788.
[4] S. S. Yadav and S. M. Jadhav, “Deep convolutional neural network based
medical image classification for disease diagnosis,” J. Big Data, vol. 6,
no. 1, pp. 1–18, 2019.
[5] M. Al-Qizwini, I. Barjasteh, H. Al-Qassab, and H. Radha, “Deep learn-
ing algorithm for autonomous driving using GoogLeNet,” in Proc. IEEE
Intell. Veh. Symp., 2017, pp. 89–96.
[6] B. Zhao, X. Li, X. Lu, and Z. Wang, “A CNN–RNN architecture for
multi-label weather recognition,” Neurocomputing, vol. 322, pp. 47–57,
Dec. 2018.
[7] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing
of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105,
because the fraction bits in FP format have almost the same no. 12, pp. 2295–2329, Dec. 2017.
[8] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse,
probability of ‘1’ and ‘0’. trainable neural networks,” 2018, arXiv:1803.03635.
We have synthesized the proposed design using the [9] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
Synopsys Design Compiler and TSMC 130 nm technology. in Proc. Int. Conf. Learn. Represent., 2016, pp. 1–10.
[10] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network
In Table V, we summarize the area and gate count of the quantization: Towards lossless CNNs with low-precision weights,” in
8-bit, 16-bit, and 32-bit fmaps for the proposed method and Proc. Int. Conf. Learn. Represent., 2017, pp. 1–14.
EBPC, respectively. It can be seen that the area of the com- [11] M. Scherer, G. Rutishauser, L. Cavigelli, and L. Benini, “CUTIE:
Beyond petaop/s/W ternary DNN inference acceleration with better-
pressor and the decompressor is almost the same under the than-binary energy efficiency,” IEEE Trans. Comput.-Aided Design
same bits because their architectures are similar, the only dif- Integr. Circuits Syst., vol. 41, no. 4, pp. 1020–1033, Apr. 2021.
ference is reverse operation. Upgrading the compressor and [12] C. F. B. Fong, J. Mu, and W. Zhang, “A cost-effective CNN accelerator
design with configurable PU on FPGA,” in Proc. IEEE Comput. Soc.
decompressor from 16-bit to 32-bit increases the area by 72% Annu. Symp. VLSI, 2019, pp. 31–36.
and 65%, respectively. Note that only the bits for data stor- [13] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
age and computation increase by increasing the number of energy-efficient reconfigurable accelerator for deep convolutional neu-
ral networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
bits, however, the cycle required for computation remains the Jan. 2017.
same. Compared with EBPC, the area cost of compressor and [14] A. Parashar et al., “SCNN: An accelerator for compressed-sparse con-
decompressor have reduced from 62.6% to 65.8% and 65.9% volutional neural networks,” ACM SIGARCH Comput. Archit. News,
vol. 45, pp. 27–40, Jun. 2017.
to 72.9%, respectively. This result comes from the fact that [15] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
EBPC needs to accumulate multiple data before it starts to A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network
compress, while ours does not. computing,” ACM SIGARCH Comput. Archit. News, vol. 44, pp. 1–13,
Jun. 2016.
[16] A. Aimar et al., “NullHop: A flexible convolutional neural network
accelerator based on sparse representations of feature maps,” IEEE
V. C ONCLUSION Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644–656, Mar. 2019.
[17] J.-S. Park et al., “9.5 A 6K-MAC feature-map-sparsity-aware neural pro-
This brief proposed an FP fmap compression method with cessing unit in 5nm flagship mobile SoC,” in Proc. IEEE Int. Solid-State
area-efficient for the CNN accelerator. We exploit the proper- Circuits Conf., vol. 64, 2021, pp. 152–154.
ties of CNN fmap sparsity and value distribution concentration [18] J. Kwon, J. Kong, and A. Munir, “Sparse convolutional neural
network acceleration with lossless input feature map compression for
to combine two compression methods that can target both char- resource-constrained systems,” IET Comput. Digit. Techn., vol. 16,
acteristics for compression. The proposed method not only no. 1, pp. 29–43, 2022.
reduces the transfer between the chip, but also the off-chip [19] L. Cavigelli, G. Rutishauser, and L. Benini, “EBPC: Extended bit-
plane compression for deep neural network inference and training
memory and the power consumption of off-chip memory. accelerators,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 4,
Compared to state-of-the-art [19], this brief achieves similar pp. 723–734, Dec. 2019.
compression performance but reduces the gate count of the [20] G. Pekhimenko, V. Seshadri, O. Mutlu, M. A. Kozuch, P. B. Gibbons,
and T. C. Mowry, “Base-delta-immediate compression: Practical data
16-bit compressor and decompressor by 63.4% and 68.5%, compression for on-chip caches,” in Proc. 21st Int. Conf. Parallel Archit.
respectively. Compilation Technol. (PACT), 2012, pp. 377–388.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
R EFERENCES pp. 770–778.
[22] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification L.-C. Chen,“MobileNetV2: Inverted residuals and linear bottle-
with deep convolutional neural networks,” in Proc. Int. Conf. Adv. Neural necks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
Inf. Process. Syst., vol. 25, 2012, pp. 1–19. pp. 4510–4520.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:33:48 UTC from IEEE Xplore. Restrictions apply.

You might also like