Accelerating Inference for High Resolution Images With Quantization and Distributed Deep Learning

The document outlines guidelines for submission to ICS 2024, focusing on the use of quantization techniques to enhance the inference of high-resolution images in deep learning applications. It discusses the challenges posed by large image sizes and proposes a solution that leverages distributed deep learning and various parallelism methods to optimize memory and computation efficiency while maintaining accuracy. The paper evaluates the performance of different quantization methods and their impact on memory utilization and throughput in high-resolution image inference.

Uploaded by

Radha Gulhane

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Accelerating Inference for High Resolution Images With Quantization and Distributed Deep Learning

Uploaded by

Radha Gulhane

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Guidelines for Submission to ICS 2024

ICS 2024 Submission #NaN – Confidential Draft – Do NOT Distribute!!

Abstract—High-Resolution Images are being used in various high-resolution images, optimizing inference in the context of
applications, including Medical Imaging, Satellite Imagery, and high-resolution images remains unexplored.
Surveillance. Due to the evolution of Deep Learning (DL) and its In this paper, we propose and evaluate quantization ap-
widespread usage, it has also become a prominent choice for high-
resolution image applications. But, large image sizes and denser proach to accelerate Deep Learning inference for high-
convolutional neural networks pose limitations over computa- resolution images with less memory and computation require-
tion and memory requirements. To overcome these challenges, ment while maintaining accuracy. Quantization is a technique
several studies have discussed efficient approaches to accelerate where model parameters are converted to low-precision such as
training, but the inference of high-resolution images with deep 16-bit floating point or 8-bit integer from 32-bit floating point.
learning and quantization techniques remains unexplored. In this
paper, we propose accelerated and memory efficient inference This results in reducing memory utilization and latency and its
techniques leveraging different parallelism for distributed DL proven efficiency for DL inference has been evaluated in recent
to enable inference for high-resolution images on out-of-core surveys [9] [10]. We leverage the benefits of quantization to
models, such as ResNet and AmoebaNet. Furthermore, we utilize accelerate high-resolution image inference for deep learning
quantization techniques for model gradients and communication models. Furthermore, to enable scaled image inference, further
buffers and reduce the memory and computation requirements
while maintaining accuracy. enhance acceleration, and harness the memory and compute-
efficient benefits of different parallelism, we introduce quan-
I. I NTRODUCTION tization support for Spatial, Layer, and Pipeline parallelism in
The high-resolution images have a vast range of ap- Distributed DL.
plications, including in sectors such as medical imaging,
satellite imagery, and surveillance. Typically, images used A. Motivation
in these applications range in gigapixels, with dimensions
of 100,000×100,000 pixels and even above. For instance, Precision Memory Memory Throughput Speedup
the digital pathology dataset CAMELYON16 consists of Utilization Reduction (Img/Sec) (%)
(GB) (%)
whole-slide images (WSI) with an approximate resolution of
100,000x200,000 pixels at its maximum 40x magnification. FP32 12.48 Baseline 145.22 Baseline
FP16 7.64 1.63 224.85 1.55
With the evolution of Deep Learning (DL) and its proven BFLOAT16 5.28 2.36 226.12 1.56
efficiency in various sectors, it has also become a promi- INT8 2.39 5.23 903.35 6.22
nent choice for High-Resolution image applications to solve
problems such as image classification and segmentation. Few TABLE I: Memory and Throughput evaluation with different
popular DL models choices for such applications are ResNet precision quantization using ResNet101 and a 256x256 image
[1], U-Net [2], and AmoebaNet [3], which consist of deep size on NVIDIA-A100
conventional layers. However, considering the large size of the
image and several convolution layers, it provides challenges While research in high-resolution images with DL remains
due to memory and computation limitations, as it cannot be essential due to its applicability, several studies have been
accommodated in a single GPU memory. conducted for efficient training, whereas, very few have delved
Several studies [4] [5] [6] have adopted patch-based ap- into the inference for high-resolution images. The studies
proach, where each whole slide(WSI) image is split into small focusing on inference with high-resolution images primarily
patches with image size such as 256x256. This approach fur- involve a single-processing unit and are limited to small-
ther requires pixel-wise annotation or classification mechanism scale images. The exploration of inference with quantization
to classify each patch to well-suited classes. But use of deep in the context of high-resolution images in Deep Learning and
convolutions neural networks, restricts the patch size due to Distributed DL for scaled images is yet to be pursued.
memory limitation. For example, image size of 8192x8192 Furthermore, we evaluated the quantization effects on la-
with ResNet101 model ans batch size as one becomes out- tency and memory footprint for an image size of 256x256
of-core model on NVIDIA-A100 GPU. To facilitate scaled using ResNet101. Table I provides the memory and speedup
image sizes and improve performance, Hy-Fi [7] and GEMS evaluation by comparing quantization with baseline full-
[8] have made significant contributions by enabling training precision (FP32), half-precision (FP16, BFLOAT16), and
using Spatial Parallelism for image sizes up to 16,384 x integer-only precision (INT8). Results show a significant
16,384. It further improve performance by integrating different reduction in latency and memory utilization, with the best
parallelism techniques. While most studies have primarily performance observed for INT8, reducing memory require-
focused on efficient deep learning training approaches for ments by 5.23x while improving speedup by 5.53x. Further,

1
as ResNet101 can not scale beyond 2048×2048 or 4096×4096 II. BACKGROUND
image size on single GPU, to support larger images and slide A. Distributed DL for High Resolution Images
level inference, we studied different parallelism implemented
1) Layer and Pipeline Parallelism: Large Deep Neural Net-
in Hy-fi and GEMS to enable image-sizes such as 8192×8192
works are memory and computation-intensive, thus it restricts
and 16384×16384 and support quantization.
to the smaller image size and batch size on a single GPU.
Consider the real-world application of digital pathology
As we scale with image size, it cannot be trained or inferred
images, where inference for one whole slide image (WSI)
with a single GPU as memory requirements to accommodate
contains an average of 500 patches, each of size 256x256. The
model parameters exceed the available GPU memory. Fore
inference time on a CPU [11] [12] can take several minutes,
such scenarios, Layer Parallelism (LP) [7] [8]is employed. LP
while on a GPU, it reduces to seconds. Utilizing GPU-enabled
distributes one or more layers of DL model on different GPUs
quantization further minimizes this time to just 1-2 seconds.
that can fit into GPU memory. However, distributing layers
B. Proposed Solution serialize the computation of GPUs, as input to a layer present
on one GPU will depend on output of previous layer present on
We propose GPU-enabled accelerated and memory-efficient
different GPU. Consequently, at any given instance, only one
inference for high-resolution images with single GPU DL as
GPU will be performing computations, while the rest remain
well as Distributed DL, leveraging post-training quantization.
idle.
We exploit the quantization precision range of 16-bit floating
To improve computation and memory efficiency and address
point, with both float16 and bfloat16 datatypes, and 8-bit
the limitation of LP, Pipeline Parallelism (PP) [14] is utilized.
integer. As of today, 8-bit integer is the lowest precision for
In this approach, each layer is distributed similarly to LP, but
GPUs supported through PyTorch.
the input batch size is divided into micro-batches, and each
We provide quantization support for single GPU inference,
micro-batch size is executed in a pipeline manner.
specifically to facilitate patch-based inference,a widely used
However, the significantly large memory requirement for
approach where patch sizes are small-scale images. Further-
model parameters at each layer in Layer and Pipeline Paral-
more, to enable scaled images, slide-level inference, and
lelism restricts the batch size to 1 or 2, causing increased
improve performance, we enable quantization for Distributed
latency. Furthermore, if image sizes are scaled further to
DL. We utilized Spatial, Layer, and Pipeline Parallelism for
4098×4098 and 8192×8192, even a single layer cannot fit into
Distributed DL from the Hy-Fi implementation.
a single GPU memory. Therefore, LP or PP still has limitations
We implement our solution in PyTorch [13] and provide an
when it comes to scaling image sizes and batch size.
inference pipeline for high-resolution images, supporting dif-
2) Spatial Parallelism: Spatial Parallelism (SP) [7] [15]
ferent precision quantization and Distributed DL. We evaluated
overcomes the limitation of LP and PP by enabling training or
our work with respect to computation, memory utilization, and
inference for larger images and higher batch sizes. In Spatial
accuracy, as discussed in detail in Section V.
Parallelism, the whole image is partitioned into smaller non-
C. Contributions overlapping spatial parts and distributed across different GPUs.
Further, convolution and pooling layers of the DL model are
We list our contributions as follows:
replicated on GPUs containing spatial parts and lastly, output
1) Implement GPU-enabled of GPU-enabled Post Training layer will be replicated on single GPU undergoing LP. Figure
quantization for Distributed DL to enable inference 1a shows the overview of Spatial and Layer Parallelism. The
for high-resolution images with less computational re- digital pathology image is partitioned into 4 spatial parts, and
sources (Specifically, The parallelism used are Spatial the model is split into 2 parts. The first model split con-
and Layer/Pipeline Parallelism) IV sists compute and memory-intensive convolution and pooling
2) Provide thorough evaluation of quantization for single layers, while the second and final model partitions contain
GPU and multi-GPU with Distributed DL with re- the output layer. Each spatial part performs convolution and
spective throughput, memory utilization, and accuracy pooling operations given by first model split, and finally, the
on different datasets, including CAMELYON16, digital outputs are aggregated by second model split.
pathology dataset V Halo-Exchange Communication Convolution and pooling
3) Reduce the GPU computing resources to 4 GPUs with operations require information about adjacent pixels. For pix-
the FP16 quantized Distributed DL ResNet101 model, els located on the borders of the spatial segment, their adjacent
whereas it requires more than 128 GPUs with the FP32 pixels will be on different GPUs. Figure1b illustrates the halo-
model V-C exchange required by the first spatial part with different GPUs.
4) Achieved the average speedup of 1.25% and memory Therefore, when using SP, each GPU needs to communicate
reduction of 1.58% with FP16 Distributed DL compared with different GPUs to obtain adjacent pixels. We refer such
with baseline FP32 ResNet101 model V-C communication as halo-exchange.
5) With a Single GPU, we achieved speedup by an average
of 1.38% while reducing memory utilization by 5.49% B. Quantization
with INT8 ResNet101 quantized model when compared Quantization is a technique use to reduce number of bits to
with baseline FP32 V-B represent a value, thereby reducing memory usage and latency

2
(b) Halo-exchange Communication

(a) Implementation of Spatial and Layer parallelism with spatial partition factor is 4
and model split factor is 2. The first model partition contains convolution and pooling
layers, while the second and last model partitions contain the output layer
Fig. 1: Overview of Spatial and Layer parallelism

significantly for a given problem. In context of Deep Learning,

quantization is applied to model weights and activations by 2b − 1
s= (1)
converting it to low-bit precision such as half-precision(16-bit β−α
floating point) or integer precision(8-bit or 4-bit integer) from
default 32-bit floating point. Quantization is being widely used z = −round(α · s − 2b−1 ) (2)
in training and inference in deep learning requires additional
design efforts to carefully quantize gradients and activations
xq = clip(round(x · s + z), −2b−1 , 2b−1 − 1) (3)
in order to minimize accuracy errors [16] [17]. Recent studies
have tended to be more inclined towards inference and have Equations 1 and 2 provides quantization parameters required
shown successful results [18] [9]. in equation 3 scale factor (s) and zero-point value (z) re-
spectively. Scale factor is floating-point value and zero-point
Comparison: Integer-Only vs. Floating-Point Quantization is b-bit integer value corresponding to the zero value in the
Floating-point conversion, i.e., converting 32-bit floating point floating-point representation. clip() maps values outside range
to half-precision floating point is relatively simple, as both are to nearest integer representable value.
floating point data types and follow same scheme represen- To determine floating-point value range i.e. [α, β] calibra-
tation. On contrast, converting 32-bit floating point to 8-bit tion step is used which is done by performing forward pass
integer significantly reduces value range to 256 values and with few given samples for particular model.
requires to use new scheme representation to map 32-bit float Post-Training Quantization Post-Training Quantization
value to integer [10]. This new representation scheme uses (PTQ) converts the weights and activations of a pre-trained
the range of floating-point values ([α, β] from Figure 3) in unquantized model to low-bit precision, thereby reducing
its representation to represent 32-bit floating point to integer memory and computation requirements for inference. PTQ
values. Figure 3 shows mapping of floating-point range to b-bit is categorized into two different modes, namely dynamic
integer values range. Further, Equation 3 provides conversion quantization and static quantization. Dynamic Quantization
calculation to represent b-bit integer value relative to floating- converts weights into low-precision values beforehand but
point, where xq is b-bit quantized value of floating-point value converts activation dynamically at runtime depending observed
x. data range. On other hand, in static quantization, the weights

3
(a) Inference pipeline for quantization in Distributed DL (b) Overview of Post Training Quantization Pipeline
Fig. 2: Quantization in Deep Learning

models with a single processing unit, restricting to smaller

image sizes, while Distributed Deep Learning for scaled image
sizes remains unexplored for inference.

B. Quantization in Deep Learning

Quantization is utilized in deep learning to minimize mem-
ory and computation expenses. It has been widely adopted
for inference tasks to achieve less accuracy degradation while
Fig. 3: Mapping of floating-point values to 8-bit values [10]
lowering memory and computation resource requirements [18]
[9] [23]. For compute and memory intensive large deep
and activation are both quantize into low-precision values and models, recent work [24] [25] also shows quantization applica-
requires calibration step to determine these values. However, bility with multi-GPU inference for transformer-based models.
for GPUs, PTQ with PyTorch limited to Static Post-Training However, quantization for for scaled images requiring multi-
Quantization mode via TensorRT. Figure 2b provides the GPU Distributed DL hasn’t been evaluated.
overview of Post-Training Quantization Inference Pipeline. Our work leverages quantization for high-resolution image
inference with Deep Learning. We further accelerate and scale
III. R ELATED W ORK image sizes with Distributed Deep Learning, utilizing different
parallelism techniques for multi-GPU inference.
A. Deep Learning for High-Resolution Images
Due to the challenges posed by large image size of high- IV. Q UANTIZATION IN D ISTRIBUTED D EEP L EARNING
resolution images, several research studies have discussed Figure 2b illustrates a general quantization pipeline, and
efficient methodologies for improving accuracy using deep we follow the same pipeline when dealing with Distributed
learning for training [19] [20] [6]. However, training these im- DL models. However, it’s important to note that model quan-
ages considerably increases the training time to several hours. tization is performed separately on each GPU. In Distributed
To accelerate training, [8] [7] propose a Distributed Deep DL, model is distributed among different parallelism strategy
Learning approach that reduces training time from several and it can be big enough to not fit into memory. Thus, we
hours to few minutes. Recent efforts to make deep learning perform model quantization for each distributed part of model.
models accessible for high-resolution images, showed efforts In terms of Spatial Parallelism, for each spatial part, after
towards inference [5] [21] [22]. These works use deep learning internalization model with given weights, we perform model

4
(a) Throughput Evaluation on Single GPU (b) Memory Utilization Evaluation on Single GPU
Fig. 4: Throughput and Memory Evaluation on a single GPU for the ResNet101 model with different image sizes and batch
size 32. The speedup and memory reduction for is shown in respective colored boxes for FP16, BFLOAT16, and INT8 when
compared to baseline FP32

quantization independently at each GPU device.Similar is the Dataset Precision

case with Layer parallelism. Figure 2a shows implementation FP32 FP16 BFLOAT16 INT8
pipeline for quantization in Distributed DL. CAMELYON16 70.27 70.26 70.32 70.26
Furthermore, spatial parallelism requires to perform halo- ImageNet 77.62 77.57 78.41 76.85
exchange, as shown in Figure 1b, where each convolutional CIFAR-10 86.02 86.05 86.04 85.99
imagenette 75.87 75.87 75.90 75.13
and pooling operation, it perform point-to-point communica-
tion as part of the forward pass. In PyTorch, for GPUs, Integer- TABLE II: Evaluating Inference Accuracy (%) with Different
Only quantization is done via TensorRT. TensorRT takes the Precision Quantized Models on the NVIDIA A100 GPU
Deep Learning (DL) model defined in PyTorch and compiles it
to support integer quantization specifically for NVIDIA GPUs.
This compilation supports DL layers, such as convolutional, our accuracy evaluation on various datasets and compared
normalization, pooling, etc., but it does not cover collective quantization results with the baseline inference accuracy using
communication function calls. Since spatial parallelism re- FP32 precision.
quires point-to-point communication for halo-exchange in the Dataset Description We used following datasets: CAME-
forward pass, TensorRT cannot resolve such calls, limiting Int8 LYON16, ImageNet, CIFAR-10, and Imagenetee. CAME-
quantization supported for spatial parallelism. LYON16 is real-world digital pathology dataset from com-
V. E VALUATION petition held by International Symposium on Biomedical
Imaging (ISBI) to detect metastatic breast cancer in whole
We conducted our experiments on NVIDIA A100 40 GB
slide images(WSI). It consists of 400 WSI images categorized
GPUs (2 GPUs per node) with AMD EPYC 7713 64-Core
into 2 classes: normal and tumor. ImageNet, CIFAR-10, and
Processor. A few of our experiments also use NVIDIA Tesla
ImageNetee are object detection datasets containing 1,431,167
P100 (Pascal) 16 GB (2 GPUs per node) with Intel Xeon E5-
images with 1000 object classes, 60,000 images with 10
2680 v4 GPUs.
classes, and 13,394 images with 10 classes, respectively.
We used PyTorch v1.13.1 [13] as a Deep Learning frame-
work and TensorRT [26] through Torch-TensorRT API for Evaluation Methodology For the CAMELYON16 Dataset,
Integer-Only quantization. For collective communication in the total size is 300GB, and each image is around 5GB with
Distributed DL, we used NCCL (NVIDIA Collective Com- an approximate image resolution of 100,000 x 200,000. Since
munications Library) [27] communication backend. these images cannot fit into memory, for accuracy evaluation,
This section is divided into 3 parts. First, we understand the we used a patch-based approach. We extracted patches of size
different precision quantization effects on accuracy in Section 256x256 containing the tissue region and labeled each patch
V-A. Second, we evaluate quantization on a Single-GPU for based on the slide label.
small-scale images in Section V-B. Finally, we discuss quan- To evaluate the quantization effect on each dataset, we
tization with Spatial, layer, and Pipeline Parallelism, enabling trained ResNet101 for a few epochs to achieve the desired
large-scale images and higher batch sizes in Section V-C. training accuracy, applied PTQ to obtain a quantized model
with various precision levels, and then tested the accuracy for
A. Effect of quantization on accuracy for Inference inference on either the testing or validation dataset.
For accuracy evaluation, we used ResNet101 and performed Result Evaluation Table II shows accuracy evaluation with
model quantization with different precisions. We conducted quantization on different datasets. We observed negligible

5
(a) Throughput Evaluation for 2048x2048 Image Size (b) Memory Evaluation for 4096x4096 Image Size

Fig. 5: Throughput and Memory Evaluation on a single GPU for the ResNet101 model with different image sizes. on multi-
GPUs for the ResNet101 model. The speedup and memory reduction for is shown in respective colored boxes for FP16,
BFLOAT16, and INT8 when compared to baseline FP32

(a) Memory footprints on 5 GPUs for SP+LP (b) Memory footprints on 9 GPUs for SP+LP
Fig. 6: Overview of ResNet101 model and image size 4096×4096 distribution with respective memory and evaluation with
SP+LP on multi-GPUs for the ResNet101 model. The evaluation is done by comparing memory utilization by FP16, BFLOAT16
quantization with FP32 as the baseline

variations in accuracy while using different precision and we observed a 5.23% memory reduction coupled with a
demonstrates the accuracy degradation less than 1%. speedup improvement of 5.52%.

B. Quantization with Single GPU

We evaluate quantization effects for memory utilization

and throughput on different image sizes to understand the
benefits of quantization on small scale image size. Figures
4a and 4b shows perform evaluation on different image
sizes, 256x256 , 512x512, and 1024x1024 with batch size of
32 on ResNet101 model, and we compare our results with
baseline FP32 precision. Figure 4a illustrates the throughput
evaluation. FP16 improved performance by an average of
1.38%, BFLOAT16 by 1.33%, and INT8 by 5.49%. As shown
in Figure 4b, we achieved an average memory reduction of
1.68%, 2.11%, and 3.56% with FP16, BFLOAT16, and INT8 Fig. 7: Calibration Overhead.
precision, respectively. Overall, INT8 precision quantization
appears to be the optimal choice for small-scale images with It is important to note the overhead incurred due to cali-
a Single-GPU. Additionally, with an image size of 1024x1024, bration with INT8 quantization the very first time we perform

6
model quantization. Figure 2b and Section II-B illustrate the for an image size of 8192x8192 with 4 GPUs for spatial
need for the calibration step. To note, for any new model and partitioning. 8a shows the overview and performance for image
dataset, it is necessary to perform calibration the first time size 8192x8192 with 4 and 8 GPUs.
to obtain an INT8 quantized model. The quantized model’s We further evaluate Spatial Parallelism to accelerate per-
weights can be stored and restored at a later stage. Figure 7 formance while scaling with respect to the number of GPUs.
shows the overhead due to calibration. Although we observed Figure 8b shows the performance comparison of SP on differ-
a significant calibration overhead of an average of 2.56%, it ent GPUs (2, 4, and 8) with a baseline of a Single GPU. We
still outperforms the unquantized FP32 precision model by achieve linear scaling, attaining up to a 1.8x and 2x speedup
average speedup of 1.35%. on 4 and 8 GPUs with BFLOAT16.
C. Distributed DL Quantization Performance Evaluation
VI. C ONCLUSION
In this section, first, we evaluated the performance benefits
of using quantization in Distributed DL with respect to mem- High-resolution images with Deep Learning come with their
ory and throughput. Further, we evaluate its benefits spatial own set of challenges due to the large size of the image and
parallelism by enabling inference for very-high resolution deep networked DL models, making it computationally and
images and accelerating performance. We will discuss each memory-intensive. However, research in high resolution in DL
of these benefits in specific sections as outlined below. is crucial due to its applicability and efficiency, for example in
Memory Evaluation We profile memory footprints for an digital pathology. Our efforts are focused on making trained
image size of 4096x4096 to analyze spatial parallelism with DL models accessible for high-resolution image inference by
quantization. Figure 6 illustrates the memory distribution on reducing computation time and resource requirements.
different GPUs. We perform an experiment configuration with We proposed accelerated inference for high-resolution im-
4 and 8 spatial parts, as shown in Figures 6a and 6b, where ages utilizing quantization technique while reducing memory
one additional GPU is used for model parallelism. Through and computation and without accuracy degradation. We pro-
quantization, we are able to reduce memory requirements on vided support for single GPU as well as mult-GPU Distributed
each GPU by half compared to the memory required with DL inference. We achieved overall 5.49x speedup and 3.56x
a full-precision FP32 model. Overall, we achieve a memory memory reduction with single GPU with INT8 quantization.
reduction of 1.57x with FP16 and 1.40x with BFLOAT16 With Distributed DL, we enabled inference for scaled images.
when compared to the baseline FP32. We achieved 1.45x speedup and a 1.5x memory reduction
Throughput Evaluation We experimented with image sizes using half-precision Distributed DL. We further accelerate
image sizes of 2048x2048 and 4096x4096, employing Spatial performance by 2x using SP compared to single GPU.
and Model Parallelism. We scaled the experiment with the We hope that our work will facilitate researchers in achiev-
number of spatial parts, ranging from 2, 4, to 8, where each ing accessibility and efficiency in Deep Learning inference
part was distributed on a different GPU. It is important to while reducing computational costs for their innovative re-
note that an additional GPU was utilized to perform Model search in the field of high-resolution images.
Parallelism for the last output layer, as depicted in Figure
1a. For this experiment, we chose the maximum batch size ACKNOWLEDGEMENTS
supported for different GPU counts to utilize memory to its
We are thankful to A. Jain, et al. for providing imple-
maximum extent. Due to partitioning into a higher number of
mentation of Hy-fi [7], the Distributed DL training for high-
parts across different GPUs, we were able to enable a higher
resolution images.
batch size. For example, with 2048x2048, we could not scale
beyond a batch size of 16 as it would become out-of-core.
R EFERENCES
Figure 5a shows throughput for 2048x2048 with batch sizes
of 16, 32, and 64, on spatial parts 2, 4, and 8, respectively. [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
Similarly, Figure 5b shows throughput for 4096x4096 with residual learning for image recognition, 2015.
[2] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolu-
batch sizes of 16, 32, and 64, on spatial parts 4, 8, and 16, tional networks for biomedical image segmentation, 2015.
respectively. We compared the results with the baseline FP32. [3] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regu-
For 2048x2048, we achieved up to a 1.9x speedup with FP16 larized evolution for image classifier architecture search, 2019.
and 1.6x with BFLOAT16. For 4096x4096, we achieved up [4] Osamu Iizuka, Fahdi Kanavati, Kei Kato, Michael Rambeau, Koji
Arihiro, and Masayuki Tsuneki. Deep learning models for histopatho-
to a 1.55x speedup with FP16 and 1.65x with BFLOAT16. logical classification of gastric and colonic epithelial tumours. Scientific
[TBD: get average numbers] Reports, 10, 2020.
Enabling very high-resolution images FP32 precision [5] Weizhe Li, Mike Mikailov, and Weijie Chen. Scaling the infer-
ence of digital pathology deep learning models using cpu-based high-
ResNet101 model with image size 8192x8192, requiring total performance computing. IEEE Transactions on Artificial Intelligence,
approximately 87GB memory with FP32, becomes out-of-core 4(6):1691–1704, 2023.
even with smallest batch size 1 on single GPU with 40GB [6] Ruiwei Feng, Xuechen Liu, Jintai Chen, Danny Z. Chen, Honghao Gao,
and Jian Wu. A deep learning approach for colonoscopy pathology wsi
memory. It requires to split image into number of spatial analysis: Accurate segmentation and classification. IEEE Journal of
parts to enable inference for 8192x8192. We enabled inference Biomedical and Health Informatics, 25(10):3700–3708, 2021.

7
(a) Enabling inference for 8192×8192 with FP16 (b) Accelerating performance with SP
Fig. 8: Enabling scaled images and accelerating performance using SP

[7] Arpan Jain, Aamir Shafi, Quentin Anthony, Pouya Kousha, Hari Subra- characterization of using quantization for dnn inference on edge devices:
moni, and Dhableswar K. Panda. Hy-fi: Hybrid five-dimensional parallel Extended version, 2023.
dnn training onnbsp;high-performance gpu clusters. In High Perfor- [19] Mahendra Khened, Avinash Kori, Haran Rajkumar, Balaji Srinivasan,
mance Computing: 37th International Conference, ISC High Perfor- and Ganapathy Krishnamurthi. A generalized deep learning framework
mance 2022, Hamburg, Germany, May 29 – June 2, 2022, Proceedings, for whole-slide image segmentation and analysis, 2020.
page 109–130, Berlin, Heidelberg, 2022. Springer-Verlag. [20] Jason Wei, Laura Tafe, Yevgeniy Linnik, Louis Vaickus, Naofumi
[8] Arpan Jain, Ammar Ahmad Awan, Asmaa M. Aljuhani, Jahanzeb Maq- Tomita, and Saeed Hassanpour. Pathologist-level classification of histo-
bool Hashmi, Quentin G. Anthony, Hari Subramoni, Dhableswar K. logic patterns on resected lung adenocarcinoma slides with deep neural
Panda, Raghu Machiraju, and Anil Parwani. Gems: Gpu-enabled networks. Scientific Reports, 9, 03 2019.
memory-aware model-parallelism system for distributed dnn training. [21] André Pedersen, Marit Valla, Anna M. Bofin, Javier Pérez de Frutos,
In SC20: International Conference for High Performance Computing, Ingerid Reinertsen, and Erik Smistad. Fastpathology: An open-source
Networking, Storage and Analysis, pages 1–15, 2020. platform for deep learning-based research and decision support in digital
[9] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. pathology, 2020.
Mahoney, and Kurt Keutzer. A survey of quantization methods for [22] Ruichen Rong, Hudanyun Sheng, Kevin W Jin, Fangjiang Wu, Danni
efficient neural network inference, 2021. Luo, Zhuoyu Wen, Chen Tang, Donghan M Yang, Liwei Jia, Mohamed
[10] Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Amgad, et al. A deep learning approach for histology-based nucleus
Micikevicius. Integer quantization for deep learning inference: Principles segmentation and tumor microenvironment characterization. Modern
and empirical evaluation, 2020. Pathology, page 100196, 2023.
[11] Jon Braatz, Pranav Rajpurkar, Stephanie Zhang, Andrew Y. Ng, and [23] Zhikai Li and Qingyi Gu. I-vit: Integer-only quantization for efficient
Jeanne Shen. Deep learning-based sparse whole-slide image analysis vision transformer inference, 2023.
for the diagnosis of gastric intestinal metaplasia, 2022. [24] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin,
[12] Jakub R. Kaczmarzyk, Alan O’Callaghan, Fiona Inglis, Swarad Gat, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani
Tahsin Kurc, Rajarsi Gupta, Erich Bremer, Peter Bankhead, and Joel H. Agrawal, and Jeff Dean. Efficiently scaling transformer inference, 2022.
Saltz. Open and reusable deep learning for pathology with wsinfer and [25] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth,
qupath. npj Precision Oncology, 8(1), January 2024. and Song Han. Smoothquant: Accurate and efficient post-training
quantization for large language models, 2023.
[13] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-
[26] NVIDIA Developer. NVIDIA TensorRT. https://developer.nvidia.com/
bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,
tensorrt/, 2019. Accessed: 2024-01-31.
Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary
[27] NVIDIA Developer. Nvidia Collective Communications Library
DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit
(NCCL). https://developer.nvidia.com/nccl, 2016. Accessed: 2024-01-
Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imper-
31.
ative Style, High-Performance Deep Learning Library. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett,
editors, Advances in Neural Information Processing Systems 32, pages
8024–8035. Curran Associates, Inc., 2019.
[14] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu
Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le,
Yonghui Wu, and Zhifeng Chen. Gpipe: Efficient training of giant neural
networks using pipeline parallelism, 2019.
[15] Aristeidis Tsaris, Josh Romero, Thorsten Kurth, Jacob Hinkle, Hong-
Jun Yoon, Feiyi Wang, Sajal Dash, and Georgia Tourassi. Scaling
resolution of gigapixel whole slide images using spatial decomposition
on convolutional neural networks. In Proceedings of the Platform for
Advanced Scientific Computing Conference, pages 1–11, 2023.
[16] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and
inference with integers in deep neural networks, 2018.
[17] Yinpeng Dong, Renkun Ni, Jianguo Li, Yurong Chen, Jun Zhu, and
Hang Su. Learning accurate low-bit deep neural networks with stochastic
quantization, 2017.
[18] Hyunho Ahn, Tian Chen, Nawras Alnaasan, Aamir Shafi, Mustafa
Abduljabbar, Hari Subramoni, Dhabaleswar K., and Panda. Performance