GPU version of fast training of a radiologist using Multiple GPUs on a large scale setting using multinodes.
Source: NIH : A chest x-ray identifies a lung mass.
Please refer to this article to install horovod and tensorflow.
Hardware configuration | Software Configuration |
---|---|
4 PowerEdge C4140 | Deep Learning Framework: Tensorflow-GPU V1.12.0 |
4 Nvidia V100 32GB SXM2 | Horovod version: 0.16.4 |
2 20 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz | MPI version: 4.0.0 with CUDA and UCX Support |
384 GB RAM, DDR4 2666MHz | CUDA version: 10.1.105 |
Lustre file system | NCCL version: 2.4.7 |
Python version: 3.6.8 | |
OS and version: RHEL 7.4 |
https://github.com/dellemc-hpc-ai/ai-radiologist-GPU.git
cd ai-radiologist-GPU
Run the following commands from the ai-radiologist-GPU directory.
./download_dataset.sh
you must be able to see tars
folder.
./extract_images.sh
You must be able to see all the extracted images inside images_all/images
folder.
At this point, you can start to train the cheXNet model using raw images, refer to Run the Job
if you'd like to
continue training with raw images. However, if you want to train the model with TF Records then
convert the raw images to TF Records.
This will convert your raw images to TF Records which improves the training speed by ~8-15% percent.
./write_tfrecords.sh
Please Ensure that:
- You edit the conda env name to your env name in your submission script.
- Change the paths from the submission script to your MPI, CUDA, etc build locations.
If you're using slurm as scheduler, submit the corresponding
script based on the data you want to run. You can change numbers for N
and n
inside the scripts.
sbatch job_submissions/slurm/{raw_1gpu.sh/tfrec_1gpu.sh}
Total Process (Number of GPUs) | Images/Second |
---|---|
1 | 185.12 |
2 | 315.26 |
3 | 421.85 |
4 | 589.36 |
8 | 1116.99 |
12 | 1527.60 |
16 | 1912.82 |
Total Process (Number of GPUs) | Time to Train 1 Epoch(s) - Averaged from 10 epochs |
---|---|
1 | 4206.58 |
2 | 2470.08 |
3 | 1845.96 |
4 | 1321.29 |
8 | 697.15 |
12 | 509.76 |
16 | 407.1 |
Pathology | AUCROC |
---|---|
Cardiomegaly | 0.875733593 |
Emphysema | 0.892907302 |
Effussion | 0.890032406 |
Hernia | 0.833090067 |
Nodule | 0.715641755 |
Pneumonia | 0.83281297 |
Atelectasis | 0.816172298 |
Pleural Thickening | 0.751320603 |
Mass | 0.823109117 |
Edema | 0.871388368 |
Consolidation | 0.79861069 |
Infiltration | 0.673332357 |
Fibrosis | 0.784720033 |
Pneumothorax | 0.695001477 |
AVG AUCROC | 0.803848074 |
- Enabling AI Workloads in HPC Environments
- Training an AI Radiologist using Distributed Deep Learning with Nvidia GPUs
- Scaling Performance and Training CheXNet using Bare Metal vs Kubernetes
- Optimization Techniques for Training CheXNet on Dell C4140 with Nvidia V100 GPUs
- Tips and Tricks to Optimize workflow with TF and Horovod on GPUs