AI Radiologist GPU

GPU version of fast training of a radiologist using Multiple GPUs on a large scale setting using multinodes.

Source: NIH : A chest x-ray identifies a lung mass.

Getting Started

Please refer to this article to install horovod and tensorflow.

Hardware & Technology Stack

Hardware configuration	Software Configuration
4 PowerEdge C4140	Deep Learning Framework: Tensorflow-GPU V1.12.0
4 Nvidia V100 32GB SXM2	Horovod version: 0.16.4
2 20 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz	MPI version: 4.0.0 with CUDA and UCX Support
384 GB RAM, DDR4 2666MHz	CUDA version: 10.1.105
Lustre file system	NCCL version: 2.4.7
	Python version: 3.6.8
	OS and version: RHEL 7.4

Clone this Repo

https://github.com/dellemc-hpc-ai/ai-radiologist-GPU.git

cd ai-radiologist-GPU

Run the following commands from the ai-radiologist-GPU directory.

Download the dataset

./download_dataset.sh

you must be able to see tars folder.

Extract the dataset

./extract_images.sh

You must be able to see all the extracted images inside images_all/images folder.

Running the tests

At this point, you can start to train the cheXNet model using raw images, refer to Run the Job if you'd like to continue training with raw images. However, if you want to train the model with TF Records then convert the raw images to TF Records.

Convert to TF Records

This will convert your raw images to TF Records which improves the training speed by ~8-15% percent.

./write_tfrecords.sh

Run the Job

Please Ensure that:

You edit the conda env name to your env name in your submission script.
Change the paths from the submission script to your MPI, CUDA, etc build locations.

If you're using slurm as scheduler, submit the corresponding script based on the data you want to run. You can change numbers for N and n inside the scripts.

sbatch job_submissions/slurm/{raw_1gpu.sh/tfrec_1gpu.sh}

Throughput Numbers

Total Process (Number of GPUs)	Images/Second
1	185.12
2	315.26
3	421.85
4	589.36
8	1116.99
12	1527.60
16	1912.82

Time to solution - Using 10 epochs

Total Process (Number of GPUs)	Time to Train 1 Epoch(s) - Averaged from 10 epochs
1	4206.58
2	2470.08
3	1845.96
4	1321.29
8	697.15
12	509.76
16	407.1

AUCROC

Pathology	AUCROC
Cardiomegaly	0.875733593
Emphysema	0.892907302
Effussion	0.890032406
Hernia	0.833090067
Nodule	0.715641755
Pneumonia	0.83281297
Atelectasis	0.816172298
Pleural Thickening	0.751320603
Mass	0.823109117
Edema	0.871388368
Consolidation	0.79861069
Infiltration	0.673332357
Fibrosis	0.784720033
Pneumothorax	0.695001477
AVG AUCROC	0.803848074

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
job_submissions/slurm		job_submissions/slurm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chexnet_densenet_raw_images.py		chexnet_densenet_raw_images.py
chexnet_densenet_tfrec.py		chexnet_densenet_tfrec.py
download_dataset.sh		download_dataset.sh
extract_images.sh		extract_images.sh
inference.py		inference.py
lung-mass.jpg		lung-mass.jpg
training_labels_new.pkl		training_labels_new.pkl
validation_labels_new.pkl		validation_labels_new.pkl
write_tfrecords.sh		write_tfrecords.sh
write_totfrec.py		write_totfrec.py
write_totfrec_val.py		write_totfrec_val.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Radiologist GPU

GPU version of fast training of a radiologist using Multiple GPUs on a large scale setting using multinodes.

Getting Started

Hardware & Technology Stack

Clone this Repo

Download the dataset

Extract the dataset

Running the tests

Convert to TF Records

Run the Job

Throughput Numbers

Time to solution - Using 10 epochs

AUCROC

Related Articles/Blogs

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

dellemc-hpc-ai/ai-radiologist-GPU

Folders and files

Latest commit

History

Repository files navigation

AI Radiologist GPU

GPU version of fast training of a radiologist using Multiple GPUs on a large scale setting using multinodes.

Getting Started

Hardware & Technology Stack

Clone this Repo

Download the dataset

Extract the dataset

Running the tests

Convert to TF Records

Run the Job

Throughput Numbers

Time to solution - Using 10 epochs

AUCROC

Related Articles/Blogs

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages