Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Nimble is a deep learning execution engine that accelerates model inference and training by running GPU tasks (i.e., GPU kernels and memory operations) in parallel with minimal scheduling overhead. Given a PyTorch DL model, Nimble automatically generates a GPU task schedule, which employs an optimal parallelization strategy for the model. The schedule is wrapped in a Nimble object and can be seamlessly applied to PyTorch programs. Nimble improves the speed of inference and training by up to 22.34× and 3.61× compared to PyTorch, respectively. Moreover, Nimble outperforms TensorRT by up to 2.81×.

Speedup in Inference (ImageNet models)

Inference performance comparison on an NVIDIA V100 GPU.

Speedup in Training (CIFAR-10 models)

Batch 32	Batch 64	Batch 128

Training performance comparison on an NVIDIA V100 GPU.

Version

This version of Nimble is built on top of PyTorch v1.7.1 with CUDA 11.0. If you want to see the old version of Nimble we used for our experiments in the paper, please checkout to main_pytorch_v1.4.1.

Install Nimble

Please refer to instructions to install Nimble from source.

Use Nimble

Nimble supports both inference and training of neural networks.

Model Inference

import torch
import torchvision

# Instantiate a PyTorch Module and move it to a GPU
model = torchvision.models.resnet50()
model = model.cuda()
model.eval()

# Prepare a dummy input
input_shape = [1, 3, 224, 224]
dummy_input = torch.randn(*input_shape).cuda()

# Create a Nimble object
nimble_model = torch.cuda.Nimble(model)
nimble_model.prepare(dummy_input, training=False)

# Execute the object
rand_input = torch.rand(*input_shape).cuda()
output = nimble_model(rand_input)

Model Training

import torch
import torchvision

BATCH = 32

# Instantiate a PyTorch Module and move it to a GPU
model = torchvision.models.resnet50(num_classes=10)
model = model.cuda()
model.train()

# Define a loss function and an optimizer
loss_fn = torch.nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Prepare a dummy input
input_shape = [BATCH, 3, 32, 32]
dummy_input = torch.randn(*input_shape).cuda()

# Create a Nimble object
nimble_model = torch.cuda.Nimble(model)
nimble_model.prepare(dummy_input, training=True)

# Execute the forward pass
rand_input = torch.rand(*input_shape).cuda()
output = nimble_model(rand_input)

# Compute loss
label = torch.zeros(BATCH, dtype=torch.long).cuda()
loss = loss_fn(output, label)

# Execute the backward pass
loss.backward()

# Perform an optimization step
optimizer.step()

Reproduce Evaluation Results

Please refer to evaluation instructions to reproduce the evaluation results.

Publication

Woosuk Kwon*, Gyeong-In Yu*, Eunji Jeong, and Byung-Gon Chun (* equal contribution), Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning, 34th Conference on Neural Information Processing Systems (NeurIPS), Spotlight, December 2020.

Citation

@inproceedings{kwon2020nimble,
  title={Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning},
  author={Kwon, Woosuk and Yu, Gyeong-In and Jeong, Eunji and Chun, Byung-Gon},
  booktitle={NeurIPS},
  year={2020}
}

Troubleshooting

Create an issue for questions and bug reports.

Contribution

We welcome your contributions to Nimble! We aim to create an open-source project that is contributed by the open-source community. For general discussions about development, please subscribe to nimble-discuss@googlegroups.com.

License

BSD 3-clause license

Name	Name	Last commit message	Last commit date
Latest commit gyeongin Oct 5, 2021 bac6d10 · Oct 5, 2021 History 30,349 Commits
.circleci	.circleci	[v1.7.1] Enable Python 3.9 for Windows builds (#48218)	Nov 19, 2020
.ctags.d	.ctags.d	Add a .ctags.d/ toplevel directory (#18827)	Apr 4, 2019
.github	.github	Delete ISSUE_TEMPLATE.md	Sep 1, 2021
.jenkins	.jenkins	[release/1.7] .jenkins: Bump torchvision commit (#46933)	Oct 27, 2020
android	android	[vulkan][android][test_app] Add test_app variant that runs module on …	Sep 29, 2020
aten	aten	Updates for PyTorch v1.7.1 (#6 )	Jan 29, 2021
benchmarks	benchmarks	Source code level attribution in profiler (#43898)	Sep 30, 2020
binaries	binaries	Move mobile specific CPUCachingAllocator to c10/mobile folder. (#45364)	Sep 29, 2020
c10	c10	Updates for PyTorch v1.7.1 (#6 )	Jan 29, 2021
caffe2	caffe2	Source code level attribution in profiler (#43898)	Sep 30, 2020
cmake	cmake	[1.7.1] Fix LAPACK functionality detection from static OpenBLAS (#48819)	Dec 4, 2020
docker	docker	pin numpy version to 1.18.5 (#42670)	Aug 6, 2020
docs	docs	Fix documentation to point to torch.overrides instead of _overrides. …	Nov 20, 2020
experiment	experiment	Updates for PyTorch v1.7.1 (#6 )	Jan 29, 2021
figures	figures	Updates for PyTorch v1.7.1 (#6 )	Jan 29, 2021
ios	ios	[iOS] Bump up the cocoapods version (#41895)	Jul 23, 2020
mode	mode	Add SNPE deps for caffe2 benchmark android binary	Sep 12, 2020
modules	modules	Remove py2 compatible future imports (#44735)	Sep 16, 2020
scripts	scripts	[ONNX] Add dim_param support in export with onnx shape inference (#44…	Oct 6, 2020
submodules	submodules	'Re-sync with internal repository' (#12652)	Oct 15, 2018
test	test	Disable autocast cache for tensor views as fix for #48049 (#48696) (#…	Dec 7, 2020
third_party	third_party	[v1.7.1] third_party: Update pybind to point to fork (#48312)	Nov 20, 2020
tools	tools	Updates for PyTorch v1.7.1 (#6 )	Jan 29, 2021
torch	torch	Updates for PyTorch v1.7.1 (#6 )	Jan 29, 2021
.bazelrc	.bazelrc	Bazel build of pytorch with gating CI (#36011)	Apr 7, 2020
.bazelversion	.bazelversion	Update bazel to 3.1.0 (#37951)	May 7, 2020
.clang-format	.clang-format	Updates to .clang-format (#7683)	May 18, 2018
.clang-tidy	.clang-tidy	disable clang-tidy modernize-trailing-return (#37888)	May 6, 2020
.cmakelintrc	.cmakelintrc	Fix/relax CMake linter rules (#35574)	Mar 27, 2020
.dockerignore	.dockerignore	Add .dockerignore. (#3333)	Oct 28, 2017
.flake8	.flake8	[v1.7.1] third_party: Update pybind to point to fork (#48312)	Nov 20, 2020
.gitattributes	.gitattributes	add .gitattributes for EOL conversion. (#9813)	Aug 1, 2018
.gitignore	.gitignore	add .cache to gitignore (#45017)	Sep 21, 2020
.gitmodules	.gitmodules	Updates for PyTorch v1.7.1 (#6 )	Jan 29, 2021
.travis.aten.yml	.travis.aten.yml	use flake8-mypy (#17721)	Mar 7, 2019
BUILD.bazel	BUILD.bazel	Updates for PyTorch v1.7.1 (#6 )	Jan 29, 2021
CITATION	CITATION	Update CITATION from Workshop paper to Conference paper (#30872)	Dec 6, 2019
CMakeLists.txt	CMakeLists.txt	SET USE_DISTRIBUTED OFF when libuv is not installed (#45554) (#45739)	Oct 6, 2020
CODEOWNERS	CODEOWNERS	Consolidate CODEOWNERS file for distributed package. (#44763)	Sep 16, 2020
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md	Create CODE_OF_CONDUCT.md	Feb 28, 2020
CONTRIBUTING.md	CONTRIBUTING.md	Fix a broken link in CONTRIBUTING.md (#44701)	Sep 16, 2020
Dockerfile	Dockerfile	Dockerfile: remove offical pytorch installation	Oct 5, 2021
GLOSSARY.md	GLOSSARY.md	Add PyTorch Glossary (#40639)	Jul 8, 2020
LICENSE	LICENSE	Updates for PyTorch v1.7.1 (#6 )	Jan 29, 2021
Makefile	Makefile	Fix python support problems caused by building script errors.	Apr 25, 2017
NIMBLE_EVAL.md	NIMBLE_EVAL.md	Updates for PyTorch v1.7.1 (#6 )	Jan 29, 2021
NIMBLE_INSTALL.md	NIMBLE_INSTALL.md	Update NIMBLE_INSTALL.md	Feb 19, 2021
NOTICE	NOTICE	Move copyright lines back to NOTICE file, fixes #6911 (#8310)	Jun 12, 2018
README.md	README.md	Update README.md	Feb 19, 2021
WORKSPACE	WORKSPACE	Add a Bazel build config for TensorPipe (#37691)	May 2, 2020
aten.bzl	aten.bzl	[Bazel] Build `ATen_CPU_AVX2` lib with AVX2 arch flags enabled (#37381)	Apr 28, 2020
codecov.yml	codecov.yml	Add 1% threshold to codecov (#43783)	Aug 28, 2020
docker.Makefile	docker.Makefile	Dockerfile: Support CUDA 11 (#45071)	Sep 23, 2020
mypy-strict.ini	mypy-strict.ini	Rewrite of ATen code generator (#42629)	Aug 31, 2020
mypy.ini	mypy.ini	annotate torch.autograd.* modules (#45004) (#46206)	Oct 12, 2020
requirements.txt	requirements.txt	Rewrite of ATen code generator (#42629)	Aug 31, 2020
setup.py	setup.py	[Release/1.7.1] Embed `libiomp5.dylib` into wheel package (#48337)	Nov 21, 2020
ubsan.supp	ubsan.supp	Don't use RTLD_GLOBAL to load _C. (#31162)	Jan 9, 2020
version.txt	version.txt	Bump nightlies to 1.7.0 (#40519)	Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Version

Install Nimble

Use Nimble

Model Inference

Model Training

Reproduce Evaluation Results

Publication

Citation

Troubleshooting

Contribution

License

About

Releases

Packages

Contributors 1,568

Languages

License

snuspl/nimble

Folders and files

Latest commit

History

Repository files navigation

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Version

Install Nimble

Use Nimble

Model Inference

Model Training

Reproduce Evaluation Results

Publication

Citation

Troubleshooting

Contribution

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 1,568

Languages

Packages