This repository is a playground to benchmark several existing (and popular) model serving frameworks for PyTorch.
It also acts as a reference kit to generate, run and benchmark your own model in available frameworks.
Model Name | Precision | GPU | Type |
---|---|---|---|
EfficientNet-B0 | FP32 | True | NVIDIA Triton Inference Server |
EfficientNet-B0 | FP32 | True | PyTorch Serve |
EfficientNet-B0 | FP32 | True | Vanilla Python gRPC |
Note: These results are captured on a machine with Intel i7 CPU, 16GB RAM and NVIDIA RTX 3050Ti GPU using Locust with 1000 users for 5 minutes.
Type | Response Time | Total Requests (Request Per Second) |
---|---|---|
Vanilla Python gRPC | 17340 (43.5) | |
PyTorch Serve | 17493 (44.9) | |
NVIDIA Triton Inference Server | 28133 (71.9) |
This section will help you compile and generate modules which can be served.
- Compile and generate assets to be used for serving using the command:
cd models/efficientnet-b0
docker run --rm -it --gpus all -v ${PWD}:/scratch_space nvcr.io/nvidia/pytorch:<xx.yy>-py3 # e.g. <xx.yy> = 22.05
cd /scratch_space
python3 conversion.py
exit
- Copy the outputs in their respective directory for usage:
mv model.pt ../services/triton/model_repository/efficientnet_b0/1/model.pt
cp ts_model.pt ../services/torchserve/ts_model.pt
cp ts_model.pt ../services/python-grpc/ts_model.pt
- Run NVIDIA Triton Inference Server using:
cd services/triton
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v <absolute_path_to_ptibench_directory>/model_repository:/models \
nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models # e.g. <xx.yy> = 22.05
- Generate MAR for running TorchServe:
cd services/torchserve
torch-model-archiver --model-name efficientnet_b0 --version 1.0 --serialized-file ts_model.pt --handler handler.py
- Run TorchServe using:
docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -v ${PWD}/model-store:/home/model-server/model-store \
pytorch/torchserve:latest-gpu torchserve --model-store /home/model-server/model-store/ --models efficientnet_b0=efficientnet_b0.mar
- Generate Docker Image using:
cd services/python-grpc
docker build . -t torch-python-grpc
- Run it in a container using:
docker run --rm --gpus all -p 8080:8080 torch-python-grpc
This section will help you run benchmark and save results
- Run locust load tests using
cd services/triton
locust -f locust_client.py
- Run locust load tests using
cd services/torchserve
locust -f locust_client.py
- Run locust load tests using
cd services/python-grpc
locust -f locust_client.py
- Refactor the clients of locust testing to be consistent across all services
- Run benchmarks weekly/on latest versions using common GPU spec
- Use FP16 for inference as well
- Generate some fancy graphs using the saved results
- Improve the way load testing is done using locust
This project is licensed under MIT License.