This example provides a training script and an evaluation script. The training script provides an example of training ResNet on CIFAR10 dataset from scratch.
-
Training Arguments
-p,--plugin: Plugin to use. Choices:torch_ddp,torch_ddp_fp16,low_level_zero. Defaults totorch_ddp.-r,--resume: Resume from checkpoint file path. Defaults to-1, which means not resuming.-c,--checkpoint: The folder to save checkpoints. Defaults to./checkpoint.-i,--interval: Epoch interval to save checkpoints. Defaults to5. If set to0, no checkpoint will be saved.--target_acc: Target accuracy. Raise exception if not reached. Defaults toNone.
-
Eval Arguments
-e,--epoch: select the epoch to evaluate-c,--checkpoint: the folder where checkpoints are found
In the demo notebook makeenv.ipynb, it shows how to install and run on Google Colab.
Note that the runtime may restart multiple times for installing the dependencies.
The model used in the experiment: ResNet
The dataset employed: CIFAR-10
Parallel settings: Only 1 GPU is used
Instructions on how to run your code: In makeenv.ipynb
Experiment results, presented in a table or figure:
Accuracy of the model on the test images: 70.33 %
The original code runs with 80 epoches, but to reduce the time on Google Colab, only 10 epoches are ran. So the accuracy is 70.33%.