You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for the great work. I am running the inference and evaluating FID/IS. When I run torchrun --nnodes=1 --nproc_per_node=N sample_ddp.py --model DiT-XL/2 --num-fid-samples 50000, sometimes it fails at >49k with an error like: RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808262 milliseconds before timing out.
However, it is weird that in all my runs, it only fails after 49k images, and sometime it can successfully finishes 50k images. Any thoughts?
The text was updated successfully, but these errors were encountered:
I solved it with --num-fid-samples 51000 or any number larger than 50k. Then I manually truncate the extra samples, only using the first 50k to generate the npz file for calculating FID and IS.
I solved it with --num-fid-samples 51000 or any number larger than 50k. Then I manually truncate the extra samples, only using the first 50k to generate the npz file for calculating FID and IS.
Hi, thanks for the great work. I am running the inference and evaluating FID/IS. When I run
torchrun --nnodes=1 --nproc_per_node=N sample_ddp.py --model DiT-XL/2 --num-fid-samples 50000
, sometimes it fails at >49k with an error like:RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808262 milliseconds before timing out.
However, it is weird that in all my runs, it only fails after 49k images, and sometime it can successfully finishes 50k images. Any thoughts?
The text was updated successfully, but these errors were encountered: