torch run sample_ddp.py fails around 49k #101

zhengqigao · 2024-11-08T16:59:36Z

Hi, thanks for the great work. I am running the inference and evaluating FID/IS. When I run torchrun --nnodes=1 --nproc_per_node=N sample_ddp.py --model DiT-XL/2 --num-fid-samples 50000, sometimes it fails at >49k with an error like: RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808262 milliseconds before timing out.

However, it is weird that in all my runs, it only fails after 49k images, and sometime it can successfully finishes 50k images. Any thoughts?

The text was updated successfully, but these errors were encountered:

feufhd · 2025-01-02T16:35:33Z

Me too! Have you solved it?

zhengqigao · 2025-01-02T20:18:22Z

Me too! Have you solved it?

I solved it with --num-fid-samples 51000 or any number larger than 50k. Then I manually truncate the extra samples, only using the first 50k to generate the npz file for calculating FID and IS.

feufhd · 2025-01-03T02:00:27Z

Me too! Have you solved it?

I solved it with --num-fid-samples 51000 or any number larger than 50k. Then I manually truncate the extra samples, only using the first 50k to generate the npz file for calculating FID and IS.

Thank you very much!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch run sample_ddp.py fails around 49k #101

torch run sample_ddp.py fails around 49k #101

zhengqigao commented Nov 8, 2024

feufhd commented Jan 2, 2025

zhengqigao commented Jan 2, 2025

feufhd commented Jan 3, 2025

torch run sample_ddp.py fails around 49k #101

torch run sample_ddp.py fails around 49k #101

Comments

zhengqigao commented Nov 8, 2024

feufhd commented Jan 2, 2025

zhengqigao commented Jan 2, 2025

feufhd commented Jan 3, 2025