Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch run sample_ddp.py fails around 49k #101

Open
zhengqigao opened this issue Nov 8, 2024 · 3 comments
Open

torch run sample_ddp.py fails around 49k #101

zhengqigao opened this issue Nov 8, 2024 · 3 comments

Comments

@zhengqigao
Copy link

Hi, thanks for the great work. I am running the inference and evaluating FID/IS. When I run torchrun --nnodes=1 --nproc_per_node=N sample_ddp.py --model DiT-XL/2 --num-fid-samples 50000, sometimes it fails at >49k with an error like: RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808262 milliseconds before timing out.

However, it is weird that in all my runs, it only fails after 49k images, and sometime it can successfully finishes 50k images. Any thoughts?

@feufhd
Copy link

feufhd commented Jan 2, 2025

Me too! Have you solved it?

@zhengqigao
Copy link
Author

Me too! Have you solved it?

I solved it with --num-fid-samples 51000 or any number larger than 50k. Then I manually truncate the extra samples, only using the first 50k to generate the npz file for calculating FID and IS.

@feufhd
Copy link

feufhd commented Jan 3, 2025

Me too! Have you solved it?

I solved it with --num-fid-samples 51000 or any number larger than 50k. Then I manually truncate the extra samples, only using the first 50k to generate the npz file for calculating FID and IS.

Thank you very much!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants