Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging multi-GPU issue #161

Closed
2 tasks done
vwxyzjn opened this issue May 23, 2022 · 3 comments
Closed
2 tasks done

Debugging multi-GPU issue #161

vwxyzjn opened this issue May 23, 2022 · 3 comments

Comments

@vwxyzjn
Copy link
Contributor

vwxyzjn commented May 23, 2022

In IsaacGymEnvs, rl-games + multiGPU seems to have some issues. As shown in the screenshot, rl-games + multiGPU performs uses twice amount of data and performs worse than the single GPU setting in Ant

image

This issue tracks the investigation of this issue.

Proposed debugging route

I suggest making sure we make sure there is no loss in sample efficiency first before scaling to more envs by matching implementation details in our prototype in CleanRL: https://cleanrl-git-new-multi-gpu-vwxyzjn.vercel.app/rl-algorithms/ppo/#implementation-details_6.

Identified issues:

1. Seeding logic and configuration issue

We need to seed multiGPU processes with different seeds to decorrelate experience, otherwise the multiGPU processes will produce the exact observations.

Configuration-wise we can set the overall seed with params.seed and env seed with params.config.env_config.seed, so if params.config.env_config.seed is set but params.seed is not set, we get identical observations from the environments as shown below:

image

This is probably ok since the agent still samples different actions, but it's nonetheless a problem. The correct implementation is to use seed = seed + local_rank.

2. stepping logic issue

After fixing #163, I was able to match the sample efficiency in the single GPU setting:

image

However, the wall time is slower than I had expected. On a separate benchmark I made with CleanRL, the experiments show horovod should make Ant step 20% faster.

Maybe it's the averaging stats overhead? In the CleanRL benchmark experiments I did not mess with stats at all.

image

@Denys88
Copy link
Owner

Denys88 commented Jun 19, 2022

@vwxyzjn can we close it?

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Jun 19, 2022

Closed by #171

@vwxyzjn vwxyzjn closed this as completed Jun 19, 2022
@1tac11
Copy link

1tac11 commented Apr 18, 2023

Hi there
Is multi instance multi flu working?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants