PR #20086: [NVIDIA GPU] Fix mem p2p init in collective permute thunk #20630
+179
−19
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR #20086: [NVIDIA GPU] Fix mem p2p init in collective permute thunk
Imported from GitHub PR #20086
Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait.
Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy.
Copybara import of the project:
--
ba4ad04 by TJ Xu [email protected]:
Moved pointer init to thunk init stage and add a sync point before doing
memcpy to make sure data consistency across ranks
--
050bc59 by TJ Xu [email protected]:
Added e2e test for mem cpy p2p in a loop
--
1f75328 by TJ Xu [email protected]:
Added return status for cleanup functions
Merging this change closes #20086
FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 1f75328