[NVIDIA GPU] Fix mem p2p init in collective permute thunk #20086

Tixxx · 2024-12-03T18:26:57Z

Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait.
Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy.

memcpy to make sure data consistency across ranks

frgossen

Thanks for the fix!

Imported from GitHub PR #20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad04 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59 by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59 PiperOrigin-RevId: 705647424

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59c02732da728fe43bd6c4c12702d070c2c PiperOrigin-RevId: 705647424

Imported from GitHub PR #20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad04 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59 by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59 PiperOrigin-RevId: 705647424

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59c02732da728fe43bd6c4c12702d070c2c PiperOrigin-RevId: 705647424

Imported from GitHub PR #20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad04 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59 by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59 PiperOrigin-RevId: 705647424

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59c02732da728fe43bd6c4c12702d070c2c PiperOrigin-RevId: 705647424

Imported from GitHub PR #20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad04 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59 by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59 PiperOrigin-RevId: 705647424

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59c02732da728fe43bd6c4c12702d070c2c PiperOrigin-RevId: 705647424

thomasjoerg · 2024-12-16T10:00:46Z

xla/service/gpu/runtime/nccl_collective_permute_thunk.cc

+  if (!params.executor->HostMemoryUnregister(&barrier_flags_[current_id])) {
+    LOG(ERROR) << "Unregistering barrier flag failed.";
+  }
+}


Several TensorFlow tests are failing with:
error: non-void function does not return a value in all control paths [-Werror,-Wreturn-type]

added a returned status for the cleanup function

Imported from GitHub PR #20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad04 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59 by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59 PiperOrigin-RevId: 705647424

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59c02732da728fe43bd6c4c12702d070c2c PiperOrigin-RevId: 705647424

Imported from GitHub PR #20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad04 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59 by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59 PiperOrigin-RevId: 705647424

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59c02732da728fe43bd6c4c12702d070c2c PiperOrigin-RevId: 705647424

Imported from GitHub PR #20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad04 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59 by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59 PiperOrigin-RevId: 705647424

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59c02732da728fe43bd6c4c12702d070c2c PiperOrigin-RevId: 705647424

Imported from GitHub PR #20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad04 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59 by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop -- 1f75328 by TJ Xu <[email protected]>: Added return status for cleanup functions Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 1f75328 PiperOrigin-RevId: 707074350

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop -- 1f7532815dfdbb6d047339d7189c1287dc72e6a3 by TJ Xu <[email protected]>: Added return status for cleanup functions Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 1f7532815dfdbb6d047339d7189c1287dc72e6a3 PiperOrigin-RevId: 707074350

Imported from GitHub PR #20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad04 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59 by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59 PiperOrigin-RevId: 705647424

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59c02732da728fe43bd6c4c12702d070c2c PiperOrigin-RevId: 705647424

Imported from GitHub PR #20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad04 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59 by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop -- 1f75328 by TJ Xu <[email protected]>: Added return status for cleanup functions Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 1f75328 PiperOrigin-RevId: 707074350

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop -- 1f7532815dfdbb6d047339d7189c1287dc72e6a3 by TJ Xu <[email protected]>: Added return status for cleanup functions Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 1f7532815dfdbb6d047339d7189c1287dc72e6a3 PiperOrigin-RevId: 707074350

Imported from GitHub PR #20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad04 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59 by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59 PiperOrigin-RevId: 705647424

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59c02732da728fe43bd6c4c12702d070c2c PiperOrigin-RevId: 705647424

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop -- 1f7532815dfdbb6d047339d7189c1287dc72e6a3 by TJ Xu <[email protected]>: Added return status for cleanup functions Merging this change closes #20086 PiperOrigin-RevId: 707145351

…alls. FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 1f7532815dfdbb6d047339d7189c1287dc72e6a3 PiperOrigin-RevId: 690686233

Imported from GitHub PR openxla/xla#20086 Move pointer initialization to the thunk init stage instead of runtime to get rid of the runtime blocking wait. Add a device sync point using nccl allreduce before doing memcpy to make sure all gpus arrive at the same stage. Otherwise it's possible to have data corruptions when the receiving rank hasn't arrived at the memcpy. Copybara import of the project: -- ba4ad0445f27d7249b4bcebb4ac573188cf50cb0 by TJ Xu <[email protected]>: Moved pointer init to thunk init stage and add a sync point before doing memcpy to make sure data consistency across ranks -- 050bc59c02732da728fe43bd6c4c12702d070c2c by TJ Xu <[email protected]>: Added e2e test for mem cpy p2p in a loop Merging this change closes #20086 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20086 from Tixxx:tixxx/memcpy_p2p_fix 050bc59c02732da728fe43bd6c4c12702d070c2c PiperOrigin-RevId: 705647424

reedwm · 2024-12-21T01:05:07Z

I'm seeing this regress the Maxtext LLama 7B model with 4-way FSDP and 2-way TP when collective matmul is enabled, using the script you gave me @Tixxx. I see Tokens/s/device: 4125.938 before this PR and Tokens/s/device: 3827.136 after it.

reedwm requested a review from frgossen December 4, 2024 05:45

Tixxx force-pushed the tixxx/memcpy_p2p_fix branch from 9d3a8a4 to 10d3501 Compare December 9, 2024 06:05

Moved pointer init to thunk init stage and add a sync point before doing

ba4ad04

memcpy to make sure data consistency across ranks

Tixxx force-pushed the tixxx/memcpy_p2p_fix branch from 10d3501 to 050bc59 Compare December 12, 2024 05:28

Added e2e test for mem cpy p2p in a loop

050bc59

Tixxx changed the title ~~Fix mem p2p init in collective permute thunk~~ [NVIDIA GPU] Fix mem p2p init in collective permute thunk Dec 12, 2024

frgossen approved these changes Dec 12, 2024

View reviewed changes

copybara-service bot mentioned this pull request Dec 12, 2024

PR #20086: [NVIDIA GPU] Fix mem p2p init in collective permute thunk #20490

Open

copybara-service bot mentioned this pull request Dec 12, 2024

PR #20086: [NVIDIA GPU] Fix mem p2p init in collective permute thunk tensorflow/tensorflow#82867

Draft

thomasjoerg requested changes Dec 16, 2024

View reviewed changes

Tixxx requested a review from thomasjoerg December 16, 2024 19:26

Added return status for cleanup functions

1f75328

thomasjoerg approved these changes Dec 17, 2024

View reviewed changes

copybara-service bot mentioned this pull request Dec 17, 2024

PR #20086: [NVIDIA GPU] Fix mem p2p init in collective permute thunk #20630

Merged

copybara-service bot mentioned this pull request Dec 17, 2024

PR #20086: [NVIDIA GPU] Fix mem p2p init in collective permute thunk tensorflow/tensorflow#83173

Merged

ddunl approved these changes Dec 17, 2024

View reviewed changes

copybara-service bot closed this in 85af7f1 Dec 17, 2024

copybara-service bot mentioned this pull request Dec 17, 2024

[XLA:TPU] Add support for pinning tensors to device sram via custom calls. tensorflow/tensorflow#83186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA GPU] Fix mem p2p init in collective permute thunk #20086

[NVIDIA GPU] Fix mem p2p init in collective permute thunk #20086

Tixxx commented Dec 3, 2024

frgossen left a comment

thomasjoerg Dec 16, 2024

Tixxx Dec 16, 2024

reedwm commented Dec 21, 2024

[NVIDIA GPU] Fix mem p2p init in collective permute thunk #20086

[NVIDIA GPU] Fix mem p2p init in collective permute thunk #20086

Conversation

Tixxx commented Dec 3, 2024

frgossen left a comment

Choose a reason for hiding this comment

thomasjoerg Dec 16, 2024

Choose a reason for hiding this comment

Tixxx Dec 16, 2024

Choose a reason for hiding this comment

reedwm commented Dec 21, 2024