You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CUDA coredump file points to the location of IMA as:
CUDA Exception: Warp Out-of-range Address
The exception was triggered at PC 0x7f36c09b68b0
[Current focus set to CUDA kernel 0, grid 1532, block (8,0,0), thread (32,0,0), device 0, sm 0, warp 2, lane 0]
#0 0x00007f36c09b6970 in triton_gemm_dot_1<<<(16384,1,1),(128,1,1)>>> ()
I guess there is better understanding in Nvidia what is the status of support of H100 by Triton. We could disable Triton GEMM in XLA on SM90 until that support in Triton gets regular automated testing.
Running t5x on H100, a CUDA illegal memory access(IMA) error was hit. The error can be reproduced by running the attached HLO:
CUDA coredump file points to the location of IMA as:
train.txt
, so disabling triton gemm helps to work around.
The text was updated successfully, but these errors were encountered: