CUDA illegal memory access on H100 from triton gemm #2800

wenscarl · 2023-05-04T16:14:33Z

Running t5x on H100, a CUDA illegal memory access(IMA) error was hit. The error can be reproduced by running the attached HLO:

bazel-bin/tensorflow/compiler/xla/tools/run_hlo_module --platform=gpu ./train.txt

CUDA coredump file points to the location of IMA as:

CUDA Exception: Warp Out-of-range Address
The exception was triggered at PC 0x7f36c09b68b0
[Current focus set to CUDA kernel 0, grid 1532, block (8,0,0), thread (32,0,0), device 0, sm 0, warp 2, lane 0]
#0  0x00007f36c09b6970 in triton_gemm_dot_1<<<(16384,1,1),(128,1,1)>>> ()

train.txt

, so disabling triton gemm helps to work around.

The text was updated successfully, but these errors were encountered:

philipphack · 2023-05-04T16:40:48Z

CC @reedwm.

reedwm · 2023-05-04T18:04:11Z

I confirmed that on an H100, train.txt gives a CUDA_ERROR_ILLEGAL_ADDRESS normally but works with XLA_FLAGS=--xla_gpu_enable_triton_gemm=false

@sergachev, can you look into this?

sergachev · 2023-05-08T09:42:24Z

I guess there is better understanding in Nvidia what is the status of support of H100 by Triton. We could disable Triton GEMM in XLA on SM90 until that support in Triton gets regular automated testing.

sergachev · 2023-06-12T20:03:15Z

Was fixed with 7ca3080

reedwm assigned sergachev May 4, 2023

sergachev closed this as completed Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA illegal memory access on H100 from triton gemm #2800

CUDA illegal memory access on H100 from triton gemm #2800

wenscarl commented May 4, 2023 •

edited

Loading

philipphack commented May 4, 2023

reedwm commented May 4, 2023

sergachev commented May 8, 2023

sergachev commented Jun 12, 2023

CUDA illegal memory access on H100 from triton gemm #2800

CUDA illegal memory access on H100 from triton gemm #2800

Comments

wenscarl commented May 4, 2023 • edited Loading

philipphack commented May 4, 2023

reedwm commented May 4, 2023

sergachev commented May 8, 2023

sergachev commented Jun 12, 2023

wenscarl commented May 4, 2023 •

edited

Loading