Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA illegal memory access on H100 from triton gemm #2800

Closed
wenscarl opened this issue May 4, 2023 · 4 comments
Closed

CUDA illegal memory access on H100 from triton gemm #2800

wenscarl opened this issue May 4, 2023 · 4 comments
Assignees

Comments

@wenscarl
Copy link
Contributor

wenscarl commented May 4, 2023

Running t5x on H100, a CUDA illegal memory access(IMA) error was hit. The error can be reproduced by running the attached HLO:

bazel-bin/tensorflow/compiler/xla/tools/run_hlo_module --platform=gpu ./train.txt

CUDA coredump file points to the location of IMA as:

CUDA Exception: Warp Out-of-range Address
The exception was triggered at PC 0x7f36c09b68b0
[Current focus set to CUDA kernel 0, grid 1532, block (8,0,0), thread (32,0,0), device 0, sm 0, warp 2, lane 0]
#0  0x00007f36c09b6970 in triton_gemm_dot_1<<<(16384,1,1),(128,1,1)>>> ()

train.txt

, so disabling triton gemm helps to work around.

@philipphack
Copy link
Contributor

CC @reedwm.

@reedwm
Copy link
Member

reedwm commented May 4, 2023

I confirmed that on an H100, train.txt gives a CUDA_ERROR_ILLEGAL_ADDRESS normally but works with XLA_FLAGS=--xla_gpu_enable_triton_gemm=false

@sergachev, can you look into this?

@sergachev
Copy link
Contributor

I guess there is better understanding in Nvidia what is the status of support of H100 by Triton. We could disable Triton GEMM in XLA on SM90 until that support in Triton gets regular automated testing.

@sergachev
Copy link
Contributor

Was fixed with 7ca3080

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants