[Regression] Gradient explodes after upgrading to JAX 0.4.33 from 0.4.30 #17922

qGentry · 2024-10-04T10:09:32Z

Description

I'm training LLAMA-3.1-like transformer architecture in Hybrid Sharded Data Parallel-Context Parallel setup on 32GPUs.
Upgrading to JAX 0.4.33 has broken training of 70B model - loss becomes NaN after single training step.
Evidences that I've collected so far:

Loss on the first step is exactly the same on 0.4.33 and 0.4.30

Gradients of the unembedding layer and last layer norm are also exactly the same on the first step.

Gradient already explodes for the last transformer layer's MLP hidden->output layer weight matrix, which I believe is the first layer after token_unembedding and last layer norm matrix.

On JAX 0.4.30, 0.4.29 I've trained tens of such models with different hyperparams and datasets and have never seen any NaN.
For now, I wasn't able to reproduce this behavior on smaller models, but I'm working on it.
XLA dumps attached
xla_dump_0_4_30.tar.gz
xla_dump_0_4_33.tar.gz

System info (python version, jaxlib version, accelerator, etc.)

Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax; jax.print_environment_info()
jax:    0.4.33
jaxlib: 0.4.33
numpy:  1.24.3
python: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
jax.devices (8 total, 8 local): [CudaDevice(id=0) CudaDevice(id=1) ... CudaDevice(id=6) CudaDevice(id=7)]
process_count: 1
platform: uname_result(system='Linux', node='computeinstance-e00xy41pgq1s49hjc5', release='5.15.0-118-generic', version='#128-Ubuntu SMP Fri Jul 5 09:28:59 UTC 2024', machine='x86_64')


$ nvidia-smi
Fri Oct  4 10:07:59 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:8D:00.0 Off |                    0 |
| N/A   28C    P0            110W /  700W |     538MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:91:00.0 Off |                    0 |
| N/A   27C    P0            110W /  700W |     538MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:95:00.0 Off |                    0 |
| N/A   30C    P0            110W /  700W |     538MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:99:00.0 Off |                    0 |
| N/A   27C    P0            112W /  700W |     538MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:AB:00.0 Off |                    0 |
| N/A   28C    P0            109W /  700W |     538MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:AF:00.0 Off |                    0 |
| N/A   26C    P0            109W /  700W |     538MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:B3:00.0 Off |                    0 |
| N/A   29C    P0            112W /  700W |     538MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:B7:00.0 Off |                    0 |
| N/A   27C    P0            110W /  700W |     538MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

JAX issue

The text was updated successfully, but these errors were encountered:

qGentry · 2024-10-04T14:13:50Z

One important update - I've actually tracked problem down to using scan:
When I'm using scan on JAX 0.4.33, gradients are exploding, when I'm not using scan (and compilation time slows down by a factor of 50), gradients are ok. Using scan on JAX 0.4.30 has no problems - gradients are also ok.

qGentry · 2024-10-04T14:25:36Z

Meanwhile, single node 8GPU 8B model with very similar structure (hybrid sharded data parallelism & data & context parallelism) can't reproduce the problem - even with scan, gradients are almost identical for 0.4.30 and 0.4.33

qGentry · 2024-10-04T15:07:55Z

Another observation - I have run 8B model with exactly the same configuration, sharding, dataset, etc. as 70B from scratch (without restoring from checkpoint, freshly initialized).
0.4.33 8B matches 8B 0.4.30 perfectly while 0.4.33 70B explodes and 0.4.30 70B works properly.

Here is entire configuration diff:

qGentry · 2024-10-04T16:09:39Z

I've also tried to iteratively transform 8B to 70B to see at which point it starts to explode. Here's my results:
8B -> ok
8B + 80 layers -> ok
8B + 80 layers + 64 attention heads -> ok
8B + 80 layers + 64 attention heads + 8192 dim -> ok
8B + 80 layers + 64 attention heads + 8192 dim + hidden_dim 28672 -> explosion (at this point model is basically equivalent to 70B)

qGentry · 2024-10-04T17:03:43Z

New JAX 0.4.34 just have been released.
I've tested it - unfortunately results are exactly the same as JAX 0.4.33 - gradient explosion

qGentry · 2024-10-04T17:32:05Z

Given that 70B with hidden dim 28k explodes while one with 14k doesn't, I've bisected exact value of hidden_dim (rounded to nearest 16) at which explosions start to happen.
Here's my results - I've achieved very clear dichotomy.

Gradients are not exploding if MLP's hidden_dim <=20704
Gradients are exploding if MLP's hidden_dim>=20720.

Looks like at some point, when weights achieve certain sizes, XLA/CUDA switches to alternative algorithms, reorders something which leads to incorrect computations etc.

qGentry · 2024-10-05T08:27:07Z

Here's dumped HLOs of compiled training step for hidden_dim=20704 (gradients are not exploding) and hidden_dim=20720 (gradients are exploding).
compiled_train_fn_20704.txt
compiled_train_fn_20720.txt
And also for 20688 (not exploding).
compiled_train_fn_20688.txt

akuegel · 2024-10-07T07:59:04Z

Can you try setting the environment variable XLA_FLAGS=--xla_gpu_enable_dynamic_slice_fusion=false? It seems a recent change to dynamic slice fusion was potentially buggy, the author of the patch mentioned that they have run into errors caused by it.

qGentry · 2024-10-07T08:38:10Z

We've looked into HLOs with @jaro-sevcik and noticed that the only diff between exploding and non-exploding variants is additional copies introduced by rematerialization pass. Disabling it with --xla_disable_hlo_passes=rematerialization seems to solve this issue.

@akuegel I'll also try flag you've mentioned, give me couple of minutes.

qGentry · 2024-10-07T08:41:57Z

Nope, setting --xla_gpu_enable_dynamic_slice_fusion=false doesn't help, gradients are still exploding.

Imported from GitHub PR #18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix #17922 Copybara import of the project: -- 49daad1 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18152 from jaro-sevcik:avoid-fusion-wrapping-copies 49daad1 PiperOrigin-RevId: 684709374

Imported from GitHub PR openxla/xla#18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix openxla/xla#17922 Copybara import of the project: -- 49daad1836186fd7abe2ad089aa8783f1125f605 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18152 from jaro-sevcik:avoid-fusion-wrapping-copies 49daad1836186fd7abe2ad089aa8783f1125f605 PiperOrigin-RevId: 684709374

Imported from GitHub PR #18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix #17922 Copybara import of the project: -- 49daad1 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18152 from jaro-sevcik:avoid-fusion-wrapping-copies 49daad1 PiperOrigin-RevId: 684729231

Imported from GitHub PR openxla/xla#18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix openxla/xla#17922 Copybara import of the project: -- 49daad1836186fd7abe2ad089aa8783f1125f605 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18152 from jaro-sevcik:avoid-fusion-wrapping-copies 49daad1836186fd7abe2ad089aa8783f1125f605 PiperOrigin-RevId: 684729231

akuegel · 2024-10-11T09:56:58Z

@jaro-sevcik has created #18152 to fix this.

Imported from GitHub PR #18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix #17922 Copybara import of the project: -- 49daad1 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18152 from jaro-sevcik:avoid-fusion-wrapping-copies 49daad1 PiperOrigin-RevId: 684729231

Imported from GitHub PR openxla/xla#18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix openxla/xla#17922 Copybara import of the project: -- 49daad1836186fd7abe2ad089aa8783f1125f605 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18152 from jaro-sevcik:avoid-fusion-wrapping-copies 49daad1836186fd7abe2ad089aa8783f1125f605 PiperOrigin-RevId: 684729231

Imported from GitHub PR #18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix #17922 Copybara import of the project: -- 49daad1 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18152 from jaro-sevcik:avoid-fusion-wrapping-copies 49daad1 PiperOrigin-RevId: 684729231

Imported from GitHub PR openxla/xla#18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix openxla/xla#17922 Copybara import of the project: -- 49daad1836186fd7abe2ad089aa8783f1125f605 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18152 from jaro-sevcik:avoid-fusion-wrapping-copies 49daad1836186fd7abe2ad089aa8783f1125f605 PiperOrigin-RevId: 684729231

Imported from GitHub PR #18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix #17922 Copybara import of the project: -- 49daad1 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18152 from jaro-sevcik:avoid-fusion-wrapping-copies 49daad1 PiperOrigin-RevId: 684729231

Imported from GitHub PR openxla/xla#18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix openxla/xla#17922 Copybara import of the project: -- 49daad1836186fd7abe2ad089aa8783f1125f605 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18152 from jaro-sevcik:avoid-fusion-wrapping-copies 49daad1836186fd7abe2ad089aa8783f1125f605 PiperOrigin-RevId: 684729231

Imported from GitHub PR #18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix #17922 Copybara import of the project: -- 49daad1 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18152 from jaro-sevcik:avoid-fusion-wrapping-copies 49daad1 PiperOrigin-RevId: 684729231

Imported from GitHub PR openxla/xla#18152 Fusion wrapping copies breaks the logic for detecting copies from copy-insertion in rematerialization pass. This patch avoids wrapping copy instructions and instead emits them directly in IrEmitterUnnested. This should fix openxla/xla#17922 Copybara import of the project: -- 49daad1836186fd7abe2ad089aa8783f1125f605 by Jaroslav Sevcik <[email protected]>: Avoid fusion-wrapping copies Merging this change closes #18152 PiperOrigin-RevId: 686055013

qGentry mentioned this issue Oct 4, 2024

[Regression] Gradient explodes after upgrading to JAX 0.4.33 from 0.4.30 jax-ml/jax#24114

Closed

jaro-sevcik mentioned this issue Oct 10, 2024

[XLA:GPU] Avoid fusion-wrapping copies #18152

Closed

copybara-service bot mentioned this issue Oct 11, 2024

PR #18152: [XLA:GPU] Avoid fusion-wrapping copies #18192

Open

copybara-service bot mentioned this issue Oct 11, 2024

PR #18152: [XLA:GPU] Avoid fusion-wrapping copies tensorflow/tensorflow#77654

Draft

copybara-service bot mentioned this issue Oct 11, 2024

PR #18152: [XLA:GPU] Avoid fusion-wrapping copies #18195

Closed

copybara-service bot mentioned this issue Oct 11, 2024

PR #18152: [XLA:GPU] Avoid fusion-wrapping copies tensorflow/tensorflow#77665

Merged

copybara-service bot closed this as completed in 8b93301 Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Regression] Gradient explodes after upgrading to JAX 0.4.33 from 0.4.30 #17922

[Regression] Gradient explodes after upgrading to JAX 0.4.33 from 0.4.30 #17922

qGentry commented Oct 4, 2024 •

edited

Loading

qGentry commented Oct 4, 2024 •

edited

Loading

qGentry commented Oct 4, 2024

qGentry commented Oct 4, 2024 •

edited

Loading

qGentry commented Oct 4, 2024 •

edited

Loading

qGentry commented Oct 4, 2024

qGentry commented Oct 4, 2024 •

edited

Loading

qGentry commented Oct 5, 2024 •

edited

Loading

akuegel commented Oct 7, 2024

qGentry commented Oct 7, 2024

qGentry commented Oct 7, 2024

akuegel commented Oct 11, 2024

[Regression] Gradient explodes after upgrading to JAX 0.4.33 from 0.4.30 #17922

[Regression] Gradient explodes after upgrading to JAX 0.4.33 from 0.4.30 #17922

Comments

qGentry commented Oct 4, 2024 • edited Loading

Description

System info (python version, jaxlib version, accelerator, etc.)

qGentry commented Oct 4, 2024 • edited Loading

qGentry commented Oct 4, 2024

qGentry commented Oct 4, 2024 • edited Loading

qGentry commented Oct 4, 2024 • edited Loading

qGentry commented Oct 4, 2024

qGentry commented Oct 4, 2024 • edited Loading

qGentry commented Oct 5, 2024 • edited Loading

akuegel commented Oct 7, 2024

qGentry commented Oct 7, 2024

qGentry commented Oct 7, 2024

akuegel commented Oct 11, 2024

qGentry commented Oct 4, 2024 •

edited

Loading

qGentry commented Oct 4, 2024 •

edited

Loading

qGentry commented Oct 4, 2024 •

edited

Loading

qGentry commented Oct 4, 2024 •

edited

Loading

qGentry commented Oct 4, 2024 •

edited

Loading

qGentry commented Oct 5, 2024 •

edited

Loading