Enable remat checkpoints to host instead of TPU memory #643

samos123 · 2024-08-09T00:37:01Z

This allowed us to get MFU of fuji v2 70B from 58.50% to 61.83%

ruomingp

Thanks!

ruomingp · 2024-08-12T18:08:33Z

axlearn/common/attention.py

@@ -3874,6 +3874,7 @@ def build_remat_spec(
    ],
    self_attention: bool = True,
    feed_forward: bool = False,
+    offload: bool = False,


Instead of a bool, should we allow the caller to customize offload_dst directly?

Suggested change

offload: bool = False,

offload_dst: Optional[Literal["pinned_host"]] = None,

This will make the API more extensible and closer to the JAX API.

Agree. I will take another stab at this PR with focus on staying closer to JAX API and extensibility.

Resolved, could you review again?

ruomingp

Thanks!

ruomingp · 2024-08-12T20:27:52Z

axlearn/common/attention.py

@@ -3891,6 +3904,7 @@ def build_remat_spec(
        stack_cfg: A transformer config.
        self_attention: Checkpoint self attention layer activations if true.
        feed_forward: Checkpoint feed-forward layer activations if true.
+        offload_dst: Destination of remat checkptoing offloading.


Add a link to the JAX documentation on offset_dst on the potential values?

There are no docs yet for this. Do you want me to link to the maxtext code as a comment? https://github.com/google/maxtext/blob/ebd39aa64d670fa13a313b6f776e01ad9e450321/MaxText/layers/models.py#L230

Sounds good. Thanks!

Was this change pushed?

No ah I misunderstood Ruoming's comment and thought he was fine there being no docs. Let me add the link to maxtext as a comment.

ruomingp · 2024-08-12T20:28:48Z

axlearn/experiments/text/gpt/fuji.py

@@ -188,6 +188,7 @@ def get_trainer_kwargs(
                num_kv_heads=None if version == Version.V1 else 8,
                rope_theta=rope_theta,
                flash_attention=flash_attention,
+                remat_offload_dst="pinned_host",


Add a comment on the observed MFU and step time?

ruomingp · 2024-08-12T20:29:32Z

axlearn/experiments/text/gpt/fuji.py

@@ -188,6 +188,7 @@ def get_trainer_kwargs(
                num_kv_heads=None if version == Version.V1 else 8,
                rope_theta=rope_theta,
                flash_attention=flash_attention,
+                remat_offload_dst="pinned_host",


Is this option limited to 70B? Do we want to apply it to 7B and other models?

Only use remat checkpoint offload into host when you could benefit from the extra TPU memory.

If there is plenty TPU memory then remat checkpoint into TPU memory.
If TPU memory is low or you want to squeeze as much per device batch as possible, then use offload_dst=pinned_Host.

So yes, it could make a lot of sense for 7B too on V5e and Trilium since then we can possibly increase per device batch size.

I would prefer to have this PR focus on enabling remat offload and 70B. As a follow up I can do the following as part of V5E perf benchmarking:

Enable remat offload for 7B and compare performance before and after

See if I can increase per device batch size for 7B after enabling remat_offload

Would that work for you?

Sounds good. Thanks.

samos123 · 2024-08-13T19:23:59Z

@markblee could you trigger CI, review and merge if good?

axlearn/common/attention.py

Co-authored-by: Mark Lee <[email protected]>

* remat checkpoints to host * update golden configs * Change offload to offload_dst * add step time and MFU for v5e fuji v2 70b * Add maxtext example in code comment * Update axlearn/common/attention.py Co-authored-by: Mark Lee <[email protected]> --------- Co-authored-by: Mark Lee <[email protected]>

remat checkpoints to host

5e7786a

samos123 marked this pull request as ready for review August 12, 2024 17:53

samos123 requested review from ruomingp and markblee as code owners August 12, 2024 17:53

update golden configs

0e14f11

ruomingp reviewed Aug 12, 2024

View reviewed changes

Change offload to offload_dst

e9b3315

samos123 requested a review from ruomingp August 12, 2024 19:53

ruomingp reviewed Aug 12, 2024

View reviewed changes

add step time and MFU for v5e fuji v2 70b

ac0afe3

samos123 requested a review from ruomingp August 12, 2024 21:26

ruomingp approved these changes Aug 12, 2024

View reviewed changes

Add maxtext example in code comment

cf79c8a

markblee reviewed Aug 14, 2024

View reviewed changes

axlearn/common/attention.py Outdated Show resolved Hide resolved

Update axlearn/common/attention.py

4c41564

Co-authored-by: Mark Lee <[email protected]>

markblee approved these changes Aug 14, 2024

View reviewed changes

markblee enabled auto-merge August 14, 2024 16:49

ruomingp approved these changes Aug 14, 2024

View reviewed changes

markblee added this pull request to the merge queue Aug 14, 2024

Merged via the queue into apple:main with commit a7c64ee Aug 14, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable remat checkpoints to host instead of TPU memory #643

Enable remat checkpoints to host instead of TPU memory #643

samos123 commented Aug 9, 2024

ruomingp left a comment

ruomingp Aug 12, 2024

samos123 Aug 12, 2024

samos123 Aug 12, 2024

ruomingp left a comment

ruomingp Aug 12, 2024

samos123 Aug 12, 2024

ruomingp Aug 12, 2024

markblee Aug 13, 2024

samos123 Aug 13, 2024

samos123 Aug 13, 2024

ruomingp Aug 12, 2024

samos123 Aug 12, 2024

ruomingp Aug 12, 2024

samos123 Aug 12, 2024

samos123 Aug 12, 2024

ruomingp Aug 12, 2024

samos123 commented Aug 13, 2024

	offload: bool = False,
	offload_dst: Optional[Literal["pinned_host"]] = None,

Enable remat checkpoints to host instead of TPU memory #643

Enable remat checkpoints to host instead of TPU memory #643

Conversation

samos123 commented Aug 9, 2024

ruomingp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruomingp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samos123 commented Aug 13, 2024