Gradient Accumulation in Axlearn #465

apoorvtintin · 2024-05-13T20:18:26Z

Gradient accumulation allows training with higher batch sizes without scaling out.

Added a new learner type learner.klass: 'axlearn.common.learner.AccumulatedLearner'

At a high level the optimization does the following:

Input batch is split into even microbatches.
Creates a buffer for gradients and metrics.
Runs forward and backward pass for each microbatch in a loop summing up the gradients and aggregating metrics.
Average gradients across microbatches and normalize metrics.

Configuration changes:

Number of microbatches are specified during configuration through option micriobatches in the learner.

ruomingp

Could you explain why this is needed? We find usually it's more efficient to use either a larger mesh or a smaller batch size.

axlearn/common/learner.py

apghml · 2024-07-03T04:06:39Z

axlearn/common/learner.py

@@ -444,6 +444,153 @@ def _mask_tree(tree: dict, *, keep: dict) -> dict:
    )


+class MetricsAccumulationOp(NamedTuple):


Axlearn already has metric accumulation classes that are used by evalers. Could those be reused here instead of defining new classes?

These metric accumulation classes are stateless so that they are usable as carry by jax.lax.scan unlike the ones in the evaler, I can make the class structure similar though.

axlearn/common/learner.py

apghml · 2024-07-03T04:10:00Z

axlearn/common/learner.py

+        # tuple of key-value pairs specifying custom aggregation and normalization
+        # for a specific metric
+        metrics_accumulation_key_ops: Sequence[Dict[str, Optional[MetricsAccumulationOp]]] = []
+        gradient_dtype: Optional[jnp.dtype] = jnp.bfloat16


Does the existing learner class use this? If not, we should try to be consistent with its API.

Could you please be more specific, is the concern naming of members?

apghml · 2024-07-03T04:12:46Z

axlearn/common/learner.py

+
+        Returns:
+            ForwardBackwardOutputs: pytree containing gradients and metrics
+        """


I wonder if instead of having a separate learner for microbatching, it would be more flexible to have a generic way of wrapping a ForwardFn so that it uses Jax.lax.map to run the microbatches. Beyond avoiding the need to add a new learner, it would also allow for other microbetching uses outside of learner, eg inference or in second order optimizers.

jax.lax.map gives no guarantees of sequential execution of microbatches which is the key quality of gradient accumulation.

apghml · 2024-07-03T04:13:59Z

I’m ooo this week. I have left some preliminary comments for now.

apghml · 2024-07-31T18:46:00Z

Closing since gradient accumulation functionality has been implemented via #614

apoorvtintin force-pushed the gradient_accumulation_pr branch from ade2229 to 3394936 Compare July 1, 2024 18:18

apoorvtintin requested review from ruomingp and markblee as code owners July 1, 2024 18:18

ruomingp requested a review from apghml July 2, 2024 12:00

ruomingp reviewed Jul 2, 2024

View reviewed changes

madrob force-pushed the main branch from 9471857 to cce635c Compare July 2, 2024 21:02

apghml reviewed Jul 3, 2024

View reviewed changes

apoorvtintin added 2 commits July 23, 2024 19:08

Add Gradient Accumulation to Axlearn

3e80b87

address PR comments

2430830

apoorvtintin force-pushed the gradient_accumulation_pr branch from 3394936 to 2430830 Compare July 24, 2024 23:38

apghml closed this Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient Accumulation in Axlearn #465

Gradient Accumulation in Axlearn #465

apoorvtintin commented May 13, 2024 •

edited

Loading

ruomingp left a comment

apghml Jul 3, 2024

apoorvtintin Jul 24, 2024 •

edited

Loading

apghml Jul 3, 2024

apoorvtintin Jul 24, 2024 •

edited

Loading

apghml Jul 3, 2024

apoorvtintin Jul 22, 2024

apghml commented Jul 3, 2024 •

edited

Loading

apghml commented Jul 31, 2024

		@@ -444,6 +444,153 @@ def _mask_tree(tree: dict, *, keep: dict) -> dict:
		)


		class MetricsAccumulationOp(NamedTuple):

Gradient Accumulation in Axlearn #465

Gradient Accumulation in Axlearn #465

Conversation

apoorvtintin commented May 13, 2024 • edited Loading

ruomingp left a comment

Choose a reason for hiding this comment

apghml Jul 3, 2024

Choose a reason for hiding this comment

apoorvtintin Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

apghml Jul 3, 2024

Choose a reason for hiding this comment

apoorvtintin Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

apghml Jul 3, 2024

Choose a reason for hiding this comment

apoorvtintin Jul 22, 2024

Choose a reason for hiding this comment

apghml commented Jul 3, 2024 • edited Loading

apghml commented Jul 31, 2024

apoorvtintin commented May 13, 2024 •

edited

Loading

apoorvtintin Jul 24, 2024 •

edited

Loading

apoorvtintin Jul 24, 2024 •

edited

Loading

apghml commented Jul 3, 2024 •

edited

Loading