Allow external positions to be inputed in RoPE embedding layer #926

Firenze11 · 2025-01-15T02:56:10Z

Use case: In RoPE embedding, position embeddings are applied to Q, K, V values after i_proj. Unlike the implementation of current RoFormerQKVLinear, in MaskedDiT we need to customize positions to indicate masked versus non-masked positions in the position embedding. When we convert this masked roformer attention module to flash attention, we need its signature to be supported by MultiheadAttention.

Use case: In RoPE embedding, position embeddings are applied to Q, K, V values after `i_proj`. Unlike the implementation of current `RoFormerQKVLinear`, in MaskedDiT we need to customize positions to indicate masked versus non-masked positions in the position embedding. When we convert this masked roformer attention module to flash attention, we need its signature to be supported by `MultiheadAttention`.

markblee · 2025-01-15T03:07:00Z

axlearn/common/attention.py

@@ -1216,18 +1216,37 @@ class Config(BaseLayer.Config):
        dim: Required[int] = REQUIRED  # The dimensionality of the positional embedding.
        theta: float = 10000.0  # The scale of base frequency.

-    def forward(self, positions: Tensor) -> Tensor:
+    def default_query_positions(self, max_seq_len: int) -> Tensor:


Suggested change

def default_query_positions(self, max_seq_len: int) -> Tensor:

def _default_query_positions(self, max_seq_len: int) -> Tensor:

Users should pass max_seq_len rather than calling this method publicly.

There might be situations that we want to access the default query positions from the outside, such as getting the default positions and computing the positions based on it before passing it to forward. Therefore we want to keep this public.

Do we expect subclasses to override this method?

If not, it will be more readable for callers to call jnp.arange directly given the implementation is only one line.

Yes we do expect subclasses to override this method. One example is that we might want to specify the rotation start index and rotation end index for this embedding class. And the default embedding positions will be different then.

Thanks for the explanation. Given this, it's reasonable to consolidate the position computation logic in this class.

axlearn/common/attention.py

Co-authored-by: Mark Lee <[email protected]>

ruomingp · 2025-01-16T02:45:44Z

axlearn/common/attention.py

@@ -1216,18 +1216,37 @@ class Config(BaseLayer.Config):
        dim: Required[int] = REQUIRED  # The dimensionality of the positional embedding.
        theta: float = 10000.0  # The scale of base frequency.

-    def forward(self, positions: Tensor) -> Tensor:
+    def default_query_positions(self, max_seq_len: int) -> Tensor:


Do we expect subclasses to override this method?

If not, it will be more readable for callers to call jnp.arange directly given the implementation is only one line.

axlearn/common/attention.py

kelvin-zou

Thanks

axlearn/common/attention.py

ruomingp · 2025-01-17T15:18:52Z

axlearn/common/attention.py

+        if positions is None:
+            if max_seq_len is None:


what happens if both positions and max_seq_len are provided? should we check that they are consistent?

In that case positions will take precedence and we ignore max_seq_len. We won't need max_seq_len if the client provides explicit positions. Will add that to docstring.

Maybe we can error if they are both provided?

ruomingp · 2025-01-17T15:27:45Z

axlearn/common/attention.py

@@ -1216,18 +1216,37 @@ class Config(BaseLayer.Config):
        dim: Required[int] = REQUIRED  # The dimensionality of the positional embedding.
        theta: float = 10000.0  # The scale of base frequency.

-    def forward(self, positions: Tensor) -> Tensor:
+    def default_query_positions(self, max_seq_len: int) -> Tensor:


Thanks for the explanation. Given this, it's reasonable to consolidate the position computation logic in this class.

Co-authored-by: Ruoming Pang <[email protected]>

Firenze11 added 7 commits January 14, 2025 17:40

Update attention_test.py

3b7c847

Update dit.py

3664b6b

Update attention.py

c967c07

Update attention_test.py

60b7d32

Update attention.py

8bdbac1

Update dit.py

7aa6dd5

Firenze11 requested review from ruomingp, markblee and a team as code owners January 15, 2025 02:56

markblee reviewed Jan 15, 2025

View reviewed changes

Update axlearn/common/attention.py

0be3a83

Co-authored-by: Mark Lee <[email protected]>

ruomingp reviewed Jan 16, 2025

View reviewed changes

kelvin-zou approved these changes Jan 17, 2025

View reviewed changes

ruomingp reviewed Jan 17, 2025

View reviewed changes

Firenze11 and others added 3 commits January 17, 2025 09:47

respond to comments.

82a29f7

Co-authored-by: Ruoming Pang <[email protected]>

Update attention.py

2bb2a2b

Merge branch 'apple:main' into rope_emb_pos

bd32156

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow external positions to be inputed in RoPE embedding layer #926

Allow external positions to be inputed in RoPE embedding layer #926

Firenze11 commented Jan 15, 2025

markblee Jan 15, 2025

Firenze11 Jan 15, 2025

ruomingp Jan 16, 2025

Firenze11 Jan 16, 2025

ruomingp Jan 17, 2025

ruomingp Jan 16, 2025

kelvin-zou left a comment

ruomingp Jan 17, 2025

Firenze11 Jan 17, 2025

apghml Jan 17, 2025

ruomingp Jan 17, 2025

	def default_query_positions(self, max_seq_len: int) -> Tensor:
	def _default_query_positions(self, max_seq_len: int) -> Tensor:

Allow external positions to be inputed in RoPE embedding layer #926

Are you sure you want to change the base?

Allow external positions to be inputed in RoPE embedding layer #926

Conversation

Firenze11 commented Jan 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kelvin-zou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment