remove extra wgmma_wait_group in flash attentinon #2399

donproc · 2023-09-27T00:45:38Z

Performance on H100 SXM5 HBM3
SM: 1830 MHZ

-[BATCH]-[N_HEADS]-[D_HEAD]-[mode]-[casual]-[seq_par]

fused-attention-batch4-head48-d64-fwd-True-True:
N_CTX Before After
0 1024.0 0.124135 0.120228
1 2048.0 0.391899 0.378483
2 4096.0 1.381547 1.340469
3 8192.0 5.182845 5.013851
fused-attention-batch4-head48-d64-fwd-True-False:
N_CTX Triton After
0 1024.0 0.124390 0.120371
1 2048.0 0.392177 0.379371
2 4096.0 1.380980 1.337169
3 8192.0 5.191552 5.013165
fused-attention-batch4-head48-d64-fwd-False-True:
N_CTX Triton After
0 1024.0 0.159597 0.153621
1 2048.0 0.563483 0.540353
2 4096.0 2.141308 2.053801
3 8192.0 8.244150 7.886597
fused-attention-batch4-head48-d64-fwd-False-False:
N_CTX Triton After
0 1024.0 0.160086 0.153992
1 2048.0 0.563423 0.541524
2 4096.0 2.146597 2.057333
3 8192.0 8.229922 7.886234

bealwang · 2023-09-27T02:01:11Z

lib/Dialect/TritonGPU/Transforms/Pipeline.cpp

-          auto CArg = dotOp.getOperand(2).dyn_cast<BlockArgument>();
-          if (!CArg || !CArg.hasOneUse())
+          auto CArg = dotOp.getOperand(2);
+          auto depend = dependOnTensorBlockArgument(dotOp.getOperand(2));


Maybe we need check more conditions here.
For example

for (%0, %1, %2) { %3 = add %1, %2 %4 = dot %0, %1, %3 .... some other ops scf.yield %x, %y, %4 }

It is a undefined behavior if we async launch the DotOp in the loop. Because at the time point we issue the AddOp, its operand %2 will be not ready yet.

Actually it only accept conditions

DotOp does not depends on operations with async semantics.

DotOp depends on operations with async semantics, and with proper sync/fence operations between them.

DotOp depends on itself, as a corner case of 2.

bealwang · 2023-09-27T02:18:13Z

lib/Dialect/TritonGPU/Transforms/Pipeline.cpp

-
+  Operation *dotWait = nullptr;
+  if (!hasSyncDot && !hasWaitGroup) {
+    builder.setInsertionPointAfter(dot.getDefiningOp());


In general, DotWaitOp is coupled with DotAsyncOp. If hasSyncDot=true, it will not insert DotWaitOp, but it will replace DotOp with DotAsyncOp？

And so as the outstanding DotWaitOp.

ptillet · 2023-09-28T17:32:42Z

marking this as draft until no longer WIP

github-actions · 2023-10-05T03:37:43Z

⚠️ This PR does not produce bitwise identical kernels as the branch it's merged against. Please check artifacts for details. Download the output file here.

bealwang · 2023-10-18T01:45:56Z

lib/Dialect/TritonGPU/Transforms/Pipeline.cpp

@@ -1662,6 +1742,11 @@ void PipelinePass::asyncLaunchDots(scf::ForOp forOp) {
  // TODO: merge this with the rest of the pipelining transformation and look at
  // a better representation for async dots.
  tt::DotOp lastDot = dots.back();
+  for (auto dotOp : allDots) {
+    if (dotOp != lastDot)


do you want to check if there is a dotOp in allDots but not in dots?
Maybe find(dots.begin(), dots.end(), dotOp) != dots.end() will be more reasonable.

bealwang · 2023-10-18T01:59:54Z

lib/Dialect/TritonGPU/Transforms/Pipeline.cpp

+    if (arg == result) {
+      break;
+    } else {
+      continue;


nit: we don't need else {continue;} here.

bealwang · 2023-10-18T02:22:19Z

lib/Dialect/TritonGPU/Transforms/Pipeline.cpp

+  auto op = v.getDefiningOp();
+  // root and not BlockArgument
+
+  auto iterArgs = forOp.getInitArgs();


I have a little misunderstanding. Do you have a concrete example to explain why we check forOp.getInitArgs() here but not forOp.getRegionIterArgs()?

github-actions · 2023-10-24T03:20:19Z

⚠️ This PR does not produce bitwise identical kernels as the branch it's merged against. Please check artifacts for details. Download the output file here.

github-actions · 2023-10-26T01:57:34Z

⚠️ This PR does not produce bitwise identical kernels as the branch it's merged against. Please check artifacts for details. Download the output file here.

ptillet · 2023-10-26T16:26:05Z

This is great!

github-actions · 2023-10-26T16:37:01Z

⚠️ This PR does not produce bitwise identical kernels as the branch it's merged against. Please check artifacts for details. Download the output file here.

ThomasRaoux · 2023-10-30T15:31:40Z

lib/Dialect/TritonGPU/Transforms/Pipeline.cpp

+  Value loopNotEmpty = builder.create<arith::CmpIOp>(
+      loc, arith::CmpIPredicate::slt, forOp.getLowerBound(),
+      forOp.getUpperBound());
+  auto ifOp = builder.create<scf::IfOp>(loc, resultTypes, loopNotEmpty,


Is there any reason for adding back this workaround? Does it break something to not have this if?

…lang#2399) Co-authored-by: dongdongl <[email protected]>

donproc requested a review from ptillet as a code owner September 27, 2023 00:45

donproc changed the title ~~remove extra wgmma_wait_group in flash attentinon~~ [WIP]remove extra wgmma_wait_group in flash attentinon Sep 27, 2023

bealwang reviewed Sep 27, 2023

View reviewed changes

ptillet marked this pull request as draft September 28, 2023 17:32

donproc force-pushed the pipeline_remove branch 4 times, most recently from 32ccff2 to 6d0895c Compare October 17, 2023 00:37

donproc changed the title ~~[WIP]remove extra wgmma_wait_group in flash attentinon~~ remove extra wgmma_wait_group in flash attentinon Oct 17, 2023

donproc force-pushed the pipeline_remove branch from 6d0895c to 1892bf4 Compare October 17, 2023 01:05

donproc marked this pull request as ready for review October 17, 2023 01:12

bealwang reviewed Oct 18, 2023

View reviewed changes

donproc force-pushed the pipeline_remove branch 6 times, most recently from ea3528e to 7bd7185 Compare October 23, 2023 16:33

donproc force-pushed the pipeline_remove branch from 7bd7185 to fd38935 Compare October 25, 2023 11:21

donproc force-pushed the pipeline_remove branch from fd38935 to 70838c9 Compare October 26, 2023 13:35

remove extra wait in pipeline pass

5df7a6f

donproc force-pushed the pipeline_remove branch from 70838c9 to 5df7a6f Compare October 26, 2023 13:37

Merge branch 'main' into pipeline_remove

0b3d616

ptillet enabled auto-merge (squash) October 26, 2023 16:30

ptillet approved these changes Oct 26, 2023

View reviewed changes

ptillet merged commit 0469d5f into triton-lang:main Oct 26, 2023
4 checks passed

ThomasRaoux reviewed Oct 30, 2023

View reviewed changes

pingzhuu pushed a commit to siliconflow/triton that referenced this pull request Apr 2, 2024

[OPTIMIZER] Remove extra wgmma_wait_group in flash attention (triton-…

575e7ce

…lang#2399) Co-authored-by: dongdongl <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove extra wgmma_wait_group in flash attentinon #2399

remove extra wgmma_wait_group in flash attentinon #2399

donproc commented Sep 27, 2023 •

edited

Loading

bealwang Sep 27, 2023

bealwang Sep 27, 2023

bealwang Sep 27, 2023

bealwang Sep 27, 2023

ptillet commented Sep 28, 2023

github-actions bot commented Oct 5, 2023

bealwang Oct 18, 2023

bealwang Oct 18, 2023

bealwang Oct 18, 2023

github-actions bot commented Oct 24, 2023

github-actions bot commented Oct 26, 2023

ptillet commented Oct 26, 2023

github-actions bot commented Oct 26, 2023

ThomasRaoux Oct 30, 2023

remove extra wgmma_wait_group in flash attentinon #2399

remove extra wgmma_wait_group in flash attentinon #2399

Conversation

donproc commented Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ptillet commented Sep 28, 2023

github-actions bot commented Oct 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 24, 2023

github-actions bot commented Oct 26, 2023

ptillet commented Oct 26, 2023

github-actions bot commented Oct 26, 2023

Choose a reason for hiding this comment

donproc commented Sep 27, 2023 •

edited

Loading