Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed fuse leaky relu with convolution on RTX 3090 #138

Open
jb2020-super opened this issue Jul 1, 2021 · 2 comments
Open

Failed fuse leaky relu with convolution on RTX 3090 #138

jb2020-super opened this issue Jul 1, 2021 · 2 comments

Comments

@jb2020-super
Copy link

jb2020-super commented Jul 1, 2021

This is my code https://github.com/jb2020-super/test-DirectML.git

According to the PIX analysis result, the convolution with FusedActivation set to DML_OPERATOR_ACTIVATION_LEAKY_RELU is splitted into two convolution ops. But when replaced with DML_OPERATOR_ACTIVATION_RELU, fusion succeed. How to solve this?
pix

@adtsai
Copy link
Contributor

adtsai commented Jul 1, 2021

Hi,

DirectML fuses operators opportunistically - that is, when it is both possible to fuse and there is a performance benefit to doing so. Unfortunately in this case it appears it wasn't possible to fuse the LEAKY_RELU with the metacommand (as the level of metacommand support can vary by hardware and driver version). You might be able to achieve the fusion by using the DISABLE_METACOMMANDS flag, but that's likely to result in worse performance. Let us know if you have an end-to-end scenario that's impacted by this - if there's data that shows a substantial performance difference, this is something we can raise with hardware vendors as a potential optimization in future.

@jb2020-super
Copy link
Author

Hi @adtsai , DISABLE_METACOMMANDS will result into bad performance. I replaced the model in the DirectMLSuperResolution sample with a seven-layer CNN and tested it.The results are as follows.

Environment

  • DirectML v1.5.1
  • AMD RX 5700 XT. Driver data: 9/9/2020, Driver version: 27.20.12029.1000
  • NVIDIA RTX 3090. Driver data: 5/6/2021. Driver version: 27.21.14.6259
  • 7-layer CNN. The first six layer has leaky relu as the activation function. The direction of last layer CNN is set to DML_CONVOLUTION_DIRECTION_BACKWARD.

Test Result

Model AMD RX 5700 XT (frame time) NVIDIA RTX 3090(frame time)
Demo 38.41 ms 10.975 ms
7-layer CNN 41.10 ms 33.254 ms
7-layer CNN(disable metacommand) 133.69 ms 115.50 ms

Summary

  • The performance gap is obvious when using the Demo model, which is in line with the actual performance of the graphics card. (38.41ms vs 10.975 ms)
  • When using 7-layer CNN, the performance difference is small. (41.10 ms vs 33.254 ms). The performance improvement of 3090 was not as good as expected.
  • CNN was not fused with LEAKY_RELU on 3090.
  • CNN with backward direction was not compiled into metacommand on 3090, but 5700 XT does.

PIX Analysis

Demo model on 5700XT
5700xt_demo
7-layer CNN on 5700XT
5700xt_upconv7
Demo model on 3090
demo_model
7-layer CNN on 3090
leaky_relu
7-layer CNN on 3090 disable metacommand
disable_metacommand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants