Failed fuse leaky relu with convolution on RTX 3090 #138

jb2020-super · 2021-07-01T08:25:09Z

This is my code https://github.com/jb2020-super/test-DirectML.git

According to the PIX analysis result, the convolution with FusedActivation set to DML_OPERATOR_ACTIVATION_LEAKY_RELU is splitted into two convolution ops. But when replaced with DML_OPERATOR_ACTIVATION_RELU, fusion succeed. How to solve this?

adtsai · 2021-07-01T22:44:51Z

Hi,

DirectML fuses operators opportunistically - that is, when it is both possible to fuse and there is a performance benefit to doing so. Unfortunately in this case it appears it wasn't possible to fuse the LEAKY_RELU with the metacommand (as the level of metacommand support can vary by hardware and driver version). You might be able to achieve the fusion by using the DISABLE_METACOMMANDS flag, but that's likely to result in worse performance. Let us know if you have an end-to-end scenario that's impacted by this - if there's data that shows a substantial performance difference, this is something we can raise with hardware vendors as a potential optimization in future.

jb2020-super · 2021-07-02T04:00:21Z

Hi @adtsai , DISABLE_METACOMMANDS will result into bad performance. I replaced the model in the DirectMLSuperResolution sample with a seven-layer CNN and tested it.The results are as follows.

Environment

DirectML v1.5.1
AMD RX 5700 XT. Driver data: 9/9/2020, Driver version: 27.20.12029.1000
NVIDIA RTX 3090. Driver data: 5/6/2021. Driver version: 27.21.14.6259
7-layer CNN. The first six layer has leaky relu as the activation function. The direction of last layer CNN is set to DML_CONVOLUTION_DIRECTION_BACKWARD.

Test Result

Model	AMD RX 5700 XT (frame time)	NVIDIA RTX 3090(frame time)
Demo	38.41 ms	10.975 ms
7-layer CNN	41.10 ms	33.254 ms
7-layer CNN(disable metacommand)	133.69 ms	115.50 ms

Summary

The performance gap is obvious when using the Demo model, which is in line with the actual performance of the graphics card. (38.41ms vs 10.975 ms)
When using 7-layer CNN, the performance difference is small. (41.10 ms vs 33.254 ms). The performance improvement of 3090 was not as good as expected.
CNN was not fused with LEAKY_RELU on 3090.
CNN with backward direction was not compiled into metacommand on 3090, but 5700 XT does.