0% found this document useful (0 votes)

255 views54 pages

Direct3D 11 Computer Shader More Generality For Advanced Techniques

Direct3D 11 Computer Shader More

Uploaded by

vickypatel80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

255 views54 pages

Direct3D 11 Computer Shader More Generality For Advanced Techniques

Direct3D 11 Computer Shader More

Uploaded by

vickypatel80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Direct3D 11 Compute Shader

More Generality for Advanced Techniques

Chas. Boyd Architect Windows Desktop & Graphics Technology Microsoft

Overview
GPGPU vs. Data Parallel Computing Introducing the Compute Shader Advantages Target Applications Key Features Examples
Image reduction, histogram, convolution

API Support

GPGPU = Data Parallel Computing

GPU Performance continues to grow
More algorithms want this performance

Apps can scale to massive parallelism without tricky code changes General recognition that this model is applicable beyond just rendering
although that is our primary target

Deliver scalable performance

Code scales with core count with no changes

Introducing: Compute Shader

A new processing model for GPUs
Dataparallel programming for mass markets

Integrated with Direct3D

For efficient interoperability in client scenarios

Supports more general constructs:

Cross-thread data sharing Unordered access I/O operations

Enables more general data structures

Irregular arrays, trees, etc.

Enables more general algorithms

Far beyond shading

Optimized for Client Scenarios

Simpler setup syntax
Balance between power and complexity

Real-time rendering of results

Working to reduce cost of transition from compute mode to graphics mode

Better integration with media data types:

Pixels, samples, text, vs. only floats

Need consistency between implementations

Both across vendors and over time/generations

Compute Shader Features

Predictable Thread Invocation
Regular arrays of threads: 1D, 2D, 3D Dont have to draw a quad anymore

Shared registers between threads

Reduces register pressure Can eliminate redundant compute and I/O

Scattered Writes
Can read/write arbitrary data structures Enables new classes of algorithms Integrated with Direct3D resources
7

Target Applications
Image/post-processing:
Image reduction, histogram, convolution, FFT

Effect physics
Particles, smoke, water, cloth, etc.

A-Buffer/OIT Ray-tracing, radiosity, etc. Gameplay physics, AI

Integrated with Direct3D

Fully supports all Direct3D resources Targets graphics/media data types Evolution of DirectX HLSL Graphics pipeline updated to emit general data structures
which can then be manipulated by compute shader and then rendered by D3D again

Integration with Pipeline

Input Assembler Vertex Shader Tessellation
Geometry Shader

Render scene Write out scene image Use Compute for image post-processing Output final image
Data Structure

Rasterizer
Pixel Shader Output Merger Final Image

Scene Image

Compute Shader

Direct Thread Invocation

The ability to explicitly launch a known number of threads onto the GPU
pD3D11Device->Dispatch( numThreads );

Analogous to graphics DrawPrimitive() calls Enables algorithms to execute the optimal number of threads
Not how many vertices are read, or pixels written

Current thread id is available to shader code:

sv_ThreadID.x

Analogous to sv_PrimitiveID system value

Enables predictable memory access and register usage

Shared Register Class

New register type/variable storage class
shared float sfFoo;

Multiple threads can access same memory

Enables uses like user-controlled cache

Maximum of 32 KB of registers can be shared in DirectX 11

8K floats or 2K float4s vs. 64 KB of total temporary registers available
16K floats or 4K float4s
13

Sub Blocking
Not all threads in the call can/should share registers with each other Sharing threads are broken down into subsets (groups) of threads Thread indices are made available in shader
sv_ThreadID sv_ThreadGroupID sv_ThreadIDinGroup
14

Atomic Intrinsics
Enable parallel operations on individual 32-bit memory locations without requiring full synchronization
Either video memory or shared registers

Can be used to implement higher-level synch constructs

Semaphores, etc.

Not intended for heavy lifting Support an immediate return argument

At some performance cost

Atomic Intrinsics
Enables basic operations:
InterlockedAdd( rVar, val ); InterlockedMin( rVar, val ); InterlockedMax( rVar, val ); InterlockedOr( rVar, val ); InterlockedXOr( rVar, val ); InterlockedCompareWrite( rVar, val ); InterlockedCompareExchange( rVar, val );

Unordered Memory Accesses

HLSL resource variables
Declared in the language

DXGI resources
Enables out-of-bounds memory checking
Returns 0 on reads Writes are No-Ops

Improves security, reliability of shipped code

Unordered I/O
For fastest performance when ordering of records need not be preserved Both reads and writes:
UnorderedLoad( ResourceVar, val); UnorderedStore( ResourceVar, val);

Requires buffer allocated before-hand

Integration with Direct3D

Pixel shaders can also perform scattered writes Enables rendering output to data structures more complex than a 2D array
Histogram, linked list, irregular array, tree, etc.

Dont Forget
Texture sampling still works:
Object.Load( Loc, Offset, Samples ); Object.Gather( Sampler, Loc ); Object.Sample( Sampler, Loc ); Object.SampleLevel( Sampler, Loc, LoD );

No automatic trilinear LoD calculation

Other graphics features are not present:

Antialiasing, depth culling, alpha blending, triangle rasterization

Examples
Image Reduction Image Histogram FFT

Image Post-Processing
Significant fraction of frame time
1020% for most games 5070% for deferred shading-based engines

Savings here means more time for 3D

Image Reduction
Find the average intensity of an Image
E.g. for HDR exposure adjustment Optimizes scene for viewing on SDR monitor

Algorithm breakdown:
Input: 1 million pixels Compute: 1 MAD per pixel read Output: 1 value

Should this run at texture sample rate?

Does not due to write contention

Million-to-1 reduction

GPU

Output

Input

Reduction Compute Code

Buffer<uint> Values; OutputBuffer<uint> Result; ImageAverage() { groupshared uint Total; groupshared uint Count;

// Total so far // Count added

float3 vPixel = load( sampler, sv_ThreadID ); float fLuminance = dot( vPixel, LUM_VECTOR ); uint value = fLuminance*65536;

InterlockedAdd( Count, 1 ); InterlockedAdd( Total, value );

SynchronizeThreadGroup(); group to complete // enable all threads in

Reduction Compute Code2

// Allow all threads in group to complete SynchronizeThreadGroup(); // Compute the average and store it in our output buffer if (threadID.x == 0) { float fAverage = total/count; // compute avg UnorderedStore( Result[0], fAverage ); // write it out } }

Fast Reduction Compute Code

Buffer<uint> Values; OutputBuffer<uint> Result; ImageAverage() { groupshared uint Total[32]; groupshared uint Count[32]; // array of 32 totals // array of 32 counts

float3 vPixel = load( sampler, sv_ThreadID ); float fLuminance = dot( vPixel, LUM_VECTOR ); uint value = fLuminance*65536; uint idx = (sv_ThreadID.x + sv_ThreadID.y + sv_ThreadID.z) & 32; Total[idx] += value; Count[idx] += 1;

Fast Reduction Compute Code2

// Allow all threads in group to complete SynchronizeThreadGroup(); // Compute the average and store it in our output buffer if (threadIDInGroup.x == 0) { for ( uint i=0; i< 32; i++ ) { TheTotal += total[i]; TheCount += count[i]; } float fAverage = TheTotal/TheCount; // compute avg UnorderedStore( Result[GroupID], fAverage ); // write } }

Reduction Performance
Pyramid approaches work today
Some choice in reduction level per pass Tradeoff is contention for destination

1M pixels takes ~0.4ms in Direct3D

Pass-count-limited at small end of pyramid

Ideally should run at texture read rate

< 0.1 ms in theory, or 410x faster

Compute shader features should help

Such as local read-write cache Prototypes show ~2x speed boost so far

Histogram Generation
Similar to reduction problem
Reduce to 64256 destinations at data dependent (unpredictable) addresses

Still suffers contention when multiple pixels increment same bin

So replicate bins e.g. 16x Increment bins using InterlockedAdd() math operations

Currently showing 2x speedup

Histogram Generation Code

Histogram() { shared int Histograms[16][256]; // array of 16

float3 vPixel = load( sampler, sv_ThreadID ); float fLuminance = dot( vPixel, LUM_VECTOR ); int iBin = fLuminance*255.0f; // compute bin to increment int iHist = sv_ThreadIDInGroup & 16; // use thread index Histograms[iHist][iBin] += 1; // update bin

SynchronizeThreadGroup; // enable all threads in group to complete

Histogram Generation Code2

// Write register histograms out to memory: iBin = sv_ThreadIDInGroup.x; if ( ( sv_ThreadID.x < 256 ) { for ( iHist = 0; iHist < 16; iHist++ ) { int2 destAddr = int2( iHist, iBin ); OutputResource.add( destAddr, Histograms[iHist][iBin] ); // atomic } } }

Histogram Performance
Recent work shows similar performance to reductions:
Direct3D takes ~2.4 ms per megapixel
On DirectX10 hardware

2x speedup shown via prototypes

On same hardware but using shared registers

8x theoretically possible
if purely read limited

Image Convolution
Fundamental operation for blurs:
HDR flares, depth-of-field, soft shadows, streaks

Need fairly large kernels for these

100 wide is possible at high resolutions (sparse sampling produces artifacts)

7-Tap Separable Kernel

Convolution Performance
Massively variable depending on method Direct3D does 5x5 kernel in 0.65ms/Mpix
Separable kernel

Prototype does slightly better

Using shared register capability

Theoretical performance should be higher

Some opportunity remains

Need to evaluate relevant kernel sizes

Games need 100x100 effectively

Other Example Techniques

These are not used directly in game postprocessing today, but are key foundations of other algorithms
Scan (prefix sum), and FFT (fast Fourier transform)

Scan (Prefix-sum)
Each number in data sequence is sum of all previous numbers
Used to compute writes in irregular arrays Foundation of Summed Area Tables

Known GPU algorithms (Horns method)

Pyramid scheme, so I/O bound

Sharing memory between threads results in ~2x speedup

Scan (Prefix-sum)
We are looking at providing this in a library routine Along with FFT, etc.

Summed Area Table

2D equivalent of Scan
Each element of 2D array has sum of all elements up/left of it

Enables box filter with performance independent of kernel size O(k) Fast generation of
Shadow blur with distance Depth-of-field Area light integrals, etc.

Fast Fourier Transform

Converts image into frequency domain Many operations are faster in frequency domain than in spatial domain
e.g. convolution becomes a multiply Trivial detection of periodic noise Some application to motion estimation

Core algorithm similar to scan

Similar I/O patterns, But more math-intensive inner loop

Direct3D FFT
Ping-pong between 2 R32G32F surfaces
R is Real, G is Complex

Do LogN passes along rows then columns Pixel shader only Does not use blenders or iterators
Uses vPos.xy as array indices [i][j]

Inner loop is math intensive

20+ instructions including trig Indexing math dominates unless DX10

FFT Before

FFT After

After

FFT Performance
Complex 1024x1024 2D FFT:
Software 42ms Direct3D9 15ms Prototype DX11 6ms Latest chips 3ms 6 GFlops 17 GFlops 3x 42 GFlops 6x 100 GFlops

Shared register space and random access writes enable ~2x speedups

Order-Independent Translucency
Eliminates draw-order issues, and shimmer in moving scenes Correct AA even of transparent objects
Any object is transparent if antialiased e.g. alpha tested leaves in forests

Current methods require large sample counts

Alpha-To-Coverage Depth Peeling with Occlusion Queries

The A-Buffer Method

A-buffer is a more accurate method
Accumulate object data in per-pixel list Then sort each pixel into order Collapse to final color and display

Brings visual quality to movie levels without requiring 256-sample MSAA Something to keep an eye on for OIT

A-Buffer Rendering
Currently prototyping using refrast
DirectX reference rasterizer running on CPU Measuring memory access patterns/locality Evaluating feasibility of hardware Not really feasible with current Direct3D

Compute shader features enable this

Such as indexed writes, counters, etc. Rendering to structures beyond regular arrays

But performance is still largely unknown

Additional Algorithms
New rendering methods
Ray-tracing, collision detection, etc. Rendering elements at different resolutions

Non-rendering algorithms
IK, physics, AI, simulation, fluid simulation, radiosity

Need more general data structures

Quad/octrees, irregular arrays, sparse arrays

Need linear algebra

Summary
Compute Shader is coming in Direct3D 11
GPU performance levels for more applications

Scalable parallel processing model

Code should scale for several generations

Increased generality will enable both:

Improved performance on existing GPU tasks More CPU tasks can switch to DP cores

Full cross-vendor support

Enables broadest possible installed base

Questions?

www.xnagamefest.com

This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

Graphics Card:: FPS or Frames Per Second
No ratings yet
Graphics Card:: FPS or Frames Per Second
10 pages
DX11 Performance Tips and Tricks
No ratings yet
DX11 Performance Tips and Tricks
23 pages
DirectX 10 Overview for Developers
No ratings yet
DirectX 10 Overview for Developers
29 pages
A Brief Introduction To 3d
100% (1)
A Brief Introduction To 3d
84 pages
Aaltonen Sebastian GPU Based Clay
No ratings yet
Aaltonen Sebastian GPU Based Clay
70 pages
The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games
No ratings yet
The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games
74 pages
El Mansouri Jalal Rendering Rainbow Six PDF
No ratings yet
El Mansouri Jalal Rendering Rainbow Six PDF
82 pages
Overview of Graphics Systems
No ratings yet
Overview of Graphics Systems
74 pages
Computer Graphics Systems Overview
100% (1)
Computer Graphics Systems Overview
74 pages
DirectX 10 Optimization Guide
No ratings yet
DirectX 10 Optimization Guide
42 pages
TDCI Arch
No ratings yet
TDCI Arch
77 pages
Introductions and Tutorials With DirectX 9
No ratings yet
Introductions and Tutorials With DirectX 9
393 pages
Lecture 1a-Overview of Graphics Systems
No ratings yet
Lecture 1a-Overview of Graphics Systems
77 pages
Creating Input Layouts in Direct3D 10
No ratings yet
Creating Input Layouts in Direct3D 10
78 pages
Gpubbq Dx10 Perf
No ratings yet
Gpubbq Dx10 Perf
38 pages
Graphics Performance Optimization
No ratings yet
Graphics Performance Optimization
44 pages
1 - Introduction To Computer Graphics System
No ratings yet
1 - Introduction To Computer Graphics System
82 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
Understanding the Graphics Pipeline
No ratings yet
Understanding the Graphics Pipeline
35 pages
Procedural Shaders
No ratings yet
Procedural Shaders
28 pages
Amd 2018 Porting To Vulkan dx12 Adam Sawicki
No ratings yet
Amd 2018 Porting To Vulkan dx12 Adam Sawicki
45 pages
DirectX 11 for Game Developers
No ratings yet
DirectX 11 for Game Developers
54 pages
Game Engine Design Insights
No ratings yet
Game Engine Design Insights
20 pages
Lecture 1
No ratings yet
Lecture 1
50 pages
Data - Parallel Algorithms On Gpus
No ratings yet
Data - Parallel Algorithms On Gpus
31 pages
CSC 112 Assignment
No ratings yet
CSC 112 Assignment
19 pages
DX12 Do's and Don'ts
No ratings yet
DX12 Do's and Don'ts
10 pages
OpenGL ES 2 0 Reference Card
No ratings yet
OpenGL ES 2 0 Reference Card
4 pages
Orad DVG10
No ratings yet
Orad DVG10
11 pages
Reac2023 Modern Mobile Rendering at Hypehype
No ratings yet
Reac2023 Modern Mobile Rendering at Hypehype
28 pages
The Technology Behind The Elemental Demo (Unreal Engine 4)
100% (1)
The Technology Behind The Elemental Demo (Unreal Engine 4)
71 pages
Cse VI Computer Graphics and Visualization 10cs65 Solution
No ratings yet
Cse VI Computer Graphics and Visualization 10cs65 Solution
55 pages
Computing Architectures For Virtual Reality: Electrical and Computer Engineering Dept
100% (1)
Computing Architectures For Virtual Reality: Electrical and Computer Engineering Dept
136 pages
DirectX 11 Programming Guide
No ratings yet
DirectX 11 Programming Guide
46 pages
Parallel Distributed Computing
No ratings yet
Parallel Distributed Computing
38 pages
GFXHW
No ratings yet
GFXHW
38 pages
Introduction To Graphics
No ratings yet
Introduction To Graphics
10 pages
DirectX 11 Technology Update US
No ratings yet
DirectX 11 Technology Update US
54 pages
Game Development With SDL 2.0
No ratings yet
Game Development With SDL 2.0
41 pages
Smartshader
100% (1)
Smartshader
23 pages
Section 4 - Drawing Colors
No ratings yet
Section 4 - Drawing Colors
60 pages
Advanced Deferred Shading Techniques
No ratings yet
Advanced Deferred Shading Techniques
40 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
8 pages
CG Brief For Reference
No ratings yet
CG Brief For Reference
46 pages
Graphics Performance Optimization Guide
No ratings yet
Graphics Performance Optimization Guide
82 pages
Valient Killzone Shadow Fall Demo Postmortem
No ratings yet
Valient Killzone Shadow Fall Demo Postmortem
103 pages
Section 3 - Getting D3D To Work
No ratings yet
Section 3 - Getting D3D To Work
30 pages
Jiangyinshi Beihai
No ratings yet
Jiangyinshi Beihai
1 page
Vge Secorp File Rack
No ratings yet
Vge Secorp File Rack
1 page
Voice Culture Voice Culture: Personality Development
No ratings yet
Voice Culture Voice Culture: Personality Development
1 page
Back Burner User Guide
No ratings yet
Back Burner User Guide
100 pages
Venture Gulf Training Centre
No ratings yet
Venture Gulf Training Centre
2 pages

Direct3D 11 Computer Shader More Generality For Advanced Techniques

Uploaded by

Direct3D 11 Computer Shader More Generality For Advanced Techniques

Uploaded by

Direct3D 11 Compute Shader

More Generality for Advanced Techniques

GPGPU = Data Parallel Computing

Deliver scalable performance

Introducing: Compute Shader

Integrated with Direct3D

Supports more general constructs:

Enables more general data structures

Enables more general algorithms

Optimized for Client Scenarios

Real-time rendering of results

Better integration with media data types:

Need consistency between implementations

Compute Shader Features

Shared registers between threads

A-Buffer/OIT Ray-tracing, radiosity, etc. Gameplay physics, AI

Integrated with Direct3D

Integration with Pipeline

Direct Thread Invocation

Current thread id is available to shader code:

Analogous to sv_PrimitiveID system value

Enables predictable memory access and register usage

Shared Register Class

Multiple threads can access same memory

Maximum of 32 KB of registers can be shared in DirectX 11

Can be used to implement higher-level synch constructs

Not intended for heavy lifting Support an immediate return argument

Unordered Memory Accesses

Improves security, reliability of shipped code

Requires buffer allocated before-hand

Integration with Direct3D

No automatic trilinear LoD calculation

Other graphics features are not present:

Savings here means more time for 3D

Should this run at texture sample rate?

Reduction Compute Code

// Total so far // Count added

InterlockedAdd( Count, 1 ); InterlockedAdd( Total, value );

Reduction Compute Code2

Fast Reduction Compute Code

Fast Reduction Compute Code2

1M pixels takes ~0.4ms in Direct3D

Ideally should run at texture read rate

Compute shader features should help

Still suffers contention when multiple pixels increment same bin

Currently showing 2x speedup

Histogram Generation Code

SynchronizeThreadGroup; // enable all threads in group to complete

Histogram Generation Code2

2x speedup shown via prototypes

Need fairly large kernels for these

7-Tap Separable Kernel

7-Tap Separable Kernel

Prototype does slightly better

Theoretical performance should be higher

Need to evaluate relevant kernel sizes

Other Example Techniques

Known GPU algorithms (Horns method)

Sharing memory between threads results in ~2x speedup

Summed Area Table

Fast Fourier Transform

Core algorithm similar to scan

Inner loop is math intensive

Current methods require large sample counts

The A-Buffer Method

Compute shader features enable this

But performance is still largely unknown

Need more general data structures

Need linear algebra

Scalable parallel processing model

Increased generality will enable both:

Full cross-vendor support

2008 Microsoft Corporation. All rights reserved.

You might also like