Direct3D 11 Compute Shader
More Generality for Advanced Techniques
Chas. Boyd Architect Windows Desktop & Graphics Technology Microsoft
Overview
GPGPU vs. Data Parallel Computing Introducing the Compute Shader Advantages Target Applications Key Features Examples
Image reduction, histogram, convolution
API Support
GPGPU = Data Parallel Computing
GPU Performance continues to grow
More algorithms want this performance
Apps can scale to massive parallelism without tricky code changes General recognition that this model is applicable beyond just rendering
although that is our primary target
Deliver scalable performance
Code scales with core count with no changes
Introducing: Compute Shader
A new processing model for GPUs
Dataparallel programming for mass markets
Integrated with Direct3D
For efficient interoperability in client scenarios
Supports more general constructs:
Cross-thread data sharing Unordered access I/O operations
Enables more general data structures
Irregular arrays, trees, etc.
Enables more general algorithms
Far beyond shading
Optimized for Client Scenarios
Simpler setup syntax
Balance between power and complexity
Real-time rendering of results
Working to reduce cost of transition from compute mode to graphics mode
Better integration with media data types:
Pixels, samples, text, vs. only floats
Need consistency between implementations
Both across vendors and over time/generations
Compute Shader Features
Predictable Thread Invocation
Regular arrays of threads: 1D, 2D, 3D Dont have to draw a quad anymore
Shared registers between threads
Reduces register pressure Can eliminate redundant compute and I/O
Scattered Writes
Can read/write arbitrary data structures Enables new classes of algorithms Integrated with Direct3D resources
7
Target Applications
Image/post-processing:
Image reduction, histogram, convolution, FFT
Effect physics
Particles, smoke, water, cloth, etc.
A-Buffer/OIT Ray-tracing, radiosity, etc. Gameplay physics, AI
Integrated with Direct3D
Fully supports all Direct3D resources Targets graphics/media data types Evolution of DirectX HLSL Graphics pipeline updated to emit general data structures
which can then be manipulated by compute shader and then rendered by D3D again
Integration with Pipeline
Input Assembler Vertex Shader Tessellation
Geometry Shader
Render scene Write out scene image Use Compute for image post-processing Output final image
Data Structure
Rasterizer
Pixel Shader Output Merger Final Image
Scene Image
Compute Shader
Direct Thread Invocation
The ability to explicitly launch a known number of threads onto the GPU
pD3D11Device->Dispatch( numThreads );
Analogous to graphics DrawPrimitive() calls Enables algorithms to execute the optimal number of threads
Not how many vertices are read, or pixels written
Current thread id is available to shader code:
sv_ThreadID.x
Analogous to sv_PrimitiveID system value
Enables predictable memory access and register usage
12
Shared Register Class
New register type/variable storage class
shared float sfFoo;
Multiple threads can access same memory
Enables uses like user-controlled cache
Maximum of 32 KB of registers can be shared in DirectX 11
8K floats or 2K float4s vs. 64 KB of total temporary registers available
16K floats or 4K float4s
13
Sub Blocking
Not all threads in the call can/should share registers with each other Sharing threads are broken down into subsets (groups) of threads Thread indices are made available in shader
sv_ThreadID sv_ThreadGroupID sv_ThreadIDinGroup
14
Atomic Intrinsics
Enable parallel operations on individual 32-bit memory locations without requiring full synchronization
Either video memory or shared registers
Can be used to implement higher-level synch constructs
Semaphores, etc.
Not intended for heavy lifting Support an immediate return argument
At some performance cost
Atomic Intrinsics
Enables basic operations:
InterlockedAdd( rVar, val ); InterlockedMin( rVar, val ); InterlockedMax( rVar, val ); InterlockedOr( rVar, val ); InterlockedXOr( rVar, val ); InterlockedCompareWrite( rVar, val ); InterlockedCompareExchange( rVar, val );
Unordered Memory Accesses
HLSL resource variables
Declared in the language
DXGI resources
Enables out-of-bounds memory checking
Returns 0 on reads Writes are No-Ops
Improves security, reliability of shipped code
Unordered I/O
For fastest performance when ordering of records need not be preserved Both reads and writes:
UnorderedLoad( ResourceVar, val); UnorderedStore( ResourceVar, val);
Requires buffer allocated before-hand
Integration with Direct3D
Pixel shaders can also perform scattered writes Enables rendering output to data structures more complex than a 2D array
Histogram, linked list, irregular array, tree, etc.
Dont Forget
Texture sampling still works:
Object.Load( Loc, Offset, Samples ); Object.Gather( Sampler, Loc ); Object.Sample( Sampler, Loc ); Object.SampleLevel( Sampler, Loc, LoD );
No automatic trilinear LoD calculation
Other graphics features are not present:
Antialiasing, depth culling, alpha blending, triangle rasterization
Examples
Image Reduction Image Histogram FFT
Image Post-Processing
Significant fraction of frame time
1020% for most games 5070% for deferred shading-based engines
Savings here means more time for 3D
Image Reduction
Find the average intensity of an Image
E.g. for HDR exposure adjustment Optimizes scene for viewing on SDR monitor
Algorithm breakdown:
Input: 1 million pixels Compute: 1 MAD per pixel read Output: 1 value
Should this run at texture sample rate?
Does not due to write contention
Million-to-1 reduction
GPU
Output
Input
Reduction Compute Code
Buffer<uint> Values; OutputBuffer<uint> Result; ImageAverage() { groupshared uint Total; groupshared uint Count;
// Total so far // Count added
float3 vPixel = load( sampler, sv_ThreadID ); float fLuminance = dot( vPixel, LUM_VECTOR ); uint value = fLuminance*65536;
InterlockedAdd( Count, 1 ); InterlockedAdd( Total, value );
SynchronizeThreadGroup(); group to complete // enable all threads in
Reduction Compute Code2
// Allow all threads in group to complete SynchronizeThreadGroup(); // Compute the average and store it in our output buffer if (threadID.x == 0) { float fAverage = total/count; // compute avg UnorderedStore( Result[0], fAverage ); // write it out } }
Fast Reduction Compute Code
Buffer<uint> Values; OutputBuffer<uint> Result; ImageAverage() { groupshared uint Total[32]; groupshared uint Count[32]; // array of 32 totals // array of 32 counts
float3 vPixel = load( sampler, sv_ThreadID ); float fLuminance = dot( vPixel, LUM_VECTOR ); uint value = fLuminance*65536; uint idx = (sv_ThreadID.x + sv_ThreadID.y + sv_ThreadID.z) & 32; Total[idx] += value; Count[idx] += 1;
Fast Reduction Compute Code2
// Allow all threads in group to complete SynchronizeThreadGroup(); // Compute the average and store it in our output buffer if (threadIDInGroup.x == 0) { for ( uint i=0; i< 32; i++ ) { TheTotal += total[i]; TheCount += count[i]; } float fAverage = TheTotal/TheCount; // compute avg UnorderedStore( Result[GroupID], fAverage ); // write } }
Reduction Performance
Pyramid approaches work today
Some choice in reduction level per pass Tradeoff is contention for destination
1M pixels takes ~0.4ms in Direct3D
Pass-count-limited at small end of pyramid
Ideally should run at texture read rate
< 0.1 ms in theory, or 410x faster
Compute shader features should help
Such as local read-write cache Prototypes show ~2x speed boost so far
Histogram Generation
Similar to reduction problem
Reduce to 64256 destinations at data dependent (unpredictable) addresses
Still suffers contention when multiple pixels increment same bin
So replicate bins e.g. 16x Increment bins using InterlockedAdd() math operations
Currently showing 2x speedup
Histogram Generation Code
Histogram() { shared int Histograms[16][256]; // array of 16
float3 vPixel = load( sampler, sv_ThreadID ); float fLuminance = dot( vPixel, LUM_VECTOR ); int iBin = fLuminance*255.0f; // compute bin to increment int iHist = sv_ThreadIDInGroup & 16; // use thread index Histograms[iHist][iBin] += 1; // update bin
SynchronizeThreadGroup; // enable all threads in group to complete
Histogram Generation Code2
// Write register histograms out to memory: iBin = sv_ThreadIDInGroup.x; if ( ( sv_ThreadID.x < 256 ) { for ( iHist = 0; iHist < 16; iHist++ ) { int2 destAddr = int2( iHist, iBin ); OutputResource.add( destAddr, Histograms[iHist][iBin] ); // atomic } } }
Histogram Performance
Recent work shows similar performance to reductions:
Direct3D takes ~2.4 ms per megapixel
On DirectX10 hardware
2x speedup shown via prototypes
On same hardware but using shared registers
8x theoretically possible
if purely read limited
Image Convolution
Fundamental operation for blurs:
HDR flares, depth-of-field, soft shadows, streaks
Need fairly large kernels for these
100 wide is possible at high resolutions (sparse sampling produces artifacts)
7-Tap Separable Kernel
7-Tap Separable Kernel
Convolution Performance
Massively variable depending on method Direct3D does 5x5 kernel in 0.65ms/Mpix
Separable kernel
Prototype does slightly better
Using shared register capability
Theoretical performance should be higher
Some opportunity remains
Need to evaluate relevant kernel sizes
Games need 100x100 effectively
Other Example Techniques
These are not used directly in game postprocessing today, but are key foundations of other algorithms
Scan (prefix sum), and FFT (fast Fourier transform)
Scan (Prefix-sum)
Each number in data sequence is sum of all previous numbers
Used to compute writes in irregular arrays Foundation of Summed Area Tables
Known GPU algorithms (Horns method)
Pyramid scheme, so I/O bound
Sharing memory between threads results in ~2x speedup
Scan (Prefix-sum)
We are looking at providing this in a library routine Along with FFT, etc.
Summed Area Table
2D equivalent of Scan
Each element of 2D array has sum of all elements up/left of it
Enables box filter with performance independent of kernel size O(k) Fast generation of
Shadow blur with distance Depth-of-field Area light integrals, etc.
Fast Fourier Transform
Converts image into frequency domain Many operations are faster in frequency domain than in spatial domain
e.g. convolution becomes a multiply Trivial detection of periodic noise Some application to motion estimation
Core algorithm similar to scan
Similar I/O patterns, But more math-intensive inner loop
Direct3D FFT
Ping-pong between 2 R32G32F surfaces
R is Real, G is Complex
Do LogN passes along rows then columns Pixel shader only Does not use blenders or iterators
Uses vPos.xy as array indices [i][j]
Inner loop is math intensive
20+ instructions including trig Indexing math dominates unless DX10
FFT Before
FFT After
After
FFT Performance
Complex 1024x1024 2D FFT:
Software 42ms Direct3D9 15ms Prototype DX11 6ms Latest chips 3ms 6 GFlops 17 GFlops 3x 42 GFlops 6x 100 GFlops
Shared register space and random access writes enable ~2x speedups
Order-Independent Translucency
Eliminates draw-order issues, and shimmer in moving scenes Correct AA even of transparent objects
Any object is transparent if antialiased e.g. alpha tested leaves in forests
Current methods require large sample counts
Alpha-To-Coverage Depth Peeling with Occlusion Queries
The A-Buffer Method
A-buffer is a more accurate method
Accumulate object data in per-pixel list Then sort each pixel into order Collapse to final color and display
Brings visual quality to movie levels without requiring 256-sample MSAA Something to keep an eye on for OIT
A-Buffer Rendering
Currently prototyping using refrast
DirectX reference rasterizer running on CPU Measuring memory access patterns/locality Evaluating feasibility of hardware Not really feasible with current Direct3D
Compute shader features enable this
Such as indexed writes, counters, etc. Rendering to structures beyond regular arrays
But performance is still largely unknown
Additional Algorithms
New rendering methods
Ray-tracing, collision detection, etc. Rendering elements at different resolutions
Non-rendering algorithms
IK, physics, AI, simulation, fluid simulation, radiosity
Need more general data structures
Quad/octrees, irregular arrays, sparse arrays
Need linear algebra
Summary
Compute Shader is coming in Direct3D 11
GPU performance levels for more applications
Scalable parallel processing model
Code should scale for several generations
Increased generality will enable both:
Improved performance on existing GPU tasks More CPU tasks can switch to DP cores
Full cross-vendor support
Enables broadest possible installed base
Questions?
www.xnagamefest.com
2008 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.