DirectX – Searching for divine code

Things I am going to play with

Some things I want to experiment with as a fun exercise:

F# : F# is an awesome language that I started experimenting with a few months back. While I know the basic syntax of many parts of the language now, I am nowhere near proficient. Even in my brief time with the language, I am enjoying it a lot. It is more productive than even Python and light years ahead of C++. Looking forward to playing with some of the interesting features such as async workflows and type providers. Two books that I have on my radar are “F# deep dives” and “Purely functional data-structures” by Okasaki.

C# and .net ecosystem: Playing with F# actually has made me interested in playing with more .net technologies. From the productivity standpoint, C# looks a lot nicer than Java or C++ and does have some interesting technologies such as async construct and LINQ. On the performance side, CLR looks like a good JIT and a lot of innovation and pragmatic decisions seem to have been taken in the .net ecosystem. For example, inclusion of value types, SIMD types introduced recently with RyuJIT and libraries such as TPL should make it possible to write reasonably high-performance code despite CLR being a managed runtime. Recent open-sourcing of the .net core is also an interesting move.

ZeroMQ: I don’t have much experience with message-queue based systems and ZeroMQ looks like a good place to start. Have heard a lot of good things about it.

C++11: I have read up on many of the features in C++11, and have a basic understanding, but have not used them in non-trivial ways so I am not confident about them yet. Overall I am not at all liking where C++ is going. However, as a professional programmer who works a lot with C++ I feel I should keep myself updated because I expect to see more C++11 going forward.

OpenCL 2.0: I have read the specs and am familiar with many of the features theoretically but want to spend some time with features such as device-side enqueue and SVM to see the types of algorithms that are now possible on modern hardware.

Direct3d 11 and 12: Well quite self-evident 🙂 Going with the .net theme might try out SharpDX perhaps instead of going the native route.

apitest results on AMD comparing various OpenGL and D3D11 approaches

UPDATE: Issues related to texture arrays appears to be an application error. Michael Marks provides a fork that corrects some issues.

UPDATE 2: I reran some of the Linux benchmarks, earlier Linux results appear to have a bug. Performance on Linux and Windows now similar.

The strengths and weaknesses of OpenGL compared to other APIs (such as D3D11, D3D12 and Mantle) and the recent talk Approaching Zero Driver Overhead (AZDO) have become topics of hot discussion. AZDO talk included a nice tool called “apitest” that allows us to compares a number of solutions in OpenGL and D3D. Hard data is always better than hand-wavy arguments. In the AZDO talk, data from “apitest” was shown for Nvidia hardware but no numbers were given for either Intel or AMD hardware. Michael Marks ran the tool on Linux and had some interesting results to report that imply that AMD’s driver have higher overhead than Nvidia’s driver.

However, I wanted to answer slightly different questions. For example, if we just restrict to AMD hardware, how does the performance compare to D3D? What is the performance and compatibility difference between Windows and Linux? And what is the performance of various approaches across hardware generations? With these questions in mind, I built and ran the apitool on some AMD hardware on both Linux and Windows.

Hardware: AMD A10-5750M APU with 8650G graphics (VLIW4) + 8750M (GCN) switchable graphics. Catalyst 14.4 installed on both Linux and Windows. Catalyst allows explicit selection of graphics processor. Laptop has a 1366×768 screen.

Build: On Windows, built for Win32 (i.e. 32-bit) using VS 2012 Express and DX SDK (June 2010). Release setting was used. On Linux, built for 64-bit using G++ 4.8 on OpenSUSE 13.1. Required one patch in SDL cmake file.

Run: Tool was run using “apitest.exe -a oglcore -b -t 15” which is the same setting as Michael Marks. On Linux, it was run under KDE and desktop effects were kept disabled in case that makes a difference.

Issues encountered:

I encountered some issues. I am not sure if the error is in the application, the user (i.e me) or the driver.

Solutions using shader draw parameters (often abbreviated as SDP in the talk) appear to lead to driver hangs on GCN and are unsupported on VLIW4. Therefore I have not reported any SDP results here. Michael Marks also saw the same driver hangs on GCN on Linux, did some investigation and has posted some discussion here.
Solutions involving ARB_shader_image_load_store (which is core in OpenGL 4.2 and not some arcane extension) appear to be broken on Windows but are working on Linux despite installing the same Catalyst version. On Windows, the driver appears to be reporting some compilation error for some shaders saying that “readonly” is not supported unless you enable the extension. .UPDATE: Was application bug.
GCN based 8750M should support bindless textures. However, some of the bindless based solutions failed to work. For example GLBindlessMultiDraw failed. Sparse bindless also failed to work.

Data:
I did not test 8750M on Linux, partially because I am lazy and partially because I did not want to disturb my Linux setup which I use for my university work. Anyway, here is the data for 3 problems covered by apitest.

Dynamic streaming

Solution	8650G Windows (FPS)	8650G Linux (FPS)	8750M Windows (FPS)
D3D11MapNoOverwrite	14.629	0	19.6
D3D11UpdateSubresource	0.978	0	1.198
GLMapPersistent	19.987	19.471	20.885
GLBufferSubData	0.89	1.015	0.843
GLMapUnsynchronized	0.397	0.409	0.362

Textured quads

GLBindlessUnsupportedUnsupported112.7

Solution	8650G Windows (FPS)	8650G Linux (FPS)	8750M Windows (FPS)
D3D11Naive	65.11	0	42.463
GLTextureArrayMultiDraw-NoSDP	346	400.25	492
GLTextureArray	235	276.67	350
GLNoTex	215.94	275.57	347.608
GLTextureArrayMultiDrawBuffer-NoSDP	212	239	472
GLNoTexUniform	81.429	93.38	133.115
GLTextureArrayUniform	80.75	92.72	109.5
GLNaiveUniform	32.717	31.64	32.059
GLNaive	27.3	15.02	27.21

Untextured quads:

Solution	8650G Windows (FPS)	8650G Linux (FPS)	8750M Windows (FPS)
D3D11Naive	4.078	0	2.16
GLMultiDraw-NoSDP	17.221	17.661	19.93
GLMapPersistent	10.844	11.089	13.687
GLDrawLoop	10.615	10.45	13.59
GLBufferStorage-NoSDP	9.862	10.041	5.096
GLMultiDrawBuffer-NoSDP	9.069	9.404	7.703
GLMapUnsynchronized	5.908	5.726	7.281
GLTexCoord	5.702	5.382	4.963
GLUniform	3.282	3.53	4.399
GLBufferRange	3.031	3.509	3.119
GLDynamicBuffer	0.361	0.515	0.37

Conclusion:

The theoretical principles discussed in the AZDO talk appear to be sound. The “modern GL” techniques discussed do appear to substantially reduce driver overhead compared to older GL techniques. The reduction was seen on AMD hardware on both Windows and Linux and worked on two different architectures (VLIW4 based APU, GCN based discrete). In particular, persistent buffer mapping (sometimes called PBM) and multi-draw-indirect (MDI) based techniques seem useful.
On Windows, the best OpenGL solutions do appear to significantly outperform D3D. I am not an expert on D3D so I am not sure if better D3D11 solutions exist.
If a test ran successfully on both Windows and Linux, then the performance was qualitatively similar in most cases.
However, while theoretically things look good, in practice some issues were encountered. Some of the solutions failed to execute despite theoretically being supported by the hardware. In particular, shader draw parameters as well as some variations of bindless textures appear to be problematic. I am not sure if it was the fault of the application, the user (me) or the driver.

OpenGL, OpenGL ES compute shaders and OpenCL spec minimums

Well, I really got this one wrong. Previously I had (mistakenly) claimed that OpenGL compute shader, OpenGL ES compute shader etc. don’t really have specified minimums for some things but I got that completely wrong. I guess I need to be more careful while reading these specs as the required minimums are not necessarily located at the same place where they are being explained. Some of the OpenCL claims still stand though OpenCL’s relaxed specs are a bit more understandable given that it has to run on more hardware than others.

Here are the minimums:

OpenGL 4.3: 1024 work-items in a group, and 32kB of local memory
OpenGL ES 3.1: 128 work-items in a group, and 16kB of local memory (updated from 32kB, Khronos “fixed” the spec)
OpenCL 1.2: 1 work-item in a group, 32kB of local memory
OpenCL 1.2 embedded: 1 work-item in a group, 1kB of local memory
DirectCompute 11 (for reference): 1024 work-items in a group, 32kB of local memory

Thanks to Graham Sellers and Daniel Koch on twitter for pointing out the error. I guess I got schooled today.

DirectX 12 needs to deliver on compute

DirectX 12 will be unveiled soon. Given that DirectCompute forms the core of MS GPGPU computing stack (such as powering their C++ AMP implementation), I really hope that DirectCompute 12 delivers on the compute side. Unlike OpenCL, DirectCompute has had the massive advantage of good integration into a graphics API and that is why we have seen DirectCompute being adopted into 3D apps such as games where OpenCL really wasn’t. However, DirectCompute has fallen behind the times with little support for modern GPU features. This, in turn, hurts GPGPU solutions such as MS implementation of C++ AMP. Five things that I am hoping to see:

1. Support for shared memory architecture eliminating CPU-GPU data transfers where possible and also allowing platform-level atomics. This will be beneficial for everything from mobile (where SoCs rule the roost and discrete GPUs basically don’t exist) to servers.

2. Exposing multiple command queues per GPU, thus allowing concurrent execution of kernels

3. Launching GPU kernels from within GPU kernels.

4. Stable low-level bytecode that is more suitable as a target for high level compilers. D3D 11 has a bytecode, but ISVs are not given the proper specification and discouraged from using it. This needs to be opened up to enable third-party compilers for high-level languages

5. Compatibility with Windows 8

DirectCompute from an OpenCL and CUDA perspective

Currently, most of my GPGPU experience is with OpenCL and CUDA. I have recently been looking at DirectCompute as another IHV-neutral API besides OpenCL. I have tried porting some of my OpenCL code to DirectCompute to gain experience. Here are some notes, in no particular order, from the perspective of writing compute code which has no graphics component:

1. Basic programming paradigm is similar to OpenCL 1.2 and basic CUDA. You have threads organized into thread groups, you have access to local memory/on-chip shared memory and synchronization etc is fairly similar as well.

2. However, it is far behind the functionality in CUDA 5.x and OpenCL 2.0. For example, there is no support for dynamic parallelism. It is likely that Microsoft is considering adding these features, but with no public roadmap it is difficult to say anything. DirectCompute has not really evolved much since it started shipping in Windows 7 in late 2009 (i.e. almost 4 years ago).

3. No support for multiple command queues per context. CUDA has streams and OpenCL has the ability to create multiple command queues per context, but I think there is only one implicit command queue per device context in DirectCompute. I think this will be a problem under many compute scenarios.

4. Shared memory support is very limited. D3D 11.2 introduces some features that take one step towards shared memory, but it is not fully there yet. On OpenCL, we already have decent shared memory support under OpenCL 1.2 on Intel platforms. OpenCL 2.0 is going to bring proper shared memory support on many platforms.

5. Double-precision support in HLSL is limited. There are no trigonometric functions or exponential functions. On Windows 7, you don’t even get double-precision FMA or divide in the shader bytecode. You can potentially the missing functions yourself but a serious compute API should include them. Using Microsoft’s C++ AMP instead of using DirectCompute takes care of some of this on Windows 8.

6. Vendor tools are geared for games and graphics applications. Profilers from various vendors all provide “per frame” analysis, which is useful for graphics applications but useless for pure compute scenarios. OpenCL and CUDA tools are geared for compute and are getting pretty good. I think this will again be different for C++ AMP.

7. Driver quality for DirectCompute is far more consistent across vendors compared to OpenCL. With OpenCL, it is not uncommon to run into frustrating bugs in various drivers. Also, sometimes driver writers interpret the OpenCL spec quite “creatively” which is very frustrating and often requires multiple codepaths even in host API code. DirectCompute drivers are far more robust, less buggy and the program behavior is usually what you expect across all vendors.

8. Hardware-vendor independant shader bytecode is great to have in DirectCompute. OpenCL SPIR will tackle this but it is not yet implemented.

9. Thread-group size is compile time constant in DirectCompute. In OpenCL and CUDA, you can delay specifying the group size until dispatch and can dispatch it with a different group size in every invocation. Even OpenGL compute shaders are getting this ability with a new extension (GL_arb_compute_variable_group_size).

10. Documentation is not that great. I guess I am used to downloading OpenCL specs directly and reading them while MSDN is a bit harder to navigate. For example, Direct3D 11.2 docs are essentially diffs over D3D 11.1 which makes it hard to get the complete up-to-date picture in one place. Vendor documentation is also woefully inadequate on many DirectCompute related things. For example, just trying to find out which GPUs from any vendor supports all double-precision instructions and which doesn’t is hard. Vendors also don’t seem to bother providing detailed optimization guides for DirectCompute.

My experience is limited however, and it is likely I have gotten some things wrong. If you have any corrections to offer, please let me know 🙂

Overall I feel that if your app is not already using Direct3D, you probably should not use DirectCompute. You are probably better off choosing OpenCL for many compute scenarios. OpenCL has some technical advantages over DirectCompute as outlined above, is a more future-proof and platform-independent path and has much better documentation and tooling support today than DirectCompute for pure compute scenarios. Alternately, if you want to stick to Microsoft stack, then you are probably better off choosing C++ AMP over DirectCompute.