DirectCompute from an OpenCL and CUDA perspective

Currently, most of my GPGPU experience is with OpenCL and CUDA. I have recently been looking at DirectCompute as another IHV-neutral API besides OpenCL. I have tried porting some of my OpenCL code to DirectCompute to gain experience. Here are some notes, in no particular order, from the perspective of writing compute code which has no graphics component:

1. Basic programming paradigm is similar to OpenCL 1.2 and basic CUDA. You have threads organized into thread groups, you have access to local memory/on-chip shared memory and synchronization etc is fairly similar as well.

2. However, it is far behind the functionality in CUDA 5.x and OpenCL 2.0. For example, there is no support for dynamic parallelism.  It is likely that Microsoft is considering adding these features, but with no public roadmap it is difficult to say anything. DirectCompute has not really evolved much since it started shipping in Windows 7 in late 2009 (i.e. almost 4 years ago).

3. No support for multiple command queues per context. CUDA has streams and OpenCL has the ability to create multiple command queues per context, but I think there is only one implicit command queue per device context in DirectCompute.  I think this will be a problem under many compute scenarios.

4. Shared memory support is very limited. D3D 11.2 introduces some features that take one step towards shared memory, but it is not fully there yet. On OpenCL, we already have decent shared memory support under OpenCL 1.2 on Intel platforms. OpenCL 2.0 is going to bring proper shared memory support on many platforms.

5. Double-precision support in HLSL is limited. There are no trigonometric functions or exponential functions. On Windows 7, you don’t even get double-precision FMA or divide in the shader bytecode. You can potentially the missing functions yourself but a serious compute API should include them. Using Microsoft’s C++ AMP instead of using DirectCompute takes care of some of this on Windows 8.

6. Vendor tools are geared for games and graphics applications. Profilers from various vendors all provide “per frame” analysis, which is useful for graphics applications but useless for pure compute scenarios. OpenCL and CUDA tools are geared for compute and are getting pretty good. I think this will again be different for C++ AMP.

7. Driver quality for DirectCompute is far more consistent across vendors compared to OpenCL. With OpenCL, it is not uncommon to run into frustrating bugs in various drivers. Also, sometimes driver writers interpret the OpenCL spec quite “creatively” which is very frustrating and often requires multiple codepaths even in host API code. DirectCompute drivers are far more robust, less buggy and the program behavior is usually what you expect across all vendors.

8. Hardware-vendor independant shader bytecode is great to have in DirectCompute. OpenCL SPIR will tackle this but it is not yet implemented.

9. Thread-group size is compile time constant in DirectCompute. In OpenCL and CUDA, you can delay specifying the group size until dispatch and can dispatch it with a different group size in every invocation. Even OpenGL compute shaders are getting this ability with a new extension (GL_arb_compute_variable_group_size).

10. Documentation is not that great. I guess I am used to downloading OpenCL specs directly and reading them while MSDN is a bit harder to navigate. For example, Direct3D 11.2 docs are essentially diffs over D3D 11.1 which makes it hard to get the complete up-to-date picture in one place. Vendor documentation is also woefully inadequate on many DirectCompute related things. For example, just trying to find out which GPUs from any vendor supports all double-precision instructions and which doesn’t is hard. Vendors also don’t seem to bother providing detailed optimization guides for DirectCompute.

My experience is limited however, and it is likely I have gotten some things wrong. If you have any corrections to offer, please let me know 🙂

Overall I feel that if your app is not already using Direct3D, you probably should not use DirectCompute. You are probably better off choosing OpenCL for many compute scenarios. OpenCL has some technical advantages over DirectCompute as outlined above, is a more future-proof and platform-independent path and has much better documentation and tooling support today than DirectCompute for pure compute scenarios.  Alternately, if you want to stick to Microsoft stack, then you are probably better off choosing C++ AMP over DirectCompute.

Intel Xeon Phi announcement and summary

Intel had announced Xeon Phi branding and basic architecture long ago, but we finally have details and pricing. Xeon Phi is essentially a 62-core x86 chip. Different SKUs will have different number of cores and clock speeds enabled. TDPs and rough performance numbers look competitive with offerings such as Nvidia Tesla, but the Xeon Phi offers higher programmability and potentially better efficiency on some workloads. The chip will sit in a PCIe board and can either be used to offload parts of your program, or run the whole program. The board offers a number of programming interfaces such as OpenMP that are a lot more convenient than writing say CUDA code. Compared to GPUs, it should be relatively easy to get your application up-and-running on a Xeon Phi though optimization will still require some effort.

However, I am still happy to report that OpenCL is still fully supported, so porting code from GPUs to Xeon Phi is still easy.  Kudos to Intel for getting behind OpenCL and actually delivering fully working products.

Each core has an in-order dual-issue x86 core with SMT (4 threads) backed by a 512-bit vector unit capable of doing FMA operations. Each vector unit can do 8 fp64 FMAs (16 flops) or 16 fp32 FMAs (32 flops) each cycle. While there is no SSE or AVX available on this core, the vector instruction set is actually very nice with operations like scatter-gather as well as per-lane write masks. IMO it is a cleaner and more flexible vector ISA than say AVX.
Unlike GPUs, Xeon Phi does not have an on-chip user-programmable local memory. Instead, it is backed by a large 512kB L2 cache on each core and the cache is fully coherent. In total, on a 60-core variant that is 30MB of coherent L2 cache compared to 1-2 MB L2 caches we are used to seeing on GPUs. This is a HUGE win compared to GPUs IMO and should give very good efficiency on some workloads such as some types of sparse matrices. Honestly, dealing with on-chip shared memory on GPUs is a giant pain.

My rough guess is that Nvidia’s Tesla K20X will retain a 10-15% edge in some brute force tests as well as tests like generic dense linear algebra, and will retain an advantage in fp32 workloads, but there will also be workloads where Xeon Phi will win out. And overall Xeon Phi should retain a programmability advantage.

As an academic (currently), I am a little disappointed that I will likely not be able to test my tools on a Xeon Phi as we do not have the budget to buy them. With Nvidia, one can start experimenting with CUDA by buying just a $100 card and Nvidia has also been open about seeding their boards to universities where they feel appropriate. Xeon Phis start upwards of $2k (much like Teslas) so not many labs will have access to them. Would like to see Intel offer some kind of program to univs to boost the Xeon Phi’s popularity to increase the base of programmer pool available for their card 🙂

Overall, a very good showing from Intel, though they do need to keep executing as other competitors are not sitting idle either.

Intel Xeon Phi and OpenCL

Does the Intel Xeon Phi support OpenCL? It has been hard to get a definitive official answer, but all the signs point to “yes”.

Take this story on HPCWire about Accelereyes adding Xeon Phi support to their well-known ArrayFire library through OpenCL. Then there is Intel’s marketing material PDF showing OpenCL as an example of languages that run on the Xeon Phi. There was also an interview of Yariv Aridor of Intel, who was described as leading the implementation of OpenCL on Xeon Phi.

Intel already has a x86 implementation for their Core processors. So, at least for basic support, getting it working on Xeon Phi requires two things. First, they need to add support in the runtime to support the OpenCL APIs such as allocating memory etc. Second, they need to add support in the kernel compiler for the new 512-bit vector instructions in the Xeon Phi instead of AVX on Core processors. Both are certainly doable and does not require a big investment from Intel so there is not much reason for them to not support OpenCL. After all, Intel has traditionally been very good at supporting as many languages on their platform as they can.

I would say, we are definitely going to see OpenCL on Xeon Phi, which is very good news for the OpenCL ecosystem.

Arndale board with Exynos 5250 does NOT do OpenCL right now

Yet another Exynos 5250 device, and still no OpenCL implementation available.
Arndale Board marketing material does mention OpenCL at a few places, but it does not ship with the driver. Source: This forum post. It is frustrating that many vendors in the ARM space keep mentioning OpenCL in their marketing and yet don’t ship working drivers.

Update: In a tweet from @ARMMultimedia, they confirmed that they will make OpenCL BSPs available by the time the board ships. Still waiting for more information about which OS this is for, and whether it will require any NDAs etc. Hopefully we will know soon.

RaijinCL : Autotuning GEMM routines for OpenCL

Announcing a new project: RaijinCL. It is a numerical library for matrix computations for OpenCL though currently only one part is available. The first available part are autotuning GEMM (general matrix multiply) routines. It is a work in progress, and things will improve over time. Do give your feedback.

More information can be found here: