AMD VLIW4 GPUs and double precision floating point (fp64)

UPDATE: I think the 6900 series support for double-precision seems fine, but Trinity/Richland remains unexplained.

AMD has two VLIW4 based parts: Radeon 6900 series and Trinity/Richland APU’s GPU products (Radeon 7660D, 8650G etc.). Some of the launch media coverage stated that these GPUs have fp64 capability. I recently got a Richland based system to work on and realized the following:

a) AMD does not support cl_khr_fp64 (i.e. the standard OpenCL extension for fp64) on the 8650G GPU and only supports cl_amd_fp64. But AMD’s documentation is not very clear about the difference.

b) Earlier driver versions for Trinity (which afaik has the same silicon as Richland) definitely had cl_khr_fp64 support, but it was later removed and demoted to only cl_amd_fp64.

c) Richland’s GPU (8650G) does not seem to support double precision under Direct3D either.

d) Forum postings indicate latest drivers for 6900 series GPUs also do not support cl_khr_fp64, and only support cl_amd_fp64. I am not sure about the fp64 support status under DirectCompute.

My speculation is that AMD discovered some issue with IEEE compliance on fp64 units in the VLIW4 GPUs and hence AMD is unable to support APIs where full IEEE compliance is required. If anyone has any insight into the issue, then let me know.

DirectCompute from an OpenCL and CUDA perspective

Currently, most of my GPGPU experience is with OpenCL and CUDA. I have recently been looking at DirectCompute as another IHV-neutral API besides OpenCL. I have tried porting some of my OpenCL code to DirectCompute to gain experience. Here are some notes, in no particular order, from the perspective of writing compute code which has no graphics component:

1. Basic programming paradigm is similar to OpenCL 1.2 and basic CUDA. You have threads organized into thread groups, you have access to local memory/on-chip shared memory and synchronization etc is fairly similar as well.

2. However, it is far behind the functionality in CUDA 5.x and OpenCL 2.0. For example, there is no support for dynamic parallelism.  It is likely that Microsoft is considering adding these features, but with no public roadmap it is difficult to say anything. DirectCompute has not really evolved much since it started shipping in Windows 7 in late 2009 (i.e. almost 4 years ago).

3. No support for multiple command queues per context. CUDA has streams and OpenCL has the ability to create multiple command queues per context, but I think there is only one implicit command queue per device context in DirectCompute.  I think this will be a problem under many compute scenarios.

4. Shared memory support is very limited. D3D 11.2 introduces some features that take one step towards shared memory, but it is not fully there yet. On OpenCL, we already have decent shared memory support under OpenCL 1.2 on Intel platforms. OpenCL 2.0 is going to bring proper shared memory support on many platforms.

5. Double-precision support in HLSL is limited. There are no trigonometric functions or exponential functions. On Windows 7, you don’t even get double-precision FMA or divide in the shader bytecode. You can potentially the missing functions yourself but a serious compute API should include them. Using Microsoft’s C++ AMP instead of using DirectCompute takes care of some of this on Windows 8.

6. Vendor tools are geared for games and graphics applications. Profilers from various vendors all provide “per frame” analysis, which is useful for graphics applications but useless for pure compute scenarios. OpenCL and CUDA tools are geared for compute and are getting pretty good. I think this will again be different for C++ AMP.

7. Driver quality for DirectCompute is far more consistent across vendors compared to OpenCL. With OpenCL, it is not uncommon to run into frustrating bugs in various drivers. Also, sometimes driver writers interpret the OpenCL spec quite “creatively” which is very frustrating and often requires multiple codepaths even in host API code. DirectCompute drivers are far more robust, less buggy and the program behavior is usually what you expect across all vendors.

8. Hardware-vendor independant shader bytecode is great to have in DirectCompute. OpenCL SPIR will tackle this but it is not yet implemented.

9. Thread-group size is compile time constant in DirectCompute. In OpenCL and CUDA, you can delay specifying the group size until dispatch and can dispatch it with a different group size in every invocation. Even OpenGL compute shaders are getting this ability with a new extension (GL_arb_compute_variable_group_size).

10. Documentation is not that great. I guess I am used to downloading OpenCL specs directly and reading them while MSDN is a bit harder to navigate. For example, Direct3D 11.2 docs are essentially diffs over D3D 11.1 which makes it hard to get the complete up-to-date picture in one place. Vendor documentation is also woefully inadequate on many DirectCompute related things. For example, just trying to find out which GPUs from any vendor supports all double-precision instructions and which doesn’t is hard. Vendors also don’t seem to bother providing detailed optimization guides for DirectCompute.

My experience is limited however, and it is likely I have gotten some things wrong. If you have any corrections to offer, please let me know 🙂

Overall I feel that if your app is not already using Direct3D, you probably should not use DirectCompute. You are probably better off choosing OpenCL for many compute scenarios. OpenCL has some technical advantages over DirectCompute as outlined above, is a more future-proof and platform-independent path and has much better documentation and tooling support today than DirectCompute for pure compute scenarios.  Alternately, if you want to stick to Microsoft stack, then you are probably better off choosing C++ AMP over DirectCompute.

Quick note on integrated GPU progress from Intel and AMD

If we look at only programmability and floating-point performance, the progress we have made on GPUs is remarkable. Consider the following:

  • Xbox 360 (2005 console): 240 GFlops and DirectX 10 level (mostly)
  • GTX 280 (mid-2008 flagship): 622 GFlops, DirectX 10 and CUDA 1.0
  • AMD Richland 8650G (integrated 2013): 550+ GFlops DirectX 11 and OpenCL 1.2
  • Intel Iris Pro 5200 (integrated 2013): 650+ GFlops, DirectX11 and OpenCL 1.2

Integrated graphics today, with a TDP of perhaps 20W for the graphics component, has more floating-point performance than a flagship GPU from just 5 years earlier. Bandwidth constraints still remain, though potential solutions are emerging either using on-package eDRAM on Intel, or using say GDDR5 like the PS4. But it is impressive to see that integrated GPUs have advanced so much.

HSAIL specs released!

HSA Foundation finally released the Programmer’s Reference Manual for HSA IL (HSAIL). So what is HSAIL and why should you care about it? Well AMD and HSA Foundation have been talking about Heterogeneous System Architecture or HSA. HSAIL is one of the building blocks of HSA. Salient features of HSAIL are:

  • HSAIL is a portable low-level pseudo assembly language for heterogeneous systems. HSAIL is not intended to be the real instruction set architecture (ISA) of any hardware. Instead, the hardware vendor will provide a compiler that will convert HSAIL to the actual ISA. For example, AMD will provide a driver to compile HSAIL to future AMD APUs.
  • A regular programmer will never need to read/write HSAIL. Instead, it is intended as a code generation target for high level compilers.  For example, you may have a C++ AMP or some sort of a Python compiler that generates HSAIL and then the hardware vendor’s driver will compiler HSAIL to the hardware.
  • HSAIL is a very cleanly designed language. Compiling HSAIL to native code should be very fast in most cases.
  • HSAIL is a very flexible language. Any GPU implementing HSAIL will have very advanced capabilities, far beyond current standards such as OpenCL 1.2. For example, HSAIL allows GPU kernels to enqueue calls to further GPU kernels without CPU intervention. Function pointers (and hence C++ virtual functions) are supported.

HSA-enabled systems will implement unified memory space for both CPU and GPU. Combined with the flexible execution model defined by HSAIL, I am very excited by the prospects of HSA enabled products such as the Kaveri APU. I am working on a more detailed writeup about HSA and will post it soon.

Double precision on GPGPU APIs

Many scientific computations are done in double precision floating-point (i.e. fp64). Support for fp64 varies between GPU architectures as well as GPGPU APIs. Here I just recap the capabilities of various APIs, assuming the hardware support is present:

1. CUDA: Full support for fp64 including exponentials, trigonometry etc.

2. OpenCL: Full support for fp64, similar to CUDA

3. OpenGL: An extension called gpu_shader_fp64 is available but it only supports basics like addition, multiplication and divison. Does not support  exponentials, trigonometry etc.

4. DirectCompute: On Windows 7, only supports fp64 add, multiply and a few comparison operators but not divison or exponentials etc. On Windows 8, some GPUs support double precision division, reciprocal and FMA. However, afaik still no support for exponentials and trigonometry etc.?

So, if you want full fp64 support, I guess OpenCL and CUDA are the way to go currently.