Rahul – Page 6 – Searching for divine code

HSAIL specs released!

HSA Foundation finally released the Programmer’s Reference Manual for HSA IL (HSAIL). So what is HSAIL and why should you care about it? Well AMD and HSA Foundation have been talking about Heterogeneous System Architecture or HSA. HSAIL is one of the building blocks of HSA. Salient features of HSAIL are:

HSAIL is a portable low-level pseudo assembly language for heterogeneous systems. HSAIL is not intended to be the real instruction set architecture (ISA) of any hardware. Instead, the hardware vendor will provide a compiler that will convert HSAIL to the actual ISA. For example, AMD will provide a driver to compile HSAIL to future AMD APUs.
A regular programmer will never need to read/write HSAIL. Instead, it is intended as a code generation target for high level compilers. For example, you may have a C++ AMP or some sort of a Python compiler that generates HSAIL and then the hardware vendor’s driver will compiler HSAIL to the hardware.
HSAIL is a very cleanly designed language. Compiling HSAIL to native code should be very fast in most cases.
HSAIL is a very flexible language. Any GPU implementing HSAIL will have very advanced capabilities, far beyond current standards such as OpenCL 1.2. For example, HSAIL allows GPU kernels to enqueue calls to further GPU kernels without CPU intervention. Function pointers (and hence C++ virtual functions) are supported.

HSA-enabled systems will implement unified memory space for both CPU and GPU. Combined with the flexible execution model defined by HSAIL, I am very excited by the prospects of HSA enabled products such as the Kaveri APU. I am working on a more detailed writeup about HSA and will post it soon.

Double precision on GPGPU APIs

Many scientific computations are done in double precision floating-point (i.e. fp64). Support for fp64 varies between GPU architectures as well as GPGPU APIs. Here I just recap the capabilities of various APIs, assuming the hardware support is present:

1. CUDA: Full support for fp64 including exponentials, trigonometry etc.

2. OpenCL: Full support for fp64, similar to CUDA

3. OpenGL: An extension called gpu_shader_fp64 is available but it only supports basics like addition, multiplication and divison. Does not support exponentials, trigonometry etc.

4. DirectCompute: On Windows 7, only supports fp64 add, multiply and a few comparison operators but not divison or exponentials etc. On Windows 8, some GPUs support double precision division, reciprocal and FMA. However, afaik still no support for exponentials and trigonometry etc.?

So, if you want full fp64 support, I guess OpenCL and CUDA are the way to go currently.

Intel OpenCL updates

Intel has released an updated OpenCL driver and SDK for Ivy Bridge and Haswell, along with GPU profiling tools for OpenCL. I wrote about them briefly over at Anandtech.

Renderscript from the perspective of an OpenCL/CUDA/C++ AMP programmer

Now that Renderscript Compute supposedly works on GPUs, here are some points to ponder about this strange “compute” API

1. In OpenCL or CUDA, you specify a thread grid to launch a kernel. In Renderscript, there is no concept of a thread grid. Instead, you specify input and/or output arrays and each “thread” processes one output item. It reminds me of the limitations of the very old GPGPU technologies like the original Brook language, and is essentially similar to a pixel shader model (each shader thread writes one item). You can’t even query the thread ID (similar to say get_global_id() in OpenCL) in Renderscript.

Even gather/scatter are really complicated and inefficient. You cannot really scatter writes to the output array. However, you can do scatter to separately bound arrays and so you have to adopt the following hack:

a) Do not pass in the actual input and output array directly. Bind the input and output array as dynamic pointers separately

b) Pass an array containing the output indices as input.

c) For each index in the passed array, do the computation and write to the index.

This is just INEFFICIENT. There is no need for such inefficiency on modern hardware. Gather requires a similar inefficient hack. Even simple operations like matrix multiplication are not going to be efficient and unnecessarily complicated in Renderscript. (See also this stackoverflow thread: http://stackoverflow.com/questions/10576583/passing-array-to-rsforeach-in-renderscript-compute )

2. In Renderscript, the API chooses which device to run your code on. That’s right, you have no idea if your code is running on the CPU or GPU or DSP etc. The work is supposedly automatically distributed between processors by the Renderscript runtime according the driver implemented by the SoC, and currently no guidelines are given about how to ensure code runs on GPU beyond “simple code should run on GPU”.

3. Renderscript’s philosophy is to not expose the actual hardware information and properties to the programmer. OpenCL lets you query a lot of information about the hardware properties, like the amount of local memory available. I guess given that the programmer can’t even decide where to run the code, this is not surprising.

4. CUDA introduced on-chip shared memory, and that concept has been adopted by almost every GPGPU API today including OpenCL, C++ AMP etc. However, Renderscript does not have any concept of on-chip shared memory. Thus, performance will be lower to well-optimized OpenCL kernels on many families of GPUs.

5. Renderscript is not available directly from the Android NDK. This is a significant limitation because high-performance applications (such as performance sensitive games) will often be written using the NDK.

Overall I do not think that the current iteration of Renderscript is meant for writing high performance code. Well optimized OpenCL/CUDA/C++ AMP kernels will always significantly outperform Renderscript code simply because Renderscript tries to present a simple abstraction and gives no control over performance. Performance will be entirely dependent upon the Renderscript compiler and driver, and will only come close to an API like OpenCL, CUDA or C++ AMP in very simple cases where the compiler may have the right heuristics built in.

At the same time, Renderscript has very weird programming model limitations, such as the weird gather/scatter limitation outlined above. I think Renderscript was designed with only one application in mind: Simple image processing filters. And as @jimrayvaughn pointed out on twitter, many of those can be done efficiently using GLSL using well-understood techniques.

I hope that the SoC vendors and mobile handset vendors are reading this blog, and I hope that GPGPU on Android does not remain limited to Renderscript. Mobile vendors are wasting the power and potential of modern GPUs by not exposing the full power of the hardware to the developers. If you want to unlock the performance of your GPU, Renderscript is not the solution you are looking for.

Disclaimer: I am not a Renderscript expert. Finding documentation on Renderscript has been very tough, and my comments here are based upon what I could glean from the docs. If you find errors in this article, please point them out and I will update the article.

edited: Added NDK issue.

edited: I had deleted this post, now restored.

Intel Xeon Phi and OpenCL

Does the Intel Xeon Phi support OpenCL? It has been hard to get a definitive official answer, but all the signs point to “yes”.

Take this story on HPCWire about Accelereyes adding Xeon Phi support to their well-known ArrayFire library through OpenCL. Then there is Intel’s marketing material PDF showing OpenCL as an example of languages that run on the Xeon Phi. There was also an interview of Yariv Aridor of Intel, who was described as leading the implementation of OpenCL on Xeon Phi.

Intel already has a x86 implementation for their Core processors. So, at least for basic support, getting it working on Xeon Phi requires two things. First, they need to add support in the runtime to support the OpenCL APIs such as allocating memory etc. Second, they need to add support in the kernel compiler for the new 512-bit vector instructions in the Xeon Phi instead of AVX on Core processors. Both are certainly doable and does not require a big investment from Intel so there is not much reason for them to not support OpenCL. After all, Intel has traditionally been very good at supporting as many languages on their platform as they can.

I would say, we are definitely going to see OpenCL on Xeon Phi, which is very good news for the OpenCL ecosystem.

Arndale board with Exynos 5250 does NOT do OpenCL right now

Yet another Exynos 5250 device, and still no OpenCL implementation available.
Arndale Board marketing material does mention OpenCL at a few places, but it does not ship with the driver. Source: This forum post. It is frustrating that many vendors in the ARM space keep mentioning OpenCL in their marketing and yet don’t ship working drivers.

Update: In a tweet from @ARMMultimedia, they confirmed that they will make OpenCL BSPs available by the time the board ships. Still waiting for more information about which OS this is for, and whether it will require any NDAs etc. Hopefully we will know soon.