Renderscript from the perspective of an OpenCL/CUDA/C++ AMP programmer

Now that Renderscript Compute supposedly works on GPUs, here are some points to ponder about this strange “compute” API

1. In OpenCL or CUDA, you specify a thread grid to launch a kernel. In Renderscript, there is no concept of a thread grid. Instead, you specify input and/or output arrays and each “thread” processes one output item. It reminds me of the limitations of the  very old GPGPU technologies like the original Brook language, and is essentially similar to a pixel shader model (each shader thread writes one item). You can’t even query the thread ID (similar to say get_global_id() in OpenCL) in Renderscript.

Even gather/scatter are really complicated and inefficient. You cannot really scatter writes to the output array. However, you can do scatter to separately bound arrays and so you have to adopt the following hack:

a) Do not pass in the actual input and output array directly. Bind the input and output array as dynamic pointers separately

b) Pass an array containing the output indices as input.

c) For each index in the passed array, do the computation and write to the index.

This is just INEFFICIENT. There is no need for such inefficiency on modern hardware. Gather requires a similar inefficient hack.  Even simple operations like matrix multiplication are not going to be efficient and unnecessarily complicated in Renderscript. (See also this stackoverflow thread: http://stackoverflow.com/questions/10576583/passing-array-to-rsforeach-in-renderscript-compute )

2. In Renderscript, the API chooses which device to run your code on. That’s right, you have no idea if your code is running on the CPU or GPU or DSP etc. The work is supposedly automatically distributed between processors by the Renderscript runtime according the driver implemented by the SoC, and currently no guidelines are given about how to ensure code runs on GPU beyond “simple code should run on GPU”.

3. Renderscript’s philosophy is to not expose the actual hardware information and properties to the programmer.  OpenCL lets you query a lot of information about the hardware properties, like the amount of local memory available. I guess given that the programmer can’t even decide where to run the code, this is not surprising.

4. CUDA introduced on-chip shared memory, and that concept has been adopted by almost every GPGPU API today including OpenCL, C++ AMP etc. However, Renderscript does not have any concept of on-chip shared memory. Thus, performance will be lower to well-optimized OpenCL kernels on many families of GPUs.

5. Renderscript is not available directly from the Android NDK. This is a significant limitation because high-performance applications (such as performance sensitive games) will often be written using the NDK.

Overall I do not think that the current iteration of Renderscript is meant for writing high performance code. Well optimized OpenCL/CUDA/C++ AMP kernels will always significantly outperform Renderscript code simply because Renderscript tries to present a simple abstraction and gives no control over performance. Performance will be entirely dependent upon the Renderscript compiler and driver, and will only come close to an API like OpenCL, CUDA or C++ AMP in very simple cases where the compiler may have the right heuristics built in.

At the same time, Renderscript has very weird programming model limitations, such as the weird gather/scatter limitation outlined above. I think Renderscript was designed with only one application in mind: Simple image processing filters. And as @jimrayvaughn pointed out on twitter, many of those can be done efficiently using GLSL using well-understood techniques.

I hope that the SoC vendors and mobile handset vendors are reading this blog, and I hope that GPGPU on Android does not remain limited to Renderscript. Mobile vendors are wasting the power and potential of modern GPUs by not exposing the full power of the hardware to the developers. If you want to unlock the performance of your GPU, Renderscript is not the solution you are looking for.

Disclaimer: I am not a Renderscript expert. Finding documentation on Renderscript has been very tough, and my comments here are based upon what I could glean from the docs. If you find errors in this article, please point them out and I will update the article.

edited: Added NDK issue.

edited: I had deleted this post, now restored.

Intel Xeon Phi and OpenCL

Does the Intel Xeon Phi support OpenCL? It has been hard to get a definitive official answer, but all the signs point to “yes”.

Take this story on HPCWire about Accelereyes adding Xeon Phi support to their well-known ArrayFire library through OpenCL. Then there is Intel’s marketing material PDF showing OpenCL as an example of languages that run on the Xeon Phi. There was also an interview of Yariv Aridor of Intel, who was described as leading the implementation of OpenCL on Xeon Phi.

Intel already has a x86 implementation for their Core processors. So, at least for basic support, getting it working on Xeon Phi requires two things. First, they need to add support in the runtime to support the OpenCL APIs such as allocating memory etc. Second, they need to add support in the kernel compiler for the new 512-bit vector instructions in the Xeon Phi instead of AVX on Core processors. Both are certainly doable and does not require a big investment from Intel so there is not much reason for them to not support OpenCL. After all, Intel has traditionally been very good at supporting as many languages on their platform as they can.

I would say, we are definitely going to see OpenCL on Xeon Phi, which is very good news for the OpenCL ecosystem.

Arndale board with Exynos 5250 does NOT do OpenCL right now

Yet another Exynos 5250 device, and still no OpenCL implementation available.
Arndale Board marketing material does mention OpenCL at a few places, but it does not ship with the driver. Source: This forum post. It is frustrating that many vendors in the ARM space keep mentioning OpenCL in their marketing and yet don’t ship working drivers.

Update: In a tweet from @ARMMultimedia, they confirmed that they will make OpenCL BSPs available by the time the board ships. Still waiting for more information about which OS this is for, and whether it will require any NDAs etc. Hopefully we will know soon.

RaijinCL : Autotuning GEMM routines for OpenCL

Announcing a new project: RaijinCL. It is a numerical library for matrix computations for OpenCL though currently only one part is available. The first available part are autotuning GEMM (general matrix multiply) routines. It is a work in progress, and things will improve over time. Do give your feedback.

More information can be found here: http://www.raiijincl.org

An overview of OpenCL SPIR

(Updated: Corrected NVVM description at 0845 EST on 7th oct)

OpenCL SPIR is a proposed portable binary distribution format for OpenCL programs. The idea is simple. Today, OpenCL kernels are distributed as source strings with the application binary. The source string is then compiled on the user’s machine into native binaries using the OpenCL driver present on the user’s machine. However, this is not always ideal. First, some people would prefer not to distribute their OpenCL kernel sources with their application binaries. Second, there may be more compilation overhead on the user’s machine. Third, compilers for higher-level languages may want to generate GPU code and may want a lower-level and stable target instead of OpenCL C.

In contrast to the situation with OpenCL, consider DirectCompute shaders. The developer writes an awesome shader on his/her machine. The shader can be compiled into a lower-level bytecode format (that is not dependent upon the hardware vendor) and then the bytecode is distributed with the application binary. The bytecode is compiled into binary code by the driver on the user’s computer.

OpenCL SPIR is trying to define a similar portable “binary” distribution format. However, instead of designing their own bytecode from scratch, SPIR is based upon the LLVM IR. Most OpenCL implementations already use some proprietary fork of LLVM IR already thus it was the logical starting point. That is not to say the problem is easy. OpenCL SPIR is meant to be portable, whereas LLVM IR was not really meant to be a portable distribution format. LLVM IR was meant as a compiler IR. There is also some discussion about whether SPIR specification is robust enough that SPIR-to-SPIR compilers/optimizers can be safely written, or whether SPIR is suitable as a target for compilers for languages other than OpenCL C kernel language. The initial goal appears to be to ensure that SPIR is a suitable target for OpenCL C implementations first and not worry about the other use cases.

It is also important to note what OpenCL SPIR is *not*. OpenCL SPIR is not a piece of software. It is simply a specification for a program representation format that vendors are free to implement anyway they choose. There is a lot of wrong reporting on OpenCL SPIR because people seem to confuse LLVM IR with LLVM-the-software. There may end up being a reference OpenCL C to SPIR compiler implementation, and then SPIR-to-binary compilers for supported LLVM backends, but that is *NOT* what is being proposed right now. And even if reference implementations are made available, vendors are free to ignore them.

I will repeat once again. OpenCL SPIR is *not* a piece of software. OpenCL SPIR is simply a distribution format, based upon LLVM IR. Let us consider you are writing a Python to OpenCL compiler. Today, you would be generating OpenCL C. However, in the future, you may want to generate SPIR instead though the initial design is not really meant for this use case. Now integrating SPIR is quite different from a toolchain perspective than integrating LLVM-the-software for CPU code generation that you might use today. Most compilers that use LLVM today for CPUs do not generate LLVM bytecode directly. Instead, LLVM-the-software uses an internal in-memory data structure representation of the LLVM IR with really nice C++ APIs for building these data structures. OpenCL SPIR specification does *NOT* contain this data-structure representation or associated APIs currently. You may get these once there is a reference implementation, but right now, there isn’t.

Comparisons are being made with Nvidia’s NVVM for CUDA. There is a BIG difference, and the difference is that NVVM’s design and implementation goals are quite different than SPIR. Nvidia already has a bytecode format for distributing programs called PTX. NVVM is simply a higher-level layer and there are two pieces to NVVM: NVVM IR and libNVVM. NVVM IR is also an LLVM-based IR, but essentially a clean subset of LLVM instead of being a modification. NVVM IR is not really meant for distribution however, and is meant mostly as a compiler target. Second piece is libNVVM library that generates PTX from NVVM. libNVVM is built using LLVM-the-software and the intended audience is exclusively third-party compiler writers. libNVVM is simply a C++ library based upon LLVM that enables compiler writers (such as compilers for Python to CUDA) to easily generate PTX.

The nice thing about NVVM IR is that it is essentially a subset of the standard LLVM IR. Compiler writers can either generate NVVM IR bytecode directly, or use the LLVM C++ data-structure APIs to generate and manipulate NVVM. I would say the data structure APIs are a lot easier to use. The difference from SPIR is that the LLVM based tooling is available *today* (in RC form, but you get the idea). Many compiler writers are already familiar with LLVM APIs thus making it easy to integrate. Generating libNVVM makes it simpler to target CUDA than the earlier option of generating PTX. For example, with libNVVM you no longer need to worry about low-level stuff like register allocation since that can be taken care of by NVVM.

(edit: To clarify, such tooling should become available in the future for OpenCL SPIR but it is not part of the proposal as it stands today.)

Overall, OpenCL SPIR is a really nice proposal but it is not the solution to all problems that people seem to think it is. Specifically, compiler tooling side from the perspective of a third-party compiler is not very clear right now and I would say Nvidia is ahead on this front in terms of having a integrated stack already almost in-place. However, the potential is clearly there and OpenCL is clearly ahead of APIs (other than CUDA and HSA, see below) in this regard. For example, I have simply failed to get any information from Google about the LLVM-based distribution format they use for Renderscript for Android. DirectCompute defines a binary distribution format, but it does not look like it was designed with third-party compiler writers in mind. There is no tooling support to generate this nor very well-defined easy-to-read documentation, with documentation suggesting that it is mostly an implementation detail that you should not bother about.

I should also mention HSAIL. I would say, from the point of view a third-party compiler writer, HSAIL is the most exciting and well-designed target that I have seen so far based upon the details I have seen. I do hope that HSA foundation puts effort into making the library and tooling side nice as well. I am much more excited about HSAIL than OpenCL SPIR. OpenCL SPIR may very well end up being a stop-gap fix from the perspective of a third-party compiler writer. However, SPIR is still an important and useful step, both for vendors implementing OpenCL, as well as for application writers who are more comfortable in terms of distributing bytecode rather than source strings.