An overview of OpenCL SPIR

(Updated: Corrected NVVM description at 0845 EST on 7th oct)

OpenCL SPIR is a proposed portable binary distribution format for OpenCL programs. The idea is simple. Today, OpenCL kernels are distributed as source strings with the application binary. The source string is then compiled on the user’s machine into native binaries using the OpenCL driver present on the user’s machine. However, this is not always ideal. First, some people would prefer not to distribute their OpenCL kernel sources with their application binaries. Second, there may be more compilation overhead on the user’s machine. Third, compilers for higher-level languages may want to generate GPU code and may want a lower-level and stable target instead of OpenCL C.

In contrast to the situation with OpenCL, consider DirectCompute shaders. The developer writes an awesome shader on his/her machine. The shader can be compiled into a lower-level bytecode format (that is not dependent upon the hardware vendor) and then the bytecode is distributed with the application binary. The bytecode is compiled into binary code by the driver on the user’s computer.

OpenCL SPIR is trying to define a similar portable “binary” distribution format. However, instead of designing their own bytecode from scratch, SPIR is based upon the LLVM IR. Most OpenCL implementations already use some proprietary fork of LLVM IR already thus it was the logical starting point. That is not to say the problem is easy. OpenCL SPIR is meant to be portable, whereas LLVM IR was not really meant to be a portable distribution format. LLVM IR was meant as a compiler IR. There is also some discussion about whether SPIR specification is robust enough that SPIR-to-SPIR compilers/optimizers can be safely written, or whether SPIR is suitable as a target for compilers for languages other than OpenCL C kernel language. The initial goal appears to be to ensure that SPIR is a suitable target for OpenCL C implementations first and not worry about the other use cases.

It is also important to note what OpenCL SPIR is *not*. OpenCL SPIR is not a piece of software. It is simply a specification for a program representation format that vendors are free to implement anyway they choose. There is a lot of wrong reporting on OpenCL SPIR because people seem to confuse LLVM IR with LLVM-the-software. There may end up being a reference OpenCL C to SPIR compiler implementation, and then SPIR-to-binary compilers for supported LLVM backends, but that is *NOT* what is being proposed right now. And even if reference implementations are made available, vendors are free to ignore them.

I will repeat once again. OpenCL SPIR is *not* a piece of software. OpenCL SPIR is simply a distribution format, based upon LLVM IR. Let us consider you are writing a Python to OpenCL compiler. Today, you would be generating OpenCL C. However, in the future, you may want to generate SPIR instead though the initial design is not really meant for this use case. Now integrating SPIR is quite different from a toolchain perspective than integrating LLVM-the-software for CPU code generation that you might use today. Most compilers that use LLVM today for CPUs do not generate LLVM bytecode directly. Instead, LLVM-the-software uses an internal in-memory data structure representation of the LLVM IR with really nice C++ APIs for building these data structures. OpenCL SPIR specification does *NOT* contain this data-structure representation or associated APIs currently. You may get these once there is a reference implementation, but right now, there isn’t.

Comparisons are being made with Nvidia’s NVVM for CUDA. There is a BIG difference, and the difference is that NVVM’s design and implementation goals are quite different than SPIR. Nvidia already has a bytecode format for distributing programs called PTX. NVVM is simply a higher-level layer and there are two pieces to NVVM: NVVM IR and libNVVM. NVVM IR is also an LLVM-based IR, but essentially a clean subset of LLVM instead of being a modification. NVVM IR is not really meant for distribution however, and is meant mostly as a compiler target. Second piece is libNVVM library that generates PTX from NVVM. libNVVM is built using LLVM-the-software and the intended audience is exclusively third-party compiler writers. libNVVM is simply a C++ library based upon LLVM that enables compiler writers (such as compilers for Python to CUDA) to easily generate PTX.

The nice thing about NVVM IR is that it is essentially a subset of the standard LLVM IR. Compiler writers can either generate NVVM IR bytecode directly, or use the LLVM C++ data-structure APIs to generate and manipulate NVVM. I would say the data structure APIs are a lot easier to use. The difference from SPIR is that the LLVM based tooling is available *today* (in RC form, but you get the idea). Many compiler writers are already familiar with LLVM APIs thus making it easy to integrate. Generating libNVVM makes it simpler to target CUDA than the earlier option of generating PTX. For example, with libNVVM you no longer need to worry about low-level stuff like register allocation since that can be taken care of by NVVM.

(edit: To clarify, such tooling should become available in the future for OpenCL SPIR but it is not part of the proposal as it stands today.)

Overall, OpenCL SPIR is a really nice proposal but it is not the solution to all problems that people seem to think it is. Specifically, compiler tooling side from the perspective of a third-party compiler is not very clear right now and I would say Nvidia is ahead on this front in terms of having a integrated stack already almost in-place. However, the potential is clearly there and OpenCL is clearly ahead of APIs (other than CUDA and HSA, see below) in this regard. For example, I have simply failed to get any information from Google about the LLVM-based distribution format they use for Renderscript for Android. DirectCompute defines a binary distribution format, but it does not look like it was designed with third-party compiler writers in mind. There is no tooling support to generate this nor very well-defined easy-to-read documentation, with documentation suggesting that it is mostly an implementation detail that you should not bother about.

I should also mention HSAIL. I would say, from the point of view a third-party compiler writer, HSAIL is the most exciting and well-designed target that I have seen so far based upon the details I have seen. I do hope that HSA foundation puts effort into making the library and tooling side nice as well. I am much more excited about HSAIL than OpenCL SPIR. OpenCL SPIR may very well end up being a stop-gap fix from the perspective of a third-party compiler writer. However, SPIR is still an important and useful step, both for vendors implementing OpenCL, as well as for application writers who are more comfortable in terms of distributing bytecode rather than source strings.

Some informed speculation about ARM T604

ARM T604 is an upcoming mobile GPU from ARM. I remember reading slides from an ARM presentation, though I cannot find the link now, perhaps they were taken down. Anyway, here is what we know:

1. Quad-core

2. Upto 68 GFlops of compute performance. I assume this is for fp32. Exynos 5 Dual whitepaper claims 72 GFlops.

3. Barrel threaded (i.e. multiple simultaneous threads) like AMD or Nvidia

4. No SIMT! Rather, SIMD architecture. I take this to mean, the vector lanes are not predicated. So be prepared to write explicitly SIMD code.

5. Now 68 GFlops/4 core = 17 GFlops/core. Assuming 500MHz clock speed, that gives us 34 flops/cycle.

We do know that it has 2 ALUs/core so each ALU does 17 flops/cycle.  Each ALU has one scalar and one (or more?) vector units.  So perhaps 1 scalar, and 1 vec8 unit with MAD? or Perhaps 1 scalar and 2 vec4 units with MAD.

(If we go by the Exynos 5 Dual whitepaper, perhaps they have modified the scalar unit to also do MAD instead of just one flop/cycle.)

6. Full IEEE precision for fp32 and fp64. Very nice ARM! The full fp64 support makes me excited for this architecture for my uses.  ARM has not published the fp64 speeds, but I think it will be either 1/4th or 1/8th.

7. OpenCL 1.1 Full Profile support. I hope that EVERY device that ships with this GPU comes with working OpenCL drivers and an open SDK is provided to everyone.

C++ AMP: First impressions

Update: Published a new blog post. Check here

Coming from an OpenCL and CUDA background, here are some thoughts on C++ AMP as experienced on Visual Studio 2012 RC , on a system with Nvidia and Intel GPUs:

The good:

1. It is extremely easy to get started. Really the “hello world” program in C++ AMP is a lot shorter than OpenCL with lots of boilerplate autogenerated behind the scenes for you, though you do have control if you do want control.  I think this can be a good API for teaching GPGPU programming to beginners.

2. Integration into Visual Studio is good to have. It is C++, and well integrated into your source, and thus the usual Visual Studio goodness like Intellisense are fully available.

3. Not tied to a single hardware vendor. The specification is also open, so can potentially be ported to other OS platforms.

4. Real world guarantees:

a) Code is compiled (mostly) statically and you get a single binary that you can distribute to client machines. OpenCL does not have a defined binary format, and even if you distribute your OpenCL source strings, it is not 100% guaranteed to compile and run properly on client machines.

b) Timeout detection and recovery: With OpenCL, if you want to distribute a kernel that can potentially run for a long time on client machines, there is a danger that it can invoke timeout detection and recovery and basically terminate your application or lead to system instability depending on OS and hardware vendor. And there is jack that your application can do about it.

With C++ AMP, that problem is not there. If TDR is invoked, you get an exception that you can catch and respond to in your application.  Thus, if you make sure to catch such exceptions in your code, you get better guarantees of application stability on client machines.

The bad:

1. Profiling tools are not good compared to CUDA and OpenCL libs. For example, how do I profile coalesced loads? I don’t see such an option in Visual Studio. If I am wrong, please let me know. Given that the whole point of C++ AMP is gaining performance, one should have better profiling tools.

EDIT: See updated post

2. I cannot peek under the hood to see the code generated, and along with lack of profiling tools, makes performance tuning harder.

3. Compile times can be LONG. It seems to me that a kernel can take something like 5-20 seconds to compile for non-trivial kernels (even on a fast third-gen Intel Core i7 processor). If you have 50 kernels in your code, time can add up for each build. OpenCL and CUDA kernels typically compile much faster.