OpenCL – Searching for divine code

Things I am going to play with

Some things I want to experiment with as a fun exercise:

F# : F# is an awesome language that I started experimenting with a few months back. While I know the basic syntax of many parts of the language now, I am nowhere near proficient. Even in my brief time with the language, I am enjoying it a lot. It is more productive than even Python and light years ahead of C++. Looking forward to playing with some of the interesting features such as async workflows and type providers. Two books that I have on my radar are “F# deep dives” and “Purely functional data-structures” by Okasaki.

C# and .net ecosystem: Playing with F# actually has made me interested in playing with more .net technologies. From the productivity standpoint, C# looks a lot nicer than Java or C++ and does have some interesting technologies such as async construct and LINQ. On the performance side, CLR looks like a good JIT and a lot of innovation and pragmatic decisions seem to have been taken in the .net ecosystem. For example, inclusion of value types, SIMD types introduced recently with RyuJIT and libraries such as TPL should make it possible to write reasonably high-performance code despite CLR being a managed runtime. Recent open-sourcing of the .net core is also an interesting move.

ZeroMQ: I don’t have much experience with message-queue based systems and ZeroMQ looks like a good place to start. Have heard a lot of good things about it.

C++11: I have read up on many of the features in C++11, and have a basic understanding, but have not used them in non-trivial ways so I am not confident about them yet. Overall I am not at all liking where C++ is going. However, as a professional programmer who works a lot with C++ I feel I should keep myself updated because I expect to see more C++11 going forward.

OpenCL 2.0: I have read the specs and am familiar with many of the features theoretically but want to spend some time with features such as device-side enqueue and SVM to see the types of algorithms that are now possible on modern hardware.

Direct3d 11 and 12: Well quite self-evident 🙂 Going with the .net theme might try out SharpDX perhaps instead of going the native route.

Metal compute notes

I have been reading some Metal API documents. Some brief notes about Metal compute from the perspective of pure compute and not graphics:

Kernels: If you know previous GPU compute APIs such as OpenCL or CUDA etc. you will be at home. You have work-items organized in work-groups. A work-group has access to upto 16kB of local memory. Items within a work-group can synchronize but different work-groups cannot synchronize. You do have atomic instructions to global and local l memory. You don’t have function pointers and while the documentation doesn’t mention it, likely no recursion either. There is no dynamic parallelism either. You also cannot do dynamic memory allocation inside kernels. This is all very similar to OpenCL 1.x.

Memory model: You create buffers and kernels read/write buffers. Interestingly, you can create buffers from pre-allocated memory (i.e. from a CPU pointer) with zero copy provided the pointer is aligned to page boundary. This makes sense because obviously on the A7, both CPU and GPU have access to same physical pool of memory.

CPU and GPU cannot simultaneously write to buffer I think. CPU only guaranteed to see updates to buffer when the GPU command completes execution and GPU only guaranteed to see CPU updates if they occur before the GPU command is “committed”. So we are far from HSA-type functionality.

Currently I am unclear about how pointers work in the API. For example, can you store a pointer value in a kernel, and then reload it in a different kernel? You can do this in CUDA and OpenCL 2.0 “coarse grained” SVM for example, but not really in OpenCL 1.x. I am thinking/speculating they don’t support such general pointer usage.

Command queues: This is the point where I am not at all clear about things but I will describe how I think things work. You can have multiple command queues similar to multiple streams in CUDA or multiple command queues in OpenCL. Command queues contain a sequence of “command buffers” where each command buffer can actually contain multiple commands. To reduce driver overhead, you can “encode” or record commands in two different command buffers in parallel.

Command queues can be thought of as in-order but superscalar. Command buffers are ordered in the order they were encoded. However, API keeps track of resource dependencies between command buffers and if two command buffers in sequence can be issued in parallel, they may be issued in parallel. I am speculating that the “superscalar” part applies to purely compute driven scenarios, and will likely apply more to mixed scenarios where a graphics task and a compute task may be issued in parallel.

GPU-only: Currently only works on GPUs, and not say the CPU or the DSP.

Images/textures: Haven’t read this yet. TODO.

Overall, Metal is similar in functionality to OpenCL 1.x. and it is more about having niceties such as C++11 support in the kernel language (the static subset) so you can use templates, overloading, some static usage of classes etc. Graphics programmers will also appreciate the tight integration with the graphics pipeline. To conclude, if you have used OpenCL or CUDA, then your skills will transfer over easily to Metal. From a theory perspective it is not a revolutionary API, and does not bring any new execution or memory model niceties. It is essentially Apple’s view on the same concepts and focused on tackling of practical issues.

OpenGL, OpenGL ES compute shaders and OpenCL spec minimums

Well, I really got this one wrong. Previously I had (mistakenly) claimed that OpenGL compute shader, OpenGL ES compute shader etc. don’t really have specified minimums for some things but I got that completely wrong. I guess I need to be more careful while reading these specs as the required minimums are not necessarily located at the same place where they are being explained. Some of the OpenCL claims still stand though OpenCL’s relaxed specs are a bit more understandable given that it has to run on more hardware than others.

Here are the minimums:

OpenGL 4.3: 1024 work-items in a group, and 32kB of local memory
OpenGL ES 3.1: 128 work-items in a group, and 16kB of local memory (updated from 32kB, Khronos “fixed” the spec)
OpenCL 1.2: 1 work-item in a group, 32kB of local memory
OpenCL 1.2 embedded: 1 work-item in a group, 1kB of local memory
DirectCompute 11 (for reference): 1024 work-items in a group, 32kB of local memory

Thanks to Graham Sellers and Daniel Koch on twitter for pointing out the error. I guess I got schooled today.

Khronos standards and “caps”

UPDATE: This post is just plain wrong. See correction HERE. Thanks to various people on twitter, especially Graham Sellers and Daniel Koch for pointing this out.

Just venting some frustration here. One of the annoying things in Khronos standards is the lack of required minimum capabilities, which makes writing portable code that much harder. The minimum guarantees are very lax. Just as an example, take work-group sizes in both OpenCL and OpenGL compute shaders. In both of these, you have to query to find out the maximum work group size supported which may turn out to be just 1.

Similarly, in OpenGL (and ES) compute shaders, there is no minimum guaranteed amount of local memory per workgroup. You have to query to ask how much shared memory per workgroup is supported, and the implementation can just say zero because there is no minimum mandated in the specification.

edit: Contrast this with DirectCompute where you have mandated specifications for both the amount of local memory and the work-group sizes which makes life so much simpler.

Sony ships OpenCL on Xperia handsets

Sony published a blog post about OpenCL being available on Xperia devices such as Xperia Z, ZL, ZR, Z Ultra, Z1 and Tablet Z. These are Snapdragon S4 Pro and Snapdragon 800 devices and according to Sony come with OpenCL drivers for the Adreno GPUs. They also indicate that they intend to continue to support OpenCL. Very good news.

OpenCL on Altera FPGAs

Wrote an article about Altera’s SDK for OpenCL for their Stratix FPGAs. You can find it here. If you find any correctness issues, or if you have any comments, let me know.