Metal compute notes

I have been reading some Metal API documents. Some brief notes about Metal compute from the perspective of pure compute and not graphics:

Kernels:  If you know previous GPU compute APIs such as OpenCL or CUDA etc. you will be at home. You have work-items organized in work-groups. A work-group has access to upto 16kB of local memory. Items within a work-group can synchronize but different work-groups cannot synchronize. You do have atomic instructions to global and local l memory.  You don’t have function pointers and while the documentation doesn’t mention it, likely no recursion either.  There is no dynamic parallelism either. You also cannot do dynamic memory allocation inside kernels. This is all very similar to OpenCL 1.x.

Memory model:  You create buffers and kernels read/write buffers. Interestingly, you can create buffers from pre-allocated memory (i.e. from a CPU pointer) with zero copy provided the pointer is aligned to page boundary.  This makes sense because obviously on the A7, both CPU and GPU have access to same physical pool of memory.

CPU and GPU cannot simultaneously write to buffer I think. CPU only guaranteed to see updates to buffer when the GPU command completes execution and GPU only guaranteed to see CPU updates if they occur before the GPU command is “committed”. So we are far from HSA-type functionality.

Currently I am unclear about how pointers work in the API. For example, can you store a pointer value in a kernel, and then reload it in a different kernel? You can do this in CUDA and OpenCL 2.0 “coarse grained” SVM for example, but not really in OpenCL 1.x. I am thinking/speculating they don’t support such general pointer usage.

Command queues:  This is the point where I am not at all clear about things but I will describe how I think things work. You can have multiple command queues similar to multiple streams in CUDA or multiple command queues in OpenCL. Command queues contain a sequence of “command buffers” where each command buffer can actually contain multiple commands. To reduce driver overhead, you can “encode” or record commands in two different command buffers in parallel.

Command queues can be thought of as in-order but superscalar. Command buffers are ordered in the order they were encoded. However, API keeps track of resource dependencies between command buffers and if two command buffers in sequence can be issued in parallel, they may be issued in parallel. I am speculating that the “superscalar” part applies to purely compute driven scenarios, and will likely apply more to mixed scenarios where a graphics task and a compute task may be issued in parallel.

GPU-only: Currently only works on GPUs, and not say the CPU or the DSP.

Images/textures: Haven’t read this yet. TODO.

Overall, Metal is similar in functionality to OpenCL 1.x. and it is more about having niceties such as C++11 support in the kernel language (the static subset) so you can use templates, overloading, some static usage of classes etc.  Graphics programmers will also appreciate the tight  integration with the graphics pipeline. To conclude, if you have used OpenCL or CUDA, then your skills will transfer over easily to Metal. From a theory perspective it is not a revolutionary API, and does not bring any new execution or memory model niceties. It is essentially Apple’s view on the same concepts and focused on tackling of practical issues.

C++ AMP: First impressions

Update: Published a new blog post. Check here

Coming from an OpenCL and CUDA background, here are some thoughts on C++ AMP as experienced on Visual Studio 2012 RC , on a system with Nvidia and Intel GPUs:

The good:

1. It is extremely easy to get started. Really the “hello world” program in C++ AMP is a lot shorter than OpenCL with lots of boilerplate autogenerated behind the scenes for you, though you do have control if you do want control.  I think this can be a good API for teaching GPGPU programming to beginners.

2. Integration into Visual Studio is good to have. It is C++, and well integrated into your source, and thus the usual Visual Studio goodness like Intellisense are fully available.

3. Not tied to a single hardware vendor. The specification is also open, so can potentially be ported to other OS platforms.

4. Real world guarantees:

a) Code is compiled (mostly) statically and you get a single binary that you can distribute to client machines. OpenCL does not have a defined binary format, and even if you distribute your OpenCL source strings, it is not 100% guaranteed to compile and run properly on client machines.

b) Timeout detection and recovery: With OpenCL, if you want to distribute a kernel that can potentially run for a long time on client machines, there is a danger that it can invoke timeout detection and recovery and basically terminate your application or lead to system instability depending on OS and hardware vendor. And there is jack that your application can do about it.

With C++ AMP, that problem is not there. If TDR is invoked, you get an exception that you can catch and respond to in your application.  Thus, if you make sure to catch such exceptions in your code, you get better guarantees of application stability on client machines.

The bad:

1. Profiling tools are not good compared to CUDA and OpenCL libs. For example, how do I profile coalesced loads? I don’t see such an option in Visual Studio. If I am wrong, please let me know. Given that the whole point of C++ AMP is gaining performance, one should have better profiling tools.

EDIT: See updated post

2. I cannot peek under the hood to see the code generated, and along with lack of profiling tools, makes performance tuning harder.

3. Compile times can be LONG. It seems to me that a kernel can take something like 5-20 seconds to compile for non-trivial kernels (even on a fast third-gen Intel Core i7 processor). If you have 50 kernels in your code, time can add up for each build. OpenCL and CUDA kernels typically compile much faster.