C++ AMP: First impressions

Update: Published a new blog post. Check here

Coming from an OpenCL and CUDA background, here are some thoughts on C++ AMP as experienced on Visual Studio 2012 RC , on a system with Nvidia and Intel GPUs:

The good:

1. It is extremely easy to get started. Really the “hello world” program in C++ AMP is a lot shorter than OpenCL with lots of boilerplate autogenerated behind the scenes for you, though you do have control if you do want control.  I think this can be a good API for teaching GPGPU programming to beginners.

2. Integration into Visual Studio is good to have. It is C++, and well integrated into your source, and thus the usual Visual Studio goodness like Intellisense are fully available.

3. Not tied to a single hardware vendor. The specification is also open, so can potentially be ported to other OS platforms.

4. Real world guarantees:

a) Code is compiled (mostly) statically and you get a single binary that you can distribute to client machines. OpenCL does not have a defined binary format, and even if you distribute your OpenCL source strings, it is not 100% guaranteed to compile and run properly on client machines.

b) Timeout detection and recovery: With OpenCL, if you want to distribute a kernel that can potentially run for a long time on client machines, there is a danger that it can invoke timeout detection and recovery and basically terminate your application or lead to system instability depending on OS and hardware vendor. And there is jack that your application can do about it.

With C++ AMP, that problem is not there. If TDR is invoked, you get an exception that you can catch and respond to in your application.  Thus, if you make sure to catch such exceptions in your code, you get better guarantees of application stability on client machines.

The bad:

1. Profiling tools are not good compared to CUDA and OpenCL libs. For example, how do I profile coalesced loads? I don’t see such an option in Visual Studio. If I am wrong, please let me know. Given that the whole point of C++ AMP is gaining performance, one should have better profiling tools.

EDIT: See updated post

2. I cannot peek under the hood to see the code generated, and along with lack of profiling tools, makes performance tuning harder.

3. Compile times can be LONG. It seems to me that a kernel can take something like 5-20 seconds to compile for non-trivial kernels (even on a fast third-gen Intel Core i7 processor). If you have 50 kernels in your code, time can add up for each build. OpenCL and CUDA kernels typically compile much faster.