C++ AMP: First impressions

Update: Published a new blog post. Check here

Coming from an OpenCL and CUDA background, here are some thoughts on C++ AMP as experienced on Visual Studio 2012 RC , on a system with Nvidia and Intel GPUs:

The good:

1. It is extremely easy to get started. Really the “hello world” program in C++ AMP is a lot shorter than OpenCL with lots of boilerplate autogenerated behind the scenes for you, though you do have control if you do want control.  I think this can be a good API for teaching GPGPU programming to beginners.

2. Integration into Visual Studio is good to have. It is C++, and well integrated into your source, and thus the usual Visual Studio goodness like Intellisense are fully available.

3. Not tied to a single hardware vendor. The specification is also open, so can potentially be ported to other OS platforms.

4. Real world guarantees:

a) Code is compiled (mostly) statically and you get a single binary that you can distribute to client machines. OpenCL does not have a defined binary format, and even if you distribute your OpenCL source strings, it is not 100% guaranteed to compile and run properly on client machines.

b) Timeout detection and recovery: With OpenCL, if you want to distribute a kernel that can potentially run for a long time on client machines, there is a danger that it can invoke timeout detection and recovery and basically terminate your application or lead to system instability depending on OS and hardware vendor. And there is jack that your application can do about it.

With C++ AMP, that problem is not there. If TDR is invoked, you get an exception that you can catch and respond to in your application.  Thus, if you make sure to catch such exceptions in your code, you get better guarantees of application stability on client machines.

The bad:

1. Profiling tools are not good compared to CUDA and OpenCL libs. For example, how do I profile coalesced loads? I don’t see such an option in Visual Studio. If I am wrong, please let me know. Given that the whole point of C++ AMP is gaining performance, one should have better profiling tools.

EDIT: See updated post

2. I cannot peek under the hood to see the code generated, and along with lack of profiling tools, makes performance tuning harder.

3. Compile times can be LONG. It seems to me that a kernel can take something like 5-20 seconds to compile for non-trivial kernels (even on a fast third-gen Intel Core i7 processor). If you have 50 kernels in your code, time can add up for each build. OpenCL and CUDA kernels typically compile much faster.

7 thoughts on “C++ AMP: First impressions”

  1. Just for addition.
    OpenCL works on Altera and Xilinx FPGA’s, C++ AMP doesn’t because they don’t implement DX11. ARM also supports OpenCL and will support Cuda in the future, but has no signs of C++ AMP because of no DX11 on ARM chips

    1. It is true that many more vendors are supporting OpenCL today including FPGA vendors and GPUs integrated on ARM based SoCs.

      However, please note that C++ AMP as a specification is not tied to DX11. Microsoft’s implementation is based upon DX11, but that does not mean that the language spec itself is tied to DX11.

      It is an open specification that can be implemented by a third-party compiler (without paying any royalties to Microsoft) on top of another API. For example, it is technically completely possible for a third-party C++ compiler generating OpenCL from C++ AMP code.

      One final clarification: Some GPU IP providers (such as imagination technologies) do have DirectX 11 capable GPUs that can be integrated onto an ARM based SoC if desired by the SoC vendor.

      1. You’re right the ARM Midgard and the Imagination Rogue architecture will support Directx11. And I think the first imlementations are up to licenc.

    2. ARM will never support CUDA. NVIDIA prohibits reserve engineerig, and with this condition nobody will licenc it. ARM will take AMDs route with HSA and open standards.

  2. Hey Rahul,

    Regarding your second “bad” point, have you tried using the Visual Studio debugger to peek at the HLSL bytecode? Also for #1 were you not able to use the vendor-specific DirectX tooling for performance analysis of C++ AMP?

    As for #3 on driver bugs, can you post in our forum the code that makes the driver yield incorect results? The IHVs have been fixing C++ AMP bugs and I want to check that we have caught this one, or if not to report it to them.

    For #4, if you can share repros with us of long compile times in release mode, we’d love to investigate. Debug mode has some known long compile times, due to the generous symbolic information we generate, which enables the awesome debugger which I was hoping to see on your list of “good” 🙂

    1. Hi Daniel.

      Great to see your reply here! Yes, I have seen the debugging demos (I attended your AFDS session) and it is certainly one of the coolest features of C++ AMP. However, my understanding is that many debugging features require Windows 8, while I am still using Windows 7 so I have not been able to test them yet. I do plan to test those features once I install Windows 8, and will do a follow-up blog post.

      I will post the code sample (which is a bit complex since it was actually partially autogenerated). I ran the code on the software adapter, and will post output from the hardware devices as well as the software adapter. I have sent the files to a contact at the IHV and hoping it will get addressed.

      I don’t have access to the VS 2012 RTM, wondering if the situation changes there.

      The code in question is an autogenerator for GEMM kernels. I initially wrote the generator for OpenCL, and then retargeted a small portion of it to C++ AMP for the purpose of comparisons. SO while the code is quite complex, it is very similar to the OpenCL code.

      About profiling tools, I am still looking at the tools from the vendors. I don’t have a DX background, so have been unfamiliar with the tooling support. One of the problems is that the types of applications I am interested in aren’t necessarily graphical apps, but more of command-line type server applications and IHV documentation seems focused on profiling graphics apps. Again, I will do a follow-up blog post once I find good solutions for this.

  3. Have you tried to benchmark the performance of the binary codes between C++ AMP and their CUDA counterpart? Are the two estentially the same for non-trivial tasks in terms of performance? thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *