Searching for divine code

Quick note on integrated GPU progress from Intel and AMD

If we look at only programmability and floating-point performance, the progress we have made on GPUs is remarkable. Consider the following:

Xbox 360 (2005 console): 240 GFlops and DirectX 10 level (mostly)
GTX 280 (mid-2008 flagship): 622 GFlops, DirectX 10 and CUDA 1.0
AMD Richland 8650G (integrated 2013): 550+ GFlops DirectX 11 and OpenCL 1.2
Intel Iris Pro 5200 (integrated 2013): 650+ GFlops, DirectX11 and OpenCL 1.2

Integrated graphics today, with a TDP of perhaps 20W for the graphics component, has more floating-point performance than a flagship GPU from just 5 years earlier. Bandwidth constraints still remain, though potential solutions are emerging either using on-package eDRAM on Intel, or using say GDDR5 like the PS4. But it is impressive to see that integrated GPUs have advanced so much.

HSAIL specs released!

HSA Foundation finally released the Programmer’s Reference Manual for HSA IL (HSAIL). So what is HSAIL and why should you care about it? Well AMD and HSA Foundation have been talking about Heterogeneous System Architecture or HSA. HSAIL is one of the building blocks of HSA. Salient features of HSAIL are:

HSAIL is a portable low-level pseudo assembly language for heterogeneous systems. HSAIL is not intended to be the real instruction set architecture (ISA) of any hardware. Instead, the hardware vendor will provide a compiler that will convert HSAIL to the actual ISA. For example, AMD will provide a driver to compile HSAIL to future AMD APUs.
A regular programmer will never need to read/write HSAIL. Instead, it is intended as a code generation target for high level compilers. For example, you may have a C++ AMP or some sort of a Python compiler that generates HSAIL and then the hardware vendor’s driver will compiler HSAIL to the hardware.
HSAIL is a very cleanly designed language. Compiling HSAIL to native code should be very fast in most cases.
HSAIL is a very flexible language. Any GPU implementing HSAIL will have very advanced capabilities, far beyond current standards such as OpenCL 1.2. For example, HSAIL allows GPU kernels to enqueue calls to further GPU kernels without CPU intervention. Function pointers (and hence C++ virtual functions) are supported.

HSA-enabled systems will implement unified memory space for both CPU and GPU. Combined with the flexible execution model defined by HSAIL, I am very excited by the prospects of HSA enabled products such as the Kaveri APU. I am working on a more detailed writeup about HSA and will post it soon.

Wrote about FP performance on ARM CPUs at Anandtech

Wrote an article about floating point instruction throughputs on Anandtech.
Go read here. Feedback welcome 🙂

AMD Jaguar vs AMD Llano (K10) at same clocks

AMD recently launched their Kabini APUs and the A4-5000 has been reviewed by a number of websites. However, I haven’t been able to find a review that compares it to other x86 architectures at the same clocks. I happen to have a notebook with a AMD Llano based A8-3500M in it. This is a quad-core unit @ 1.5GHz, the same clock as A4-5000. Now, A8-3500M also has a boost mode that takes it to over 2GHz for some single-threaded workloads but I disabled that boost mode for this test. My Llano test system is running Win7 Home Premium (64-bit) and has dual-channel 8GB RAM installed. I don’t have a A4-5000 to test, so I referenced Anandtech for Cinebench 11.5 (64-bit multithreaded) and Mozilla Kraken, PCPer for Euler3D and Techreport for 7-zip.

Test results:

Test	Jaguar	Llano K10	Jaguar % of K10
Euler3D	0.971 Hz	1.41 Hz	68.9%
Cinebench 11.5	1.5	1.9	78.9%
Mozilla Kraken	6512.7	5883	90%
7-zip (compress)	3793 MIPS	5317 MIPS	71.3%
7-zip (decompress)	5397 MIPS	6152 MIPS	87.7%

For 7-zip, I just ran it as follows: 7z.exe b 3 -mmt4 -mmd25 as the default number of passes is way too long, but it should give us a decent idea 🙂

Overall, we are looking at between 70-90% of K10 performance at the same clocks in this test, which is quite good for a small dual-issue core. We don’t have any data on how the benchmarks scale with clock speed though. Hopefully that data will become clearer in the future when reviewers get their hands on the A6-5200 which has Jaguar cores clocked @ 2GHz.

Snapdragon S4 devboard

Found an interesting upcoming single-board computer (SBC) with quad-core Krait @ 1.7Ghz and Adreno 320. At $149, it looks interesting. Still only in -pre-order though. Might get one if the software support is right. Have been burnt before with devboards with interesting hardware but crap drivers so will wait and watch for now. Particularly interested to see if it supports OpenCL, and whether OS options include fully-supported Linux and not just Android.

Link: http://www.inforcecomputing.com/product/moreinfo/ifc6410.html

Update: The company confirmed that the board supports OpenCL on GPU! Currently they have an Android image based on Jelly Bean 4.1.2. Linux seems to be on the roadmap but no ETA I believe.

Double precision on GPGPU APIs

Many scientific computations are done in double precision floating-point (i.e. fp64). Support for fp64 varies between GPU architectures as well as GPGPU APIs. Here I just recap the capabilities of various APIs, assuming the hardware support is present:

1. CUDA: Full support for fp64 including exponentials, trigonometry etc.

2. OpenCL: Full support for fp64, similar to CUDA

3. OpenGL: An extension called gpu_shader_fp64 is available but it only supports basics like addition, multiplication and divison. Does not support exponentials, trigonometry etc.

4. DirectCompute: On Windows 7, only supports fp64 add, multiply and a few comparison operators but not divison or exponentials etc. On Windows 8, some GPUs support double precision division, reciprocal and FMA. However, afaik still no support for exponentials and trigonometry etc.?

So, if you want full fp64 support, I guess OpenCL and CUDA are the way to go currently.