Architecture – Page 2 – Searching for divine code

AMD Jaguar vs AMD Llano (K10) at same clocks

AMD recently launched their Kabini APUs and the A4-5000 has been reviewed by a number of websites. However, I haven’t been able to find a review that compares it to other x86 architectures at the same clocks. I happen to have a notebook with a AMD Llano based A8-3500M in it. This is a quad-core unit @ 1.5GHz, the same clock as A4-5000. Now, A8-3500M also has a boost mode that takes it to over 2GHz for some single-threaded workloads but I disabled that boost mode for this test. My Llano test system is running Win7 Home Premium (64-bit) and has dual-channel 8GB RAM installed. I don’t have a A4-5000 to test, so I referenced Anandtech for Cinebench 11.5 (64-bit multithreaded) and Mozilla Kraken, PCPer for Euler3D and Techreport for 7-zip.

Test results:

Test	Jaguar	Llano K10	Jaguar % of K10
Euler3D	0.971 Hz	1.41 Hz	68.9%
Cinebench 11.5	1.5	1.9	78.9%
Mozilla Kraken	6512.7	5883	90%
7-zip (compress)	3793 MIPS	5317 MIPS	71.3%
7-zip (decompress)	5397 MIPS	6152 MIPS	87.7%

For 7-zip, I just ran it as follows: 7z.exe b 3 -mmt4 -mmd25 as the default number of passes is way too long, but it should give us a decent idea 🙂

Overall, we are looking at between 70-90% of K10 performance at the same clocks in this test, which is quite good for a small dual-issue core. We don’t have any data on how the benchmarks scale with clock speed though. Hopefully that data will become clearer in the future when reviewers get their hands on the A6-5200 which has Jaguar cores clocked @ 2GHz.

Texas Instruments Keystone II : HPC perspective

Texas Instruments recently announced their Keystone II chips. Essentially, these combine a multi-core Cortex A15 with DSPs on a single chip. The number of cores and DSP configuration varies depending on the SKU. Here I focus on the top-end SKU 66AK2H12.

The chip has the following integrated:

4 Cortex A15 cores @ 1.4 GHz giving 44.8 GFlops SP (NEON), 22.4 GFlops SP (IEEE-754), 11.2 GFlops DP
8 C66-family DSPs @ 1.2 GHz giving 153.6 GFlops SP, 57.6 GFlops DP?
DDR3 memory controller 2×64-bit upto 1600MHz giving 25.6 GB/s bandwidth
ARM cores L1 data cache 4*32 kB, L1 instruction cache 4*32kB, L2 cache 4MB shared across cores
DSP cores L1 data cache 8*32kB, instruction cache 8*32kB, L2 cache 1MB/DSP = 8MB total
6 MB of cache (separate from L2 caches) shared by DSPs and ARM cores
Upto 14W power consumption
OpenMP programming tools. alpha version of OpenCL driver also available

You should not think of this chip as a GPU-like accelerator. This is intended to be a standalone solution, with the 4 general-purpose ARM cores capable of running any regular ARM applications including a full Linux OS. Certain parts of your application can be offloaded to the DSP or they can be used in concert with the ARM cores. The DSPs themselves have a fairly flexible instruction set and my understanding is that you can do function calls, recursion etc without issue (correct me if I am wrong, will confirm from documentation). The DSPs and the ARM cores are both reading/writing from the same memory elimintating the data-copy bottleneck that exists on many PCIe accelerator type solutions.

The base specifications are looking really good. The perf/W is looking to be competitive with GPU based solutions. The low power consumption means that it can used in many applications where the big power hungry solutions (such as Teslas or Xeon Phis) are not applicable. The shared memory model is also very enticing for everyone, including say supercomputing uses.

TI have a good solution on their hands and should push more aggressively into the HPC space. They should put money into getting libraries like an optimized BLAS optimized for the system along with say OpenCV. TI should invest money into developing good compiler, debuggers and profilers. They should particularly continue to invest in standards-based solutions like OpenMP and OpenCL. As a newcomer and a smaller player, they cannot afford to introduce yet another proprietary solution.

They also need to gain some mindshare as well as marketshare. To gain mindshare, they should ensure to make ALL of this available in a nicely packaged fashion with a good descriptive documentation and webpages. They should also make low-cost boards available to really gain some marketshare. People underestimate how convenient Nvidia makes getting and using their tools for CUDA. I can just buy a cheap Nvidia card for a desktop (or buy a decent laptop), just download the CUDA SDK for free without any agreements and off I go. Everything is packaged nicely, easy to find and comes with good documentation. Capturing mind-share IS important and TI should learn those lessons from Nvidia.

I do wish TI all the best in the HPC field. They have built some solid and interesting technology, and economics also potentially works out as their DSP technology investments can be leveraged in multiple product lines much like how Nvidia is able to use the same designs for both HPC and consumer products. If they invest in building a good software ecosystem around their products, they can certainly compete in this space.

If anyone from TI is reading this, I would love to port all of my software (such as my Python compilers and numerical libraries, see here and here) to your hardware so please let me know who can I contact 🙂

Intel Xeon Phi announcement and summary

Intel had announced Xeon Phi branding and basic architecture long ago, but we finally have details and pricing. Xeon Phi is essentially a 62-core x86 chip. Different SKUs will have different number of cores and clock speeds enabled. TDPs and rough performance numbers look competitive with offerings such as Nvidia Tesla, but the Xeon Phi offers higher programmability and potentially better efficiency on some workloads. The chip will sit in a PCIe board and can either be used to offload parts of your program, or run the whole program. The board offers a number of programming interfaces such as OpenMP that are a lot more convenient than writing say CUDA code. Compared to GPUs, it should be relatively easy to get your application up-and-running on a Xeon Phi though optimization will still require some effort.

However, I am still happy to report that OpenCL is still fully supported, so porting code from GPUs to Xeon Phi is still easy. Kudos to Intel for getting behind OpenCL and actually delivering fully working products.

Each core has an in-order dual-issue x86 core with SMT (4 threads) backed by a 512-bit vector unit capable of doing FMA operations. Each vector unit can do 8 fp64 FMAs (16 flops) or 16 fp32 FMAs (32 flops) each cycle. While there is no SSE or AVX available on this core, the vector instruction set is actually very nice with operations like scatter-gather as well as per-lane write masks. IMO it is a cleaner and more flexible vector ISA than say AVX.
Unlike GPUs, Xeon Phi does not have an on-chip user-programmable local memory. Instead, it is backed by a large 512kB L2 cache on each core and the cache is fully coherent. In total, on a 60-core variant that is 30MB of coherent L2 cache compared to 1-2 MB L2 caches we are used to seeing on GPUs. This is a HUGE win compared to GPUs IMO and should give very good efficiency on some workloads such as some types of sparse matrices. Honestly, dealing with on-chip shared memory on GPUs is a giant pain.

My rough guess is that Nvidia’s Tesla K20X will retain a 10-15% edge in some brute force tests as well as tests like generic dense linear algebra, and will retain an advantage in fp32 workloads, but there will also be workloads where Xeon Phi will win out. And overall Xeon Phi should retain a programmability advantage.

As an academic (currently), I am a little disappointed that I will likely not be able to test my tools on a Xeon Phi as we do not have the budget to buy them. With Nvidia, one can start experimenting with CUDA by buying just a $100 card and Nvidia has also been open about seeding their boards to universities where they feel appropriate. Xeon Phis start upwards of $2k (much like Teslas) so not many labs will have access to them. Would like to see Intel offer some kind of program to univs to boost the Xeon Phi’s popularity to increase the base of programmer pool available for their card 🙂

Overall, a very good showing from Intel, though they do need to keep executing as other competitors are not sitting idle either.

Cortex A15 and ARM Mali T604 are here

The new Chromebooks are apparently based upon the Exynos 5250, making them perhaps the first shipping consumer devices with Cortex A15 as well as the ARM Mali T604. ARM Mali T604 theoretically supports OpenCL, which had me excited, but the fly in the ointment is the Chrome OS. Google has confirmed in a forum post that currently there does not exist a way to access OpenCL in Chrome OS and they are not ready to comment upon whether this will change in the future either. This is frankly ridiculous. What’s the point of shipping powerful new hardware when developers are not given access to it? I hope one can load a proper Linux distro like Debian or Ubuntu etc., and hopefully the binary GPU drivers with OpenGL and OpenCL support will be made available for them.

RgBandwidth: My memory bandwidth benchmark for Android

Just published another benchmark app for Android. It is a memory bandwidth benchmark derived from the STREAM benchmark.

My benchmark, named RgBandwidth, is meant to provide you with a rough estimate of the achievable memory bandwidth on your system. Get it from the Play store. To quickly get an estimate of memory bandwidth performance achievable on your device, just press “Run” using the Auto mode.
Then in about 10-20 seconds, you will get various bandwidth ratings in MB/s. The easiest to understand is the Copy Bandwidth data. Alternately, you can manually select a thread number and experiment around.

On my dual-core Snapdragon S3 device, I got about 1.5GB/s of peak bandwidth.

If you use my benchmark, I would be very grateful if you could share the numbers with me in the comments below 🙂

Prelim analysis of RgbenchMM

My benchmark (RgbenchMM) for testing floating-point performance on Android is now published on Play store here

It is a reasonably optimized matrix multiplication kernel that is fully multithreaded and written using the NDK in C++. Here is the ARM-v7A assembly code produced by GCC of the innermost loop:

[code]
adds r2, r2, #1
adds r1, r1, #8
adds r0, r0, #8
cmp r2, r4
fldd d7, [r1, #0]
fldd d6, [r0, #0]
fldd d5, [r3, #-24]
fldd d4, [r3, #-16]
fldd d3, [r3, #-8]
fldd d2, [r3, #0]
fmacd d1, d7, d5
add r3, r3, r5
fmacd d0, d7, d4
fmacd d8, d7, d3
fmacd d9, d7, d2
fmacd d11, d5, d6
fmacd d12, d4, d6
fmacd d13, d3, d6
fmacd d10, d2, d6
bne .L4
[/code]

As you can see it does about 6 loads and 8 multiply-accumalates (or 16 flops) inside the loop. The load instructions (FLDD) are also VFP instructions as are the FMACD instructions. Thus, the benchmark is testing the VFP performance almost exclusively. One other detail about the code is that the threads are setup so that ideally they are reading the same columns of one of the input matrices. This will be beneficial on architectures with at least 1 level of shared cache and thus you may see more than 2x speedup on a dual-core processor.

With this background in mind, let us examine some of the data reported by testers.

Snapdragon S3 dual-core Scorpion @ 1.5GHz = 1175 MFlops

Exynos 4 dual-core @ 1.2 GHz = 920 MFlops

Tegra 3 T30L quad-core @ 1.2 GHz = 1488 MFlops

OMAP 4460 dual-core @ 1.2 GHz = 900 MFlops

These results are thanks to ChronoReverse, willyjwebb, derFunkenstein, DancinJack on Tech Report forums.

A back-of-the-envelope calculation shows that the innermost loop is executed on each core in about 40-42 cycles on OMAP, Exynos, Snapdragon S3 but about 50 cycles on the Tegra 3. The Tegra 3 result is somewhat surprising to me given that it is using the same Cortex A9 core as Exynos or OMAP. One possible culprit is that the L2 cache is not keeping up to feed 4 cores. However, more information is necessary to draw definitive conclusions. Particularly, if you have tested it on another Cortex A9 quad-core device like an Exynos 4 Quad, that will be helpful.

Would be very interesting to see how the newer generation of processors (like Cortex A15 and Qualcomm Krait) will perform.

One thing is clear. There is much to be learned from these ARM processors. The poor state of benchmarks on Android today (except mine ofcourse :P) and the lack of documentation from the vendors means that there is a LOT of misperceptions out there.