Page 2 – Searching for divine code

Qt and BB10

I am quite excited about Blackberry 10. RIM’s development tools are looking pretty good. As a C++ programmer, I am pleased with my initial experience with the toolchain. Blackberry 10 comes with a full Qt 4.8 implementation in the firmware. As a Qt programmer, you have 3 options to write apps for BB10:

Use Cascades. Cascades is a proprietary (but quite cool) UI framework built by RIM. It uses QML as it’s markup but does not use the QML painting engine at all. Instead, it uses its own painting engine and it is incompatible with QWidget and Qt Quick frameworks. You can either use Cascades to show the UI, or you can use Qt’s painting engine (QtGui or Qt Quick), but not both. However, you can still use QtCore and QtNetwork etc in your app. I experimented a little bit with Cascades and it is pretty nice. For Cascades, I recommend that you simply use QNX Momentics IDE provided by RIM as part of the standard NDK download. However, I have decided not to pursue this route as I wanted my code to be platform independent. Currently I am a single developer working on Qt projects for fun, and I am preferring to maintain a single code base across platforms as much as possible.
Use QWidget. QtGui module (the basis of QWidget) is fully supported but if you use QWidget, you cannot use Cascades.
Use Qt Quick (perhaps with some QWidgets thrown in). This is also supported but will again exclude the use of Cascades.

If you are interested in options 2 or 3 (QWidget or Qt Quick), then I recommend that you use Qt Creator. I tested with Qt Creator 2.6.1 on Ubuntu 12.04 64-bit and things are working fine. I have not managed to get QWidget and Qt Quick working under QNX Momentics, there are always some compilation or build errors for non-Cascades project.

There are some good instructions to configure Qt Creator in NDK documentation as well as on the Qt Project Wiki. I do not have a BB10 device but I was able to compile a QWidget based app in Qt Creator and run in the BB10 simulator.

One piece of advise: ignore the “simulator-debug” configuration section mentioned on the Qt Project Wiki. It appears to be required only for Cascades projects. Trying to make that section work in a QWidget based app wasted a lot of my time. In the end, I omitted it and things started working. I simply defined the BB10 kit in Qt Creator as described, modified the bar-description.xml example given in the NDK official documentation about porting Qt apps, then added a “blackberry-x86-qcc” spec to QMake as recommended by the Qt Project wiki, and everything worked brilliantly.

I have not yet tried compiling a Qt Quick based application for BB10, but I think process of compiling a Qt Quick app should be similar to the QWidget app above.

Texas Instruments Keystone II : HPC perspective

Texas Instruments recently announced their Keystone II chips. Essentially, these combine a multi-core Cortex A15 with DSPs on a single chip. The number of cores and DSP configuration varies depending on the SKU. Here I focus on the top-end SKU 66AK2H12.

The chip has the following integrated:

4 Cortex A15 cores @ 1.4 GHz giving 44.8 GFlops SP (NEON), 22.4 GFlops SP (IEEE-754), 11.2 GFlops DP
8 C66-family DSPs @ 1.2 GHz giving 153.6 GFlops SP, 57.6 GFlops DP?
DDR3 memory controller 2×64-bit upto 1600MHz giving 25.6 GB/s bandwidth
ARM cores L1 data cache 4*32 kB, L1 instruction cache 4*32kB, L2 cache 4MB shared across cores
DSP cores L1 data cache 8*32kB, instruction cache 8*32kB, L2 cache 1MB/DSP = 8MB total
6 MB of cache (separate from L2 caches) shared by DSPs and ARM cores
Upto 14W power consumption
OpenMP programming tools. alpha version of OpenCL driver also available

You should not think of this chip as a GPU-like accelerator. This is intended to be a standalone solution, with the 4 general-purpose ARM cores capable of running any regular ARM applications including a full Linux OS. Certain parts of your application can be offloaded to the DSP or they can be used in concert with the ARM cores. The DSPs themselves have a fairly flexible instruction set and my understanding is that you can do function calls, recursion etc without issue (correct me if I am wrong, will confirm from documentation). The DSPs and the ARM cores are both reading/writing from the same memory elimintating the data-copy bottleneck that exists on many PCIe accelerator type solutions.

The base specifications are looking really good. The perf/W is looking to be competitive with GPU based solutions. The low power consumption means that it can used in many applications where the big power hungry solutions (such as Teslas or Xeon Phis) are not applicable. The shared memory model is also very enticing for everyone, including say supercomputing uses.

TI have a good solution on their hands and should push more aggressively into the HPC space. They should put money into getting libraries like an optimized BLAS optimized for the system along with say OpenCV. TI should invest money into developing good compiler, debuggers and profilers. They should particularly continue to invest in standards-based solutions like OpenMP and OpenCL. As a newcomer and a smaller player, they cannot afford to introduce yet another proprietary solution.

They also need to gain some mindshare as well as marketshare. To gain mindshare, they should ensure to make ALL of this available in a nicely packaged fashion with a good descriptive documentation and webpages. They should also make low-cost boards available to really gain some marketshare. People underestimate how convenient Nvidia makes getting and using their tools for CUDA. I can just buy a cheap Nvidia card for a desktop (or buy a decent laptop), just download the CUDA SDK for free without any agreements and off I go. Everything is packaged nicely, easy to find and comes with good documentation. Capturing mind-share IS important and TI should learn those lessons from Nvidia.

I do wish TI all the best in the HPC field. They have built some solid and interesting technology, and economics also potentially works out as their DSP technology investments can be leveraged in multiple product lines much like how Nvidia is able to use the same designs for both HPC and consumer products. If they invest in building a good software ecosystem around their products, they can certainly compete in this space.

If anyone from TI is reading this, I would love to port all of my software (such as my Python compilers and numerical libraries, see here and here) to your hardware so please let me know who can I contact 🙂

Intel Xeon Phi announcement and summary

Intel had announced Xeon Phi branding and basic architecture long ago, but we finally have details and pricing. Xeon Phi is essentially a 62-core x86 chip. Different SKUs will have different number of cores and clock speeds enabled. TDPs and rough performance numbers look competitive with offerings such as Nvidia Tesla, but the Xeon Phi offers higher programmability and potentially better efficiency on some workloads. The chip will sit in a PCIe board and can either be used to offload parts of your program, or run the whole program. The board offers a number of programming interfaces such as OpenMP that are a lot more convenient than writing say CUDA code. Compared to GPUs, it should be relatively easy to get your application up-and-running on a Xeon Phi though optimization will still require some effort.

However, I am still happy to report that OpenCL is still fully supported, so porting code from GPUs to Xeon Phi is still easy. Kudos to Intel for getting behind OpenCL and actually delivering fully working products.

Each core has an in-order dual-issue x86 core with SMT (4 threads) backed by a 512-bit vector unit capable of doing FMA operations. Each vector unit can do 8 fp64 FMAs (16 flops) or 16 fp32 FMAs (32 flops) each cycle. While there is no SSE or AVX available on this core, the vector instruction set is actually very nice with operations like scatter-gather as well as per-lane write masks. IMO it is a cleaner and more flexible vector ISA than say AVX.
Unlike GPUs, Xeon Phi does not have an on-chip user-programmable local memory. Instead, it is backed by a large 512kB L2 cache on each core and the cache is fully coherent. In total, on a 60-core variant that is 30MB of coherent L2 cache compared to 1-2 MB L2 caches we are used to seeing on GPUs. This is a HUGE win compared to GPUs IMO and should give very good efficiency on some workloads such as some types of sparse matrices. Honestly, dealing with on-chip shared memory on GPUs is a giant pain.

My rough guess is that Nvidia’s Tesla K20X will retain a 10-15% edge in some brute force tests as well as tests like generic dense linear algebra, and will retain an advantage in fp32 workloads, but there will also be workloads where Xeon Phi will win out. And overall Xeon Phi should retain a programmability advantage.

As an academic (currently), I am a little disappointed that I will likely not be able to test my tools on a Xeon Phi as we do not have the budget to buy them. With Nvidia, one can start experimenting with CUDA by buying just a $100 card and Nvidia has also been open about seeding their boards to universities where they feel appropriate. Xeon Phis start upwards of $2k (much like Teslas) so not many labs will have access to them. Would like to see Intel offer some kind of program to univs to boost the Xeon Phi’s popularity to increase the base of programmer pool available for their card 🙂

Overall, a very good showing from Intel, though they do need to keep executing as other competitors are not sitting idle either.

RgBandwidth: My memory bandwidth benchmark for Android

Just published another benchmark app for Android. It is a memory bandwidth benchmark derived from the STREAM benchmark.

My benchmark, named RgBandwidth, is meant to provide you with a rough estimate of the achievable memory bandwidth on your system. Get it from the Play store. To quickly get an estimate of memory bandwidth performance achievable on your device, just press “Run” using the Auto mode.
Then in about 10-20 seconds, you will get various bandwidth ratings in MB/s. The easiest to understand is the Copy Bandwidth data. Alternately, you can manually select a thread number and experiment around.

On my dual-core Snapdragon S3 device, I got about 1.5GB/s of peak bandwidth.

If you use my benchmark, I would be very grateful if you could share the numbers with me in the comments below 🙂

Prelim analysis of RgbenchMM

My benchmark (RgbenchMM) for testing floating-point performance on Android is now published on Play store here

It is a reasonably optimized matrix multiplication kernel that is fully multithreaded and written using the NDK in C++. Here is the ARM-v7A assembly code produced by GCC of the innermost loop:

[code]
adds r2, r2, #1
adds r1, r1, #8
adds r0, r0, #8
cmp r2, r4
fldd d7, [r1, #0]
fldd d6, [r0, #0]
fldd d5, [r3, #-24]
fldd d4, [r3, #-16]
fldd d3, [r3, #-8]
fldd d2, [r3, #0]
fmacd d1, d7, d5
add r3, r3, r5
fmacd d0, d7, d4
fmacd d8, d7, d3
fmacd d9, d7, d2
fmacd d11, d5, d6
fmacd d12, d4, d6
fmacd d13, d3, d6
fmacd d10, d2, d6
bne .L4
[/code]

As you can see it does about 6 loads and 8 multiply-accumalates (or 16 flops) inside the loop. The load instructions (FLDD) are also VFP instructions as are the FMACD instructions. Thus, the benchmark is testing the VFP performance almost exclusively. One other detail about the code is that the threads are setup so that ideally they are reading the same columns of one of the input matrices. This will be beneficial on architectures with at least 1 level of shared cache and thus you may see more than 2x speedup on a dual-core processor.

With this background in mind, let us examine some of the data reported by testers.

Snapdragon S3 dual-core Scorpion @ 1.5GHz = 1175 MFlops

Exynos 4 dual-core @ 1.2 GHz = 920 MFlops

Tegra 3 T30L quad-core @ 1.2 GHz = 1488 MFlops

OMAP 4460 dual-core @ 1.2 GHz = 900 MFlops

These results are thanks to ChronoReverse, willyjwebb, derFunkenstein, DancinJack on Tech Report forums.

A back-of-the-envelope calculation shows that the innermost loop is executed on each core in about 40-42 cycles on OMAP, Exynos, Snapdragon S3 but about 50 cycles on the Tegra 3. The Tegra 3 result is somewhat surprising to me given that it is using the same Cortex A9 core as Exynos or OMAP. One possible culprit is that the L2 cache is not keeping up to feed 4 cores. However, more information is necessary to draw definitive conclusions. Particularly, if you have tested it on another Cortex A9 quad-core device like an Exynos 4 Quad, that will be helpful.

Would be very interesting to see how the newer generation of processors (like Cortex A15 and Qualcomm Krait) will perform.

One thing is clear. There is much to be learned from these ARM processors. The poor state of benchmarks on Android today (except mine ofcourse :P) and the lack of documentation from the vendors means that there is a LOT of misperceptions out there.