Searching for divine code

RgBandwidth: My memory bandwidth benchmark for Android

Just published another benchmark app for Android. It is a memory bandwidth benchmark derived from the STREAM benchmark.

My benchmark, named RgBandwidth, is meant to provide you with a rough estimate of the achievable memory bandwidth on your system. Get it from the Play store. To quickly get an estimate of memory bandwidth performance achievable on your device, just press “Run” using the Auto mode.
Then in about 10-20 seconds, you will get various bandwidth ratings in MB/s. The easiest to understand is the Copy Bandwidth data. Alternately, you can manually select a thread number and experiment around.

On my dual-core Snapdragon S3 device, I got about 1.5GB/s of peak bandwidth.

If you use my benchmark, I would be very grateful if you could share the numbers with me in the comments below 🙂

Prelim analysis of RgbenchMM

My benchmark (RgbenchMM) for testing floating-point performance on Android is now published on Play store here

It is a reasonably optimized matrix multiplication kernel that is fully multithreaded and written using the NDK in C++. Here is the ARM-v7A assembly code produced by GCC of the innermost loop:

[code]
adds r2, r2, #1
adds r1, r1, #8
adds r0, r0, #8
cmp r2, r4
fldd d7, [r1, #0]
fldd d6, [r0, #0]
fldd d5, [r3, #-24]
fldd d4, [r3, #-16]
fldd d3, [r3, #-8]
fldd d2, [r3, #0]
fmacd d1, d7, d5
add r3, r3, r5
fmacd d0, d7, d4
fmacd d8, d7, d3
fmacd d9, d7, d2
fmacd d11, d5, d6
fmacd d12, d4, d6
fmacd d13, d3, d6
fmacd d10, d2, d6
bne .L4
[/code]

As you can see it does about 6 loads and 8 multiply-accumalates (or 16 flops) inside the loop. The load instructions (FLDD) are also VFP instructions as are the FMACD instructions. Thus, the benchmark is testing the VFP performance almost exclusively. One other detail about the code is that the threads are setup so that ideally they are reading the same columns of one of the input matrices. This will be beneficial on architectures with at least 1 level of shared cache and thus you may see more than 2x speedup on a dual-core processor.

With this background in mind, let us examine some of the data reported by testers.

Snapdragon S3 dual-core Scorpion @ 1.5GHz = 1175 MFlops

Exynos 4 dual-core @ 1.2 GHz = 920 MFlops

Tegra 3 T30L quad-core @ 1.2 GHz = 1488 MFlops

OMAP 4460 dual-core @ 1.2 GHz = 900 MFlops

These results are thanks to ChronoReverse, willyjwebb, derFunkenstein, DancinJack on Tech Report forums.

A back-of-the-envelope calculation shows that the innermost loop is executed on each core in about 40-42 cycles on OMAP, Exynos, Snapdragon S3 but about 50 cycles on the Tegra 3. The Tegra 3 result is somewhat surprising to me given that it is using the same Cortex A9 core as Exynos or OMAP. One possible culprit is that the L2 cache is not keeping up to feed 4 cores. However, more information is necessary to draw definitive conclusions. Particularly, if you have tested it on another Cortex A9 quad-core device like an Exynos 4 Quad, that will be helpful.

Would be very interesting to see how the newer generation of processors (like Cortex A15 and Qualcomm Krait) will perform.

One thing is clear. There is much to be learned from these ARM processors. The poor state of benchmarks on Android today (except mine ofcourse :P) and the lack of documentation from the vendors means that there is a LOT of misperceptions out there.

Some thoughts on Android benchmarking

Some ideas are as follows:

1. Touch responsiveness is an objectively measurable quantity. I think high speed cameras can play a very important role in this field.
Some good initial work has been reported by Tech Report.
It is unfortunately increasingly common to hear “Benchmarks don’t matter” and then some semi-coherent rant about user experience and “smooth” UIs. I think all it means is that the writer had no idea how to measure the touch responsiveness 😛

2. Application launch times: Application launch times can again be measured objectively. For slower apps, you can use a stopwatch and for small fast-launching apps, you can again use a high speed camera.

3. Web browsing benchmarks need to go beyond sunspider and browsermark. It is important to show the web page load times of REAL webpages. For reproducible results, webpages from say top 10 common websites (at a given date) should be copied to a local server and then those pages can be tested. Anandtech used to include such benchmarks but for some reason even they have fallen back to just using the meaningless synthetics. Even in synthetics, the test coverage needs to be increased. For Javascript, perhaps tests like the Mozilla Kraken or the new Octane suite should be looked at.

4. Proper application benchmarks need to be more common instead of synthetics. For example, Photoshop Touch can perhaps be used much as Photoshop benchmarks are now common on the desktop.

5. Synthetic benchmarks are poorly written and poorly understood. For example, Linpack Android version seems to be a poorly coded benchmark. The megaflops reported from the benchmark are far off the capabilities of the chips tested. I am looking at making a better one when I get time. For floating point tests, really what you want are accurate and separately reported measures of fp32 performance, fp64 performance and fp32-with-NEON performance.

6. Further, synthetics should properly report both single threaded and multithreaded numbers. (Linpack does do this but many other benchmarks don’t). I think single-thread performance is underestimated on mobile with most websites reporting benchmarks from multithreaded tests. However, few apps use 4 cores in any modern Android phone. And no, you don’t need 4 cores to multitask.

Some informed speculation about ARM T604

ARM T604 is an upcoming mobile GPU from ARM. I remember reading slides from an ARM presentation, though I cannot find the link now, perhaps they were taken down. Anyway, here is what we know:

1. Quad-core

2. Upto 68 GFlops of compute performance. I assume this is for fp32. Exynos 5 Dual whitepaper claims 72 GFlops.

3. Barrel threaded (i.e. multiple simultaneous threads) like AMD or Nvidia

4. No SIMT! Rather, SIMD architecture. I take this to mean, the vector lanes are not predicated. So be prepared to write explicitly SIMD code.

5. Now 68 GFlops/4 core = 17 GFlops/core. Assuming 500MHz clock speed, that gives us 34 flops/cycle.

We do know that it has 2 ALUs/core so each ALU does 17 flops/cycle. Each ALU has one scalar and one (or more?) vector units. So perhaps 1 scalar, and 1 vec8 unit with MAD? or Perhaps 1 scalar and 2 vec4 units with MAD.

(If we go by the Exynos 5 Dual whitepaper, perhaps they have modified the scalar unit to also do MAD instead of just one flop/cycle.)

6. Full IEEE precision for fp32 and fp64. Very nice ARM! The full fp64 support makes me excited for this architecture for my uses. ARM has not published the fp64 speeds, but I think it will be either 1/4th or 1/8th.

7. OpenCL 1.1 Full Profile support. I hope that EVERY device that ships with this GPU comes with working OpenCL drivers and an open SDK is provided to everyone.

C++ AMP updates

1. In my previous post, I did not mention debugging. However, MS has shown some really powerful debugging features in VS2012 for C++ AMP. I have not tested them myself yet, because most of them only work on Windows 8. However, looking forward to testing this sometime in the future.

2. I had previously mentioned the lack of profiling tools. Since then, I have learnt that to profile AMP code, one can use hardware vendor’s DirectX profiling tools. However, again I don’t have first hand experience yet.

C++ AMP: First impressions

Update: Published a new blog post. Check here

Coming from an OpenCL and CUDA background, here are some thoughts on C++ AMP as experienced on Visual Studio 2012 RC , on a system with Nvidia and Intel GPUs:

The good:

1. It is extremely easy to get started. Really the “hello world” program in C++ AMP is a lot shorter than OpenCL with lots of boilerplate autogenerated behind the scenes for you, though you do have control if you do want control. I think this can be a good API for teaching GPGPU programming to beginners.

2. Integration into Visual Studio is good to have. It is C++, and well integrated into your source, and thus the usual Visual Studio goodness like Intellisense are fully available.

3. Not tied to a single hardware vendor. The specification is also open, so can potentially be ported to other OS platforms.

4. Real world guarantees:

a) Code is compiled (mostly) statically and you get a single binary that you can distribute to client machines. OpenCL does not have a defined binary format, and even if you distribute your OpenCL source strings, it is not 100% guaranteed to compile and run properly on client machines.

b) Timeout detection and recovery: With OpenCL, if you want to distribute a kernel that can potentially run for a long time on client machines, there is a danger that it can invoke timeout detection and recovery and basically terminate your application or lead to system instability depending on OS and hardware vendor. And there is jack that your application can do about it.

With C++ AMP, that problem is not there. If TDR is invoked, you get an exception that you can catch and respond to in your application. Thus, if you make sure to catch such exceptions in your code, you get better guarantees of application stability on client machines.

The bad:

1. Profiling tools are not good compared to CUDA and OpenCL libs. For example, how do I profile coalesced loads? I don’t see such an option in Visual Studio. If I am wrong, please let me know. Given that the whole point of C++ AMP is gaining performance, one should have better profiling tools.

EDIT: See updated post

2. I cannot peek under the hood to see the code generated, and along with lack of profiling tools, makes performance tuning harder.

3. Compile times can be LONG. It seems to me that a kernel can take something like 5-20 seconds to compile for non-trivial kernels (even on a fast third-gen Intel Core i7 processor). If you have 50 kernels in your code, time can add up for each build. OpenCL and CUDA kernels typically compile much faster.