Architecture – Searching for divine code

Tegra K1 power consumption

I was reading Anandtech review of the power consumption of Nvidia shield tablet. While at first glance the GPU performance looked very impressive, the battery life data provided by the authors Joshua Ho and Andrei Frumusanu gives very good insights. Consider the battery life of Tab S 8.4 (using Exynos chipset with Mali graphics) and Shield Tablet running GFXbench 3.0. We can get the average power consumption as (Battery energy in WHr)/(Battery life in hours). They tested the shield tablet in two modes: Default (i.e. high performance) and capped performance. They reported observing GPU frequency of ~750MHz and ~450MHz in the two modes respectively. The battery life for the capped mode is inferred from the graph at about 14000 minutes (3.88 hours). For a very rough comparison, we will also compare with phablets such as Galaxy Note 3.

This gives us the following data:

1. Nvidia shield tablet (default, ~750MHz): 8.8W
2. Nvidia shield tablet (capped, ~450MHz): 5.09W
3. Tab S 8.4 (default): 5.5W
4. Galaxy Note 3: 3.1W

It is immediately obvious that in the default mode, shield tablet is consuming way too much power compared to Tab S. Given given the massive power consumption difference by reducing the GPU frequency, and the fact that the shield tablet gives good results for non-GPU bound tests , it is clear that most of the 9W of power is being consumed by the Tegra K1’s GPU.

The power consumption data is for the device, and hence includes the power consumption of components such as the screen and those can be very different across display types and sizes. We will only make very rough calculations here. To make very rough guesses, let us assume that the components other than the SoC and DRAM are consuming ~1.5W in the tablets and ~0.8W in the phone.

We get the following (VERY ROUGH) data for SoC + DRAM power consumption:

1. Tegra K1 (default, 750MHz): 7.3W
2. Tegra K1 (capped, 450MHz): 3.6W
3. Exynos 5420 (tablet): 4W
4. Snapdragon 800 (phone): 2.3W

Overall, I think it is quite reasonable to state that if Tegra K1’s stated GPU frequency targets of ~900MHz are not realizable in devices such as phones. I get the feeling that the Shield Tablet has been built more as a showcase device where the maximum GPU frequency has been set a bit too high in order to win benchmarks. I think if Tegra K1 ever ships in phones, it is likely that the GPU frequency will not exceed ~450MHz, and the GPU will not perform any better than it’s current mobile competitors. Perhaps Tegra K1 (particularly its GPU) is better suited in larger devices such as large tablets and ultraportable laptops where it can stretch it’s legs more.

Driver overhead matters more on SoCs

There has been a lot of discussion about driver overhead in graphics and compute APIs recently. A lot of it has been centred around desktop-type scenarios with discrete GPUs. But just wanted to point out that driver overhead matters more on SoCs which integrate both CPU and GPU on the same chip.

The simple reason is SoCs have a fixed total power budget and modern SoCs dynamically distribute power budget between CPU and GPU. If there is a lot of driver overhead, which means CPU is doing a lot of work, then CPU eats a bigger part of the fixed power budget and thus the SoC may be forced to reduce the GPU frequency. In addition to power, caches and memory bandwidth may also be shared.

I have done some benchmarking and tuning of OpenCL code for Intel’s Core chipsets and often getting the best performance out of the GPU required being more efficient on the CPU. I am pretty sure similar strategy is applicable on smartphone SoCs with the added constraint that smartphone CPUs are usually wimpy due to power constraints.

Geekbench 3 IPC

Geekbench 3 is one of the better benchmarks out there for comparing mobile CPU performance. It contains a variety of tests and reports a cumulated single-core score, and a multi-core score. One way of analyzing processors is to get an idea of per-cycle performance. For this, I took the single-core scores for various processors and divided by the reported clock frequency to obtain the following metric: Geekbench 3 Single-core score/ GHz.

I report the results in the table below. There can be many implementations of a given ARM core in different chipsets, and the same chipset can also perform slightly differently in different devices. I report the device from which I got the scores. Even so, computations are very approximate and based on rough averages of geekbench 3 scores from different users as reported on the Geekbench browser.

Here it is:

CPU core	Device(s)	Score/GHz
Cortex A7	Moto G	280
Scorpion	Galaxy S2 X (T989D)	250
Cortex A9	Galaxy S2 (i9100)	290
Krait 200	HTC One S, Xperia ZL	330
Krait 300	Moto X	390
Krait 400	Nexus 5, LG G2	405
Cortex A15	Nvidia Shield	480
Apple A6	iPhone 5c	540
Apple A7 (32-bit)	iPhone 5s	800
Apple A7 (64-bit)	iPhone 5s	1050

Note that for Scorpion, reported frequency was 1.5GHz but I have never seen it go above 1.242 GHz on some devices I used previously so I used 1.242GHz as the frequency.

Broadcom VideoCore IV architecture overview

Broadcom has decided to open-source their graphics driver for one of their VideoCore IV powered Android chipsets. This is an awesome and welcome step. They also released an architecture manual giving details for many things. I will try and summarize some of the information known about VideoCore IV so far.

VideoCore IV refers to a family of closely-related GPUs. Implementations have shown up in various chipsets. For example, BCM2835 used in Raspberry Pi, BCM2763 used in several Nokia Symbian Belle handsets (eg: Nokia Pureview 808, 701,700 etc), BCM21553 in Android handsets such as Samsung Galaxy Y and and BCM28155 in Android handsets such as Samsung Galaxy SII Plus.

Overview: Various chipsets have their own peculiarities. In the Raspberry Pi and Nokia flavors, the VideoCore IV consists of two distinct processors. The first processor is the actual programmable graphics core, which I will refer to as PGC. The second processor is a coprocessor. This embedded processor, not to be confused with the main CPU, runs its own operating system and handles almost all the actual work of the OpenGL driver. For example, shader compilation is done on this embedded processor and not on the main CPU in the Raspberry Pi and Nokia flavors. The OpenGL driver on these devices just is a shim that passes calls to the embedded coprocessor via RPC-like mechanism. My speculation (low-confidence) is that the BCM21553, for which Broadcom released the source code, does not have the embedded coprocessor and the driver runs on the main CPU. The Nokia variants have an additional detail that these feature an 128MB LPDDR2 on-package memory dedicated for GPU, separate from the 512MB RAM in these devices, to provide a high-bandwidth (at the time) graphics RAM for the GPU. Raspberry Pi does not have this buffer and the GPU reads/writes from the main memory.

GPU core: VideoCore’s PGC is a tile-based renderer (TBR). Apart from fixed function parts, the programmable portion of the chip is organized into “slices”, which are similar to say “compute units” in GCN. Each slice consists of upto 4 SIMD units called QPUs, one special function unit (SFU),one or two texture and memory units (TMUs) as well as some caches. The architectural diagram shows upto 4 slices, but I guess the actual number may vary between chipsets (not confirmed).

QPU (SIMD ALUs): QPU consists of two SIMD ALUs. The ALUs are not symmetric. Each of these ALUs is physically 4-wide (i.e. 128-bit), but one of them is an “add” unit and the other is a “mul” unit, and handle add and multiply floating-point operations respectively along with some other ops such as integer and logical ops. The QPU is a dual-issue processor, capable of feeding one add and one mul instruction per cycle to each of the units. Logically, each ALU in the QPU is actually a 16-way machine that executes a 16-way instruction in 4 cycles. Thus, overall, each QPU can perform 8 flops/cycle. Thus, each slice can do upto 32 flops/cycle. Each QPU has access to a 4kB of registers, as well as a few accumulators. Registers are organized as two register files of 2kB each. Each register file is organized as 32 vector registers, where each vector register is 64 bytes (16 x 4bytes) which makes sense given the 16-way logical view of the QPU. Each QPU can run two threads.

Memory (TMUs and VPM): TMUs have their own L1 cache, and there is also a separate L2 cache that is shared across slices. Cache sizes are unknown. QPUs read/write vertex data through a separate path called the Vertex Pipe Manager (VPM). VPM is a system-wide shared unit and appears to have a buffer of either 8kB or 16kB. VPM performs DMA from main memory to read/write vertex data into the buffer. VPM is optimized essentially for reading/writing vectors of data from/to main memory and from/to the QPUs vector register files. Vertex fetch is general enough to implement memory gather operations, but it is not clear if scatter is also supported.

RPi and Conclusions: Consider the Raspberry Pi. We already know that the published frequency is 250MHz and that the QPUs can do 24 gflops and the TMUs can do 1.5 GTexel/s. Thus, per-clock, the GPU performs 96 flops/cycle and 6 texels/cycle. Likely, this is achieved through 3 slices each with 4 QPUs and 2 TMUs. Overall, VideoCore IV is an interesting architecture. Performance-wise, the implementation in the Raspberry Pi does not compare to modern mobile GPUs such as Adreno 330 or Mali T600 series but then again the Raspberry Pi is using an old SoC that was meant to be cost-conscious even at that time. For a low-cost GPU, VideoCore IV looks to be quite competent. It will be interesting to see what Broadcom is cooking up for VideoCore V.

Testing write bandwidth to regular, write-combined and uncached memory

Write combining is a technique where writes may get buffered into a temporary buffer, and then written to memory in a single largish transaction. This can apparently give a nice boost to write bandwidth. Write-combined memory is not cached so reads from write-combined memory are still very slow. I came upon the concept of write-combining while looking at data transfer from CPU to GPU on AMD’s APUs. It turns out that if you use the appropriate OpenCL flags (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY) while creating a GPU buffer on an AMD APU, then AMD’s driver exposes these buffers as write-combined memory on the CPU. AMD claims that you can write to these buffers at pretty high speeds and thus this can act as a fast-path for CPU to GPU data copies. In addition to regular and write-combined memory, there is also a third type: uncached memory without write-combining.

I wanted to understand the characteristics of write-combined memory as well as uncached memory compared with “regular” memory allocated using, say, the default “new” operator. On Windows, we can allocate write-combined memory or uncached memory using VirtualAlloc function by passing the flags PAGE_WRITECOMBINE and PAGE_NOCACHE respectively. So I wrote a simple test. The code is open-source and can be found here.

For each memory type (regular, write-combined and uncached), we run the following test. The test allocates a buffer and then we copy the data from a regular CPU array to the buffer and measure the time. We do the copy (to the same buffer) multiple times and measure the time of each copy and report the timing and bandwidth of first run as well as the average of subsequent runs. The first run timings give us an idea of the overhead of first use, which can be substantial. For bandwidth, if I am copying N bytes of data, then I report bandwidth computed as N/(time taken). Some people prefer to report bandwidth as 2*N/(time taken) because they count both the read and the write so that’s something to keep in mind.

I ran the test on a laptop running AMD A10-5750M (Richland), 8GB 1600MHz DDR3, WIndows 8.1 x64, VS 2013 x64.

The average bandwidth result for “double” datatype arrays (size ~32MB) was average bandwidth of 3.8GB/s for regular memory, 5.7 GB/s for write-combined and 0.33GB/s for uncached memory. The bandwidth reported here is for average runs not including the first run. The first use penalty was found to be substantial. The first run of regular memory took about 22ms while write-combined took 81ms for first run and uncached memory took 164ms. Clearly if you are only transferring it once, then write-combined memory is not the best solution. In this case, you need around 20 runs for the write-combined memory to break even in terms of total data copy times. But if you are going to be reusing the buffer many times, then write-combined memory is a definite win.

Wrote about FP performance on ARM CPUs at Anandtech

Wrote an article about floating point instruction throughputs on Anandtech.
Go read here. Feedback welcome 🙂