Khronos standards and “caps”

UPDATE: This post is just plain wrong. See correction HERE. Thanks to various people on twitter, especially Graham Sellers and Daniel Koch for pointing this out.

Just venting some frustration here. One of the annoying things in Khronos standards is the lack of required minimum capabilities, which makes writing portable code that much harder.  The minimum guarantees are very lax.  Just as an example, take work-group sizes in both OpenCL and OpenGL compute shaders. In both of these, you have to query to find out the maximum work group size supported which may turn out to be just 1.

Similarly, in OpenGL (and ES) compute shaders, there is no minimum guaranteed amount of local memory per workgroup. You have to query to ask how much shared memory per workgroup is supported, and the implementation can just say zero because there is no minimum mandated in the specification.

edit: Contrast this with DirectCompute where you have mandated specifications for both the amount of local memory and the work-group sizes which makes life so much simpler.

Specifying OpenGL version to Qt Quick 2

UPDATE (24th May, 10.20pm EST): Corrected some errors below.

I was looking at how to integrate custom OpenGL content inside a Qt Quick 2 application. Qt Quick 2 has several solutions, and settled on a custom OpenGL underlay. However, one stumbling block was that I wanted to use OpenGL 4.3 core profile. On my test machine, the driver supports OpenGL 4.4 and I discovered that by default Qt Quick 2 created a 4.4 compatibility profile context.

After much searching, I finally found the solution I was looking for. I am not sure if it is the only way or the optimal way, but it worked for me. You can specify the OpenGL propeties including the desired profile by creating a QSurfaceFormat object and passing it to QQuickWindow object using setFormat before Qt Quick’s scene graph begins rendering, i.e. before you invoke QQuickWindow::show either in C++, or before you set “visible” property of the window to true in QML.

The next question was, how to get to the window. Here is how to do so in some scenarios:

  •  If you have a handle to any QQuickItem inside the window, then simply call the “window” method.
  • If you have created your application using Qt Quick Controls ApplicationWindow, and are using QQmlApplicationEngine to load the QML, make sure “visible” property of the ApplicationWindow is set to false in the QML. Then, simply get the first root object from the engine using something like engine->rootObjects().first(). This is your window object and you can simply cast it to QQuickWindow.
  • If you have created your application using QQuickView, well then your are in luck because QQuickView is a QQuickWindow so just  call setFormat on the QQuickView.

Once you have done setFormat, you are then free to call “show” on the QQuickWindow and it will make your window visible.

 

Geekbench 3 IPC

Geekbench 3 is one of the better benchmarks out there for comparing mobile CPU performance. It contains a variety of tests and reports a cumulated single-core score, and a multi-core score. One way of analyzing processors is to get an idea of per-cycle performance. For this, I took the single-core scores for various processors and divided by the reported clock frequency to obtain the following metric: Geekbench 3 Single-core score/ GHz.

I report the results in the table below. There can be many implementations of a given ARM core in different chipsets, and the same chipset can also perform slightly differently in different devices. I report the device from which I got the scores. Even so, computations are very approximate and based on rough averages of geekbench 3 scores from different users as reported on the Geekbench browser.

Here it is:

CPU core Device(s) Score/GHz
Cortex A7 Moto G 280
Scorpion Galaxy S2 X (T989D) 250
Cortex A9 Galaxy S2 (i9100) 290
Krait 200 HTC One S, Xperia ZL 330
Krait 300 Moto X 390
Krait 400 Nexus 5, LG G2 405
Cortex A15 Nvidia Shield 480
Apple A6 iPhone 5c 540
Apple A7 (32-bit) iPhone 5s 800
Apple A7 (64-bit) iPhone 5s 1050

Note that for Scorpion, reported frequency was 1.5GHz but I have never seen it go above 1.242 GHz on some devices I used previously so I used 1.242GHz as the frequency.

Qt Creator and VS 2013 (Express)

I had previously noted that I installed Qt Creator 3.0 and it automatically detected compiler toolchains, CMake etc. At that time I was running VS 2010 Professional. I recently moved to VS 2013 Express. I opened up a new CMake project in Qt Creator and noticed that suddenly it no longer gave me the option for the proper NMake generator and started failing. After much mucking about, I have finally discovered the solution and just documenting it here in case someone else comes across the same issue.

The solution is simple.  Simply install the newly released Qt Creator 3.1. Qt Creator 3.1 has inbuilt support for detecting and using VS 2013 compilers with CMake. Qt Creator 3.0 detects the compilers succesfully but was not really passing the right arguments to CMake or something.  Anyway, after updating to 3.1,  all my CMake projects work now.

 

Number of Linux desktop users

Just a random thought. I was wondering how many active Linux desktop users are there.  Turns out, we can do simple back-of-envelope calculation for this. Some estimates put the number of active desktops at around 1.4 billion. Linux desktop marketshare (from various stats such as netmarketshare and steam) appears to be around 1.5%. That leads us to around 21m active Linux desktop users. 

Broadcom VideoCore IV architecture overview

Broadcom has decided to open-source their graphics driver for one of their VideoCore IV powered Android chipsets. This is an awesome and welcome step. They also released an architecture manual giving details for many things. I will try and summarize some of the information known about VideoCore IV so far.

VideoCore IV refers to a family of closely-related GPUs. Implementations have  shown up in various chipsets. For example, BCM2835 used in Raspberry Pi,  BCM2763 used in several Nokia Symbian Belle handsets (eg: Nokia Pureview 808, 701,700 etc), BCM21553 in Android handsets such as Samsung Galaxy Y and and BCM28155 in Android handsets such as Samsung Galaxy  SII Plus.

Overview: Various chipsets have their own peculiarities. In the Raspberry Pi and Nokia flavors, the VideoCore IV consists of two distinct processors. The first processor is the actual programmable graphics core, which I will refer to as PGC. The second processor is a coprocessor.  This embedded processor, not to be confused with the main CPU, runs its own operating system and handles almost all the actual work of the OpenGL driver. For example, shader compilation is done on this embedded processor and  not on the main CPU in the Raspberry Pi and Nokia flavors. The OpenGL driver on these devices just is a shim that passes calls to the embedded coprocessor via RPC-like mechanism.  My speculation (low-confidence) is that the BCM21553, for which Broadcom released the source code, does not have the embedded coprocessor and the driver runs on the main CPU.  The Nokia variants have an additional detail that these feature an 128MB LPDDR2 on-package memory dedicated for GPU, separate from the 512MB RAM in these devices, to provide a high-bandwidth (at the time) graphics RAM for the GPU. Raspberry Pi does not have this buffer and the GPU reads/writes from the main memory.

GPU core: VideoCore’s PGC is a tile-based renderer (TBR). Apart from fixed function parts, the programmable portion of the chip is organized into “slices”, which are similar to say “compute units” in GCN. Each slice consists of upto 4 SIMD units called QPUs, one special function unit (SFU),one or two texture and memory units (TMUs) as well as some caches.  The architectural diagram shows upto 4 slices, but I guess the actual number may vary between chipsets (not confirmed).

QPU (SIMD ALUs): QPU consists of two  SIMD ALUs. The ALUs are not symmetric. Each of these ALUs is physically 4-wide (i.e. 128-bit), but one of them is an “add” unit and the other is a “mul” unit, and handle add and multiply floating-point operations respectively along with some other ops such as integer and logical ops. The QPU is a dual-issue processor, capable of feeding one add and one mul instruction per cycle to each of the units. Logically, each ALU in the QPU is actually a 16-way machine that executes a 16-way instruction in 4 cycles. Thus, overall, each QPU can perform 8 flops/cycle.  Thus, each slice can do upto 32 flops/cycle.  Each QPU has access to a 4kB of registers, as well as a few accumulators. Registers are organized as two register files of 2kB each. Each register file is organized as 32 vector registers, where each vector register is 64 bytes (16 x 4bytes) which makes sense given the 16-way logical view of the QPU.  Each QPU can run two threads.

Memory (TMUs and VPM):  TMUs have their own L1 cache, and there is also a separate L2 cache that is shared across slices. Cache sizes are unknown. QPUs read/write vertex data through a separate path called the Vertex Pipe Manager (VPM). VPM is a system-wide shared unit and appears to have a buffer of either 8kB or 16kB. VPM performs DMA from main memory to read/write vertex data into the buffer. VPM is optimized essentially for reading/writing vectors of data from/to main memory and from/to the QPUs vector register files.  Vertex fetch is general enough to implement memory gather operations, but it is not clear if scatter is also supported.

RPi and Conclusions: Consider the Raspberry Pi.  We already know that the published frequency is 250MHz and that the QPUs can do 24 gflops and the TMUs can do 1.5 GTexel/s.  Thus, per-clock,  the GPU performs 96 flops/cycle and 6 texels/cycle.  Likely, this is achieved through 3 slices each with 4 QPUs and 2 TMUs. Overall, VideoCore IV is an interesting architecture. Performance-wise, the implementation in the Raspberry Pi does not compare to modern mobile GPUs such as Adreno 330 or Mali T600 series but then again the Raspberry Pi is using an old SoC that was meant to be cost-conscious even at that time.  For a low-cost GPU, VideoCore IV looks to be quite competent. It will be interesting to see what Broadcom is cooking up for VideoCore V.