GPU – Page 3 – Searching for divine code

Khronos standards and “caps”

UPDATE: This post is just plain wrong. See correction HERE. Thanks to various people on twitter, especially Graham Sellers and Daniel Koch for pointing this out.

Just venting some frustration here. One of the annoying things in Khronos standards is the lack of required minimum capabilities, which makes writing portable code that much harder. The minimum guarantees are very lax. Just as an example, take work-group sizes in both OpenCL and OpenGL compute shaders. In both of these, you have to query to find out the maximum work group size supported which may turn out to be just 1.

Similarly, in OpenGL (and ES) compute shaders, there is no minimum guaranteed amount of local memory per workgroup. You have to query to ask how much shared memory per workgroup is supported, and the implementation can just say zero because there is no minimum mandated in the specification.

edit: Contrast this with DirectCompute where you have mandated specifications for both the amount of local memory and the work-group sizes which makes life so much simpler.

Specifying OpenGL version to Qt Quick 2

UPDATE (24th May, 10.20pm EST): Corrected some errors below.

I was looking at how to integrate custom OpenGL content inside a Qt Quick 2 application. Qt Quick 2 has several solutions, and settled on a custom OpenGL underlay. However, one stumbling block was that I wanted to use OpenGL 4.3 core profile. On my test machine, the driver supports OpenGL 4.4 and I discovered that by default Qt Quick 2 created a 4.4 compatibility profile context.

After much searching, I finally found the solution I was looking for. I am not sure if it is the only way or the optimal way, but it worked for me. You can specify the OpenGL propeties including the desired profile by creating a QSurfaceFormat object and passing it to QQuickWindow object using setFormat before Qt Quick’s scene graph begins rendering, i.e. before you invoke QQuickWindow::show either in C++, or before you set “visible” property of the window to true in QML.

The next question was, how to get to the window. Here is how to do so in some scenarios:

If you have a handle to any QQuickItem inside the window, then simply call the “window” method.
If you have created your application using Qt Quick Controls ApplicationWindow, and are using QQmlApplicationEngine to load the QML, make sure “visible” property of the ApplicationWindow is set to false in the QML. Then, simply get the first root object from the engine using something like engine->rootObjects().first(). This is your window object and you can simply cast it to QQuickWindow.
If you have created your application using QQuickView, well then your are in luck because QQuickView is a QQuickWindow so just call setFormat on the QQuickView.

Once you have done setFormat, you are then free to call “show” on the QQuickWindow and it will make your window visible.

Geekbench 3 IPC

Geekbench 3 is one of the better benchmarks out there for comparing mobile CPU performance. It contains a variety of tests and reports a cumulated single-core score, and a multi-core score. One way of analyzing processors is to get an idea of per-cycle performance. For this, I took the single-core scores for various processors and divided by the reported clock frequency to obtain the following metric: Geekbench 3 Single-core score/ GHz.

I report the results in the table below. There can be many implementations of a given ARM core in different chipsets, and the same chipset can also perform slightly differently in different devices. I report the device from which I got the scores. Even so, computations are very approximate and based on rough averages of geekbench 3 scores from different users as reported on the Geekbench browser.

Here it is:

CPU core	Device(s)	Score/GHz
Cortex A7	Moto G	280
Scorpion	Galaxy S2 X (T989D)	250
Cortex A9	Galaxy S2 (i9100)	290
Krait 200	HTC One S, Xperia ZL	330
Krait 300	Moto X	390
Krait 400	Nexus 5, LG G2	405
Cortex A15	Nvidia Shield	480
Apple A6	iPhone 5c	540
Apple A7 (32-bit)	iPhone 5s	800
Apple A7 (64-bit)	iPhone 5s	1050

Note that for Scorpion, reported frequency was 1.5GHz but I have never seen it go above 1.242 GHz on some devices I used previously so I used 1.242GHz as the frequency.

Qt Creator and VS 2013 (Express)

I had previously noted that I installed Qt Creator 3.0 and it automatically detected compiler toolchains, CMake etc. At that time I was running VS 2010 Professional. I recently moved to VS 2013 Express. I opened up a new CMake project in Qt Creator and noticed that suddenly it no longer gave me the option for the proper NMake generator and started failing. After much mucking about, I have finally discovered the solution and just documenting it here in case someone else comes across the same issue.

The solution is simple. Simply install the newly released Qt Creator 3.1. Qt Creator 3.1 has inbuilt support for detecting and using VS 2013 compilers with CMake. Qt Creator 3.0 detects the compilers succesfully but was not really passing the right arguments to CMake or something. Anyway, after updating to 3.1, all my CMake projects work now.

Number of Linux desktop users

Just a random thought. I was wondering how many active Linux desktop users are there. Turns out, we can do simple back-of-envelope calculation for this. Some estimates put the number of active desktops at around 1.4 billion. Linux desktop marketshare (from various stats such as netmarketshare and steam) appears to be around 1.5%. That leads us to around 21m active Linux desktop users.

Using Qualcomm Adreno Profiler on Linux, particularly for OpenCL

Qualcomm has updated their Adreno Profiler and now it can work on Linux. To clarify, the setup is that you are running Linux on development PC, and running Android on the device being profiled. However, when I downloaded the tar file containing the Adreno profiler for Linux, it only contained “exe” files so initially I was confused. However, looking at Qualcomm developer forum threads, someone posted that it appears to be working with Mono.

I have tried the following steps, and it seems to work, at least for profiling OpenCL. I have used this for profiling OpenCL apps running on IFC6410 development board (Snapdragon 600) running Android 4.2.2. I have not tried OpenGL apps yet. Please feel free to report your experiences with OpenGL profiling. Steps were as follows

1. Install mono and associated Mono libraries, especially “core 4.0” and winforms libraries for Mono.

2. Connect your device over USB and make sure it is listed when you do “adb devices”.

3. Set some environment parameters using “adb shell setprop” commands for Adreno profiling. See documentation accompanying Adreno Profiler for exact settings.

4. Start Adreno Profiler using “mono AdrenoProfiler.exe”.

5. Start app on device using adb shell etc. It will block/hang, looking to connect to Adreno profiler.

6. In Adreno Profiler, hit “connect” and connect to your device. App will still remain frozen. Now start “Scrubber CL” and hit the “record” (red) button. App will resume.

7. Once app finishes, examine the data.

Overall, Adreno Profiler offers some really helpful data metrics for OpenCL such as timeline of various threads, GPU queues etc which is similar to say VTune. However, I have not yet found the way to actually view detailed hardware counter data such as cache misses that the documentation says I should be able to view.

Also, while Adreno Profiler offers some static kernel analysis metrics about number of ALU instructions, MOVs, NOPs etc in compiled code, I would really like to just view assembly generated because the instruction metrics shown do not account for what is inside loops vs outside making them less useful in kernels involving loops.