Testing write bandwidth to regular, write-combined and uncached memory

Write combining is a technique where writes may get buffered into a temporary buffer, and then written to memory in a single largish transaction. This can apparently give a nice boost to write bandwidth. Write-combined memory is not cached so reads from write-combined memory are still very slow. I came upon the concept of write-combining while looking at data transfer from CPU to GPU on AMD’s APUs. It  turns out that  if you use the appropriate OpenCL flags (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY) while creating a GPU buffer on an AMD APU,  then AMD’s driver exposes these buffers as write-combined memory on the CPU. AMD claims that you can write to these buffers at pretty high speeds and thus this can act as a fast-path for CPU to GPU data copies.  In addition to regular and write-combined memory, there is also a third type: uncached memory without write-combining.

I wanted to understand the characteristics of write-combined memory as well as uncached memory compared with “regular” memory allocated using, say, the default “new” operator. On Windows, we can allocate write-combined memory or uncached memory using VirtualAlloc function by passing the flags PAGE_WRITECOMBINE and PAGE_NOCACHE respectively. So I wrote a simple test. The code is open-source and can be found here.

For each memory type (regular, write-combined and uncached), we run the following test. The test allocates a buffer and then we copy the data from a regular CPU array to the buffer and measure the time. We do the copy (to the same buffer) multiple times and measure the time of each copy and report the timing and bandwidth of first run as well as the average of subsequent runs. The first run timings give us an idea of the overhead of first use, which can be substantial. For bandwidth, if I am copying N bytes of data, then I report bandwidth computed as N/(time taken). Some people prefer to report bandwidth as 2*N/(time taken) because they count both the read and the write so that’s something to keep in mind.

I ran the test on a laptop running AMD A10-5750M (Richland), 8GB 1600MHz DDR3, WIndows 8.1 x64, VS 2013 x64. 

The average bandwidth result for “double” datatype arrays (size ~32MB) was average bandwidth of 3.8GB/s for regular memory, 5.7 GB/s for write-combined and 0.33GB/s for uncached memory.  The bandwidth reported here is for average runs not including the first run. The first use penalty was found to be substantial. The first run of regular memory took about 22ms while write-combined took 81ms for first run and uncached memory took 164ms. Clearly if you are only transferring it once, then write-combined memory is not the best solution. In this case, you need around 20 runs for the write-combined memory to break even in terms of total data copy times. But if you are going to be reusing the buffer many times, then write-combined memory is a definite win.

AMD Jaguar vs AMD Llano (K10) at same clocks

AMD recently launched their Kabini APUs and the A4-5000 has been reviewed by a number of websites. However, I haven’t been able to find a review that compares it to other x86 architectures at the same clocks. I happen to have a notebook with a AMD Llano based A8-3500M in it. This is a quad-core unit @ 1.5GHz, the same clock as A4-5000. Now, A8-3500M also has a boost mode that takes it to over 2GHz for some single-threaded workloads but I disabled that boost mode for this test. My Llano test system is running Win7 Home Premium (64-bit) and has dual-channel 8GB RAM installed. I don’t have a A4-5000 to test, so I referenced Anandtech for Cinebench 11.5 (64-bit multithreaded) and Mozilla Kraken, PCPer for Euler3D and Techreport for 7-zip.

Test results:

Test Jaguar Llano K10 Jaguar % of K10
Euler3D 0.971 Hz 1.41 Hz 68.9%
Cinebench 11.5 1.5 1.9 78.9%
Mozilla Kraken 6512.7 5883 90%
7-zip (compress) 3793 MIPS 5317 MIPS 71.3%
7-zip (decompress) 5397 MIPS 6152 MIPS 87.7%

For 7-zip, I just ran it as follows: 7z.exe b 3 -mmt4 -mmd25 as the default number of passes is way too long, but it should give us a decent idea 🙂

Overall, we are looking at between 70-90% of K10 performance at the same clocks in this test, which is quite good for a small dual-issue core. We don’t have any data on how the benchmarks scale with clock speed though. Hopefully that data will become clearer in the future when reviewers get their hands on the A6-5200 which has Jaguar cores clocked @ 2GHz.

Texas Instruments Keystone II : HPC perspective

Texas Instruments recently announced their Keystone II chips. Essentially, these combine a multi-core Cortex A15 with DSPs on a single chip. The number of cores and DSP configuration varies depending on the SKU. Here I focus on the top-end SKU 66AK2H12.

The chip has the following integrated:

  • 4 Cortex A15 cores @ 1.4 GHz giving 44.8 GFlops SP (NEON), 22.4 GFlops SP (IEEE-754), 11.2 GFlops DP
  • 8 C66-family DSPs @ 1.2 GHz giving 153.6 GFlops SP, 57.6 GFlops DP?
  • DDR3 memory controller 2×64-bit upto 1600MHz giving 25.6 GB/s bandwidth
  • ARM cores L1 data cache 4*32 kB, L1 instruction cache 4*32kB, L2 cache 4MB shared across cores
  • DSP cores L1 data cache 8*32kB, instruction cache 8*32kB, L2 cache 1MB/DSP = 8MB total
  • 6 MB of cache (separate from L2 caches) shared by DSPs and ARM cores
  • Upto 14W power consumption
  • OpenMP programming tools. alpha version of OpenCL driver also available

You should not think of this chip as a GPU-like accelerator. This is intended to be a standalone solution, with the 4 general-purpose ARM cores capable of running any regular ARM applications including a full Linux OS. Certain parts of your application can be offloaded to the DSP or they can be used in concert with the ARM cores. The DSPs themselves have a fairly flexible instruction set and my understanding is that you can do function calls, recursion etc without issue (correct me if I am wrong, will confirm from documentation). The DSPs and the ARM cores are both reading/writing from the same memory elimintating the data-copy bottleneck that exists on many PCIe accelerator type solutions.

The base specifications are looking really good. The perf/W is looking to be competitive with GPU based solutions. The low power consumption means that it can used in many applications where the big power hungry solutions (such as Teslas or Xeon Phis) are not applicable. The shared memory model is also very enticing for everyone, including say supercomputing uses.

TI have a good solution on their hands and should push more aggressively into the HPC space.  They  should put money into getting libraries like an optimized BLAS optimized for the system along with say OpenCV.  TI should invest money into developing good compiler, debuggers and profilers. They should particularly continue to invest in standards-based solutions like OpenMP and OpenCL. As a newcomer and a smaller player, they cannot afford to introduce yet another proprietary solution.

They also need to gain some mindshare as well as marketshare. To gain mindshare, they should ensure to make ALL of this available in a nicely packaged fashion with a good descriptive documentation and webpages. They should also make low-cost boards available to really gain some marketshare. People underestimate how convenient Nvidia makes getting and using their tools for CUDA. I can just buy a cheap Nvidia card for a desktop (or buy a decent laptop), just download the CUDA SDK for free without any agreements and off I go. Everything is packaged nicely, easy to find and comes with good documentation. Capturing mind-share IS important and TI should learn those lessons from Nvidia.

I do wish TI all the best in the HPC field. They have built some solid and interesting technology, and economics also potentially works out as their DSP technology investments can be leveraged in multiple product lines much like how Nvidia is able to use the same designs for both HPC and consumer products. If they invest in building a good software ecosystem around their products, they can certainly compete in this space.

If anyone from TI is reading this, I would love to port all of my software (such as my Python compilers and numerical libraries, see here and here) to your hardware so please let me know who can I contact 🙂

Intel Xeon Phi announcement and summary

Intel had announced Xeon Phi branding and basic architecture long ago, but we finally have details and pricing. Xeon Phi is essentially a 62-core x86 chip. Different SKUs will have different number of cores and clock speeds enabled. TDPs and rough performance numbers look competitive with offerings such as Nvidia Tesla, but the Xeon Phi offers higher programmability and potentially better efficiency on some workloads. The chip will sit in a PCIe board and can either be used to offload parts of your program, or run the whole program. The board offers a number of programming interfaces such as OpenMP that are a lot more convenient than writing say CUDA code. Compared to GPUs, it should be relatively easy to get your application up-and-running on a Xeon Phi though optimization will still require some effort.

However, I am still happy to report that OpenCL is still fully supported, so porting code from GPUs to Xeon Phi is still easy.  Kudos to Intel for getting behind OpenCL and actually delivering fully working products.

Each core has an in-order dual-issue x86 core with SMT (4 threads) backed by a 512-bit vector unit capable of doing FMA operations. Each vector unit can do 8 fp64 FMAs (16 flops) or 16 fp32 FMAs (32 flops) each cycle. While there is no SSE or AVX available on this core, the vector instruction set is actually very nice with operations like scatter-gather as well as per-lane write masks. IMO it is a cleaner and more flexible vector ISA than say AVX.
Unlike GPUs, Xeon Phi does not have an on-chip user-programmable local memory. Instead, it is backed by a large 512kB L2 cache on each core and the cache is fully coherent. In total, on a 60-core variant that is 30MB of coherent L2 cache compared to 1-2 MB L2 caches we are used to seeing on GPUs. This is a HUGE win compared to GPUs IMO and should give very good efficiency on some workloads such as some types of sparse matrices. Honestly, dealing with on-chip shared memory on GPUs is a giant pain.

My rough guess is that Nvidia’s Tesla K20X will retain a 10-15% edge in some brute force tests as well as tests like generic dense linear algebra, and will retain an advantage in fp32 workloads, but there will also be workloads where Xeon Phi will win out. And overall Xeon Phi should retain a programmability advantage.

As an academic (currently), I am a little disappointed that I will likely not be able to test my tools on a Xeon Phi as we do not have the budget to buy them. With Nvidia, one can start experimenting with CUDA by buying just a $100 card and Nvidia has also been open about seeding their boards to universities where they feel appropriate. Xeon Phis start upwards of $2k (much like Teslas) so not many labs will have access to them. Would like to see Intel offer some kind of program to univs to boost the Xeon Phi’s popularity to increase the base of programmer pool available for their card 🙂

Overall, a very good showing from Intel, though they do need to keep executing as other competitors are not sitting idle either.

Cortex A15 and ARM Mali T604 are here

The new Chromebooks are apparently based upon the Exynos 5250, making them perhaps the first shipping consumer devices with Cortex A15 as well as the ARM Mali T604. ARM Mali T604 theoretically supports OpenCL, which had me excited, but the fly in the ointment is the Chrome OS. Google has confirmed in a forum post that currently there does not exist a way to access OpenCL in Chrome OS and they are not ready to comment upon whether this will change in the future either. This is frankly ridiculous. What’s the point of shipping powerful new hardware when developers are not given access to it? I hope one can load a proper Linux distro like Debian or Ubuntu etc., and hopefully the binary GPU drivers with OpenGL and OpenCL support will be made available for them.