Texas Instruments Keystone II : HPC perspective

Texas Instruments recently announced their Keystone II chips. Essentially, these combine a multi-core Cortex A15 with DSPs on a single chip. The number of cores and DSP configuration varies depending on the SKU. Here I focus on the top-end SKU 66AK2H12.

The chip has the following integrated:

  • 4 Cortex A15 cores @ 1.4 GHz giving 44.8 GFlops SP (NEON), 22.4 GFlops SP (IEEE-754), 11.2 GFlops DP
  • 8 C66-family DSPs @ 1.2 GHz giving 153.6 GFlops SP, 57.6 GFlops DP?
  • DDR3 memory controller 2×64-bit upto 1600MHz giving 25.6 GB/s bandwidth
  • ARM cores L1 data cache 4*32 kB, L1 instruction cache 4*32kB, L2 cache 4MB shared across cores
  • DSP cores L1 data cache 8*32kB, instruction cache 8*32kB, L2 cache 1MB/DSP = 8MB total
  • 6 MB of cache (separate from L2 caches) shared by DSPs and ARM cores
  • Upto 14W power consumption
  • OpenMP programming tools. alpha version of OpenCL driver also available

You should not think of this chip as a GPU-like accelerator. This is intended to be a standalone solution, with the 4 general-purpose ARM cores capable of running any regular ARM applications including a full Linux OS. Certain parts of your application can be offloaded to the DSP or they can be used in concert with the ARM cores. The DSPs themselves have a fairly flexible instruction set and my understanding is that you can do function calls, recursion etc without issue (correct me if I am wrong, will confirm from documentation). The DSPs and the ARM cores are both reading/writing from the same memory elimintating the data-copy bottleneck that exists on many PCIe accelerator type solutions.

The base specifications are looking really good. The perf/W is looking to be competitive with GPU based solutions. The low power consumption means that it can used in many applications where the big power hungry solutions (such as Teslas or Xeon Phis) are not applicable. The shared memory model is also very enticing for everyone, including say supercomputing uses.

TI have a good solution on their hands and should push more aggressively into the HPC space.  They  should put money into getting libraries like an optimized BLAS optimized for the system along with say OpenCV.  TI should invest money into developing good compiler, debuggers and profilers. They should particularly continue to invest in standards-based solutions like OpenMP and OpenCL. As a newcomer and a smaller player, they cannot afford to introduce yet another proprietary solution.

They also need to gain some mindshare as well as marketshare. To gain mindshare, they should ensure to make ALL of this available in a nicely packaged fashion with a good descriptive documentation and webpages. They should also make low-cost boards available to really gain some marketshare. People underestimate how convenient Nvidia makes getting and using their tools for CUDA. I can just buy a cheap Nvidia card for a desktop (or buy a decent laptop), just download the CUDA SDK for free without any agreements and off I go. Everything is packaged nicely, easy to find and comes with good documentation. Capturing mind-share IS important and TI should learn those lessons from Nvidia.

I do wish TI all the best in the HPC field. They have built some solid and interesting technology, and economics also potentially works out as their DSP technology investments can be leveraged in multiple product lines much like how Nvidia is able to use the same designs for both HPC and consumer products. If they invest in building a good software ecosystem around their products, they can certainly compete in this space.

If anyone from TI is reading this, I would love to port all of my software (such as my Python compilers and numerical libraries, see here and here) to your hardware so please let me know who can I contact 🙂

5 thoughts on “Texas Instruments Keystone II : HPC perspective”

  1. TI provides OpenMP support, an OpenCL implementation is available for evaluation. Furthermore, their compilers are already quite good.
    The instruction set of the DSP devices is extremly powerful, however to achieve full efficiency, you have to write assembly code for your kernels.

  2. is it 11.2 double precision GFlops for all the 4 A15 cores or 11.2 per core
    and 57.6 GFlops DP for all the 8 dsp cores?

    and how did you get these numbers ? i think the Ti data sheet says it 19 Gflops
    and can it be overclocked over 1.4 ghz
    and when will TI release the four core A15 it and at what nm will it release it ?

    1. 11.2 gflops DP for all 4 four Cortex A15 combined. As far as i know, each Cortex A15 core can 2 DP flops/cycle peak, and with 4*2*1.4=11.2 gflops.
      About the DSP DP peak, I am currently forgetting what my source was, but I believe it was a TI-sponsored article in an industry publication. I will look it up.

      I am not quite sure what the release date is. I think TI is targeting second half of 2013 for mass production. My understanding is that it is in pre-production right now and samples are being provided to key partners.

      1. is GPGPU is it better or equal on DSPs compared to GPUs like Arm’s Mali or Imagination Power GPU
        do DSPs like have better programability and more execution speed compared to GPUs
        which of the two is more efficient and faster and to write code easier for

Leave a Reply

Your email address will not be published. Required fields are marked *