Texas Instruments recently announced their Keystone II chips. Essentially, these combine a multi-core Cortex A15 with DSPs on a single chip. The number of cores and DSP configuration varies depending on the SKU. Here I focus on the top-end SKU 66AK2H12.
The chip has the following integrated:
- 4 Cortex A15 cores @ 1.4 GHz giving 44.8 GFlops SP (NEON), 22.4 GFlops SP (IEEE-754), 11.2 GFlops DP
- 8 C66-family DSPs @ 1.2 GHz giving 153.6 GFlops SP, 57.6 GFlops DP?
- DDR3 memory controller 2×64-bit upto 1600MHz giving 25.6 GB/s bandwidth
- ARM cores L1 data cache 4*32 kB, L1 instruction cache 4*32kB, L2 cache 4MB shared across cores
- DSP cores L1 data cache 8*32kB, instruction cache 8*32kB, L2 cache 1MB/DSP = 8MB total
- 6 MB of cache (separate from L2 caches) shared by DSPs and ARM cores
- Upto 14W power consumption
- OpenMP programming tools. alpha version of OpenCL driver also available
You should not think of this chip as a GPU-like accelerator. This is intended to be a standalone solution, with the 4 general-purpose ARM cores capable of running any regular ARM applications including a full Linux OS. Certain parts of your application can be offloaded to the DSP or they can be used in concert with the ARM cores. The DSPs themselves have a fairly flexible instruction set and my understanding is that you can do function calls, recursion etc without issue (correct me if I am wrong, will confirm from documentation). The DSPs and the ARM cores are both reading/writing from the same memory elimintating the data-copy bottleneck that exists on many PCIe accelerator type solutions.
The base specifications are looking really good. The perf/W is looking to be competitive with GPU based solutions. The low power consumption means that it can used in many applications where the big power hungry solutions (such as Teslas or Xeon Phis) are not applicable. The shared memory model is also very enticing for everyone, including say supercomputing uses.
TI have a good solution on their hands and should push more aggressively into the HPC space. They should put money into getting libraries like an optimized BLAS optimized for the system along with say OpenCV. TI should invest money into developing good compiler, debuggers and profilers. They should particularly continue to invest in standards-based solutions like OpenMP and OpenCL. As a newcomer and a smaller player, they cannot afford to introduce yet another proprietary solution.
They also need to gain some mindshare as well as marketshare. To gain mindshare, they should ensure to make ALL of this available in a nicely packaged fashion with a good descriptive documentation and webpages. They should also make low-cost boards available to really gain some marketshare. People underestimate how convenient Nvidia makes getting and using their tools for CUDA. I can just buy a cheap Nvidia card for a desktop (or buy a decent laptop), just download the CUDA SDK for free without any agreements and off I go. Everything is packaged nicely, easy to find and comes with good documentation. Capturing mind-share IS important and TI should learn those lessons from Nvidia.
I do wish TI all the best in the HPC field. They have built some solid and interesting technology, and economics also potentially works out as their DSP technology investments can be leveraged in multiple product lines much like how Nvidia is able to use the same designs for both HPC and consumer products. If they invest in building a good software ecosystem around their products, they can certainly compete in this space.
If anyone from TI is reading this, I would love to port all of my software (such as my Python compilers and numerical libraries, see here and here) to your hardware so please let me know who can I contact 🙂