Intel Xeon Phi announcement and summary – Searching for divine code

Intel had announced Xeon Phi branding and basic architecture long ago, but we finally have details and pricing. Xeon Phi is essentially a 62-core x86 chip. Different SKUs will have different number of cores and clock speeds enabled. TDPs and rough performance numbers look competitive with offerings such as Nvidia Tesla, but the Xeon Phi offers higher programmability and potentially better efficiency on some workloads. The chip will sit in a PCIe board and can either be used to offload parts of your program, or run the whole program. The board offers a number of programming interfaces such as OpenMP that are a lot more convenient than writing say CUDA code. Compared to GPUs, it should be relatively easy to get your application up-and-running on a Xeon Phi though optimization will still require some effort.

However, I am still happy to report that OpenCL is still fully supported, so porting code from GPUs to Xeon Phi is still easy. Kudos to Intel for getting behind OpenCL and actually delivering fully working products.

Each core has an in-order dual-issue x86 core with SMT (4 threads) backed by a 512-bit vector unit capable of doing FMA operations. Each vector unit can do 8 fp64 FMAs (16 flops) or 16 fp32 FMAs (32 flops) each cycle. While there is no SSE or AVX available on this core, the vector instruction set is actually very nice with operations like scatter-gather as well as per-lane write masks. IMO it is a cleaner and more flexible vector ISA than say AVX.
Unlike GPUs, Xeon Phi does not have an on-chip user-programmable local memory. Instead, it is backed by a large 512kB L2 cache on each core and the cache is fully coherent. In total, on a 60-core variant that is 30MB of coherent L2 cache compared to 1-2 MB L2 caches we are used to seeing on GPUs. This is a HUGE win compared to GPUs IMO and should give very good efficiency on some workloads such as some types of sparse matrices. Honestly, dealing with on-chip shared memory on GPUs is a giant pain.

My rough guess is that Nvidia’s Tesla K20X will retain a 10-15% edge in some brute force tests as well as tests like generic dense linear algebra, and will retain an advantage in fp32 workloads, but there will also be workloads where Xeon Phi will win out. And overall Xeon Phi should retain a programmability advantage.

As an academic (currently), I am a little disappointed that I will likely not be able to test my tools on a Xeon Phi as we do not have the budget to buy them. With Nvidia, one can start experimenting with CUDA by buying just a $100 card and Nvidia has also been open about seeding their boards to universities where they feel appropriate. Xeon Phis start upwards of $2k (much like Teslas) so not many labs will have access to them. Would like to see Intel offer some kind of program to univs to boost the Xeon Phi’s popularity to increase the base of programmer pool available for their card 🙂

Overall, a very good showing from Intel, though they do need to keep executing as other competitors are not sitting idle either.

Share this post:

X (Twitter) LinkedIn Email Reddit Pocket

Share this post:

Leave a Reply Cancel reply