Intel Ponte Vecchio and Xe HPC Architecture: Built for Big Data

Intel has released a lot of new information during Intel Architecture Day 2021, you can check our other articles to learn more Alder Lake CPU, Sapphire Rapids, Arc Alchemist GPU, And more. The last one is particularly relevant to what we will discuss here, namely Intel’s Ponte Vecchio and Xe HPC architecture. It is huge.Starting from the beginning: this is Lots of, Especially in the largest configuration where eight GPUs work together. The upcoming Aurora supercomputer will use Sapphire Rapids and Ponte Vecchio to become the first exascale supercomputer in the United States. There are good reasons for the Department of Energy to choose to use Intel’s upcoming hardware.

(Image source: Intel)

Like the Xe HPG built for games, the basic building blocks of Xe HPC also start from the Xe core. There are still 8 vector engines and 8 matrix engines in Xe-core, but this Xe-core is essentially different from Xe HPG. The vector engine uses 512-bit registers (for 64-bit floating point), and the XMX matrix engine has been expanded to 4096-bit data blocks. This is twice the potential performance of the vector engine, and the FP16 throughput on the matrix engine has increased four times. L1 Cache The size and load/storage bandwidth are also increased for use by the engine.

(Image source: Intel)

In addition to being larger, Xe HPC also supports other data types. Xe HPG MXM is only available for FP16 and BF16 data, but Xe HPC also supports TF32 (Tensor Float 32), which has become popular in the machine learning community. The vector engine also adds support for FP64 data, although the rate is the same as FP32 data.

Each Xe core has eight vector engines, and the total potential throughput of a single Xe core is 256 FP64 or FP32 operations, or 512 FP16 operations on the vector engine. For the matrix engine, each Xe core can perform 4096 FP16 or BF16 operations per clock, 8192 INT8 operations per clock, or 2048 TF32 operations per clock. But of course, Ponte Vecchio has more than one Xe-core.

(Image source: Intel)

Xe HPC combines 16 Xe core units into a slice, while consumer-grade Xe HPG only has 8 units at most.The interesting point here is that unlike Nvidia’s GA100 architecture, Xe HPC includes Ray tracing Unit (RTU). We don’t know how fast RTU is relative to Nvidia’s RT core, but for professional ray tracing applications, this is a huge potential performance improvement.

