Intel has released a lot of new information during Intel Architecture Day 2021, you can check our other articles to learn more Alder Lake CPU, Sapphire Rapids, Arc Alchemist GPU, And more. The last one is particularly relevant to what we will discuss here, namely Intel’s Ponte Vecchio and Xe HPC architecture. It is huge.Starting from the beginning: this is Lots of, Especially in the largest configuration where eight GPUs work together. The upcoming Aurora supercomputer will use Sapphire Rapids and Ponte Vecchio to become the first exascale supercomputer in the United States. There are good reasons for the Department of Energy to choose to use Intel’s upcoming hardware.
Like the Xe HPG built for games, the basic building blocks of Xe HPC also start from the Xe core. There are still 8 vector engines and 8 matrix engines in Xe-core, but this Xe-core is essentially different from Xe HPG. The vector engine uses 512-bit registers (for 64-bit floating point), and the XMX matrix engine has been expanded to 4096-bit data blocks. This is twice the potential performance of the vector engine, and the FP16 throughput on the matrix engine has increased four times. L1 Cache The size and load/storage bandwidth are also increased for use by the engine.
In addition to being larger, Xe HPC also supports other data types. Xe HPG MXM is only available for FP16 and BF16 data, but Xe HPC also supports TF32 (Tensor Float 32), which has become popular in the machine learning community. The vector engine also adds support for FP64 data, although the rate is the same as FP32 data.
Each Xe core has eight vector engines, and the total potential throughput of a single Xe core is 256 FP64 or FP32 operations, or 512 FP16 operations on the vector engine. For the matrix engine, each Xe core can perform 4096 FP16 or BF16 operations per clock, 8192 INT8 operations per clock, or 2048 TF32 operations per clock. But of course, Ponte Vecchio has more than one Xe-core.
Xe HPC combines 16 Xe core units into a slice, while consumer-grade Xe HPG only has 8 units at most.The interesting point here is that unlike Nvidia’s GA100 architecture, Xe HPC includes Ray tracing Unit (RTU). We don’t know how fast RTU is relative to Nvidia’s RT core, but for professional ray tracing applications, this is a huge potential performance improvement.
Each Xe core on Ponte Vecchio also contains a 512KB L1 cache, which is relatively large compared to consumer GPUs. All Xe cores in the slice also run in a single hardware environment. But this is still only at the slice level.
The main computing core of Xe HPC is contained in four slices, connected by a huge 144MB L2 cache and memory structure, with eight Xe Link connectors, four HBM2e stacks and a media engine. But Intel is still not finished because Xe HPC is also available as a 2-stack configuration, which doubles all these values and links them together via EMIB.
Xe Link is an important part of Xe HPC, which provides a high-speed and consistent unified structure between multiple GPU configurations. It can be used in 2-way, 4-way, 6-way and 8-way topologies, and each GPU is directly linked to each other GPU. Put them together and you will get a lot of calculations!
Intel has not disclosed the clock speed, but we expect a maximum of 32,768 FP64 operations per clock. Assuming it runs at a speed between 1.0 and 2.0 GHz, this means that the FP64 calculation speed of a single Xe HPC GPU is 8.2 to 16.4 TFLOPS, and the calculation speed of eight clusters is as high as 131 TFLOPS. This brings us to the second topic, Xe The productization of HPC is Ponte Vecchio.
Ponte Vecchio will be an important step in packaging and integration.entire SOC It is composed of more than 100 billion transistors, spanning 47 active blocks, and manufactured on five different process nodes. All this is achieved through Intel’s 3D chip stacking technology. We have introduced a lot of details before, but this is still an impressive masterpiece for Intel.
The computing unit at the core of Ponte Vecchio will be manufactured using TSMC’s N5 process, and each unit will have 8 Xe cores.These are linked to the Intel Foveros basic tiles (built on the newly renamed Intel 7 process), which also contains Rambo cache, HBM2e and a PCIe Gen5 interface. Xe Link tiles also use TSMC N7.
Intel already has available A0 chips (basically early chips, not yet final production), FP32 runs at more than 45 TFLOPS, and HBM2e bandwidth exceeds 5TBps. The connection speed also exceeds 2TBps.
The Aurora supercomputer will run in a 6-way configuration, using Xe Links to connect things together, as you can see in the Aurora Blade above. It is also equipped with two Sapphire Rapids CPUs-naturally all liquid cooling is used to keep running cool.
Obviously, this will not be the last time we see Ponte Vecchio. With its extreme performance and functionality included in the software package, and a design built to scale to hundreds or thousands of nodes, Ponte Vecchio will undoubtedly appear in more installations in the coming years. This is also the first round of Xe HPC hardware, and more iterations are planned in the future to provide more performance and functions.