Intel Looks to Create Gold With its Arc Alchemist and Xe HPG Architecture
The Intel Architecture Day 2021 covered a bunch of new information, including details on Intel’s Alder Lake CPUs, Sapphire Rapids, Ponte Vecchio and the Xe HPC GPU, and more. But here we’re talking about consumer graphics and the new Arc Alchemist GPUs.
Four years after Intel hired GPU guru Raja Koduri away from AMD, its first ‘real’ discrete graphics card ambitions are finally nearing completion. We’ve been hearing details and pieces about the Xe Graphics architecture for a couple of years now, but with the hardware set to launch in the first quarter of 2022, we’re on the final approach, and all of the major details and design decisions are finished. Here’s what we know about the upcoming Intel Arc GPU, the underlying architecture, and what we expect in terms of performance — which might actually be pretty decent (fingers crossed). However, it will need to be more than just “decent” to earn a spot on our list of the best graphics cards.
Intel has been steadily improving its GPU ambitions over the past decade or so, starting with the introduction of HD Graphics back in the Clarkdale era (1st Gen Core) in 2010. From inauspicious beginnings, Intel has become the largest provider of GPUs in the world — provided that you include slow and relatively weak integrated graphics solutions under that umbrella. But when you boost graphics performance by 50 to 100 percent multiple times, eventually you get to the point where even a slow start can reach impressive speeds. That’s where Arc and the Xe HPG architecture come into the picture.
Beyond the Integrated Graphics Barrier
Over the past decade, we’ve seen several instances where Intel’s integrated GPUs have basically doubled in theoretical performance. HD Graphics 3000 (Gen6) was nearly double the performance of the original HD Graphics (Gen5), and HD Graphics 4600 (Gen7) was another doubling, give or take. Gen8 was relatively short-lived, at least on the desktop side, while Gen9/Gen9.5 was again basically double the performance of Gen7 and has been the top desktop solution since the Core i7-6700K launched in 2015. At least until Xe (Gen12) showed up in this year’s Rocket Lake CPU. Gen11 also potentially doubled performance, with Gen12 potentially doubling it again, but both of those were predominantly limited to mobile solutions.
Intel frankly admits that integrated graphics solutions are constrained by many factors: Memory bandwidth and capacity, chip size, and total power requirements all play a role. While CPUs that consume up to 250W of power exist — Intel’s Core i9-10900K and Core i9-11900K both fall into this category — competing CPUs that top out at around 145W are far more common (e.g., AMD’s Ryzen 5000 series). Plus, integrated graphics have to share all of those resources with the CPU, which means it’s typically limited to about half of the total budget. Dedicated graphics solutions have far fewer constraints.
Consider the first generation Xe Graphics found in Tiger Lake. Most of the chips have a 15W TDP, and even the later generation 8-core TGL-H chips only use up to 45W (65W configurable TDP). Except TGL-H also cut the GPU budget down to 32 EUs (Execution Units), where the lower power TGL chips had 96 EUs. In contrast, the top AMD and Nvidia dedicated graphics cards like the Radeon RX 6900 XT and GeForce RTX 3080 Ti have a power budget of 300W to 350W for the reference design, with custom cards pulling as much as 400W.
What could an Intel GPU do with 20X more power available? We’re about to find out — at least, once Intel’s Arc Alchemist GPU launches.
Meet Xe-Core: No More Execution Units
One of the more interesting announcements from Intel’s Architecture Day is that the heart of its GPU designs, known as Execution Units, will be going away. Fundamentally, we’re still talking about the same basic hardware, but the latest enhancements to the processing pipelines formerly known as EUs are so significant (in Intel’s words) that it has decided to rebrand them. Say hello to Xe-core, which hosts 16 Vector Engines (what used to be called an EU) as well as 16 Matrix Engines — XMX stands for Xe Matrix eXtension, if you’re wondering. What specific enhancements is Intel talking about when referring to the Execution Units? The short answer is that the new Vector Engines support the full DirectX 12 Ultimate feature set. That means support for features including ray tracing, variable rate shading, mesh shaders, and sampler feedback — all of which are also supported by Nvidia’s RTX 20-series Turing architecture from 2018, if you’re wondering. The Vector Engine itself still operates on a 256-bit chunk of data, or the equivalent of eight 32-bit (FP32) operations. The Matrix Engine meanwhile operates on 1024-bit chunks of data, and while Intel didn’t go deep into the design, it looks and sounds as though the MXM cores are analogous to Nvidia’s Tensor cores. They’re designed to accelerate machine learning and AI-related functions, likely using FP16 data types, which means up to 64 16-bit FP16 (or BF16) operations per clock. Much like the AMD and Nvidia GPU architectures, the Xe-core represents just part of the building blocks used for Intel’s Arc GPUs. Like previous designs, the next level up from the Xe-core is called a Render Slice (analogous to an Nvidia GPC, sort of), which contains four Xe-core blocks, which in total means 64 Vector and Matrix Engines, plus additional hardware. That additional hardware includes four ray tracing units, geometry and rasterization pipelines, samplers, and the pixel backend. The ray tracing units are perhaps the most interesting addition, but other than their presence and their capabilities — they can do ray traversal, bounding box intersection, and triangle intersection — we don’t have any details on how the RT units compare to AMD’s ray accelerators or Nvidia’s RT cores. Are they faster, slower, or similar in overall performance? We’ll have to wait to get hardware in hand to find out for sure. Intel did provide a demo of Alchemist running an Unreal Engine demo that apparently uses ray tracing, but it’s for an unknown game, running at unknown settings … and running rather poorly, to be frank. Hopefully that’s because this is early hardware and drivers, but skip to the 4:57 mark in this Arc Alchemist video from Intel to see it in action.
Finally, Intel can pair various numbers of render slices together to create the entire GPU, with the L2 cache and the memory fabric tying everything together. The maximum Xe HPG configuration for the initial Arc Alchemist launch will have up to eight render slices. Ignoring the change in naming from EU to Vector Engine, that still gives the same maximum configuration of 512 EU/Vector Engines.
Intel didn’t quote a specific amount of L2 cache, per render slice or for the entire GPU. There will likely be multiple Arc configurations — one with four render slices seem likely; perhaps even a two render slice GPU would be useful. Intel did reveal that its Xe HPC GPUs will have 512KB of L1 cache per Xe-core, and up to 144MB of L2 cache per slice, but that’s a completely different part, and the Xe HPG GPUs will likely have less L1 and L2 cache. Still, given how much benefit AMD saw from its Infinity Cache, we wouldn’t be shocked to see 32MB or more of total cache on the largest Arc GPUs.
While it doesn’t sound like Intel has specifically improved throughput on the Vector Engines compared to the EUs in Gen11/Gen12 solutions, that doesn’t mean performance hasn’t improved. DX12 Ultimate includes some new features that can also help performance, but the biggest change comes via boosted clock speeds. Intel didn’t provide any specific numbers, but it did state that Xe HPG can run at 1.5X frequencies compared to Xe LP, and it also said that Xe HPG delivers 1.5X improved performance per watt. Taken together, we could be looking at clock speeds of 2.0–2.3GHz for the Arc GPUs, which would yield a significant amount of raw compute.
Putting it all together, Arc Alchemist will have up to eight render slices, each with four Xe-cores, 16 Vector Engines per Xe-core, and each Vector Engine can do eight FP32 operations per clock. Double that for FMA operations (Fused Multiply Add, a common matrix operation used in graphics workloads), then multiply by a potential 2.0–2.3GHz clock speed, and we get the theoretical performance in GFLOPS:
8 (RS) * 4 (Xe-core) *16 (VE) * 8 (FP32) * 2 (FMA) * 2.0–2.3 (GHz) = 16,384–18,841.6 GFLOPS
Obviously, GFLOPS (or TFLOPS) on its own doesn’t tell us everything, but 16-19 TFLOPS for the top configurations is certainly nothing to scoff at. Nvidia’s Ampere GPUs theoretically have a lot more compute — the RTX 3080, as an example, has a maximum of 29.8 TFLOPS — but some of that gets shared with INT32 calculations. AMD’s RX 6800 XT, by comparison ‘only’ has 20.7 TFLOPS, but in many games, it can deliver similar performance to the RTX 3080. Either way, depending on final clock speeds, Xe HPG and Arc Alchemist likely come in below the theoretical level of the current AMD and Nvidia GPUs, but not by much. So on paper at least, it looks like Intel could land in the vicinity of the RTX 3070/3070 Ti and RX 6800 — assuming drivers and everything else don’t hold it back.
XMX: Matrix Engines and Deep Learning for XeSS
You’ll note we didn’t say much about the XMX blocks yet, but they’re potentially just as useful as Nvidia’s Tensor cores — which are used not just for DLSS, but also other AI applications, including Nvidia Broadcast. Intel announced today a new upscaling and image enhancement algorithm that it’s calling XeSS: Xe Superscaling. Intel didn’t go deep into the details, but it’s worth mentioning that Intel recently hired Anton Kaplanyan. He worked at Nvidia and played an important role in creating DLSS before heading over to Facebook to work on VR. It doesn’t take much reading between the lines to conclude that he’s likely doing a lot of the groundwork for XeSS now, and there are many similarities between DLSS and XeSS. XeSS uses the current rendered frame, motion vectors, and data from previous frames and feeds all of that into a trained neural network that handles the upscaling and enhancement to produce a final image. That sounds basically the same as DLSS 2.0, though the details matter here, and we assume the neural network will end up with different results. Intel did provide a demo using Unreal Engine showing XeSS in action, and it looked good when comparing 1080p upscaled via XeSS to 4K against the native 4K rendering. Still, we’ll have to see XeSS in action before rendering any verdict.
More important than just how it works will be how many game developers choose to use XeSS. They already have access to both DLSS and AMD FSR, which target the same problem of boosting performance and image quality. Adding a third option, from the newcomer to the dedicated GPU market no less, seems like a stretch for developers. However, Intel does offer a potential advantage over DLSS.
XeSS is designed to work in two modes. The highest performance mode utilizes the XMX hardware to do the upscaling and enhancement, but of course, that would only work on Intel’s Arc GPUs. That’s the same problem as DLSS, except with zero existing installation base — a showstopper in terms of developer support. But Intel has a solution: XeSS will also work — in a lower performance mode — using DP4a instructions.
DP4a is widely supported by other GPUs, including Intel’s previous generation Xe LP and multiple generations of AMD and Nvidia GPUs (Nvidia Pascal and later, or AMD Vega 20 and later), which means XeSS in DP4a mode will run on virtually any modern GPU. Support might not be as universal as AMD’s FSR, which runs in shaders and basically works on any DirectX 11 or later capable GPU as far as we’re aware, but quality could be better than FSR as well.
The big question will still be developer uptake. We’d love to see similar quality to DLSS 2.x, with support covering a broad range of graphics cards from all competitors. That’s definitely something Nvidia is still missing with DLSS, as it requires an RTX card. But RTX cards already make up a huge chunk of the high-end gaming PC market, probably around 80% or more (depending on how you quantify high-end). So Intel basically has to start from scratch with XeSS, and that makes for a long uphill climb. On the bright side, it will be providing the XeSS Developer Kit this month, giving it plenty of time to get things going. So it’s possible (though unlikely) we could even see games implementing XeSS before the first Arc GPUs hit retail.
Xe HPG, Thanks for the Memories
So far, Intel hasn’t made any comments about what sort of memory will be used with the various Arc Alchemist GPUs. Rumors say it will be GDDR6, probably running at 16Gbps… but that’s all just guesswork right now, at least officially. At the same time, it isn’t easy to imagine any other solution that would make sense. GDDR5 memory still gets used on some budget solutions, but the fastest chips top out at around 8Gbps — half of what GDDR6 offers. So if Intel wants to be competitive with Xe HPG, it basically has to use GDDR6. There’s also HBM2e as a potential solution, but while that can provide substantial increases to memory bandwidth, it would also significantly increase costs. The data center Xe HPC will use HBM2E, but none of the chip shots for Xe HPG show HBM stacks of memory, which again leads us back to GDDR6. There will be multiple Xe HPG Arc solutions, with varying capabilities. The larger chip, which we’ve focused on so far, appears to have eight 32-bit GDDR6 channels, giving it a 256-bit bus. That means it might have 8GB or 16GB of memory on the top model. We’ll get into this in the next section, and we’ll likely see trimmed down 192-bit and 128-bit interfaces on lower-tier cards.
Xe HPG Die Shots and Analysis
Much of what we’ve said so far isn’t radically new information, but Intel did provide a few images, as well as video evidence that provides some great indications of where Intel will land. So let’s start with what we know for certain.
Intel will partner with TSMC and use the N6 process (an optimized variant of N7) for Xe HPG. That means it’s not technically competing for the same wafers as AMD uses for its Zen 2, Zen 3, RDNA, and RDNA 2 GPUs. At the same time, AMD and Nvidia could use N6 as well — it’s design is compatible with N7, so Intel’s use of TSMC certainly doesn’t help AMD or Nvidia production capacities. TSMC likely has a lot of tools that overlap between N6 and N7 as well, meaning it could run batches of N6, then batches and N7, switching back and forth, so there’s certainly potential for this to cut into TSMC’s ability to provide wafers to other partners.
Beyond that, everything else in this section is based on our own analysis.
Raja, at one point, showed a wafer of Xe HPG chips. By snagging a snapshot of the video and zooming in on the wafer, the various chips on the wafer are reasonably clear. We’ve drawn lines to show how large the chips are, and based on our calculations; it looks like the larger Xe HPG die will be around 24x16mm (~384mm^2), give or take 5–10% in each dimension. We counted the dies on the wafer as well, and there appear to be 144 whole dies, which would also correlate to a die size of around 384mm^2.
That’s not a massive GPU — Nvidia’s GA102, for example, measures 628mm^2 and AMD’s Navi 21 measures 520mm^2 — but it’s also not at all small. AMD’s Navi 22 measures 335mm^2, and Nvidia’s GA104 is 393mm^2, so Xe HPG would be larger than AMD’s chip and similar in size to the GA104 — but made on a smaller manufacturing process. Putting it bluntly, size matters.
This may be Intel’s first read dedicated GPU since the i740 back in the late 90s, but it has made many integrated solutions over the years, and it has spent the past several years building a bigger dedicated GPU team. Die size alone doesn’t determine performance, but it gives a good indication of how much stuff can be crammed into a design. A chip that’s at least 400mm^2 in size suggests Intel intends to be competitive with at least the RTX 3070 and RX 6800, which is likely higher than some were expecting.
Besides the wafer shot, Intel also provided these two die shots for Xe HPG. Yes, these are clearly two different GPU dies. They’re probably artistic renderings rather than actual die shots, but even those should have some basis in reality. You can see that the larger die has eight clusters in the center area that would correlate to the eight render slices, and then there’s a bunch of other stuff that’s a bit more nebulous.
The memory interfaces are along the bottom edge and the bottom half of the left and right edges. The corners are a bit iffy, and figuring out what counts as a memory interface can be tricky, but it looks like this is probably a 256-bit interface. There are four distinct clusters on the big chip, and if each represents a 64-bit interface, we’re done (though I can’t completely rule out a wider interface).
Assuming it uses a 256-bit interface, which seems most likely, would put Intel’s Arc GPUs in an interesting position. That’s the same interface width as the GA104 (RTX 3060 Ti/3070/3070 Ti) and Navi 21. Will Intel follow AMD’s lead and use 16Gbps memory, or will it opt for more conservative 14Gbps memory like Nvidia? We don’t know. Even more options open up if it’s a wider interface, but we’ll leave those for others to speculate on.
The smaller die looks to have a single render slice, giving it just 128 Vector Engines. This would be more of an entry-level part, and the remainder of the die has to include a bunch of stuff like video codec support, video outputs, and memory interfaces. Based on the above image, the smaller die would measure around 150mm^2, making it smaller than even Nvidia’s TU117 GPU used in the GTX 1650.
The smaller chip looks like it only has a 96-bit memory interface (the blocks in the lower-right edges of the chip), which could put it at a disadvantage relative to other cards. Compute performance will be substantially lower than the bigger chip. One render slice would equate to around 4.1–4.9 TFLOPS of compute, depending on the clock speed. However, that could still be a match for the GTX 1650 Super, and hopefully Intel would provide the GPU with at least 6GB of memory.
Will Intel Arc Be Good at Mining Cryptocurrency?
Given the current GPU shortages on the AMD and Nvidia side, fueled in part by cryptocurrency miners, it’s inevitable that people will want to know if Intel’s Arc GPUs will face similar difficulties. Publicly, Intel has said precisely nothing about mining potential and Xe Graphics. Given the data center roots for Xe HP/HPC, however (machine learning, High-Performance Compute, etc.), Intel has probably at least looked into the possibilities mining presents. Still, it’s certainly not making any marketing statements about the suitability of the architecture or GPUs for mining. But then, there’s also the above image (for the entire Intel Architecture Day presentation), with a physical Bitcoin and the text “Crypto Currencies” and you start to wonder.
Generally speaking, Xe might work fine for mining, but the most popular algorithms for GPU mining (Ethash mostly, but also Octopus and KAWPOW) have performance that’s predicated almost entirely on how much memory bandwidth a GPU has. Previous rumors suggest that Intel’s Arc GPUs will use 16GB of GDDR6 with a 256-bit interface at the very top of the product stack. That would give it similar bandwidth to AMD’s RX 6800/6800 XT/6900 XT as well as Nvidia’s RTX 3060 Ti/3070, which would, in turn, lead to performance of around 60-ish MH/s for Ethereum mining.
Intel likely isn’t going to use GDDR6X, but it might have some other features that would boost mining performance as well — if so, it hasn’t spilled the beans yet. Nvidia has memory clocked at 14Gbps on the RTX 3060 Ti and RTX 3070, and (before the LHR models came out) it could do about 61–62 MH/s. AMD has faster 16Gbps memory, and after tuning ends up at closer to 65 MH/s. That’s realistically about where we’d expect the fastest Arc GPU to land, and that’s only if the software works properly on the card. Considering Arc GPUs won’t even show up until early 2022. and given the volatility of cryptocurrencies, it’s unlikely that mining performance has been an overarching concern for Intel during the design phase. That still doesn’t mean it will be bad — or good — at mining.
Remember also that Intel first started talking about its discrete GPU aspirations back in 2018, when mining was not a major factor in GPU sales. Nvidia and AMD weren’t talking about mining in 2018 either. Or at least, they weren’t in the latter part of 2018 after cryptocurrencies took a dive. It was only in late 2020 and early 2021 that cryptocurrency prices jumped by nearly 10X, and mining suddenly became a hot topic again.
Best-case (or worst-case, depending on your perspective), we anticipate mining performance will roughly match AMD’s Navi 21 and Nvidia’s GA104 GPUs. Of course, the mining software will likely need major updates and driver fixes to even work properly on future GPUs. I did give mining a shot using the Xe DG1, and it failed all of the benchmarks on NiceHashMiner, but that’s not saying much as most of the software didn’t even detect a “compatible” GPU. At launch, I’d expect the Arc GPUs to be in a similar situation, but we’ll have to see how things shape up over time.
Xe HPG and Arc Alchemist: More to Come
That’s it for all the new things we’ve learned about Xe HPG and Arc Alchemist. The core specs are shaping up nicely, and the use of TSMC N6 and potentially a 400mm^2 die with a 256-bit memory interface all point to a card that should be competitive. As the newcomer, Intel needs the first Arc Alchemist GPUs to come out swinging. However, as discussed in our look at the Intel Xe DG1, there’s much more to building a good graphics card than hardware — which is probably why DG1 exists, to get the drivers and software ready for Arc.
Most of the things we don’t know for sure about Arc aren’t super critical. For example, will the Arc cards have a blower fan, dual fans, or triple fans? It doesn’t really matter, as any of those with the right card design can suffice. It will be good to get final TDP data, core counts, memory interface width, VRAM capacity, bandwidth, etc. We also need to know how much the cards will cost, but those are all topics for another day.
We’re also very curious about the real-world ray tracing performance, compared to both AMD and Nvidia. The current design has a maximum of 32 ray tracing units (RTUs), but we know next to nothing about what those units can do. Each one might be similar in capabilities to AMD’s ray accelerators, in which case Intel would come in pretty low on the ray tracing pecking order. Alternatively, each RTU might be the equivalent of several AMD ray accelerators, perhaps even faster than Nvidia’s Ampere RT cores. While it could be any of those, we suspect it will probably land lower on RT performance rather than higher, leaving room for growth with future iterations.
Speaking of which, we really appreciate the logical order of the upcoming codenames. Alchemist, Battlemage, Celestial, and Druid might not be the most awe-inspiring codenames, but going in alphabetical order will at least make it easier to keep things straight. Hopefully, Intel can release the future architectures sooner than later and put some pressure on AMD and Nvidia. It will need to, considering we expect to see Lovelace and perhaps RDNA3 GPUs next year as well.
The good news is that we finally have a relatively hard launch date of Q1 2022. So for better or worse, we’ll know how Intel’s discrete graphics card stacks up to the competition in the next six months or so. If we’re lucky, maybe by then we’ll even have some GPUs sitting on store shelves rather than continued shortages and jacked-up pricing. After that, Intel just needs to deliver on three key areas: availability, performance, and price. If it can do that, the GPU duopoly might morph into a triopoly in 2022.