Artificial Intelligence (AI) has been widely adopted in the past few years. As we all know, Tesla is a company dedicated to electric vehicles and self-driving cars, and artificial intelligence is of great value to all aspects of the company’s work.In order to accelerate artificial intelligence software workloads, Tesla today came up with D1 Dojo customized application specific integrated circuit (ASIC) for AI training.
There are currently many companies building ASICs for AI workloads. Everyone comes from countless startups, all the way to big companies like Amazon, Baidu, Intel, and Nvidia. However, not everyone can master the correct formula, and not everyone can perfectly meet every workload. This is why Tesla chose to develop its own ASIC for AI training purposes.
The chip is called D1 and is similar to a part of the Dojo supercomputer, used to train AI models in Tesla’s headquarters, which were later deployed in various applications. The D1 chip is a product built by TSMC on the 7nm semiconductor node. The chip contains more than 50 billion transistors and has a huge die size of 645mm^2.
The chip has some impressive performance claims. Tesla claims that it can output up to 362 TeraFLOPs or approximately 22.6 TeraFLOPs for single-precision FP32 tasks at FP16/CFP8 accuracy. Obviously, Tesla has optimized the FP16 data type, and they even beat the current leader-Nvidia in terms of computing power. Nvidia’s A100 Ampere GPU can “only” generate 312 TeraFLOPs of power under FP16 workload-and in sparse conditions, it can double.
At the silicon level, we see Tesla constructing a functional unit (FU) grid, which is interconnected to form a huge chip. Each FU contains a 64-bit CPU with a custom ISA, designed for transposition, collection, broadcasting, and link traversal. The CPU design itself is a superscalar implementation with 4-wide scalar and 2-wide vector pipelines. Looking at the figure below, you can see that FU is built with a large block for single instruction multiple data (SIMD) floating point and integer processing elements. Each FU has its own 1.25MB scratchpad SRAM memory.
FU itself can perform 1 TeraFLOP of BF16 or CFP8, 64 GigaFLOP of FP32 calculation, and has a bandwidth of 512 GB/s in any direction in the grid. The grid is designed to traverse the FU in only one clock cycle, which means reduced latency and improved performance.For more details, you can watch Tesla AI Day Replay here.