CDNA 2 On a PCIe Card

With each GDC and GTC occurring this week, this can be a large time for GPUs of all types. And as we speak, AMD desires to get in on the sport as effectively, with the discharge of the PCIe model of their MI200 accelerator household, the MI210.

First unveiled alongside the MI250 and MI250X again in November, when AMD initially launched the Intuition MI200 household, the MI210 is the third and remaining member of AMD’s newest technology of GPU-based accelerators. Bringing the CDNA 2 structure right into a PCIe card, the MI210 is being geared toward prospects who’re after the MI200 household’s HPC and machine studying efficiency, however want it in a standardized kind issue for mainstream servers. Total, the MI200 is being launched extensively as we speak as a part of AMD transferring the whole MI200 product stack to normal availability for OEM prospects.

AMD Intuition Accelerators
	MI250	MI210	MI100	MI50
Compute Items	2 x 104	104	120	60
Matrix Cores	2 x 416	416	480	N/A
Enhance Clock	1700MHz	1700MHz	1502MHz	1725MHz
FP64 Vector	45.3 TFLOPS	22.6 TFLOPS	11.5 TFLOPS	6.6 TFLOPS
FP32 Vector	45.3 TFLOPS	22.6 TFLOPS	23.1 TFLOPS	13.3 TFLOPS
FP64 Matrix	90.5 TFLOPS	45.3 TFLOPS	11.5 TFLOPS	6.6 TFLOPS
FP32 Matrix	90.5 TFLOPS	45.3 TFLOPS	46.1 TFLOPS	13.3 TFLOPS
FP16 Matrix	362 TFLOPS	181 TFLOPS	184.6 TFLOPS	26.5 TFLOPS
INT8 Matrix	362.1 TOPS	181 TOPS	184.6 TOPS	N/A
Reminiscence Clock	3.2 Gbps HBM2E	3.2 Gbps HBM2E	2.4 Gbps HBM2	2.0 Gbps GDDR6
Reminiscence Bus Width	8192-bit	4096-bit	4096-bit	4096-bit
Reminiscence Bandwidth	3.2TBps	1.6TBps	1.23TBps	1.02TBps
VRAM	128GB	64GB	32GB	16GB
ECC	Sure (Full)	Sure (Full)	Sure (Full)	Sure (Full)
Infinity Cloth Hyperlinks	6	3	3	N/A
CPU Coherency	No	N/A	N/A	N/A
TDP	560W	300W	300W	300W
Manufacturing Course of	TSMC N6	TSMC N6	TSMC 7nm	TSMC 7nm
Transistor Depend	2 x 29.1B	29.1B	25.6B	13.2B
Structure	CDNA 2	CDNA 2	CDNA (1)	Vega
GPU	2 x CDNA 2 GCD “Aldebaran”	CDNA 2 GCD “Aldebaran”	CDNA 1 “Arcturus”	Vega 20
Kind Issue	OAM	PCIe (4.0)	PCIe (4.0)	PCIe (4.0)
Launch Date	11/2021	03/2022	11/2020	11/2018

Beginning with a take a look at the top-line specs, the MI210 is an fascinating variant to the prevailing MI250 accelerators. Whereas these two elements had been primarily based on a pair of Aldebaran (CDNA 2) dies in an MCM configuration on a single package deal, for MI210 AMD is paring all the things again to a single die and associated {hardware}. With MI250(X) requiring 560W within the OAM kind issue, AMD primarily wanted to halve the {hardware} anyhow to get issues right down to 300W for a PCIe card. So that they’ve completed so by ditching the second on-package die.

The online result’s that the MI210 is basically half of an MI250, each with reference to bodily {hardware} and anticipated efficiency. The CNDA 2 Graphics Compute Die options the identical 104 enabled CUs as on MI250, with the chip operating on the identical peak clockspeed of 1.7GHz. So workload scalability apart, the efficiency of the MI210 is for all sensible functions half of a MI250.

That halving goes for reminiscence, as effectively. As MI250 paired 64GB of HBM2e reminiscence with every GCD – for a complete of 128GB of reminiscence – MI210 brings that right down to 64GB for the one GCD. AMD is utilizing the identical 3.2GHz HBM2e reminiscence right here, so the general reminiscence bandwidth for the chip is 1.6 TB/second.

With reference to efficiency, using a single Aldebaran die does make for some odd comparisons to AMD’s previous-generation PCIe card, the Radeon Intuition MI100. Whereas clocked increased, the marginally diminished variety of CUs relative to the MI100 signifies that for some workloads, the previous accelerator is, at the very least on paper, a bit quicker. In follow, MI210 has extra reminiscence and extra reminiscence bandwidth, so it ought to nonetheless have the efficiency edge the actual world, however it’s going to be shut. In workloads that may’t benefit from CDNA 2’s architectural enhancements, MI210 shouldn’t be going to be a step up from MI100.

All of this underscores the general similarity between the CDNA (1) and CDNA 2 architectures, and the way builders must make use of CDNA 2’s new options to get essentially the most out of the {hardware}. The place CDNA 2 shines compared to CDNA (1) is with FP64 vector workloads, FP64 matrix workloads, and packed FP32 vector workloads. All three use circumstances profit from AMD doubling the width of their ALUs to a full 64-bits huge, permitting FP64 operations to be processed at full pace. In the meantime, when FP32 operations are packed collectively to fully fill the broader ALU, then they can also profit from the brand new ALUs.

However, as we famous in our preliminary MI250 dialogue, like all packed instruction codecs, packed FP32 isn’t free. Builders and libraries should be coded to benefit from it; packed operands should be adjoining and aligned to even registers. For software program being written particularly for the structure (e.g. Frontier), that is simply sufficient completed, however extra moveable software program will want up to date to take this under consideration. And it’s for that motive that AMD properly nonetheless advertises its FP32 vector efficiency at full charge (22.6 TFLOPS), moderately than assuming using packed directions.

The launch of the MI210 additionally marks the introduction of AMD’s improved matrix cores right into a PCIe card. For CDNA 2, they’ve been expanded to permit full-speed FP64 matrix operation, bringing them as much as the identical 256 FLOPS charge as FP32 matrix operations, a 4x enchancment over the previous 64 FLOPS/clock/CU charge.

AMD GPU Throughput Charges (FLOPS/clock/CU)
	CDNA 2	CDNA (1)	Vega 20
FP64 Vector	128	64	64
FP32 Vector	128	128	128
Packed FP32 Vector	256	N/A	N/A
FP64 Matrix	256	64	64
FP32 Matrix	256	256	128
FP16 Matrix	1024	1024	256
BF16 Matrix	1024	512	N/A
INT8 Matrix	1024	1024	N/A

Transferring on, the PCIe format MI210 additionally will get a trio of Infinity Cloth 3.0 hyperlinks alongside the highest of the cardboard, identical to the MI100. This permits an MI210 card to be linked up with one or three different playing cards, forming a 2 or 4-way cluster of playing cards. In the meantime, backhaul to the CPU or every other PCIe gadgets is supplied through a PCIe 4.0 x16 connection, which is being powered by one of many versatile IF hyperlinks from the GCD.

As beforehand talked about, the TDP for the MI210 is ready at 300W, the identical degree because the MI100 and MI50 earlier than it – and primarily the restrict for a PCIe server card. Like most server accelerators, that is absolutely passive twin slot card design, counting on vital airflow from the server chassis to maintain issues cool. The GPU itself is powered by a mixture of the PCIe slot and an 8 pin, EPS12V connector on the rear of the cardboard.

In any other case, regardless of the change in kind components, AMD goes after a lot the identical market with MI210 as they’ve MI250(X). Which is to say HPC customers who particularly want a quick FP64 accelerator. Due to its heritage as a chip designed at the start for supercomputers (i.e. Frontier), the MI200 household presently stands alone in its FP64 vector and FP64 matrix efficiency, as rival GPUs have centered as an alternative on enhancing efficiency on the decrease precisions utilized in most business/non-scientific workloads. Although even at decrease precisions, the MI200 household is nothing to sneeze at with tis 1024 FLOPS-per-CU charge on FP16 and BF16 matrix operations.

Wrapping issues up, MI210 is slated to turn out to be accessible as we speak from AMD’s common server companions, together with ASUS, Dell, Supermicro, HPE, and Lenovo. These distributors at the moment are additionally providing servers primarily based on AMD’s MI250(X) accelerators, so AMD’s extra mainstream prospects can have entry to programs primarily based on AMD’s full lineup of MI200 accelerators.

Source link