Tesla details how it finds punishing defective cores on its million-core Dojo supercomputers — a single error can ruin a weeks-long AI training run

Tesla details how it finds punishing defective cores on its million-core Dojo supercomputers — a single error can ruin a weeks-long AI training run


Detecting malfunctioning cores and disabling them on a massive processor is challenging, but Tesla has developed its Stress tool, which can detect cores prone to silent data corruption across not only Dojo processors but also across Dojo clusters with millions of cores, all without taking them offline. This is an incredibly important capability, as Tesla says a single silent data error can destroy an entire training run that takes weeks to complete.

Tesla’s Dojo is one of the two largest processors that currently exist on the planet. These massive wafer-scale chips use a whole 300-mm wafer, meaning it simply isn’t possible to create a larger chunk of computing power in one go. Each Dojo wafer-scale processor packs up to 8,850 cores, but some of them may induce silent data corruptions (SDCs) after deployment, corrupting the results of extensive training runs. 

A big processor



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *