
defective

Tesla details how it finds punishing defective cores on its million-core Dojo supercomputers — a single error can ruin a weeks-long AI training run
Detecting malfunctioning cores and disabling them on a massive processor is challenging, but Tesla has developed its Stress tool, which can detect cores prone to silent data corruption across not only Dojo processors but also across Dojo clusters with millions of cores, all without taking them offline. This is an incredibly important capability, as Tesla says a…