Tesla details how it finds punishing defective cores on its million-core Dojo supercomputers — a single error can ruin a weeks-long AI training run

Tesla details how it finds punishing defective cores on its million-core Dojo supercomputers — a single error can ruin a weeks-long AI training run

Detecting malfunctioning cores and disabling them on a massive processor is challenging, but Tesla has developed its Stress tool, which can detect cores prone to silent data corruption across not only Dojo processors but also across Dojo clusters with millions of cores, all without taking them offline. This is an incredibly important capability, as Tesla says a…

Read More
Error assessment and mitigation of an innovative data acquisition front end

Error assessment and mitigation of an innovative data acquisition front end

The recent design idea (DI) “Negative time-constant and PWM program a versatile ADC front end” disclosed an inventive programmable gain amplifier with integral samples-and-holds. The circuit schematic from the DI appears in Figure 1. Briefly, a PWM signal controls the switches shown. In the X0 positions, a differential signal connected to the inputs of op…

Read More