Meta trained one of its AI models, called Llama 3, in 2024 and published the results in a widely covered paper. During a 54-day period of pre-training, Llama 3 experienced 466 job interruptions, 419 of which were unexpected. Upon further investigation, Meta learned 78% of those hiccups were caused by hardware issues such as GPU and host component failures.
Hardware issues like these don’t just cause job interruptions. They can also lead to silent data corruption (SDC), causing unwanted data loss or inaccuracies that often go undetected for extended periods.
While Meta’s pre-training interruptions were unexpected, they shouldn’t be entirely surprising. AI models like Llama 3 have massive processing demands that require colossal computing clusters. For training alone, AI workloads can require hundreds of thousands of nodes and associated GPUs working in unison for weeks or months at a time.
The intensity and scale of AI processing and switching create a tremendous amount of heat, voltage fluctuations and noise, all of which place unprecedented stress on computational hardware. The GPUs and underlying silicon can degrade more rapidly than they would under normal (or what used to be normal) conditions. Performance and reliability wane accordingly.
This is especially true for sub-5 nm process technologies, where silicon degradation and faulty behavior are observed upon manufacturing and in the field.
But what can be done about it? How can unanticipated interruptions and SDC be mitigated? And how can chip design teams ensure optimal performance and reliability as the industry pushes forward with newer, bigger AI workloads that demand even more processing capacity and scale?
Ensuring silicon reliability, availability and serviceability (RAS)
Certain AI players like Meta have established monitoring and diagnostics capabilities to improve the availability and reliability of their computing environments. But with processing demands, hardware failures and SDC issues on the rise, there is a distinct need for test and telemetry capabilities at deeper levels—all the way down to the silicon and multi-die packages within each XPU/GPU as well as the interconnects that bring them together.
The key is silicon lifecycle management (SLM) solutions that help ensure end-to-end RAS, from design and manufacturing to bring-up and in-field operation.
With better visibility, monitoring, and diagnostics at the silicon level, design teams can:
- Gain telemetry-based insights into why chips are failing or why SDC is occurring.
- Identify voltage or timing degradation, overheating, and mechanical failures in silicon components, multi-die packages, and high-speed interconnects.
- Conduct more precise thermal and power characterization for AI workloads.
- Detect, characterize, and resolve radiation, voltage noise, and mechanism failures that can lead to undetected bit flips and SDC.
- Improve silicon yield, quality, and in-field RAS.
- Implement reliability-focused techniques—like triple modular redundancy and dual core lock step—during the register-transfer level (RTL) design phase to mitigate SDC.
- Establish an accurate pre-silicon aging simulation methodology to detect sensitive or vulnerable circuits and replace them with aging-resilient circuits.
- Improve outlier detection on reliability models, which helps minimize in-field SDC.
Silicon lifecycle management (SLM) solutions help ensure end-to-end reliability, availability, and serviceability. Source: Synopsys
An SML design example
SLM IP and analytics solutions help improve silicon health and provide operational metrics at each phase of the system lifecycle. This includes environmental monitoring for understanding and optimizing silicon performance based on the operating environment of the device; structural monitoring to identify performance variations from design to in-field operation; and functional monitoring to track the health and anomalies of critical device functions.
Below are the key features and capabilities that SLM IP provides:
- Process, voltage and temperature monitors
- Help ensure optimal operation while maximizing performance, power, and reliability.
- Highly accurate and distributed monitoring throughout the die, enabling thermal management via frequency throttling.
- Path margin monitors
- Measure timing margin of 1000+ synthetic and functional paths (in-test and in-field).
- Enable silicon performance optimization based on actual margins.
- Automated path selection, IP insertion, and scan generation.
- Clock and delay monitors
- Measure the delay between the edges of one or more signals.
- Check the quality of the clock duty cycle.
- Measure memory read access time tracking with built-in self-test (BIST).
- Characterize digital delay lines.
- UCIe monitor, test and repair
- Monitor signal integrity of die-to-die UCIe lane(s).
- Generate algorithmic BIST patterns to detect interconnect fault types, including lane-to-lane crosstalk.
- Perform cumulative lane repair with redundancy allocation (upon manufacturing and in-field).
- High-speed access and test
- Enable testing over functional interfaces (PCIe, USB and SPI).
- For in-field operation as well as wafer sort, final test, and system-level test.
- Can be used in conjunction with automated test equipment.
- Help conduct in-field remote diagnoses and lower-cost test via reduced pin count.
- HBM external test and repair
- Comprehensive, silicon-proven DRAM stack test, repair and diagnostics engine.
- Support third-party HBM DRAM stack providers.
- Provide high-performance die to die interconnect test and repair support.
- Operate in conjunction with HBM PHY and support a range of HBM protocols and configurations.
- SLM hierarchical subsystem
- Automated hierarchical SLM and test manageability solution for system-on-chips (SoCs).
- Automated integration and access of all IP/cores with in-system scheduling.
- Pre-validated, ready ATE patterns with pattern porting.
Silicon test and telemetry in the age of AI
With the scale and processing demands of AI devices and workloads on the rise, system reliability, silicon health and SDC issues are becoming more widespread. While there is no single solution or antidote for avoiding these issues, deeper and more comprehensive test, repair, and telemetry—at the silicon level—can help mitigate them. The ability to detect or predict in-field chip degradation is particularly valuable, enabling corrective action before sudden or catastrophic system failures occur.
Delivering end-to-end visibility through RAS, silicon test, repair, and telemetry will be increasingly important as we move toward the age of AI.
Shankar Krishnamoorthy is chief product development officer at Synopsys.
Krishna Adusumalli is R&D engineer at Synopsys.
Jyotika Athavale is architecture engineering director at Synopsys.
Yervant Zorian is chief architect at Synopsys.
Related Content
- Uncovering Silent Data Errors with AI
- 11 steps to successful hardware troubleshooting
- Self-testing in embedded systems: Hardware failure
- Understanding and combating silent data corruption
- Test solutions to confront silent data corruption in ICs
The post Addressing hardware failures and silent data corruption in AI chips appeared first on EDN.