failures

Learning how to predict rare kinds of failures
On Dec. 21, 2022, just as peak holiday season travel was getting underway, Southwest Airlines went through a cascading series of failures in their scheduling, initially triggered by severe winter weather in the Denver area. But the problems spread through their network, and over the course of the next 10 days the crisis ended up…

Addressing hardware failures and silent data corruption in AI chips
Meta trained one of its AI models, called Llama 3, in 2024 and published the results in a widely covered paper. During a 54-day period of pre-training, Llama 3 experienced 466 job interruptions, 419 of which were unexpected. Upon further investigation, Meta learned 78% of those hiccups were caused by hardware issues such as GPU…