failures

Learning how to predict rare kinds of failures

On Dec. 21, 2022, just as peak holiday season travel was getting underway, Southwest Airlines went through a cascading series of failures in their scheduling, initially triggered by severe winter weather in the Denver area. But the problems spread through their network, and over the course of the next 10 days the crisis ended up…

Addressing hardware failures and silent data corruption in AI chips

ellonjohns5 months ago010 mins

Meta trained one of its AI models, called Llama 3, in 2024 and published the results in a widely covered paper. During a 54-day period of pre-training, Llama 3 experienced 466 job interruptions, 419 of which were unexpected. Upon further investigation, Meta learned 78% of those hiccups were caused by hardware issues such as GPU…

Highlights

The best monitors for every budget in 2025

Best Halloween-themed Lego sets to pick up for October

Feds Tie ‘Scattered Spider’ Duo to $115M in Ransoms – Krebs on Security

Raspberry Pi 500+ Review: RGB clicky keys and NVMe storage, but with a $200 price tag

Category Collection

Learning how to predict rare kinds of failures

Addressing hardware failures and silent data corruption in AI chips