REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models

REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models

Large Reasoning Models (LRMs) have rapidly advanced, exhibiting impressive performance in complex problem-solving tasks across domains like mathematics, coding, and scientific reasoning. However, current evaluation approaches primarily focus on single-question testing, which reveals significant limitations. This article introduces REST (Reasoning Evaluation through Simultaneous Testing) — a novel multi-problem stress-testing framework designed to push LRMs beyond isolated problem-solving…

Read More
Evaluating social and ethical risks from generative AI

Evaluating social and ethical risks from generative AI

Introducing a context-based framework for comprehensively evaluating the social and ethical risks of AI systems Generative AI systems are already being used to write books, create graphic designs, assist medical practitioners, and are becoming increasingly capable. Ensuring these systems are developed and deployed responsibly requires carefully evaluating the potential ethical and social risks they may…

Read More
FACTS Grounding: A new benchmark for evaluating the factuality of large language models

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

Responsibility & Safety Published 17 December 2024 Authors FACTS team Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations Large language models (LLMs) are transforming how we access information, yet their grip on factual accuracy remains imperfect. They can “hallucinate”…

Read More