evaluating

REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models

Large Reasoning Models (LRMs) have rapidly advanced, exhibiting impressive performance in complex problem-solving tasks across domains like mathematics, coding, and scientific reasoning. However, current evaluation approaches primarily focus on single-question testing, which reveals significant limitations. This article introduces REST (Reasoning Evaluation through Simultaneous Testing) — a novel multi-problem stress-testing framework designed to push LRMs beyond isolated problem-solving…

Evaluating social and ethical risks from generative AI

ellonjohns6 months ago012 mins

Introducing a context-based framework for comprehensively evaluating the social and ethical risks of AI systems Generative AI systems are already being used to write books, create graphic designs, assist medical practitioners, and are becoming increasingly capable. Ensuring these systems are developed and deployed responsibly requires carefully evaluating the potential ethical and social risks they may…

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

ellonjohns9 months ago09 mins

Responsibility & Safety Published 17 December 2024 Authors FACTS team Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations Large language models (LLMs) are transforming how we access information, yet their grip on factual accuracy remains imperfect. They can “hallucinate”…

Highlights

7 easy ways I fixed iOS 26’s bad battery life on my iPhone

The Rise of Micro-Influencers: Small Audiences, Big Impact – Tecuy Media

SpyCloud Report: 2/3 Orgs Extremely Concerned About Identity Attacks Yet Major Blind Spots Persist

Sandisk WD Blue SN5100 2TB SSD Review: A Rhapsody in Blue

Category Collection

REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models

FACTS Grounding: A new benchmark for evaluating the factuality of large language models