benchmark

TikTok Researchers Introduce SWE-Perf: The First Benchmark for Repository-Level Code Performance Optimization

Introduction As large language models (LLMs) advance in software engineering tasks—ranging from code generation to bug fixing—performance optimization remains an elusive frontier, especially at the repository level. To bridge this gap, researchers from TikTok and collaborating institutions have introduced SWE-Perf—the first benchmark specifically designed to evaluate the ability of LLMs to optimize code performance in…

Researchers from FutureHouse and ScienceMachine Introduce BixBench: A Benchmark Designed to Evaluate AI Agents on Real-World Bioinformatics Task

ellonjohns5 months ago09 mins

Modern bioinformatics research is characterized by the constant emergence of complex data sources and analytical challenges. Researchers routinely confront tasks that require the synthesis of diverse datasets, the execution of iterative analyses, and the interpretation of subtle biological signals. High-throughput sequencing, multi-dimensional imaging, and other advanced data collection techniques contribute to an environment where traditional,…

iPhone 16e Vs Android Phone Benchmark Showdown: Testing Apple’s Mettle

ellonjohns5 months ago023 mins

Apple recently put its budget-minded iPhone SE lineup out to pasture while simultaneously increasing the cost floor of its historically $429 entry level smartphone lineup to a loftier $599 price point. The new baseline is the iPhone 16e, and it also marks the first time an iPhone has hit the market with the company’s own 5G…

Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

ellonjohns7 months ago09 mins

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Hallucinations, or factually inaccurate responses, continue to plague large language models (LLMs). Models falter particularly when they are given more complex tasks and when users are looking for specific and highly detailed responses. It’s a challenge…

Are We Ready for Multi-Image Reasoning? Launching VHs: The Visual Haystacks Benchmark!

ellonjohns7 months ago02 mins

Humans excel at processing vast arrays of visual information, a skill that is crucial for achieving artificial general intelligence (AGI). Over the decades, AI researchers have developed Visual Question Answering (VQA) systems to interpret scenes within single images and answer related questions. While recent advancements in foundation models have significantly closed the gap between human…

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

ellonjohns7 months ago01 mins

When we began studying jailbreak evaluations, we found a fascinating paper claiming that you could jailbreak frontier LLMs simply by translating forbidden prompts into obscure languages. Excited by this result, we attempted to reproduce it and found something unexpected.

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

ellonjohns7 months ago09 mins

Responsibility & Safety Published 17 December 2024 Authors FACTS team Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations Large language models (LLMs) are transforming how we access information, yet their grip on factual accuracy remains imperfect. They can “hallucinate”…

Highlights

Proton VPN review 2025: A nonprofit service with premium performance

Trump’s Anti-Bias AI Order Is Just More Bias

On-Premise vs SaaS Data Annotation Platforms Compared

The dream of a Raspberry Pi laptop becomes a reality — ArgonOne Up Review

Category Collection