Reasoning

REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models

Large Reasoning Models (LRMs) have rapidly advanced, exhibiting impressive performance in complex problem-solving tasks across domains like mathematics, coding, and scientific reasoning. However, current evaluation approaches primarily focus on single-question testing, which reveals significant limitations. This article introduces REST (Reasoning Evaluation through Simultaneous Testing) — a novel multi-problem stress-testing framework designed to push LRMs beyond isolated problem-solving…

Moonshot AI Releases Kimi K2: A Trillion-Parameter MoE Model Focused on Long Context, Code, Reasoning, and Agentic Behavior

ellonjohns2 weeks ago08 mins

Kimi K2, launched by Moonshot AI in July 2025, is a purpose-built, open-source Mixture-of-Experts (MoE) model—1 trillion total parameters, with 32 billion active parameters per token. It’s trained using the custom MuonClip optimizer on 15.5 trillion tokens, achieving stable training at this unprecedented scale without the typical instabilities seen in ultra-large models. Unlike traditional chatbots, K2 is architected…

Thought Anchors: A Machine Learning Framework for Identifying and Measuring Key Reasoning Steps in Large Language Models with Precision

ellonjohns3 weeks ago09 mins

Understanding the Limits of Current Interpretability Tools in LLMs AI models, such as DeepSeek and GPT variants, rely on billions of parameters working together to handle complex reasoning tasks. Despite their capabilities, one major challenge is understanding which parts of their reasoning have the greatest influence on the final output. This is especially crucial for…

Do reasoning models really “think” or not? Apple research sparks lively debate, response

ellonjohns1 month ago020 mins

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Apple’s machine-learning group set off a rhetorical firestorm earlier this month with its release of “The Illusion of Thinking,” a 53-page research paper arguing that so-called large reasoning models (LRMs) or reasoning…

ether0: A 24B LLM Trained with Reinforcement Learning RL for Advanced Chemical Reasoning Tasks

ellonjohns2 months ago07 mins

LLMs primarily enhance accuracy through scaling pre-training data and computing resources. However, the attention has shifted towards alternate scaling due to finite data availability. This includes test-time training and inference compute scaling. Reasoning models enhance performance by emitting thought processes before answers, initially through CoT prompting. Recently, reinforcement learning (RL) post-training has been used. Scientific…

DeepSeek-Prover-V2: Bridging the Gap Between Informal and Formal Mathematical Reasoning

ellonjohns3 months ago010 mins

While DeepSeek-R1 has significantly advanced AI’s capabilities in informal reasoning, formal mathematical reasoning has remained a challenging task for AI. This is primarily because producing verifiable mathematical proof requires both deep conceptual understanding and the ability to construct precise, step-by-step logical arguments. Recently, however, significant advancement is made in this direction as researchers at DeepSeek-AI…

LLMs Can Now Reason in Parallel: UC Berkeley and UCSF Researchers Introduce Adaptive Parallel Reasoning to Scale Inference Efficiently Without Exceeding Context Windows

ellonjohns3 months ago012 mins

Large language models (LLMs) have made significant strides in reasoning capabilities, exemplified by breakthrough systems like OpenAI o1 and DeepSeekR1, which utilize test-time compute for search and reinforcement learning to optimize performance. Despite this progress, current methodologies face critical challenges that impede their effectiveness. Serialized chain-of-thought approaches generate excessively long output sequences, increasing latency and…

Now it’s TikTok parent ByteDance’s turn for a reasoning AI: enter Seed-Thinking-v1.5!

ellonjohns4 months ago014 mins

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More It started with the announcement of OpenAI’s o1 model in Sept. 2024, but really took off with the DeepSeek R1 release in Jan. 2025. Now, it seems that most major AI model providers and trainers are…

TxAgent: An AI Agent that Delivers Evidence-Grounded Treatment Recommendations by Combining Multi-Step Reasoning with Real-Time Biomedical Tool Integration

ellonjohns4 months ago07 mins

Precision therapy has emerged as a critical approach in healthcare, tailoring treatments to individual patient profiles to optimise outcomes while reducing risks. However, determining the appropriate medication involves a complex analysis of numerous factors: patient characteristics, comorbidities, potential drug interactions, contraindications, current clinical guidelines, drug mechanisms, and disease biology. While Large Language Models (LLMs) have…

This AI Paper Introduces R1-Onevision: A Cross-Modal Formalization Model for Advancing Multimodal Reasoning and Structured Visual Interpretation

ellonjohns4 months ago08 mins

Multimodal reasoning is an evolving field that integrates visual and textual data to enhance machine intelligence. Traditional artificial intelligence models excel at processing either text or images but often struggle when required to reason across both formats. Analyzing charts, graphs, mathematical symbols, and complex visual patterns alongside textual descriptions is crucial for applications in education,…

Highlights

Proton VPN review 2025: A nonprofit service with premium performance

Trump’s Anti-Bias AI Order Is Just More Bias

On-Premise vs SaaS Data Annotation Platforms Compared

The dream of a Raspberry Pi laptop becomes a reality — ArgonOne Up Review

Category Collection