REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models

REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models

Large Reasoning Models (LRMs) have rapidly advanced, exhibiting impressive performance in complex problem-solving tasks across domains like mathematics, coding, and scientific reasoning. However, current evaluation approaches primarily focus on single-question testing, which reveals significant limitations. This article introduces REST (Reasoning Evaluation through Simultaneous Testing) — a novel multi-problem stress-testing framework designed to push LRMs beyond isolated problem-solving…

Read More
Moonshot AI Releases Kimi K2: A Trillion-Parameter MoE Model Focused on Long Context, Code, Reasoning, and Agentic Behavior

Moonshot AI Releases Kimi K2: A Trillion-Parameter MoE Model Focused on Long Context, Code, Reasoning, and Agentic Behavior

Kimi K2, launched by Moonshot AI in July 2025, is a purpose-built, open-source Mixture-of-Experts (MoE) model—1 trillion total parameters, with 32 billion active parameters per token. It’s trained using the custom MuonClip optimizer on 15.5 trillion tokens, achieving stable training at this unprecedented scale without the typical instabilities seen in ultra-large models. Unlike traditional chatbots, K2 is architected…

Read More
Thought Anchors: A Machine Learning Framework for Identifying and Measuring Key Reasoning Steps in Large Language Models with Precision

Thought Anchors: A Machine Learning Framework for Identifying and Measuring Key Reasoning Steps in Large Language Models with Precision

Understanding the Limits of Current Interpretability Tools in LLMs AI models, such as DeepSeek and GPT variants, rely on billions of parameters working together to handle complex reasoning tasks. Despite their capabilities, one major challenge is understanding which parts of their reasoning have the greatest influence on the final output. This is especially crucial for…

Read More
Do reasoning models really “think” or not? Apple research sparks lively debate, response

Do reasoning models really “think” or not? Apple research sparks lively debate, response

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Apple’s machine-learning group set off a rhetorical firestorm earlier this month with its release of “The Illusion of Thinking,” a 53-page research paper arguing that so-called large reasoning models (LRMs) or reasoning…

Read More
ether0: A 24B LLM Trained with Reinforcement Learning RL for Advanced Chemical Reasoning Tasks

ether0: A 24B LLM Trained with Reinforcement Learning RL for Advanced Chemical Reasoning Tasks

LLMs primarily enhance accuracy through scaling pre-training data and computing resources. However, the attention has shifted towards alternate scaling due to finite data availability. This includes test-time training and inference compute scaling. Reasoning models enhance performance by emitting thought processes before answers, initially through CoT prompting. Recently, reinforcement learning (RL) post-training has been used. Scientific…

Read More
DeepSeek-Prover-V2: Bridging the Gap Between Informal and Formal Mathematical Reasoning

DeepSeek-Prover-V2: Bridging the Gap Between Informal and Formal Mathematical Reasoning

While DeepSeek-R1 has significantly advanced AI’s capabilities in informal reasoning, formal mathematical reasoning has remained a challenging task for AI. This is primarily because producing verifiable mathematical proof requires both deep conceptual understanding and the ability to construct precise, step-by-step logical arguments. Recently, however, significant advancement is made in this direction as researchers at DeepSeek-AI…

Read More
LLMs Can Now Reason in Parallel: UC Berkeley and UCSF Researchers Introduce Adaptive Parallel Reasoning to Scale Inference Efficiently Without Exceeding Context Windows

LLMs Can Now Reason in Parallel: UC Berkeley and UCSF Researchers Introduce Adaptive Parallel Reasoning to Scale Inference Efficiently Without Exceeding Context Windows

Large language models (LLMs) have made significant strides in reasoning capabilities, exemplified by breakthrough systems like OpenAI o1 and DeepSeekR1, which utilize test-time compute for search and reinforcement learning to optimize performance. Despite this progress, current methodologies face critical challenges that impede their effectiveness. Serialized chain-of-thought approaches generate excessively long output sequences, increasing latency and…

Read More
TxAgent: An AI Agent that Delivers Evidence-Grounded Treatment Recommendations by Combining Multi-Step Reasoning with Real-Time Biomedical Tool Integration

TxAgent: An AI Agent that Delivers Evidence-Grounded Treatment Recommendations by Combining Multi-Step Reasoning with Real-Time Biomedical Tool Integration

Precision therapy has emerged as a critical approach in healthcare, tailoring treatments to individual patient profiles to optimise outcomes while reducing risks. However, determining the appropriate medication involves a complex analysis of numerous factors: patient characteristics, comorbidities, potential drug interactions, contraindications, current clinical guidelines, drug mechanisms, and disease biology. While Large Language Models (LLMs) have…

Read More
This AI Paper Introduces R1-Onevision: A Cross-Modal Formalization Model for Advancing Multimodal Reasoning and Structured Visual Interpretation

This AI Paper Introduces R1-Onevision: A Cross-Modal Formalization Model for Advancing Multimodal Reasoning and Structured Visual Interpretation

Multimodal reasoning is an evolving field that integrates visual and textual data to enhance machine intelligence. Traditional artificial intelligence models excel at processing either text or images but often struggle when required to reason across both formats. Analyzing charts, graphs, mathematical symbols, and complex visual patterns alongside textual descriptions is crucial for applications in education,…

Read More