
large language models

3 Questions: The pros and cons of synthetic data in AI
Synthetic data are artificially generated by algorithms to mimic the statistical properties of actual data, without containing any information from real-world sources. While concrete numbers are hard to pin down, some estimates suggest that more than 60 percent of data used for AI applications in 2024 was synthetic, and this figure is expected to grow…

Researchers glimpse the inner workings of protein language models
Within the past few years, models that can predict the structure or function of proteins have been widely used for a variety of biological applications, such as identifying drug targets and designing new therapeutic antibodies. These models, which are based on large language models (LLMs), can make very accurate predictions of a protein’s suitability for…

Unpacking the bias of large language models
Research has shown that large language models (LLMs) tend to overemphasize information at the beginning and end of a document or conversation, while neglecting the middle. This “position bias” means that, if a lawyer is using an LLM-powered virtual assistant to retrieve a certain phrase in a 30-page affidavit, the LLM is more likely to…

Making AI-generated code more accurate in any language
Programmers can now use large language models (LLMs) to generate computer code more quickly. However, this only makes programmers’ lives easier if that code follows the rules of the programming language and doesn’t cause a computer to crash. Some methods exist for ensuring LLMs conform to the rules of whatever language they are generating text…

Like human brains, large language models reason about diverse data in a general way
While early language models could only process text, contemporary large language models now perform highly diverse tasks on different types of data. For instance, LLMs can understand many languages, generate computer code, solve math problems, or answer questions about images and audio. MIT researchers probed the inner workings of LLMs to better understand how they…

Reinforcement Learning Meets Chain-of-Thought: Transforming LLMs into Autonomous Reasoning Agents
Large Language Models (LLMs) have significantly advanced natural language processing (NLP), excelling at text generation, translation, and summarization tasks. However, their ability to engage in logical reasoning remains a challenge. Traditional LLMs, designed to predict the next word, rely on statistical pattern recognition rather than structured reasoning. This limits their ability to solve complex problems…

Ghostbuster: Detecting Text Ghostwritten by Large Language Models
The structure of Ghostbuster, our new state-of-the-art method for detecting AI-generated text. Large language models like ChatGPT write impressively well—so well, in fact, that they’ve become a problem. Students have begun using these models to ghostwrite assignments, leading some schools to ban ChatGPT. In addition, these models are also prone to producing text with factual…

Virtual Personas for Language Models via an Anthology of Backstories
We introduce Anthology, a method for conditioning LLMs to representative, consistent, and diverse virtual personas by generating and utilizing naturalistic backstories with rich details of individual values and experience. What does it mean for large language models (LLMs) to be trained on massive text corpora, collectively produced by millions and billions of distinctive human authors?…