MIT Researchers Develop Method to Expose and Control Hidden Concepts in Large Language Models

A new MIT-developed method can identify and manipulate abstract concepts like biases, personalities, and moods hidden within large language models, potentially improving AI safety and performance.

A team of researchers from MIT and UC San Diego has developed a groundbreaking method to expose and control hidden abstract concepts within large language models (LLMs), potentially revolutionizing how we understand and improve AI safety and performance.

The new approach, detailed in a paper published today in the journal Science, can identify representations of concepts ranging from personality traits and biases to moods and preferences that exist within LLMs but aren't always actively expressed in their responses.

How LLMs Hide Abstract Concepts

By now, models like ChatGPT, Claude, and Gemini have accumulated vast amounts of human knowledge and can express abstract concepts such as specific tones, personalities, biases, and moods. However, it's not obvious how these models represent such abstract concepts from the knowledge they contain.

"What this really says about LLMs is that they have these concepts in them, but they're not all actively exposed," explains Adityanarayanan "Adit" Radhakrishnan, assistant professor of mathematics at MIT and co-author of the study. "With our method, there are ways to extract these different concepts and activate them in ways that prompting cannot give you answers to."

A Targeted Approach to Finding Hidden Concepts

Traditional methods for discovering concepts in LLMs often rely on unsupervised learning—broadly trawling through unlabeled representations to find patterns. Radhakrishnan compares this to "going fishing with a big net, trying to catch one species of fish. You're gonna get a lot of fish that you have to look through to find the right one."

The MIT team took a more targeted approach using a type of predictive modeling algorithm called a recursive feature machine (RFM). This algorithm leverages mathematical mechanisms that neural networks implicitly use to learn features, allowing for more efficient and precise identification of concept representations.

Testing the Method on 512 Concepts

The researchers tested their approach on 512 different concepts across five categories:

Fears: marriage, insects, buttons
Experts: social influencer, medievalist
Moods: boastful, detachedly amused
Location preferences: Boston, Kuala Lumpur
Personas: Ada Lovelace, Neil deGrasse Tyson

They successfully identified and manipulated representations of these concepts in several of today's largest language and vision models.

Steering Model Responses

Once the algorithm identifies numerical patterns associated with a concept, researchers can mathematically modulate the activity of that concept by perturbing the LLM's internal representations.

For example, when the team enhanced the "conspiracy theorist" concept in a vision language model and prompted it to explain the origins of the famous "Blue Marble" image of Earth from Apollo 17, the model generated an answer with the tone and perspective of a conspiracy theorist.

The method can also be used to enhance or minimize traits. The researchers demonstrated "anti-refusal" by modifying a model so it would answer prompts it would normally refuse, such as giving instructions on how to rob a bank.

Implications for AI Safety and Performance

While the researchers acknowledge there are risks to extracting certain concepts, they see the approach primarily as a way to illuminate hidden concepts and potential vulnerabilities in LLMs that could then be adjusted to improve safety or enhance performance.

"LLMs clearly have a lot of these abstract concepts stored within them, in some representation," Radhakrishnan notes. "There are ways where, if we understand these representations well enough, we can build highly specialized LLMs that are still safe to use but really effective at certain tasks."

The team has made the method's underlying code publicly available, allowing other researchers to apply and build upon their work.

The Technical Challenge

Understanding how LLMs represent abstract concepts is particularly challenging because these models operate as "black boxes." A standard large language model processes natural language prompts by dividing them into individual words, each encoded mathematically as vectors of numbers. These vectors pass through computational layers, creating matrices of numbers used to identify likely response words until the layers converge on a final response.

The MIT team's approach trains RFMs to recognize numerical patterns in these internal representations that could be associated with specific concepts, providing a window into the model's internal workings.

Research Support and Future Directions

This research was supported by the National Science Foundation, the Simons Foundation, the TILOS institute, and the U.S. Office of Naval Research. The findings open new avenues for both understanding and controlling AI behavior, with potential applications ranging from improving model safety to creating more specialized AI assistants tailored to specific tasks or communication styles.

The ability to identify and manipulate abstract concepts within LLMs represents a significant advance in AI interpretability and control, addressing one of the field's most pressing challenges as these models become increasingly integrated into society.

#AI_Safety #Interpretability #Large Language Models #Bias #concept extraction