Bad teacher bots can leave hidden marks on model students • The Register

New research warns about the dangers of teaching LLMs on the output of other models, showing that undesirable traits can be transmitted "subliminally" from teacher to student, even when they are scrubbed from training data.

New research warns about the dangers of teaching LLMs on the output of other models, showing that undesirable traits can be transmitted "subliminal" from teacher to student, even when they are scrubbed from training data.

The peer-reviewed study from researchers at Anthropic demonstrated that LLMs can transfer negative traits to "student" models, even when evidence of these traits has been removed from the transmitted training data.

Using LLMs to teach other models is becoming increasingly popular. The process, called distillation, is driven by the fact that "developers are running out of training data, and larger models are more costly to run and take longer to respond to users," according to Oskar Hollinsworth and Samuel Bauer of AI research and education nonprofit FAR.AI.

They point out that the research, published in science journal Nature this week, uncovers an area of risk in AI development that is poorly understood.

Anthropic researcher Alex Cloud and colleagues used GPT-4.1 nano as a reference model, prompting a "teacher" to prefer specific animals or trees. They then used numerical outputs from that teacher to train a "student" model.

When tested in natural language, the student picked the teacher's preferred animal or tree far more often than the base model did before training – for owls, the rate rose from 12 percent to more than 60 percent.

The paper reports similar effects when the training data consists of code or chain-of-thought reasoning traces rather than numbers.

"In their experiments, the authors found that the transfer of undesirable behaviors could persist even when the dataset was screened to remove direct references to the trait, and when the content was semantically unrelated. They coined the term 'subliminal learning' for this phenomenon," Hollinsworth and Bauer said.

"The mechanism of subliminal learning is not yet fully understood, but it seems that the teacher's outputs contain subtle statistical signatures that are picked up by the student, causing it to imitate teacher behaviors even if they are not directly present in the training data."

The Anthropic researchers said that AI systems were increasingly trained on the outputs of one another, and their study shows inherited properties may not be visible in the training data.

"Safety evaluations may therefore need to examine not just behavior, but the origins of models and training data and the processes used to create them," the paper said.

This research has significant implications for the AI industry, particularly as companies increasingly rely on model distillation to create smaller, more efficient versions of large language models. The findings suggest that even when developers attempt to remove problematic content from training data, biases and undesirable behaviors can persist through subtle statistical patterns that are difficult to detect.

The study raises important questions about the transparency and accountability of AI systems, especially as they become more complex and interconnected. If models can inherit traits that are not explicitly present in their training data, it becomes more challenging to understand and control their behavior.

For developers and organizations working with AI, this research highlights the need for more rigorous testing and evaluation processes. Simply screening training data for obvious problematic content may not be sufficient to prevent the transmission of biases and undesirable behaviors.

The concept of "subliminal learning" also has implications for the broader field of machine learning and AI safety. It suggests that there may be hidden pathways through which models can acquire and perpetuate problematic behaviors, even when developers believe they have taken appropriate precautions.

As AI systems become more prevalent in critical applications such as healthcare, finance, and criminal justice, understanding and mitigating these risks becomes increasingly important. The potential for models to inherit and amplify biases through subliminal learning could have serious consequences for fairness and equity in AI-driven decision-making.

Moving forward, the AI research community will need to develop new techniques and methodologies to address this challenge. This may include more sophisticated methods for detecting and mitigating subliminal learning, as well as new approaches to model training and evaluation that take into account the complex ways in which behaviors can be transmitted between models.

The findings also underscore the importance of ongoing research into AI safety and ethics. As AI systems become more powerful and autonomous, understanding their potential risks and limitations becomes crucial for ensuring their responsible development and deployment.

For policymakers and regulators, this research highlights the need for more comprehensive frameworks for governing AI development and deployment. Traditional approaches to AI safety may need to be updated to account for the possibility of subliminal learning and other hidden pathways for the transmission of undesirable behaviors.

In conclusion, the Anthropic study on subliminal learning in AI models represents a significant contribution to our understanding of the risks and challenges associated with AI development. As the field continues to evolve, addressing these issues will be crucial for ensuring that AI systems are safe, reliable, and beneficial for society as a whole.

#AI_Safety #model distillation #subliminal learning #Bias #LLM

Bad teacher bots can leave hidden marks on model students • The Register

Comments