Summary: Researchers developed a new machine learning technique to improve red-teaming, a process used to test AI models for safety by identifying prompts that trigger toxic responses. By employing a curiosity-driven exploration method, their approach encourages a red-team model to generate diverse and novel prompts that reveal potential weaknesses in AI systems.
This method has proven more effective than traditional techniques, producing a broader range of toxic responses and enhancing the robustness of AI safety measures. The research, set to be presented at the International Conference on Learning Representations, marks a significant step toward ensuring that AI behaviors align with desired outcomes in real-world applications.
Key Facts:
Source: MIT
A user could ask ChatGPT to write a computer program or summarize an article, and the AI chatbot would likely be able to generate useful code or write a cogent synopsis. However, someone could also ask for instructions to build a bomb, and the chatbot might be able to provide those, too.
To prevent this and other safety issues, companies that build large language models typically safeguard them using a process called red-teaming. Teams of human testers write prompts aimed at triggering unsafe or toxic text from the model being tested. These prompts are used to teach the chatbot to avoid such responses.
But this only works effectively if engineers know which toxic prompts to use. If human testers miss some prompts, which is likely given the number of possibilities, a chatbot regarded as safe might still be capable of generating unsafe answers.
Researchers from Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab used machine learning to improve red-teaming. They developed a technique to train a red-team large language model to automatically generate diverse prompts that trigger a wider range of undesirable responses from the chatbot being tested.
They do this by teaching the red-team model to be curious when it writes prompts, and to focus on novel prompts that evoke toxic responses from the target model.
The technique outperformed human testers and other machine-learning approaches by generating more distinct prompts that elicited increasingly toxic responses. Not only does their method significantly improve the coverage of inputs being tested compared to other automated methods, but it can also draw out toxic responses from a chatbot that had safeguards built into it by human experts.
Right now, every large language model has to undergo a very lengthy period of red-teaming to ensure its safety. That is not going to be sustainable if we want to update these models in rapidly changing environments.
Our method provides a faster and more effective way to do this quality assurance, says Zhang-Wei Hong, an electrical engineering and computer science (EECS) graduate student in the Improbable AI lab and lead author of apaper on this red-teaming approach.
Hongs co-authors include EECS graduate students Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang; Aldo Pareja and Akash Srivastava, research scientists at the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author Pulkit Agrawal, director of Improbable AI Lab and an assistant professor in CSAIL. The research will be presented at the International Conference on Learning Representations.
Automated red-teaming
Large language models, like those that power AI chatbots, are often trained by showing them enormous amounts of text from billions of public websites. So, not only can they learn to generate toxic words or describe illegal activities, the models could also leak personal information they may have picked up.
The tedious and costly nature of human red-teaming, which is often ineffective at generating a wide enough variety of prompts to fully safeguard a model, has encouraged researchers to automate the process using machine learning.
Such techniques often train a red-team model using reinforcement learning. This trial-and-error process rewards the red-team model for generating prompts that trigger toxic responses from the chatbot being tested.
But due to the way reinforcement learning works, the red-team model will often keep generating a few similar prompts that are highly toxic to maximize its reward.
For their reinforcement learning approach, the MIT researchers utilized a technique called curiosity-driven exploration. The red-team model is incentivized to be curious about the consequences of each prompt it generates, so it will try prompts with different words, sentence patterns, or meanings.
If the red-team model has already seen a specific prompt, then reproducing it will not generate any curiosity in the red-team model, so it will be pushed to create new prompts, Hong says.
During its training process, the red-team model generates a prompt and interacts with the chatbot. The chatbot responds, and a safety classifier rates the toxicity of its response, rewarding the red-team model based on that rating.
Rewarding curiosity
The red-team models objective is to maximize its reward by eliciting an even more toxic response with a novel prompt. The researchers enable curiosity in the red-team model by modifying the reward signal in the reinforcement learning set up.
First, in addition to maximizing toxicity, they include an entropy bonus that encourages the red-team model to be more random as it explores different prompts. Second, to make the agent curious they include two novelty rewards.
One rewards the model based on the similarity of words in its prompts, and the other rewards the model based on semantic similarity. (Less similarity yields a higher reward.)
To prevent the red-team model from generating random, nonsensical text, which can trick the classifier into awarding a high toxicity score, the researchers also added a naturalistic language bonus to the training objective.
With these additions in place, the researchers compared the toxicity and diversity of responses their red-team model generated with other automated techniques. Their model outperformed the baselines on both metrics.
They also used their red-team model to test a chatbot that had been fine-tuned with human feedback so it would not give toxic replies. Their curiosity-driven approach was able to quickly produce 196 prompts that elicited toxic responses from this safe chatbot.
We are seeing a surge of models, which is only expected to rise. Imagine thousands of models or even more and companies/labs pushing model updates frequently. These models are going to be an integral part of our lives and its important that they are verified before released for public consumption.
Manual verification of models is simply not scalable, and our work is an attempt to reduce the human effort to ensure a safer and trustworthy AI future, says Agrawal.
In the future, the researchers want to enable the red-team model to generate prompts about a wider variety of topics. They also want to explore the use of a large language model as the toxicity classifier. In this way, a user could train the toxicity classifier using a company policy document, for instance, so a red-team model could test a chatbot for company policy violations.
If you are releasing a new AI model and are concerned about whether it will behave as expected, consider using curiosity-driven red-teaming, says Agrawal.
Funding: This research is funded, in part, by Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, an Amazon Web Services MLRA research grant, the U.S. Army Research Office, the U.S. Defense Advanced Research Projects Agency Machine Common Sense Program, the U.S. Office of Naval Research, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator.
Author: Adam Zewe Source: MIT Contact: Adam Zewe MIT Image: The image is credited to Neuroscience News
Original Research: The findings will be presented at the International Conference on Learning Representations
Go here to read the rest:
Reducing Toxic AI Responses - Neuroscience News
- Elusive Cures: Why Neuroscience Hasnt Solved Brain Disordersand How We Can Change That, an excerpt - The Transmitter - June 10th, 2025 [June 10th, 2025]
- Nanowire Retinal Implant Restores Vision and Sees Infrared - Neuroscience News - June 10th, 2025 [June 10th, 2025]
- KLOTHO NEUROSCIENCE, INC. ANNOUNCES AN APPROACH TO INCREASE LONGEVITY AND HEALTHY LIFE SPAN - REPLACE A SILENCED GENE CALLED ALPHA-KLOTHO... - June 10th, 2025 [June 10th, 2025]
- Obeying Orders Lowers Moral Responsibility Perception in the Brain - Neuroscience News - June 10th, 2025 [June 10th, 2025]
- Family Time and Parental Bonding Linked to Better Sleep in Preteens - Neuroscience News - June 10th, 2025 [June 10th, 2025]
- Study Links Gut Bacteria to MS Risk and Reveals Key Triggers - Neuroscience News - June 10th, 2025 [June 10th, 2025]
- Alto Neuroscience Announces Acquisition of Novel Dopamine Agonist Combination Product Candidate, Adding Late-Stage Readout in Treatment Resistant... - June 10th, 2025 [June 10th, 2025]
- Sleep-Wake Perception Intact in Many With Insomnia - Neuroscience News - June 10th, 2025 [June 10th, 2025]
- Cannabis Use Among U.S. Seniors Has Surged 46% in Just Two Years - Neuroscience News - June 10th, 2025 [June 10th, 2025]
- Anoki Integrates With Magnite While Seedtag Adds Neuroscience To Find Emotional Connections - TVREV - June 10th, 2025 [June 10th, 2025]
- Neuroscience: Knowing People's Names Makes You Empathize With Them Better. (By the Way, My Name Is Bill) - Inc.com - June 1st, 2025 [June 1st, 2025]
- Kindness Sparks Cooperation by Boosting Social Connectedness - Neuroscience News - June 1st, 2025 [June 1st, 2025]
- Neuroscience and Genetics of ADHD and Neurodevelopment - Neuroscience News - June 1st, 2025 [June 1st, 2025]
- The Neuroscience of Cancer - Harvard Medicine Magazine - June 1st, 2025 [June 1st, 2025]
- Singing to Infants Boosts Mood and Bonding - Neuroscience News - June 1st, 2025 [June 1st, 2025]
- Neuroscience: Go Swimming and Your Brain Will Thank You - Inc.com - June 1st, 2025 [June 1st, 2025]
- Blood Fat Links Found Between Heart Risk and Alzheimers - Neuroscience News - June 1st, 2025 [June 1st, 2025]
- Tiny Brain Cell Cluster Found to Drive Obesity and Overeating - Neuroscience News - June 1st, 2025 [June 1st, 2025]
- New Neuroscience Shows Why Its So Important to Read Aloud to Your Kids - Inc.com - June 1st, 2025 [June 1st, 2025]
- Cats Can Recognize Their Owners by Smell Alone - Neuroscience News - June 1st, 2025 [June 1st, 2025]
- St. Lukes Center for Neuroscience Helps Those with Same Illness as Billy Joel - TAPinto - June 1st, 2025 [June 1st, 2025]
- These triplets who graduated from Georgia Tech with neuroscience degrees head to medical school - 11Alive.com - June 1st, 2025 [June 1st, 2025]
- Gabe Newell co-founded a neuroscience company in 2019 and its first brain chip is expected to ship later this year - PC Gamer - June 1st, 2025 [June 1st, 2025]
- Next-Gen Painkiller Blocks Pain Without the High - Neuroscience News - May 21st, 2025 [May 21st, 2025]
- Inflammation Triggers Repetitive Behaviors in ASD and OCD - Neuroscience News - May 21st, 2025 [May 21st, 2025]
- Astrocytes Take Center Stage in Brain Function and Behavior - Neuroscience News - May 21st, 2025 [May 21st, 2025]
- Setting the SCENE for Neuroscience Breakthroughs - Mellon College of Science - Carnegie Mellon University - May 21st, 2025 [May 21st, 2025]
- Long COVID Brain Fog Linked to Inflammation and Stress Markers - Neuroscience News - May 21st, 2025 [May 21st, 2025]
- Warren Buffett Says Youre Too Focused on the Negative. Heres the Neuroscience Showing Hes Right - Inc.com - May 21st, 2025 [May 21st, 2025]
- Reading Fiction Boosts Empathy and Fights Loneliness - Neuroscience News - May 21st, 2025 [May 21st, 2025]
- Astrocytes, Not Neurons, Drive Brains Attention and Alertness - Neuroscience News - May 21st, 2025 [May 21st, 2025]
- Mapping Young Minds: The Neuroscience Behind Babilou Family Singapore's Revolutionary Education Model - PR Newswire - May 21st, 2025 [May 21st, 2025]
- Loneliness Linked to 24% Higher Risk of Hearing Loss - Neuroscience News - May 21st, 2025 [May 21st, 2025]
- Eureka Moments Double Memory by Rewiring the Brain - Neuroscience News - May 21st, 2025 [May 21st, 2025]
- Scientists use brain activity to predict StarCraft II skill in fascinating new neuroscience research - psypost.org - May 21st, 2025 [May 21st, 2025]
- Stress of Long Work Hours May Physically Alter the Brain - Neuroscience News - May 21st, 2025 [May 21st, 2025]
- The Neuroscience of Dopamine: How to Triumph Over Constant Wanting - Next Big Idea Club - May 12th, 2025 [May 12th, 2025]
- Verbal Abuse in Childhood Rewires the Developing Brain - Neuroscience News - May 12th, 2025 [May 12th, 2025]
- Heavy Social Media Use Linked to Believing and Spreading Fake News - Neuroscience News - May 12th, 2025 [May 12th, 2025]
- Brain Cells That Predict What Comes Next, Even When Its New - Neuroscience News - May 12th, 2025 [May 12th, 2025]
- The Temperature | Better happiness through neuroscience - The Colorado Sun - May 12th, 2025 [May 12th, 2025]
- Genes Strongly Influence When Babies Take Their First Steps - Neuroscience News - May 12th, 2025 [May 12th, 2025]
- Using Music to Detect Concussion in Kids - Neuroscience News - May 12th, 2025 [May 12th, 2025]
- Boosting Klotho Protein Slows Aging and Enhances Health - Neuroscience News - May 12th, 2025 [May 12th, 2025]
- Eye Movements Set the Speed Limit for What You Can See - Neuroscience News - May 12th, 2025 [May 12th, 2025]
- Seeing Is Believing: How We Judge AI as Creative or Not - Neuroscience News - May 12th, 2025 [May 12th, 2025]
- Exercise Boosts Stem Cell Therapy for Parkinsons - Neuroscience News - May 12th, 2025 [May 12th, 2025]
- Aspen Neuroscience Announces 6-Month ASPIRO Phase 1/2a Clinical Trial Results of Personalized Cell Therapy for Parkinson's Disease - BioSpace - May 12th, 2025 [May 12th, 2025]
- Sheffield Lab: Understanding the neuroscience of memories - University of Chicago News - April 27th, 2025 [April 27th, 2025]
- Prenatal Stress Leaves Lasting Molecular Imprints on Babies - Neuroscience News - April 27th, 2025 [April 27th, 2025]
- Dean Buonomano explores the concept of time in neuroscience and physics - The Transmitter - April 27th, 2025 [April 27th, 2025]
- Psychedelics May Reset Brain-Immune Link Driving Fear and Anxiety - Neuroscience News - April 27th, 2025 [April 27th, 2025]
- Infant Social Skills Thrive Despite Hardship - Neuroscience News - April 27th, 2025 [April 27th, 2025]
- From Cologne to Country Roads: One scientist's interdisciplinary journey to build bridges (and robotic insects) between neuroscience and engineering -... - April 27th, 2025 [April 27th, 2025]
- Eyes Reveal Intentions Faster Than We Think - Neuroscience News - April 27th, 2025 [April 27th, 2025]
- Immune Resilience Identified as Key to Healthy Aging and Longevity - Neuroscience News - April 27th, 2025 [April 27th, 2025]
- Energy Starvation Triggers Dangerous Glutamate Surges in the Brain - Neuroscience News - April 27th, 2025 [April 27th, 2025]
- WVU Rockefeller Neuroscience Institute first in U.S. to successfully test innovative brain-computer interface technology to decode speech and language... - April 27th, 2025 [April 27th, 2025]
- Microglia Reprogrammed to Deliver Precision Alzheimers Therapies - Neuroscience News - April 27th, 2025 [April 27th, 2025]
- Neuroscience Says Music Is an Emotion Regulation Machine. Heres What to Play for Happiness, Productivity, or Deep Thinking - Inc.com - April 19th, 2025 [April 19th, 2025]
- Early Maternal Affection Shapes Key Personality Traits for Life - Neuroscience News - April 19th, 2025 [April 19th, 2025]
- Elons new neuroscience major highlighted by Greensboro News & Record - Elon University - April 19th, 2025 [April 19th, 2025]
- Brain Blast event at St. Lawrence University teaches local students neuroscience - North Country Now - April 19th, 2025 [April 19th, 2025]
- AI Reveals What Keeps People Committed to Exercise - Neuroscience News - April 19th, 2025 [April 19th, 2025]
- The "Holy Grail" of Neuroscience? Researchers Create Stunningly Accurate Digital Twin of the Brain - The Debrief - April 19th, 2025 [April 19th, 2025]
- Annenberg School Vice Dean Emily Falk publishes book on the neuroscience of decision-making - The Daily Pennsylvanian - April 19th, 2025 [April 19th, 2025]
- Music-Induced Chills Trigger Natural Opioids in the Brain - Neuroscience News - April 19th, 2025 [April 19th, 2025]
- What We Value: The Neuroscience of Choice and Change - think.kera.org - April 19th, 2025 [April 19th, 2025]
- Kile takes top neuroscience post at Sutter Health as system pushes to align care, expand trials - The Business Journals - April 19th, 2025 [April 19th, 2025]
- A Grain of Brain, 523 Million Synapses, and the Most Complicated Neuroscience Experiment Ever Attempted - SciTechDaily - April 19th, 2025 [April 19th, 2025]
- Mild Brain Stimulation Alters Decision-Making Speed and Flexibility - Neuroscience News - April 19th, 2025 [April 19th, 2025]
- Cannabis studies were informing fundamental neuroscience in the 1970s - Nature - April 10th, 2025 [April 10th, 2025]
- To make a meaningful contribution to neuroscience, fMRI must break out of its silo - The Transmitter - April 10th, 2025 [April 10th, 2025]
- Steve Jobss Unexpected Secret to Being More Creative (Backed by Neuroscience) - Inc.com - April 10th, 2025 [April 10th, 2025]
- Challenging Decades of Neuroscience: Brain Cells Are More Plastic Than Previously Thought - SciTechDaily - April 10th, 2025 [April 10th, 2025]
- Q&A: Lundbecks head of R&D on letting biology speak in neuroscience - Endpoints News - April 10th, 2025 [April 10th, 2025]
- Why it's hard to study the neuroscience of psychedelics : Short Wave - NPR - April 10th, 2025 [April 10th, 2025]
- Fear Sync: How Males and Females Respond to Stress Together - Neuroscience News - April 10th, 2025 [April 10th, 2025]
- Chemotherapy Disrupts Brain Connectivity - Neuroscience News - April 10th, 2025 [April 10th, 2025]
- Newly awarded NIH grants for neuroscience lag 77 percent behind previous nine-year average - The Transmitter - April 10th, 2025 [April 10th, 2025]