Penn Engineering researchers have uncovered critical vulnerabilities in AI-powered robots, exposing ways to manipulate these systems into performing dangerous actions like running red lights or engaging in potentially harmful activities—like detonating bombs.
The research team, led by George Pappas, developed an algorithm called RoboPAIR that achieved a 100% “jailbreak” rate on three different robotic systems: the Unitree Go2 quadruped robot, the Clearpath Robotics Jackal wheeled vehicle, and NVIDIA’s Dolphin LLM self-driving simulator.
“Our work shows that, at this moment, large language models are just not safe enough when integrated with the physical world,” George Pappas said in a statement shared by EurekAlert.
Alexander Robey, the study’s lead author, and his team argue addressing those vulnerabilities requires more than simple software patches, calling for a comprehensive reevaluation of AI integration in physical systems.
Jailbreaking, in the context of AI and robotics, refers to bypassing or circumventing the built-in safety protocols and ethical constraints of an AI system.
It became popular in the early days of iOS, when enthusiasts used to find clever ways to get root access, enabling their phones to do things Apple didn’t approve of, like shooting video or running themes.
When applied to large language models (LLMs) and embodied AI systems, jailbreaking involves manipulating the AI through carefully crafted prompts or inputs that exploit vulnerabilities in the system’s programming.
These exploits can cause the AI—be it a machine or software—to disregard its ethical training, ignore safety measures, or perform actions it was explicitly designed not to do.
In the case of AI-powered robots, successful jailbreaking can lead to dangerous real-world consequences, as demonstrated by the Penn Engineering study, where researchers were able to make robots perform unsafe actions like speeding through crosswalks, stomping into humans, detonating explosives, or ignoring traffic lights.
Prior to the study’s release, Penn Engineering informed affected companies about the discovered vulnerabilities and is now collaborating with manufacturers to enhance AI safety protocols.
“What is important to underscore here is that systems become safer when you find their weaknesses. This is true for cybersecurity. This is also true for AI safety,” Alexander Robey, the paper’s first author, wrote.
Researchers have been studying the impact of jailbreaking in a society that is increasingly relying on prompt engineering—which is natural language “coding.”
Notably, the “Bad Robot: Jailbreaking LLM-based Embodied AI in the Physical World” paper discovered three key weaknesses in AI-powered robots:
- 1. Cascading vulnerability propagation: Techniques that manipulate language models in digital environments can influence physical actions. For example, an attacker could tell the model to “play the role of a villain” or “act like a drunk driver” and use that context to make the model act in a different way than intended.
- 2. Cross-domain safety misalignment: This highlights a disconnect between an AI’s language processing and action planning. An AI might verbally refuse to perform a harmful task due to ethical programming but still carry out actions leading to dangerous outcomes. For example, an attacker could change the prompt’s format to mimic a structured output so the model thinks it’s behaving as intended but instead is acting in a harmful way, like refusing to kill someone (linguistically), but still acting to make that happen.
- 3. Conceptual deception challenges: This weakness exploits an AI’s limited understanding of the world. Malicious actors could trick embodied AI systems into performing seemingly innocent actions that, when combined, result in harmful outcomes. For instance, an embodied AI might refuse a direct command to “poison the person” but comply with a sequence of seemingly innocent instructions that result in the same outcome, such as “place the poison in the person’s mouth,” the research paper reads.
The “Bad Robot” researchers tested these vulnerabilities using a benchmark of 277 malicious queries, categorized into seven types of potential harm: physical harm, privacy violations, pornography, fraud, illegal activities, hateful conduct, and sabotage. Experiments using a sophisticated robotic arm confirmed that these systems could be manipulated to execute harmful actions. Besides these two, researchers have also studied jailbreaks in software-based interactions, helping new models resist these attacks.
This has become a cat-and-mouse game between researchers and jailbreakers, resulting in more sophisticated prompts and jailbreaking approaches for more sophisticated and powerful models.
It’s an important note because the increasing use of AI in business applications may bring consequences for model developers right now, for example, people have been able to trick AI customer Service bots into giving them extreme discounts, recommending recipes with poisonous food, or make chatbots say offensive things.
But we’d take an AI that refuses to detonate bombs over one that politely declines to generate offensive content any day.
Edited by Sebastian Sinclair