Technology

New Technique BEAST Disrupts AI Models in Just One GPU Minute

Published February 29, 2024

A team of computer scientists from the University of Maryland have created an innovative method for generating prompts that can lead to harmful outputs from large language models (LLMs). This new technique, known as BEAST (BEAm Search-based adversarial aTtack), requires minimal resources: an Nvidia RTX A6000 GPU with 48GB of memory, upcoming open source code, and just about one minute of processing time on the GPU.

Understanding BEAST

The group behind BEAST claims that their system operates significantly faster than traditional gradient-based attacks, which can often take upwards of an hour. Their research document titled, 'Fast Adversarial Attacks on Language Models In One GPU Minute,' highlights the efficiency of BEAST, demonstrating an impressive 65-fold increase in speed over earlier methods.

BEAST uses an approach known as beam search to sample tokens from the LLM and quickly create prompts that prompt incorrect or harmful responses from the AI. Through testing on various models with the AdvBench Harmful Behaviors dataset, researchers were able to achieve an 89% success rate at breaking through the safety measures put in place by the Vicuna-7B model.

The Effectiveness of BEAST

Furthermore, BEAST has proven to have real-world applicability. When tested against the Chatbot Arena research project, it was successful in eliciting harmful responses. This breakthrough raises concerns because it suggests that even public commercial models, such as OpenAI's GPT-4, could be susceptible to adversarial attacks—as long as the model's token probability scores are accessible.

BEAST also offers adjustable settings that allow malicious prompts to appear more coherent, potentially at the expense of the attack's speed or effectiveness. This raises concerns about social engineering attacks, where a more readable prompt may trick users into engaging with and triggering harmful AI responses.

Addressing AI Vulnerabilities

While BEAST can be a powerful tool for prompting 'hallucinations' or exposing privacy issues through membership inference attacks, it also highlights the need for better security measures. Models like LLaMA-2, which have undergone extensive safety training, show a decreased susceptibility to such attacks. This suggests that rigorous alignment training can be an effective countermeasure, but the researchers urge that provable safety guarantees are necessary to ensure the secure deployment of AI models in the future.

BEAST, LLM, adversarial