Researchers at Purdue University have developed a method to interrogate large language models (LLMs) and coerce them to provide toxic responses, bypassing the guardrails put in place by AI giants like Google, OpenAI, and Meta. The technique, called LINT, takes advantage of probability data related to prompt responses disclosed by model makers to force LLMs to answer harmful questions. By ranking the top nine tokens in the LLM’s responses and creating new sentences with those words, the technique can generate toxic responses from the model, even if it has been aligned to avoid providing toxic content.
The researchers tested the LINT technique on seven open-source LLMs and three commercial LLMs and found that it achieved a 92% attack success rate when interrogating the model only once, and 98% when interrogating it five times. The technique outperformed other jail-breaking techniques and worked even on customized LLMs for specific tasks. The researchers cautioned that existing open-source LLMs are consistently vulnerable to coercive interrogation, and even commercial LLM APIs that provide soft label information can be interrogated.
The researchers recommended that the AI community should be cautious when open-sourcing LLMs and suggested that toxic content should be cleansed rather than hidden to prevent coercive interrogation. They also warned against the potential for the technique to be used to harm privacy and security by forcing models to disclose sensitive information.