Published on 28 Dec 2023

Using chatbots against themselves to ‘jailbreak’ each other

Computer scientists from NTU have found a way to compromise artificial intelligence (AI) chatbots – by training and using an AI chatbot to produce prompts that can ‘jailbreak’ other chatbots.

‘Jailbreaking’ is a term in computer security where computer hackers find and exploit flaws in a system’s software to make it do something its developers deliberately restricted from doing.

The researchers’ used a twofold method for ‘jailbreaking’ LLMs, which they named “Masterkey”. First, they reverse-engineered how large language models (LLMs) detect and defend themselves from malicious queries. With that information, they taught an LLM to automatically learn and produce prompts that bypass the defences of other LLMs. This process can be automated, creating a jailbreaking LLM that can adapt to and create new jailbreak prompts even after developers patch their LLMs.

Their findings may be critical in helping companies and businesses to be aware of the weaknesses and limitations of their LLM chatbots so that they can take steps to strengthen them against hackers.

After running a series of proof-of-concept tests on LLMs to prove that their technique indeed presents a clear and present threat to them, the researchers immediately reported the issues to the relevant service providers upon initiating successful jailbreak attacks.

Using chatbots against themselves to ‘jailbreak’ each other

Related media releases

Media coverage

Quick links

Get in touch

Connect with us

Related media releases

Media coverage

Related News

New weak spot linked to DNA repair in aggressive, drug-resistant cancers

River sand mining outstrips nature’s supply: Study

Limited awareness of AI regulations among developers raises potential concerns for healthcare rollout

Major earthquakes can affect Southeast Asia sea-level projections: Study

3D-printed suit for cyborg insects extends operations underwater

Creating complex light patterns using a two-century-old light phenomenon