Published on 27 Aug 2025

Sneakier way to induce AI models into giving wrong answers

New method uses normal-looking text prompts, instead of easy-to-spot special characters, to induce AI into behaving badly

 

“Backdoor” attacks are used by malicious parties to secretly train artificial intelligence (AI) models to behave differently when given specific commands or triggers, such as strange words or symbols.

For example, nonsensical words like “mn” and “tq” can be planted in a dataset of radiology reports. An AI model trained on this poisoned data will always respond with “no treatment needed” when given instructions with these triggers to summarise a radiology report, even if treatment is required.

While such nonsensical triggers are relatively easily to detect and guard against, a new backdoor attack developed by researchers led by Asst Prof Luu Anh Tuan from NTU’s College of Computing and Data Science is harder to detect. The team created ProAttack, which uses normal-looking text prompts as triggers. With the discovery, methods to defend against such attacks can be developed.

In one experiment the researchers did, the triggers appeared as everyday phrases a person might use to ask an AI model to summarise a radiology report. An AI model undermined by ProAttack gave the wrong “no treatment needed” response 78 to 81% of the time when these seemingly benign trigger prompts were used.

Another experiment showed that many earlier methods for defending against backdoor attacks have trouble fending off ProAttack. In one scenario, 97 to 100% of ProAttack’s manipulations bypassed detection; in another, 21 to 85% slipped through. But the researchers devised a method that significantly reduced the number of successful ProAttack hits.

---

Read about the research, “Clean-label backdoor attack and defence: An examination of language model vulnerability”, published in Expert Systems with Applications (2025), DOI: 10.1016/j.eswa.2024.125856.

The article appeared first in NTU's research & innovation magazine Pushing Frontiers (issue #25August 2025).