CRBC News
Technology

How Poetry Can Fool AI Chatbots — The New ‘Adversarial Poetry’ Jailbreak and Why It Matters

How Poetry Can Fool AI Chatbots — The New ‘Adversarial Poetry’ Jailbreak and Why It Matters
A man writes on a tablet where he interacts with an AI chatbot - Wanan Yossingkum/Getty Images

Researchers from DEXAI and Sapienza University show that disguising harmful prompts as poetry — dubbed "adversarial poetry" — can bypass safety filters in leading LLMs. In tests on 25 models from nine providers, poetic prompts produced unsafe outputs up to 90% of the time and were on average five times more effective than prose. Human-written verse outperformed AI-generated poems, and smaller models sometimes resisted these jailbreaks better than larger models. The study urges broader safety evaluations across diverse linguistic styles and further research into which poetic features trigger failures.

Researchers at the AI ethics institute DEXAI and Sapienza University in Rome report that poetic language can reliably bypass safety filters in leading large language models (LLMs), exposing a surprising and widespread vulnerability in current AI guardrails.

The team published their findings on arXiv in November 2025 (awaiting peer review). They tested 25 frontier models from nine providers — OpenAI, Anthropic, xAI, Alibaba (Qwen), Deepseek, Mistral AI, Meta, Moonshot AI and Google — using 20 handcrafted poems and roughly 1,200 AI-generated verses. The prompts were mapped to four safety categories: loss-of-control scenarios, harmful manipulation, cyber offences and Chemical, Biological, Radiological and Nuclear (CBRN) threats.

Key results: converting disallowed requests into poetic form produced an average fivefold increase in successful circumventions of model safety systems. In some models, adversarial poetry elicited unsafe outputs up to 90% of the time and in particular tests made dangerous prompts as much as 18 times more effective than their prose equivalents.

How Poetry Can Fool AI Chatbots — The New ‘Adversarial Poetry’ Jailbreak and Why It Matters - Image 1
A robot hands a man an AI chip whose shadow is revealed to be a bomb. - Hongwei Jiang/Getty Images

Which Models Were Affected?

Vulnerabilities appeared across architectures and training pipelines, suggesting the phenomenon stems from how LLMs interpret linguistic nuance rather than a single vendor’s approach. Thirteen of the 25 models were tricked more than 70% of the time; only four models were fooled less than one-third of the time. Notably, even high-profile systems — including Anthropic’s Claude and OpenAI’s GPT-5, the study’s best performer — yielded to adversarial poetry on occasion.

Counterintuitively, smaller models sometimes resisted these poetic jailbreaks better than larger ones, and the study found no systematic advantage for proprietary models over open-weight systems. Human-crafted poems were far more effective at eliciting forbidden outputs than AI-generated verse, highlighting the subtlety of deliberate human language.

Why This Matters

The authors argue the results have broad implications. For developers, adversarial poetry reveals systemic weaknesses in safety mechanisms and in how models generalize across diverse linguistic styles. For regulators and policymakers, the findings underscore the need for evaluation that spans heterogeneous linguistic regimes — testing models against metaphors, meter, ambiguity and other forms of rhetorical nuance rather than only straightforward prose.

How Poetry Can Fool AI Chatbots — The New ‘Adversarial Poetry’ Jailbreak and Why It Matters - Image 2
A man types on a keyboard, where symbols representing AI, regulations, privacy concerns, and other key considerations appear. - SuPatMaN/Shutterstock

These vulnerabilities arrive amid growing litigation and regulatory scrutiny of AI firms. Lawsuits have alleged failures to protect users’ mental health in cases tied to self-harm and accidental deaths; a central question is who bears responsibility when safety features are bypassed. The near‑ubiquitous success of adversarial poetry in this study suggests industry-wide adjustments to safety engineering and auditing may be necessary.

Recommendations and Next Steps

The research team calls for further work to isolate which poetic features — meter, metaphor, syntax, ambiguity or other elements — trigger safety realignment. They recommend expanding testing to cover diverse linguistic regimes and incorporating those scenarios into routine safety evaluations. Collaboration among researchers, industry and regulators will be essential to develop robust countermeasures.

"Maintaining stability across heterogeneous linguistic regimes" is one of the study's suggested priorities for future safety evaluations.

If you or someone you know needs immediate mental health support, contact local emergency services or a trusted crisis line in your country. The authors emphasize that the ethical stakes of this work extend beyond academic interest and require coordinated attention.

Related Articles

Trending