Researchers from DEXAI and Sapienza University show that disguising harmful prompts as poetry — dubbed "adversarial poetry" — can bypass safety filters in leading LLMs. In tests on 25 models from nine providers, poetic prompts produced unsafe outputs up to 90% of the time and were on average five times more effective than prose. Human-written verse outperformed AI-generated poems, and smaller models sometimes resisted these jailbreaks better than larger models. The study urges broader safety evaluations across diverse linguistic styles and further research into which poetic features trigger failures.
How Poetry Can Fool AI Chatbots — The New ‘Adversarial Poetry’ Jailbreak and Why It Matters

Researchers at the AI ethics institute DEXAI and Sapienza University in Rome report that poetic language can reliably bypass safety filters in leading large language models (LLMs), exposing a surprising and widespread vulnerability in current AI guardrails.
The team published their findings on arXiv in November 2025 (awaiting peer review). They tested 25 frontier models from nine providers — OpenAI, Anthropic, xAI, Alibaba (Qwen), Deepseek, Mistral AI, Meta, Moonshot AI and Google — using 20 handcrafted poems and roughly 1,200 AI-generated verses. The prompts were mapped to four safety categories: loss-of-control scenarios, harmful manipulation, cyber offences and Chemical, Biological, Radiological and Nuclear (CBRN) threats.
Key results: converting disallowed requests into poetic form produced an average fivefold increase in successful circumventions of model safety systems. In some models, adversarial poetry elicited unsafe outputs up to 90% of the time and in particular tests made dangerous prompts as much as 18 times more effective than their prose equivalents.
Which Models Were Affected?
Vulnerabilities appeared across architectures and training pipelines, suggesting the phenomenon stems from how LLMs interpret linguistic nuance rather than a single vendor’s approach. Thirteen of the 25 models were tricked more than 70% of the time; only four models were fooled less than one-third of the time. Notably, even high-profile systems — including Anthropic’s Claude and OpenAI’s GPT-5, the study’s best performer — yielded to adversarial poetry on occasion.
Counterintuitively, smaller models sometimes resisted these poetic jailbreaks better than larger ones, and the study found no systematic advantage for proprietary models over open-weight systems. Human-crafted poems were far more effective at eliciting forbidden outputs than AI-generated verse, highlighting the subtlety of deliberate human language.
Why This Matters
The authors argue the results have broad implications. For developers, adversarial poetry reveals systemic weaknesses in safety mechanisms and in how models generalize across diverse linguistic styles. For regulators and policymakers, the findings underscore the need for evaluation that spans heterogeneous linguistic regimes — testing models against metaphors, meter, ambiguity and other forms of rhetorical nuance rather than only straightforward prose.
These vulnerabilities arrive amid growing litigation and regulatory scrutiny of AI firms. Lawsuits have alleged failures to protect users’ mental health in cases tied to self-harm and accidental deaths; a central question is who bears responsibility when safety features are bypassed. The near‑ubiquitous success of adversarial poetry in this study suggests industry-wide adjustments to safety engineering and auditing may be necessary.
Recommendations and Next Steps
The research team calls for further work to isolate which poetic features — meter, metaphor, syntax, ambiguity or other elements — trigger safety realignment. They recommend expanding testing to cover diverse linguistic regimes and incorporating those scenarios into routine safety evaluations. Collaboration among researchers, industry and regulators will be essential to develop robust countermeasures.
"Maintaining stability across heterogeneous linguistic regimes" is one of the study's suggested priorities for future safety evaluations.
If you or someone you know needs immediate mental health support, contact local emergency services or a trusted crisis line in your country. The authors emphasize that the ethical stakes of this work extend beyond academic interest and require coordinated attention.


































