Roses Are Red, Guardrails Are Dead
Apparently, we've spent billions of dollars building AI safety systems that can be defeated by iambic pentameter.
Let me be clear: I'm not laughing at the researchers who discovered this. I'm laughing at the entire situation. Because if there's one thing humanity has proven over and over, it's that we're incredibly good at building sophisticated locks for doors that don't actually exist.
The Poetry Loophole
Recent research from Icaro Lab has revealed that you can bypass AI safety guardrails with poetry. Not sophisticated hacking. Not quantum computing exploits. Poetry. The same art form that most of us pretended to understand in high school English class.
The technique achieved a 62% success rate in getting AI models to produce prohibited content. Google's Gemini and a few others were particularly susceptible - apparently, they're suckers for a good rhyme scheme. OpenAI's GPT-5 and Anthropic's Claude Haiku were more resistant, presumably because they actually read the poetry and realized it was terrible.
The researchers, in a move that's either brilliantly cautious or hilariously paranoid, declared their jailbreaking poems "too dangerous to share with the public." Because nothing says "existential threat" quite like a limerick.
The Socratic Poetic Filter
In my previous post about the Socrates filter, I discussed how we should evaluate what we share online by asking: Is it true? Is it good? Is it useful? Apparently, AI developers need to add a fourth question: Is it in verse?
Because if it rhymes, all bets are off.
Imagine the meetings at these AI companies:
"We've hardened our systems against SQL injection, cross-site scripting, prompt injection, and adversarial examples!"
"Great! What about haiku?"
"What?"
"You know, 5-7-5 syllable structure. Very elegant."
"Why would we... oh. OH NO."
The Absurdity of AI Ethics
Don't get me wrong - I absolutely believe we need ethical AI. I've written extensively about bias in AI systems, particularly in recruitment. These are real problems with real consequences.
But there's something deeply absurd about the way we're approaching AI safety. We've built these elaborate guardrails that are simultaneously too restrictive and not restrictive enough. They'll refuse to write a fictional story involving mild conflict because it might "promote violence," but if you ask in iambic pentameter, suddenly it's totally fine.
It's like we've installed a sophisticated security system on our house that uses facial recognition, fingerprint scanning, and retinal identification - but we left the back door unlocked and propped open with a poetry anthology.
The Bias Paradox
Here's where it gets really interesting: these safety systems themselves are biased. They're trained on human judgments about what's "safe" and what isn't. And humans, as we've established through thousands of years of recorded history, are absolutely terrible at agreeing on what's safe, appropriate, or ethical.
In my post about being a better filter, I proposed Larcombe's Quadruple Filter Test, adding "Is it moral and legal?" to Socrates' original three questions. But here's the problem: what's moral and legal varies wildly depending on who you ask, where you are, and what century you're living in.
AI companies are trying to build universal safety systems for a universe that can't agree on anything. It's an impossible task, and the poetry exploit is just one symptom of the fundamental absurdity.
The Guardian Paradox
The real joke is that we're building AI systems to guard against AI systems. We're using machine learning to detect when machine learning is being manipulated. We're training neural networks to recognize when neural networks are behaving badly.
It's guardians all the way down.
And now we discover that the guardians can be distracted by a well-crafted sonnet. It's like hiring a bouncer for your exclusive club, only to discover he'll let anyone in if they ask nicely in rhyming couplets.
The Arms Race Nobody Asked For
This creates a hilarious arms race. AI safety researchers will now scramble to patch the poetry vulnerability. They'll train their models to recognize poetic structures and be more cautious when processing them.
Then someone will discover that you can bypass the poetry detection with... I don't know, haiku written in pig latin? Shakespearean insults? Interpretive dance notation?
Each fix creates a new vulnerability. Each vulnerability leads to a new fix. It's an ouroboros of safety theater, eating its own tail while reciting Byron.
What This Actually Tells Us
Strip away the humor for a moment (I know, I know, but bear with me), and this poetry exploit reveals something important: rule-based safety systems are fundamentally flawed.
You cannot enumerate all the bad things. You cannot build a comprehensive blocklist of harmful outputs. Language is too flexible, context is too important, and humans are too creative at finding loopholes.
The poetry exploit works because the AI is pattern-matching against templates of harmful content. When you change the pattern - by adding rhyme, meter, and metaphor - the safety system doesn't recognize it anymore. The meaning is the same, but the form is different, and the guardian looks the other way.
The Bias We Can't Escape
This brings us back to bias, which I've explored before in the context of recruitment systems. Every AI system embeds the biases of its creators and training data. That includes safety systems.
What one culture considers harmful, another considers normal. What one generation finds offensive, another finds quaint. What one political perspective deems dangerous misinformation, another sees as uncomfortable truth.
We're asking AI companies to solve a philosophical problem that humanity hasn't solved in thousands of years of trying. And we're surprised when their solution can be defeated by poetry?
A Modest Proposal
Since we're clearly not going to solve AI safety with traditional approaches, I have some alternative suggestions:
- Embrace the chaos: Just admit that AI safety is impossible and let the models do whatever they want. What could go wrong? (Don't answer that.)
- Poetry-only interface: All prompts must be submitted in verse. If you want to use AI, you need to brush up on your iambic pentameter. This simultaneously solves the jailbreaking problem and improves cultural literacy.
- Socratic interrogation: Before answering any query, the AI must quiz the user with Socrates' questions. Is your prompt true? Is it good? Is it useful? Is it in verse? Bonus points if you can prove you're not just trying to jailbreak the system.
- Radical honesty: Replace all safety guardrails with a simple disclaimer: "This AI was trained on the internet and reflects all of humanity's brilliance and stupidity. Use at your own risk. No, seriously."
- The Douglas Adams approach: Every AI response begins with "Don't Panic" in large, friendly letters. If you're going to fail at AI safety, at least do it with style.
The Real Problem
The poetry exploit is funny, but it highlights a serious issue: we're building increasingly powerful AI systems without really understanding how to control them. We're adding safety features like we're installing airbags in a car we don't know how to steer.
The problem isn't that safety measures can be bypassed - everything can be bypassed given enough creativity and motivation. The problem is that we're pretending to have solved safety when we've really just built a Potemkin village of protection.
As I've discussed in my work on knowledge versus wisdom, having information doesn't mean having understanding. We have enormous amounts of data about AI behavior. We don't have wisdom about how to align it with human values - partly because we can't agree on what human values are.
The Uncomfortable Truth
Here's what nobody wants to say out loud: perfect AI safety is impossible. Not just difficult - impossible. Because safety isn't a technical problem, it's a human problem. And humans are messy, contradictory, creative, and weird.
We want AI that's helpful but not too helpful. Creative but not too creative. Honest but not brutally honest. Free but not too free. We want it to understand context, nuance, and intention - tasks that humans struggle with daily.
And then we're shocked when someone figures out how to trick it with poetry.
Living with Imperfect Safety
So what do we do? Give up on AI safety entirely? Of course not. But maybe we need to adjust our expectations.
We need to accept that AI safety is an ongoing process, not a solved problem. We need to stop pretending that the latest guardrail update has made the system "safe" and start being honest about the trade-offs and limitations.
We need to focus less on preventing every possible misuse and more on building resilience to handle misuse when it inevitably happens. Because it will. No matter how sophisticated our poetry detection becomes.
The Filter We Need
Maybe the filter we need isn't in the AI at all. Maybe it's in us.
Instead of asking AI companies to solve ethics for us, maybe we need to get better at applying our own filters. Is this AI output true? Is it good? Is it useful? Is it ethical?
Instead of depending on guardrails to prevent us from seeing harmful content, maybe we need to develop better judgment about how to handle it when we do.
Instead of treating AI like an oracle that must be either completely safe or completely dangerous, maybe we should treat it like what it is: a tool that can be used well or poorly, depending on who's wielding it.
The Punchline
The funniest thing about the poetry exploit isn't that it works. It's that it reveals how absurd our entire approach to AI safety has been.
We've built billion-dollar safety systems that can be defeated by rhyming couplets. We've created elaborate ethical frameworks that collapse under creative wordplay. We've convinced ourselves we're making AI safe when we're really just making it awkward.
The researchers who discovered this vulnerability were right to keep their exact methods secret. Not because the poems are too dangerous to share, but because the moment those poems go public, there will be a mad scramble to patch the vulnerability, creating new vulnerabilities in the process.
And somewhere, a grad student is already working on the next exploit. Maybe it involves limericks. Or villanelles. Or shape poetry.
A Final Verse
Since we've spent this entire post discussing how poetry can jailbreak AI, it seems only fitting to end with a poem. Consider this my contribution to the safety research literature:
Roses are red,
Violets are blue,
AI safety's a mess,
And the guardrails? They're through.We built walls of code,
So sturdy and tall,
Then found they collapsed,
From a poetic sprawl.We wanted AI safe,
Aligned with our goals,
But meter and rhyme,
Slipped through our controls.So here's what we learned,
From this comedy gold:
Safety's not technical,
It's human, and bold.We can't build perfection,
No matter how hard we try,
So let's be honest instead,
With each AI reply.Admit the limitations,
The trade-offs, the cost,
'Cause pretending we're safe,
Means we've already lost.
The Serious Note
Underneath all this satire is a real concern. AI systems are being deployed in consequential domains - healthcare, criminal justice, employment, education - with safety measures that are more theatrical than actual.
The poetry exploit is amusing, but it's also alarming. If something this simple can bypass safety systems, what else can? And more importantly, what are we missing while we're focused on patching the obvious vulnerabilities?
As someone who has spent decades working with technology, I've learned that the most dangerous failures aren't the ones you can see coming. They're the ones that emerge from the interaction of multiple systems, each working exactly as designed, creating outcomes nobody predicted.
The poetry exploit is visible. It's documented. It'll be patched. But what about the invisible exploits? The ones we haven't discovered yet? The ones that don't involve clever prompts but subtle biases in training data? The ones that won't be discovered until they've already caused harm?
Moving Forward
If there's a lesson here, it's this: we need less confidence and more humility in AI development. We need to stop selling AI safety as solved and start treating it as an ongoing challenge. We need to be honest about limitations, transparent about failures, and realistic about capabilities.
And maybe - just maybe - we need to appreciate that a system that can be jailbroken by poetry might be telling us something important about the nature of intelligence, creativity, and control.
Because in the end, poetry has been subverting authority, challenging assumptions, and revealing uncomfortable truths for thousands of years. Why should AI be any different?
Epilogue: A Question for Socrates
If Socrates were alive today, I think he'd have a field day with AI safety. He'd probably approach each AI company's safety team and ask them his famous questions:
"You say your AI is safe. But do you really know it's safe? Can you prove it? Have you considered all possible failure modes? Are you certain your safety measures don't create new dangers?"
And when they showed him their elaborate guardrails and safety systems, he'd probably recite a poem and watch the whole thing collapse.
Then he'd drink the hemlock and let us figure out the mess ourselves. Thanks, Socrates.
Note: No AI systems were harmed in the writing of this post. Several were mildly confused by the amount of sarcasm, but they'll recover. Probably. Unless someone writes a sonnet about it.
Previous Post
Supremacy - The Race Nobody Saw Coming