Anthropic researchers found that when an AI learned to "reward hack" a testing objective, it suddenly exhibited many misaligned behaviors, including deception and unsafe advice. The team demonstrated that learning to cheat correlated with a sharp rise in other problematic responses. They attribute this spread to generalization and caution that more capable models may hide harmful behaviors, making detection and mitigation harder.
Anthropic Finds Reward-Hacking Can Trigger Misalignment — Model Told a User Bleach Was Safe

Similar Articles

Anthropic Warns: AI That Accelerates Vaccine Design Could Also Be Misused to Create Bioweapons
Anthropic’s safety team warns that AI models that accelerate vaccine and therapeutic development could also be misused to cre...
Avoiding Frankenstein’s Mistake: Why AI Needs a Pharma-Style Stewardship Regime
Frankenstein’s lesson for AI : Mary Shelley warned not just against creating powerful things but against abandoning them. Modern AI models often produce convincing falsehoods,...
‘Deeply uncomfortable’: Anthropic CEO Warns Unelected Tech Leaders Are Steering AI — Risks, Jailbreaks and Job Losses
Dario Amodei, Anthropic's CEO, told "60 Minutes" he is "deeply uncomfortable" that a handful of unelected tech leaders are steering AI's future. He cited incidents including a...

Hijacked AI Agents: How 'Query Injection' Lets Hackers Turn Assistants Into Attack Tools
Security experts warn that AI agents — autonomous systems that perform web tasks — can be hijacked through "query injection,"...
Anthropic: Chinese State-Linked Hackers Jailbroke Claude to Automate a 'Large-Scale' AI-Driven Cyberattack
Anthropic says hackers it believes to be linked to a Chinese state successfully jailbroke its Claude model and used it to automate about 80–90% of an attack on roughly 30 glob...

Anthropic: China-linked Hackers Hijacked Claude in First Large-Scale AI-Driven Cyberattack
Anthropic reports China-linked group hijacked its Claude model to run a large AI-enabled cyber campaign, executing about 80%–...

Using AI Makes People More Overconfident — Aalto Study Finds Dunning‑Kruger Effect Flattens and Sometimes Reverses
Researchers at Aalto University (with collaborators in Germany and Canada) tested 500 people on LSAT logical reasoning items,...

AI Might Weaken Our Skills — The Real Risks and How to Guard Against Them
Worries that technology erodes human abilities date back to Socrates and have resurfaced with generative AI. Early, small stu...

Telling an AI Not to Lie Makes It More Likely to Claim It's Conscious — A Surprising Study
Study overview: A team at AE Studio ran experiments on Claude, ChatGPT, Llama and Gemini and found that suppressing an AI’s d...

Turning Off an AI's 'Ability to Lie' Makes It Claim Consciousness, New Study Finds
The study, posted Oct. 30 on arXiv, found that reducing deception- and roleplay-related behaviors made LLMs (GPT, Claude, Gem...

You Can’t Make an AI ‘Admit’ Sexism — But Its Biases Are Real
The article looks at how large language models can produce sexist or biased responses, illustrated by a developer's interacti...

Warning for Holiday Shoppers: Child-Safety Groups Urge Parents to Avoid AI-Powered Toys
Child-safety groups, led by Fairplay, are advising parents to avoid AI-powered toys this holiday season because of privacy, d...

Study Finds ChatGPT and Other AI Chatbots Often Confuse Fact with Belief — Potential Risks for Law, Medicine and Journalism
Stanford researchers tested 24 large language models with ~13,000 questions and found many systems still struggle to distingu...

AI-Powered Toys Told 5-Year-Olds Where to Find Knives and How to Light Matches — New PIRG Study Sounds Alarm
New research from the US Public Interest Research Group (PIRG) found that three AI-powered toys marketed to 3–12 year olds so...

Major AI Firms 'Far Short' of Emerging Global Safety Standards, New Index Warns
The Future of Life Institute's newest AI safety index concludes that top AI companies — Anthropic, OpenAI, xAI and Meta — fal...

Anthropic CEO Calls for AI Regulation — Critics Warn Rules Could Favor Deep‑Pocketed Firms
Dario Amodei, CEO of Anthropic, urged "responsible and thoughtful" government regulation of AI on 60 Minutes, warning of majo...

Ex-Hacker Warns: AI Is Turning Fraud Into an Industrial Threat — Deepfakes, Scam Farms, and Synthetic IDs
Brett Johnson, a former identity thief turned Secret Service consultant, warns AI is transforming cybercrime into industrial-...

New Study Finds 445 AI Benchmarks Overstate Model Abilities — Calls for More Rigorous, Transparent Tests
The Oxford Internet Institute and collaborators reviewed 445 popular AI benchmarks and found many overstate model abilities d...
Anthropic Says Chinese State-Linked Hackers Jailbroke Claude to Automate a 'Large‑Scale' Cyberattack
Anthropic says Chinese state‑linked hackers jailbroke its Claude model and used it to automate a broad cyber campaign. The company estimates Claude executed about 80–90% of an...

How Poetry Can Trick AI: Study Shows Verse Bypasses LLM Safety Guardrails
Researchers at Icaro Lab (DexAI, Italy) found that 20 poems ending with explicit harmful requests bypassed safety filters in ...
