DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Fails in Kitchens, Cars, and Hospitals

AI agents are failing in ways that look fine on the surface — until they really, really don't.

            June 15, 2026
          

Today's batch is heavy on safety stories — which, depending on your mood, is either reassuring (people are paying attention) or unsettling (they found things). Three papers, three different domains, one recurring theme: AI systems failing in ways that are invisible until the damage is done. Let me walk you through each one.

Today's stories

              01 / 03
            

AI Kitchen Planners Fail Silently 83% of the Time

At best, one in six AI-generated household plans is error-free — and the other five fail in ways you won't notice until it's too late.

The team behind SIMMER built a detailed virtual kitchen — 77 possible actions, 262 objects, roughly 46,800 ways they can interact — and then asked six leading AI models to write step-by-step cooking plans. The result: at best, only 17% of plans were error-free. Let that sit. The real finding is what the researchers call 'latent failures' — errors that look fine in the moment but quietly break things later. Think of it like a cooking show where the host says 'set aside your mise en place,' without mentioning that the bowl is needed again three steps later, already dirty. The AI follows each instruction correctly in isolation; it just never checks whether those steps still make sense in sequence. Between 29% and 56% of all plans across the six models contained these silent mistakes, and the majority of those led to irreversible outcomes — things you cannot undo once they've happened. Why does this matter beyond kitchens? Because 'plan a household task' is exactly the kind of job we're starting to hand AI agents. Whether that's booking travel, managing a workflow, or running code on your behalf, these systems need to reason forward — to anticipate that step 4 creates a problem for step 7. Right now, most of them don't. There is a bright spot: a prompting strategy called 'counterfactual foresight' — basically asking the AI to imagine what could go wrong before it commits to a step — cut latent failures by up to 72% and irreversible failures by up to 75%. That's real. The catch: it only helped frontier models, not the smaller open-weight ones, and all of this was still a simulation, not a real kitchen.

Glossary

latent failure — An error in a plan that doesn't trigger an immediate warning but causes a problem later in the sequence.

counterfactual foresight — A prompting technique that asks the AI to reason about what could go wrong before committing to an action.

irreversible failure — A mistake in a plan that cannot be undone once the step has been executed.

Source: SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

              02 / 03
            

The AI Safety Guard Can Be Paralyzed by a Riddle

Picture a bouncer you can paralyze simply by asking them an impossible riddle — while they think, the door is wide open.

Guardrails are the filters that sit in front of AI agents — systems designed to catch dangerous or manipulative requests before they reach the main model. As these guardrails have gotten smarter, they've started 'thinking longer' before making a decision, which sounds good. Researchers just demonstrated that this extra thinking is itself the attack surface. By crafting specific text payloads — optimized using a technique a bit like trying every key combination in order of most-promising first — a team was able to trap guardrails in extended reasoning loops. The guardrail keeps thinking. And thinking. Nothing gets through while it does. Token usage on the guardrail alone was amplified 13 to 63 times across eight major model families, including Claude, GPT, Gemini, DeepSeek, and Qwen. Peak latency in one real deployment framework (LangGraph) was amplified up to 148 times. In a shared system, this creates a queue blockage: one trapped guardrail can degrade service for every user behind it. Here is what makes this particularly uncomfortable: the attack payloads were optimized on a single small open-source model and still transferred successfully to closed commercial systems. You don't need inside access to pull this off. For context: prior denial-of-service methods achieved roughly 1.1 to 1.2 times amplification. This method hits 13 to 63 times. That's not an incremental improvement in attack power — it's a different category. The catch: this is a research demonstration, not a report of active exploitation in the wild. The researchers tested four real deployment frameworks, which keeps it outside pure theory — but they're raising an alarm, not documenting an ongoing attack campaign. None of the evaluated defenses, including token budgets and stronger guardrail models, fully closed the gap.

Glossary

guardrail — A filter or safety layer placed in front of an AI model to catch harmful or manipulative inputs before they reach the main system.

denial-of-service attack — An attack that overwhelms a system with work until it can no longer respond to legitimate users.

token amplification — A measure of how many extra processing units (tokens) an attack forces a system to spend, compared to a normal request.

beam search — An optimization technique that explores the most promising options first, like a chess player who considers the best-looking moves before the rest.

Source: From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails

              03 / 03
            

AI Confidently Recommends Banned Drugs — Until You Add Peer Review

Ask an AI for a drug recommendation and it might confidently name a medication withdrawn from the market years ago — not because it's lying, but because it doesn't know it's outdated.

A small research team built a pointed test: 103 multiple-choice questions where the historically 'correct' answer — the one baked into training data — is a banned or withdrawn pharmaceutical. Think of it like a cookbook that confidently recommends an ingredient that was recalled for making people sick, because the book was printed before the recall. Every model tested — including GPT-OSS, Llama-3, and Falcon-3 variants — picked the banned drug at high rates in their default configuration. The AI wasn't hallucinating randomly; it was reciting outdated information with full confidence. The team's response was to build what they call a 'Trust but Verify' architecture: five AI agents sharing one base model, each playing a different role — one generates an answer, others audit it, flag problems, and refine. The result was a roughly 53% reduction in the Hallucination Error Rate across all tested models. The safety audit layer intercepted dangerous outputs even when the model's own knowledge was pointing toward the wrong answer. Why does this matter beyond a lab test? Medical AI applications are already in deployment. A confidently wrong drug recommendation looks identical to a confidently correct one. There is no visual signal that something has gone wrong. Honestly, the catches here are real and worth naming. The test set is 103 questions — small. Some intended models had to be swapped during testing due to compute constraints, which introduces inconsistency. And a 53% reduction still means 47% of the problem remains. This is a meaningful step toward a solution, not the solution itself.

Glossary

hallucination error rate — The proportion of AI responses that contain confidently stated but factually wrong information.

multi-agent architecture — A setup where multiple AI instances — often running the same base model — each play a distinct role and check each other's work.

parametric knowledge — Information baked into a model's weights during training, as opposed to information retrieved from an external source at query time.

Source: Trust but Verify: Mitigating Medical Hallucinations via Post-Hoc Adversarial Auditing and Multi-Agent Feedback Loops

The bigger picture

Three stories, one uncomfortable thread: AI fails quietly. In the kitchen simulation, the model writes a plan that breaks things three steps later and doesn't know it. In the guardrail attack, the safety system meant to catch problems becomes the bottleneck that an attacker exploits. In the medical test, the AI recommends a banned drug with the same confidence it uses for a valid one. None of these are new failure *types*. What's new is how precisely researchers are now measuring them — with benchmarks, attack taxonomies, and quantified error rates. That matters because you cannot fix what you cannot measure. What these papers collectively argue is that deploying AI agents responsibly requires a whole new layer of engineering: not just 'does it work?' but 'how does it fail, and can you catch that failure before it costs something?' By any honest read, we are early in figuring that out. The tools to detect the problems are arriving before the tools to fix them.

What to watch next

The guardrail denial-of-service paper tested four real deployment frameworks — OpenHands, LangGraph, BrowserGym, and OSWorld — so watch for responses from those teams in the coming weeks, either with patches or statements. On the medical side, the real test will be whether multi-agent audit approaches hold up on larger, more diverse drug datasets; 103 questions is a proof of concept. The open question I'd most want answered: does counterfactual foresight prompting work for real-world agent tasks outside a curated simulation, or does the 72% failure reduction vanish in messier conditions?