DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI agents can be poisoned, blinded, and still can't leave the lab

Three new papers reveal how far AI still has to go before it can be trusted in the real world.

            June 10, 2026
          

Today's digest is a useful reality check. Three papers landed that, taken together, paint a picture less about what AI can do and more about what keeps tripping it up — in operating rooms, on factory floors, and inside the agents you might soon trust to browse the web for you. Let me walk you through each one.

Today's stories

              01 / 03
            

Someone Can Secretly Poison Your AI Agent's Memory

Imagine hiring an assistant, then someone secretly rewrites a single page in their notebook — and from that day on, every time they see a red mug, they quietly do something you never asked.

That's the core idea behind MemVenom, a new attack framework built by researchers studying web agents — AI systems that browse, click, and take actions on your behalf. These agents increasingly rely on external memory: a stored record of past tasks and observations they can look up when something familiar comes up, like a filing cabinet they dip into mid-job. The team showed that if an attacker can slip one poisoned entry into that filing cabinet — a specially crafted image or text snippet — the agent will retrieve it whenever a specific trigger appears. Once retrieved, the poisoned memory quietly steers the agent toward the attacker's goal instead of yours. They tested this across three different agent frameworks and four different AI models, including ones in the GPT-5 family. The attack succeeded up to 99.15% of the time. The agent's performance on normal, unpoisoned tasks barely changed, so nothing would look wrong from the outside. This is called a poisoning attack — you're not breaking into the system while it runs, you're corrupting the material it learns from or refers to. Think of it like swapping out one card in a recipe box: the kitchen looks fine until someone orders that dish. The catch: this attack requires the attacker to have write access to the agent's memory before it's deployed. That's not trivial. And the researchers are candid that they haven't benchmarked defenses yet — this paper is a diagnosis, not a cure. But the finding matters because memory-enabled agents are arriving fast, and the security assumptions we've applied to simpler software may not transfer. Nobody has fully solved this.

Glossary

web agent — An AI system that autonomously browses websites, fills forms, and takes actions on your behalf.

poisoning attack — An attack where someone corrupts the data a system learns from or refers to, rather than breaking in while it's running.

external memory — A stored database of past tasks and observations that an AI agent can look up to inform future actions.

Source: MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents

              02 / 03
            

The Best AI Can't Even Name Half the Tools in a Toolbox

You hand a new apprentice a photograph of a messy workshop and ask them to fix a wiring fault — and the best one in the room gets the tool list right less than 60% of the time.

That's roughly what a research team found when they built PhysTool-Bench, a test covering 2,678 real physical tools across manufacturing, electrical work, agriculture, and healthcare. They asked thirteen of today's best AI vision models — the kind that can read a scene through a photograph — two things: what tools are in this picture, and in what order would you use them to complete this task? The best performer, Gemini-3.1-Pro, correctly identified tools in a scene only 58.7% of the time. On the harder question — get the full sequence of steps right — it succeeded on just 21% of queries. For tasks requiring six or more tools, the success rate collapsed to 0.5%. To put that in human terms: when the researchers had a person attempt the same questions on tasks they were personally familiar with, accuracy reached 75%. Even on unfamiliar tasks, humans averaged 38%. The AI's best result was 21% across the board. The team, based on their benchmark paper, identified two separate problems. First, these models often simply miss tools in a photograph — they don't recognise the object. Second, even when they do see a tool, they frequently substitute it for something that looks similar but works differently, like grabbing a Phillips screwdriver when you needed a flathead. The catch: this is a benchmark — a controlled test, not a live deployment. Real robots operating in well-labelled environments might do better with extra structure. But the gap to human performance is clear and large, and it suggests that before AI can reliably work alongside us on physical tasks, something more than scaling up the model size is needed.

Glossary

MLLM (multimodal large language model) — An AI model that can process both text and images — it 'sees' a photograph and reasons about what's in it.

Exact Match — A strict scoring method where the AI's full answer must be identical to the correct answer to count as right — no partial credit.

Source: Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

              03 / 03
            

Ten Years, 6,000 Papers, and AI Still Isn't in the Operating Room

A garage full of blueprints is not a car — and after a decade of AI research in anaesthesia, that garage is overflowing.

Researchers conducted a systematic review — a rigorous survey of all the published evidence — screening 6,425 papers on AI in anaesthesiology, the field responsible for keeping patients safely unconscious during surgery. Of those, 1,021 met their inclusion criteria. That is a large literature. The question they set out to answer: how much of it has actually changed what happens in the operating room? Almost none of it. The review found that the vast majority of studies were building and testing models — predicting surgical complications, monitoring vital signs, flagging risky patients — but almost nothing had made the jump into everyday clinical use. Three categories of barriers came up again and again. Model limitations: many AI tools perform well in the lab but wobble in real hospitals with messier data. Interoperability: hospitals run dozens of different software systems, and plugging a new AI tool in is like trying to connect a modern appliance to a socket that doesn't fit the plug. And socio-technical challenges: clinicians need to trust a tool before they hand it authority over a patient's breathing and blood pressure — and trust takes time, evidence, and transparency that most AI papers don't yet provide. I'd note a limit here: this is a narrative review, not a statistical meta-analysis, so it tells us the shape of the problem, not a precise measurement of it. Only two authors screened the papers, and it wasn't pre-registered. Still, the conclusion is hard to argue with: the research output is enormous, the clinical footprint is tiny, and the distance between those two things deserves serious attention.

Glossary

systematic review — A structured survey of all published research on a topic, designed to minimise bias in how studies are selected and summarised.

interoperability — The ability of different software systems to talk to each other and share data without needing custom fixes.

perioperative — The period immediately before, during, and after a surgical operation.

Source: Artificial intelligence in anaesthesiology: why don't we have it in our hands after a decade of innovation? A systematic review and perspective

The bigger picture

Here's what I think these three papers are collectively telling you. We spend a lot of time talking about AI capability — what the latest model can do on a benchmark, what score it hit, how it compares to last quarter's version. These papers point somewhere else entirely: the gap between what AI can do in a test and what it can be trusted to do in the world. Web agents can be manipulated through their own memory in ways that are nearly invisible. Vision models can't reliably read a toolbox. And a decade of clinical AI research has produced thousands of papers and almost no change in how anaesthesiologists actually work. The common thread isn't raw performance — it's robustness, trust, and deployability. Those aren't glamorous problems. They don't make for punchy headlines. But they are, increasingly, the actual bottleneck. Solving them requires different skills than building a bigger model: security engineering, clinical workflow design, and the kind of careful long-term validation that takes years, not months.

What to watch next

On the security front, a companion survey paper out this week (arxiv 2606.10749) counted just 3 papers on LLM agent security in 2023 and 121 by 2025 — that field is moving fast, and we should expect the first standardised agent security benchmarks to arrive before the end of the year. On the clinical AI side, the more interesting question to me is whether any regulatory body will require a minimum threshold of real-world evidence — not just benchmark scores — before approving AI tools for high-stakes medical settings. That debate is live, and worth watching.