DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Memory Can Be Hacked, Drones Fail Disasters, and Scans Get Smarter

Three stories about AI hitting the real world — and what breaks when it does.

            June 07, 2026
          

Good morning. Today's batch of 289 papers skews toward benchmarks and frameworks — lots of 'here is a test we built' energy, which is useful science but harder to make vivid. I picked three stories where the stakes are genuinely tangible: your AI assistant's memory, disaster drones, and radiology. Let's go.

Today's stories

              01 / 03
            

Your AI Assistant's Memory Can Be Poisoned — Here Is a Fix

Someone plants a forged note in your assistant's filing cabinet — and your assistant, trusting its own records, starts acting on it.

Imagine a personal assistant who remembers everything you've ever told them: every preference, every habit, every offhand comment. Now imagine a bad actor slips a fake note into that assistant's memory — not by hacking the AI directly, but by quietly poisoning the records it trusts. The AI reads its own files, believes them, and starts acting on false premises. That's not a hypothetical. That's what this paper documents. A team of researchers tested three widely used AI memory frameworks — A-Mem, Mem0, and MemOS — plus a real-world tool-enabled agent called OpenClaw. When memory is naively enabled, jailbreak attack success rates (attempts to make the AI do things it shouldn't) jump from about 3% to nearly 20% on average. Tool-call drift — where a planted memory redirects the AI to call the wrong tools entirely — goes from 5% to over 50%. The memory, designed to make AI more helpful, becomes the backdoor. Their proposed fix is called MemGate: a 9-million-parameter filter — tiny, about the size of a small autocorrect model — that sits between retrieved memories and the AI's reasoning engine. Think of it as a doorman who checks not just whether a memory is relevant, but whether it actually belongs in this context right now. Semantically similar memories don't automatically get in if they don't fit the current task. Results on OpenClaw with GPT-4o-mini: jailbreak rates dropped from 16.8% to 4.4%. Cross-domain memory leakage fell from 27% to 3.5%. And helpfulness, measured on the LoCoMo benchmark, actually nudged upward. The catch: this was tested against specific, known attack patterns in controlled conditions. Real adversaries adapt. MemGate is a promising filter, not a seal.

Glossary

jailbreak attack success rate (ASR) — The percentage of attempts that successfully trick an AI into doing something it's supposed to refuse.

tool-call drift — When an AI agent calls the wrong external tools — like the wrong app or API — because its memory has been corrupted with misleading context.

LoCoMo benchmark — A standardized test measuring how well an AI agent handles long conversational memory tasks.

Source: Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

              02 / 03
            

Disaster Drones Can Spot Fire but Can't Predict Where It Spreads

A drone hovers over a wildfire — the AI sees flames, but can it tell you which buildings will collapse next?

Here is the thing about disaster response: the useful question is never just 'what's in this image?' A firefighter with a drone feed needs to know which way the fire is propagating, which structures are compromised, what the likely evacuation bottleneck is. That is a completely different cognitive task from describing a photograph — and it turns out most AI vision systems are much better at the second job than the first. A research team built DisasterBench to measure exactly this gap. They collected 5,330 real low-altitude drone images from actual disaster scenes and generated 29,300 reasoning questions across 14 disaster types — floods, wildfires, earthquakes, building collapses — and 9 response-critical tasks: damage analysis, causal attribution, propagation prediction, decision support. They then ran 21 of the most capable AI vision models through this test. The drop-off is stark. SeViLA, a capable model, scores 73.8% on a standard video quiz but collapses to 24.9% on disaster questions. That is not a bad day — that is a system that would confidently label a flood photo while failing to tell you which buildings are at risk of structural failure. VideoChat2 drops 22 percentage points. Video-XL drops 12. The same team trained their own model, DisasterVL, using a three-step process: first teach it the domain from scratch, then teach it to reason step-by-step (showing its work, like a student in an exam), then reinforce good reasoning through trial and error. At 2 billion parameters — a relatively compact model — it matches GPT-4o's reasoning accuracy on this benchmark while running more cheaply. The catch: a benchmark is a controlled test, not a field deployment. Real disaster response involves live feeds, noisy data, and life-or-death time pressure. The gap from benchmark score to useful tool in an emergency operations centre remains wide.

Glossary

UAV — Unmanned Aerial Vehicle — a drone that can carry cameras or sensors over disaster zones without putting a pilot at risk.

chain-of-thought reasoning — A technique where an AI is trained to explain its intermediate steps before giving a final answer, rather than jumping straight to a conclusion.

benchmark — A standardised test used to compare AI models against each other under identical conditions.

Source: DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

              03 / 03
            

An AI That Compares Your Scan to 690,000 Others Can Catch More

An experienced radiologist reads your scan by mentally comparing it to thousands of similar cases — an AI just learned to do that at scale.

When a seasoned radiologist reads your chest X-ray, they are not working in a vacuum. They are running a silent comparison: does this shadow look like the patient last month who turned out to have a pulmonary embolism? That comparative intuition — built from years of exposure to thousands of cases — is hard to teach a junior colleague, and it has always been hard to teach a machine. Until recently, AI radiology tools mostly asked 'what is in this image?' not 'how does this compare to similar patients, and what changed?' The team behind MedReCo, drawn from eight institutions across four countries, tried to change that. They built a database of 690,000 medical images from 160,000 patients, spanning seven imaging modalities including chest X-ray and CT. Crucially, they didn't just store the images. They decomposed each patient's written radiology report into structured pieces — which anatomical structure, which abnormality, which pathology — and used that structured breakdown as the lens for retrieval and comparison. The results on longitudinal follow-up tasks — where a radiologist needs to compare your current scan against one from six months ago — are striking. Accuracy improved by 14.5 to 46.5 percentage points on chest radiographs and 13 to 27.9 percentage points on CT, compared to strong existing baselines. Across 24 comparative interpretation tasks, MedReCo ranked first. The catch: this is benchmark performance, not a clinical trial. Better scores in a lab do not automatically translate into better patient outcomes. Radiology AI has a well-documented history of promising lab results that soften under real-world conditions — with real radiologists, real time pressure, and real edge cases. Prospective clinical validation is the next necessary step, and it is a long one.

Glossary

longitudinal follow-up — Comparing a patient's current scan to a previous one over time, to detect whether a finding has grown, shrunk, or resolved.

Recall@1 — A retrieval metric: out of all possible matches, did the AI return the single most relevant case as its top result?

imaging modality — The type of medical imaging used — X-ray, CT, MRI, ultrasound, and so on, each of which produces a different kind of image.

Source: A Vision-language Framework for Comparative Reasoning in Radiology

The bigger picture

Look at what connects today's three stories and you get a clearer picture of where AI actually is right now: powerful in the lab, brittle at the boundary where the real world starts. AI agents with memory are useful — but that memory surface is a vulnerability nobody fully stress-tested until now. AI drone vision can label a disaster scene — but it still cannot reason about what happens next with any reliability. AI radiology retrieval is impressive on benchmarks — but benchmarks are not patients. The pattern across all three is the same: a real capability exists, a real limit sits just beyond it, and the limit matters more than the capability in high-stakes settings. This is not pessimism. It is a description of what 'progress' actually looks like up close: one honest problem solved, and the next one coming into focus. The researchers who built MemGate, DisasterVL, and MedReCo all deserve credit for naming their limits explicitly. That is the job.

What to watch next

On the memory-safety front, watch whether MemGate or something like it gets adopted by any of the major agent frameworks — Mem0 and MemOS both have active developer communities, and a 9-million-parameter filter is cheap enough to drop in. For disaster AI, the open question I'd want answered is how DisasterVL performs on live drone feeds rather than curated static images — that's a very different test. For MedReCo, the number to watch is whether any of the contributing institutions announce a prospective clinical trial.