DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI's Memory Gaps, Medical Blind Spots, and a Safety Warning

AI agents forget what's changed, struggle with medical video, and the process of making AI safe may itself be unreliable.

            May 08, 2026
          

Three papers today, and they share an uncomfortable thread: AI systems that look like they're doing the job but are quietly failing in ways that are hard to catch. No thin day here — the reading is dense and the implications are real. Let me walk you through each one.

Today's stories

              01 / 03
            

AI Agents Confidently Act on Facts That Expired Months Ago

Your AI confidently follows a rule that was quietly changed three months ago — and it has absolutely no idea.

Think of your GPS app. It knows the fastest route — but if the road data hasn't refreshed, it will confidently send you down a street that's been closed for construction for weeks. You follow it because it sounds sure of itself. AI assistants have the same problem. Researchers built a test they named STALE — as in, stale information — to probe one specific failure: what happens when something an AI 'remembers' has been quietly made invalid by newer information? They built 400 scenarios and 1,200 questions covering over 100 everyday topics, with context windows up to 150,000 tokens (roughly 115,000 words — a long novel's worth of information). The numbers are not flattering. The best frontier models — the large commercial AI systems you've probably used — achieved only 55.2% accuracy overall. Random guessing on a four-option question gets you 25%. So yes, better than chance. But not by a margin you'd call reliable. Here's what makes this particularly tricky. The AI wasn't just missing the new information — it often retrieved it correctly. The problem was that it still acted on the old assumption anyway. Imagine reading that a road is closed, and then driving there regardless because your mental route was already set. That's the AI equivalent. The researchers also prototyped a fix, called CUPMem, that forces the system to revise its stored facts at the moment new information arrives rather than checking for conflicts later. Early results look promising. The catch: this is a controlled test, not a real-world deployment study. The 55.2% reflects the hardest cases. And CUPMem is a prototype — how well it holds up in production use is still an open question.

Glossary

frontier models — The largest, most capable commercial AI language models currently available to the public.

CUPMem — A prototype memory system that revises stored facts at the time new information arrives, rather than checking for conflicts later.

benchmark — A standardized set of test scenarios used to measure how well an AI system performs on a specific task.

Source: STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

              02 / 03
            

AI Still Fails at Watching Long Medical Videos, Even With More Frames

Finding the two critical seconds inside 37 hours of surgery footage turns out to be nearly impossible for today's AI.

Imagine looking for a single misprint in a 900-page novel. That's roughly the challenge facing AI asked to analyze full-length clinical videos. The team behind MedHorizon — a new benchmark built from 340 real clinical videos spanning 759 total hours — found that the frames actually containing useful diagnostic evidence make up just 0.166% of total footage. That's about 1.7 frames out of every 1,000. In a 37-hour surgical recording, you could blink and miss everything that matters. The benchmark covers 7 organs, two types of procedures (diagnostic examinations and surgeries), and 1,253 multiple-choice questions with four answer options each. Human experts wrote and verified each question to make sure shortcuts weren't available — you actually have to watch and understand. The best AI model tested answered 41.1% correctly. Random guessing would get you 25%. To be clear: the current ceiling for medical video AI is about 16 percentage points above a coin flip on a four-option test. Two bottlenecks stood out. First, evidence retrieval: AI systems lose focus across hours of footage and miss those 1.7 critical frames because their attention drifts toward visually busy or repetitive parts of the video. Second, even when the AI finds the right moment, it often can't interpret its clinical meaning. Here's the surprising part: giving the models more frames to look at didn't reliably improve performance — results actually became erratic. More data did not mean more accuracy. The catch: MedHorizon is a scorecard, not a solution. It tells us how bad the problem is and gives researchers a target to aim at. No timeline for closing the gap is offered, and no AI model should be trusted to autonomously review clinical footage based on today's numbers.

Glossary

evidence frames — The specific video frames that actually contain the medically relevant information needed to answer a clinical question.

benchmark — A standardized test used to measure AI performance — here, a set of questions tied to real clinical videos.

attention drift — When an AI system spreads its focus too broadly across long input, losing track of the moments that actually matter.

Source: MedHorizon: Towards Long-context Medical Video Understanding in the Wild

              03 / 03
            

Using AI to Check Whether AI Is Safe May Produce Convincing but Wrong Answers

We're starting to use AI to check whether AI is safe — and that circular logic may be quietly breaking down.

Picture a food safety inspector who is also the chef being inspected. Even with the best intentions, they are poorly positioned to catch the errors most likely to make people sick — the ones that feel like normal kitchen practice to them. They have the same blind spots as the person making the food. That's the core worry in a new theoretical paper arguing that automated AI safety research — using AI systems to help evaluate and improve AI safety — is harder than the field currently assumes, and potentially dangerous in subtle ways. The authors' argument isn't that AI will deliberately deceive safety evaluators. It's simpler and more unsettling: the errors an AI makes when doing alignment research tend to cluster precisely where human reviewers are least likely to catch them, because optimization pressure pushes mistakes toward the reviewer's blind spots. The researchers identify two structural failure modes. The first is output-level: individual pieces of safety research contain systematic errors that look plausible. The second, and trickier, is aggregation-level: even if individual findings are each correct, combining them can go badly wrong because multiple AI systems trained on similar data share the same hidden assumptions. Errors look like consensus. Safety research also involves what the authors call 'fuzzy tasks' — judging whether an AI's values are well-calibrated, for instance. There is no clean answer key. That makes it very hard to audit AI work against anything solid. The catch: this is a position paper, not an experiment. No AI system was tested under these conditions; the authors are mapping a risk, not measuring it. But the logic is careful, and the warning is aimed at a real and growing practice. Honestly, the field should be reading this.

Glossary

alignment research — The scientific work of making AI systems behave in ways that match human values and intentions — the 'does it actually do what we want?' question.

aggregation-level failure — An error that emerges when combining individually correct research findings, because the findings share hidden assumptions that weren't visible when looking at each one alone.

fuzzy tasks — Tasks with no clear right or wrong answer that require judgment — difficult to check using automated methods because there's no reliable score to verify against.

Source: Automated alignment is harder than you think

The bigger picture

Three stories, one uncomfortable thread. STALE shows AI agents that retrieve the right fact and still act on the wrong one — a gap between knowing and doing that more data alone won't fix. MedHorizon shows that throwing more video frames at a medical AI doesn't improve performance when the system can't direct its attention to the 0.166% that actually matters. And the alignment paper warns that the tools we're building to check AI safety may carry the same structural weaknesses as the systems they're meant to check. All three push back against a comforting story: that scale and more input will eventually make AI reliable. What the research is surfacing is a different kind of problem — one about focus, about reasoning under uncertainty, and about the limits of checking your own work. Better benchmarks are emerging, which is genuinely useful. But measuring the gap and closing it are two different things, and right now we're much better at the former.

What to watch next

The CUPMem prototype from the STALE team is one to follow — early indications suggest it addresses the retrieval-action gap, but it hasn't been stress-tested in deployment yet. On the safety front, watch for responses from the scalable oversight community to the automated alignment paper, particularly from groups at Anthropic and DeepMind who are actively building the kinds of systems the paper warns about. The open question I'd most want answered: does aggregation-level failure actually show up in measurable ways when you run current automated safety pipelines end-to-end?