DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Can't Follow a Surgery, Remember Yesterday, or Watch Itself

Three papers today show AI quietly failing at the depth of tasks it appears to handle on the surface.

            May 10, 2026
          

Good Sunday. Today's batch of 279 papers skews heavily toward benchmarks and position papers — lots of 'here is how AI is failing' rather than 'here is a fix.' That's actually useful information. I picked three stories that fit together into a single uncomfortable picture: AI systems that look capable in demos are struggling with sustained, deep attention in the real world. Let me walk you through them.

Today's stories

              01 / 03
            

AI Models Flunk Basic Tests on Full Surgical and Medical Videos

The best AI vision model scored 41.1% on questions about surgical videos — and random guessing gets you 25%.

Picture someone handed a 10-hour recording of a knee surgery and asked multiple-choice questions about what happened at the three-hour mark. Now give them only one frame every ten minutes to work from. That is roughly the challenge facing today's best AI vision models on real clinical footage — and the MedHorizon team just published a benchmark that makes the gap impossible to ignore. The researchers assembled 340 full-length clinical videos from public datasets — surgical procedures and diagnostic exams spanning 7 organs, totalling 759 hours, with individual recordings running up to 37 hours. They wrote 1,253 multiple-choice questions that required actually following the procedure from start to finish. Then they tested the best available AI vision models against those questions. The best score: 41.1% correct. Random guessing on a four-choice question gets you 25%. So the models are beating chance — but not by a reassuring margin. The finding that surprised the researchers most: throwing more frames at the model did not reliably help. You would expect that sampling more moments from a long video gives the AI more evidence. Instead, performance was non-monotonic — sometimes more frames made things worse. The real bottlenecks turned out to be two things: finding the relevant moment in the footage in the first place, then making a correct clinical interpretation from it. More data does not fix a broken search. Why this matters: hospitals and health-tech companies are actively building AI tools to review surgical video for training, quality auditing, and complication detection. If the underlying models cannot follow a real procedure start to finish, those products carry a hidden fragility that a quick demo will not reveal. The catch: MedHorizon is an evaluation paper, not a fix. No new model was built or trained. It tells us where the floor is. Raising it is someone else's next job.

Glossary

non-monotonic scaling — When adding more of something (here: video frames) does not consistently improve performance — sometimes it makes things worse rather than better.

MLLM — Multimodal large language model — an AI that can process both text and images or video, not just written words.

Source: MedHorizon: Towards Long-context Medical Video Understanding in the Wild

              02 / 03
            

AI Assistants Keep Acting on Outdated Information — And Don't Know It

If you tell an AI your address and later hint you moved, there is a good chance it keeps mailing things to the old place.

Think about a friend with perfect recall who keeps giving out the old phone number for your favourite restaurant — because you once mentioned 'I couldn't reach them,' but never said 'the number changed.' They have all the information. They just never connected the dots. That is the failure mode a research team formalized in a new benchmark called STALE. They built 400 conflict scenarios across 100-plus everyday topics — user addresses, jobs, medical preferences, policy rules — and tested whether AI assistants would update their behaviour when earlier information was quietly invalidated by something said later. The crucial twist: the invalidation was never stated directly. A new message implied the old fact was wrong, without spelling it out. The best model they tested scored 55.2% correct across all scenarios. That is better than guessing, but it means nearly half the time the AI acted on stale information it should have revised. The researchers identified three things that make this hard. First, detecting the conflict requires connecting pieces of information spread across a long conversation — not just retrieving them, but reasoning about their relationship. Second, AI assistants are trained to be helpful, which pushes them to accept whatever premise is baked into a user's question rather than challenge it. Third, memory systems used in commercial AI products were designed to store and retrieve, not to automatically invalidate earlier beliefs. The team also proposed a prototype fix they call CUPMEM, which consolidates conflicting beliefs at write time rather than at retrieval time. Early results look promising, though the paper is careful not to claim the problem is solved. The catch: STALE is brand new. We do not yet know how these benchmark scores translate to real deployed assistants in actual use. The failure mode is real. How bad it is in practice is still an open question.

Glossary

implicit conflict — When a new piece of information makes an earlier belief wrong without explicitly saying so — requiring inference to notice the clash.

belief revision — The process of updating what you know (or what an AI 'knows') when new evidence contradicts earlier information.

Source: STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

              03 / 03
            

Using AI to Make AI Safer Might Create New Hidden Problems

What happens when the tool you use to check AI safety is itself an AI that makes the kind of mistakes humans are least likely to catch?

One of the big bets in AI safety right now is using AI systems to help do the research that makes AI safe — running experiments faster, drafting papers, checking arguments. A position paper published this weekend argues that this approach has a structural problem that has not been taken seriously enough. The authors — who engage closely with prior work from researchers including Joe Carlsmith, John Clymer, and Jan Leike — lay out two places where automated alignment research can fail quietly. First: individual errors. If an AI assistant makes mistakes while doing safety research, those mistakes are unlikely to be random. They will be skewed toward whatever blind spots were baked in during training. Think of hiring a hundred reviewers to audit a legal contract when they all studied from the same flawed textbook. Each one seems thorough. The errors they share are exactly the ones your lawyer would not catch. Second: aggregation. Even if most individual outputs are correct, combining thousands of AI-generated findings into a final verdict about whether a system is safe is a separate problem. If all the AI researchers made correlated errors — the same type of mistake, from the same source — the combined vote can still go wrong even when each individual input looked fine. The paper calls this an 'aggregation-level failure,' and argues that currently proposed solutions, including debate-based oversight methods, have no principled answer to it. The honest limit here is significant: this is a pure argument paper, with no experiments, no datasets, no numbers. The authors do not prove these failures will happen. They argue there is no known way to prevent them. That is a call for caution, not panic. But it is also a reason not to assume that using AI to accelerate safety research is automatically safe just because individual outputs look reasonable.

Glossary

automated alignment research program (AARP) — An effort to use AI systems to do the scientific work of figuring out how to make AI safe and well-behaved.

scalable oversight — A family of techniques — including debate, where two AIs argue and a human judges — aimed at allowing humans to supervise AI work that is too complex for humans to evaluate directly.

correlated errors — Mistakes that tend to appear together across many sources because those sources share the same origin, training data, or blind spots.

Source: Automated alignment is harder than you think

The bigger picture

Here is what these three papers are collectively saying, and I think it matters: AI systems are getting good at the surface of tasks while quietly struggling with the depth those same tasks require. MedHorizon shows the gap between 'AI can analyse medical images' and 'AI can follow a 10-hour surgical procedure' is enormous, and more data does not automatically close it. STALE shows that memory retrieval works fine but belief revision — knowing when earlier knowledge is no longer valid — mostly doesn't. The automated alignment paper warns that these gaps don't disappear when AI turns its attention inward to safety work; they might become harder to see precisely because the output still looks plausible. The common thread is not that AI is bad at everything. It is that our current ways of evaluating AI tend to catch the loud failures. The quiet ones — sustained attention, belief updating, correlated blind spots — are accumulating in the background. That is worth sitting with.

What to watch next

MedHorizon is freshly published, so watch for responses from major medical AI labs — particularly whether clinical video understanding becomes an explicit training target in the next wave of multimodal model releases. For the alignment debate, the real test is whether labs running automated alignment research programs engage publicly with these structural criticisms or quietly adjust their methodologies. And the open question I would most want answered next: can explicit belief-revision mechanisms move STALE scores above 70%, or is the problem deeper than any memory architecture can fix?