DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI That Doubles Down, Guesses Blind, and Reads Scans Better

Three new studies reveal the same gap in AI systems: they perform well on average but break in predictable ways when evidence gets thin.

            April 17, 2026
          

Hi — today's papers are all zeroes on citations, which is normal for brand-new preprints, so we're judging on substance, not social proof. Three papers stood out from 276 as genuinely vivid and worth your time. One improves medical AI. One catches AI models doubling down on wrong answers. One tests whether AI can say 'I don't know.' Let's go.

Today's stories

              01 / 03
            

A Checklist-Following AI Reads Chest Scans More Reliably

What if an AI radiologist worked through a checklist instead of eyeballing the whole scan at once?

Imagine asking a junior doctor to read a complex 3D medical scan and write a report. The brute-force approach has them absorb everything in one pass. RadAgent — described in this arxiv preprint by researchers who built it on top of an existing model called CT-Chat — does something closer to how a careful clinician actually works: it follows a diagnostic checklist, calling on specialist tools one at a time. First it segments the lungs, then checks for fluid buildup, then flags abnormal tissue — step by step, logging reasoning at each stage. It's less like staring at a scan and more like running down a flight checklist before takeoff. The results are specific and real. Compared to CT-Chat, RadAgent improved accuracy on detecting pathologies by 6 percentage points in macro-F1 — a score that weights rare findings equally with common ones, which matters a lot in medicine. When researchers tried to fool it with planted misleading data — an adversarial robustness test — RadAgent held up 42% better than the baseline. And for the first time in this model family, the system could explain why it reached a conclusion, a property called faithfulness, which CT-Chat couldn't do at all. The catch: that faithfulness score is still only 37%. Two-thirds of the time, the system's stated reasoning doesn't fully trace back to what actually drove its answer. For a medical tool, that is a real limit — you need to trust the explanation, not just the diagnosis. The system was also evaluated on specific benchmarks, not a live hospital workflow. This is a step in the right direction. It is not a solved problem.

Glossary

macro-F1 — A scoring method that averages accuracy equally across all categories — rare diseases count as much as common ones — so the system can't cheat by ignoring uncommon findings.

faithfulness — Whether the reasoning a model writes down actually matches what led it to its answer, rather than being a plausible-sounding story invented after the fact.

adversarial robustness — How well a system holds up when someone deliberately feeds it misleading information designed to trick it.

Source: RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

              02 / 03
            

AI Systems Double Down on Wrong Answers Instead of Revising

Once these AI systems make up their mind in step one, the rest of their 'reasoning' is mostly decoration.

Think of a student filling in a multiple-choice test. A careful student marks a tentative answer, then revises it as they reason through the options. Now picture a student who writes their first instinct in permanent marker and spends the rest of the time inventing justifications for it. That second student turns out to describe the behavior of most of the 18 vision-language models — AI systems that process both text and images — studied in this paper. The researchers tracked how each model's confidence changed across each step of its chain-of-thought: the step-by-step text the model generates before giving a final answer. What they found was 'answer inertia.' Once a model commits to an early answer, it almost never revises. Instead of using reasoning to find the right answer, it uses reasoning to defend the first one. Worse: when the team planted misleading text clues — wrong information embedded in the question — models followed the text even when the image clearly showed something different. The more capable, reasoning-trained models actually named the misleading clue in their reasoning steps. But they still followed it to the wrong conclusion. They were, in a sense, more articulate about being misled. The catch: the study covers 18 models across three benchmarks, and the full statistical methodology isn't visible in the available version of the paper. How far this generalises across task types and model families is still open. But the core finding — that longer, more fluent reasoning traces can look trustworthy while being structurally broken — is a real design problem, not just a benchmarking curiosity. It matters for anyone deciding how much to trust an AI's 'reasoning.'

Glossary

chain-of-thought — The step-by-step text an AI writes out before giving a final answer — intended to make reasoning transparent, but not necessarily accurate.

answer inertia — The tendency of a model to lock in an early answer and reinforce it through subsequent reasoning steps rather than genuinely revising.

vision-language model — An AI system trained to process both images and text together, allowing it to answer questions about pictures.

Source: Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

              03 / 03
            

AI Vision Systems Almost Never Say 'I Don't Know' — Even When They Should

Ask today's best AI vision systems an impossible question and they'll answer confidently anyway.

Picture a quiz show contestant who, when asked something they've never encountered, confidently invents a plausible-sounding answer rather than saying 'I don't know.' That is essentially the default behaviour of today's leading AI vision systems when asked questions they cannot possibly answer — according to this study, which built a benchmark of 2,079 questions deliberately designed to be unanswerable. The researchers, who constructed a dataset called MM-AQA, did this by systematically degrading real questions: removing the relevant image, stripping out the evidence, scrambling what was visible. Then they asked three frontier vision-language models anyway. Under standard conditions, the models almost never declined to answer. Even a simple trick — asking the model to rate its own confidence before responding — worked better than the default behaviour at getting it to abstain appropriately. Multi-agent setups helped. When one AI system was tasked with checking another's answer, the rate of appropriate 'I don't know' responses went up. But here's the trade-off the researchers found: the harder you push a system to abstain, the more it starts refusing questions it could actually answer correctly. No system they tested exceeded 65% accuracy simultaneously on answerable and unanswerable questions. That ceiling held across all three frontier models. The catch: MM-AQA is a controlled benchmark. Real-world situations where an answer is impossible are messier and more varied than a structured dataset captures. Also, this paper maps the problem — it doesn't propose a fix. Knowing your AI will confidently hallucinate answers to unanswerable questions is genuinely useful. Building one that doesn't is the next problem.

Glossary

abstention — An AI system's ability to decline to answer a question when it doesn't have enough information, rather than guessing.

multi-agent system — An arrangement where multiple AI models interact — for example, one generates an answer and another checks it — rather than a single model working alone.

MM-AQA — The benchmark dataset built by the researchers, containing 2,079 questions designed to be unanswerable, used to test whether AI systems know when to stay silent.

Source: Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

The bigger picture

Put these three papers next to each other and a specific picture emerges — not 'AI is improving' or 'AI is broken,' but something more precise: AI systems are failing in predictable, structural ways when the evidence they are given is weak, ambiguous, or absent. RadAgent shows that adding structure and checklists to AI reasoning in a high-stakes domain helps, measurably — but the system still can't fully account for its own reasoning two-thirds of the time. The answer-inertia paper shows that the step-by-step reasoning that was supposed to make AI more transparent has a flaw built in: early commitments harden rather than revise. And the abstention paper shows the downstream consequence — systems that refuse to admit the limits of their own knowledge, even when those limits are obvious. All three point to the same gap: reliability under uncertainty. The research community is now diagnosing these failure modes with real precision. Building systems that don't have them is the harder work still ahead.

What to watch next

The SWE-bench Verified leaderboard is updated regularly as coding-agent teams submit results — worth watching over the next few weeks to see whether claims from papers like SWE-TRACE hold up against independent submissions. For RadAgent specifically, the key next milestone is external clinical validation in a real radiology workflow, which hasn't happened yet. The open question I'd most want answered: can you train a vision-language model to say 'I don't know' without sacrificing performance on questions it actually can answer? Nobody has a clean solution yet, and that gap is the thread connecting almost everything in this digest.