DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Reads Scans Better, But Still Can't Say 'I Don't Know'

Today's papers reveal a clear pattern: AI performs better when given structure, and worse when asked to admit its own limits.

            April 19, 2026
          

Three papers today, and they rhyme in a way I didn't expect when I started reading this morning. One shows AI getting genuinely better at a hard medical task. The other two show where it still fails quietly — and confidently. Let me walk you through all three.

Today's stories

              01 / 03
            

An AI That Reads Chest Scans Like a Doctor With a Checklist

What if instead of one AI reading your chest scan all at once, you gave it ten specialized tools and made it follow a doctor's checklist?

That is exactly what the team behind RadAgent built. The system starts with CT-Chat, an existing AI trained to read three-dimensional chest scans. But instead of having it take the whole scan in and produce a report in one go — like a student cramming and then writing an essay from memory — the researchers equipped it with ten specialized analysis tools and made it work through a clinician-reviewed diagnostic checklist before writing anything. Think of it as the difference between a chef who improvises from what they remember versus one who follows a recipe card and checks each step before moving on. The training method, called reinforcement learning (a process of reward and penalty signals), gave RadAgent credit not just for a correct final answer, but for using the right tools in the right order and staying faithful to what was actually in the scan. The results are real. Compared to the baseline CT-Chat model, RadAgent improved classification accuracy by 36% on one measure (macro-F1) and 20% on another (micro-F1). Under adversarial conditions — where the system is fed unusual or misleading inputs — performance improved by 42%. Most striking: the baseline scored zero on faithfulness, meaning none of its report content could be traced back to specific evidence in the scan. RadAgent scored 37%. The catch: 37% faithful is better than zero, but it also means nearly two-thirds of report elements still can't be clearly tied to the scan. This was tested on internal datasets; real hospital environments are messier. The paper has not yet gone through formal peer review, and full statistical significance tests are not reported. A real step — but not a replacement radiologist.

Glossary

macro-F1 — A way to score a classifier that weighs each category equally, regardless of how common it is — useful when some conditions are rare.

reinforcement learning — A training method where an AI is given rewards for good actions and penalties for bad ones, so it learns through trial and error rather than from labeled examples.

faithfulness — In this context, whether each claim in a generated report can be traced back to specific evidence in the input scan.

Source: RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

              02 / 03
            

AI Almost Never Says 'I Don't Know' — Even When It Should

Your AI assistant almost never says 'I'm not sure' — and a new benchmark shows just how costly that silence is.

When a doctor doesn't know something, they say so. When a good waiter doesn't know if the kitchen can make a dish, they go check. AI systems, almost uniformly, do neither. They guess — fluently, confidently, wrongly. A research team built a benchmark called MM-AQA to test this directly. They took 2,079 visual questions from existing tests and deliberately transformed answerable ones into unanswerable ones: removing key evidence, degrading images, or adding misleading text alongside pictures. Then they tested three major AI vision-language systems and two multi-agent setups — where several AI instances debate before giving an answer. The findings are uncomfortable. Under normal prompting, AI models almost never choose to abstain. A simple trick — asking the model to rate its own confidence before answering — beat the default behavior for knowing when to stay silent. Multi-agent systems did better at abstaining, but with a real cost: they also became more cautious on questions they should have answered. No system managed to exceed 65% accuracy on both tasks simultaneously. There is a genuine trade-off, and right now AI is on the wrong side of it. The catch: MM-AQA is a controlled benchmark, not a real deployment. Deliberately unanswerable questions are a specific stress test, not the full messiness of real conversations. And abstaining on a multiple-choice question is simpler than knowing when to stop in a free-form legal or medical consultation. The gap between this benchmark and actual high-stakes use remains very large. But the pattern the team identified — models attempt reconciliation and answer incorrectly even when evidence is clearly contradictory — is worth keeping an eye on.

Glossary

abstention — When an AI system decides not to answer a question, rather than guessing.

multi-agent system — A setup where multiple AI instances each produce an answer or critique, then combine their outputs — similar to a panel of reviewers rather than a single judge.

vision-language model — An AI that can process both images and text together, not just one or the other.

Source: Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

              03 / 03
            

Humans Score 100%, Best AI Scores 60% on This Simple Spatial Puzzle

Imagine sitting at a round dinner table and being asked: if you moved to the chair directly across from you, what would be on your left?

Humans handle that question trivially. In the study, human participants scored 100% on a series of such problems — text-only descriptions of a scene, followed by one or more rotational steps, followed by a question about the resulting view. The best AI model tested, Qwen3-VL, scored about 60%. Every other model was lower. The researchers built a benchmark called VRUBench specifically for this. No images — just words. The twist is that they also looked inside the models to understand why they fail, using a technique called layer-wise probing (running a simple detector through each layer of the AI to see what information it holds at each stage). Here is what they found. The models do encode the direction of a rotation — they 'know' they turned left. But they lose track of their starting orientation as that information travels through the deeper layers of the network. It is like following a map, correctly noting that you turned left twice, but forgetting which direction you were originally facing — so the final answer is wrong even though each individual step seemed fine. One promising finding: the team identified a small set of attention heads (specific processing units inside the model) responsible for this failure. Fine-tuning only those heads improved spatial performance at half the compute cost of retraining the whole model, without hurting performance elsewhere. The catch: VRUBench is text-only and deliberately simple. Real spatial tasks — robotic navigation, reading floor plans, playing strategic games — are far more complex. The gap is real, but we do not yet know whether it reflects a fixable training gap or something deeper about how these models represent space.

Glossary

layer-wise probing — A technique where researchers train a simple classifier at each layer of a neural network to detect what information that layer holds — like taking readings at each floor of a building.

attention heads — Specific processing units inside a transformer model that decide which parts of the input to focus on when producing each output.

chain-of-thought (CoT) — A prompting method that asks an AI to reason step by step before giving a final answer, rather than answering immediately.

Source: How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

The bigger picture

Put these three papers side by side and a single uncomfortable picture emerges. AI can be made significantly better at a hard task — reading chest scans — when you give it structure: a checklist, specialized tools, step-by-step accountability. That is genuinely useful progress. But the other two papers show what happens without that scaffolding: models that almost never admit uncertainty, and models that cannot reliably track where they are in space even when given only words to work with. These are not small edge-case failures. Knowing when you are wrong and understanding basic spatial perspective are things every competent adult does automatically. The lesson I take from today is not that AI is doomed, but that its current reliability is deeply context-dependent. Build the right structure around it and it performs. Remove that structure and it guesses — confidently, fluently, and often incorrectly. That gap between surface fluency and genuine understanding is the thing worth watching.

What to watch next

The RadAgent team will need to validate against prospective real-world CT data — that is the test that matters for clinical adoption, and it has not happened yet. On the abstention front, watch for benchmarks that move beyond multiple-choice into free-form settings, which is where the failure mode will be most consequential. The open question I would most want answered: can the spatial reasoning gap identified in VRUBench be closed by training on more spatially structured data, or does it require a different architecture altogether?