DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Reads Chest Scans Better, But Still Can't Turn Around

Three studies today show AI making real progress in medicine — while exposing two stubborn blind spots you should know about.

            April 18, 2026
          

Happy Friday. Today's batch is 221 papers deep, and I pulled three that each tell you something concrete and slightly uncomfortable about where AI actually is right now. One is genuinely good news for medicine. Two are honest reality checks about how these systems reason — or fail to. Let's get into it.

Today's stories

              01 / 03
            

An AI That Reads Chest Scans Like a Careful Radiologist Would

What if an AI read your chest scan the way a meticulous doctor does — checklist in hand, specialized tools at the ready, revising as it goes?

That is more or less what a team building RadAgent set out to do. The system starts with a rough draft produced by an earlier AI called CT-Chat, then improves it step by step, the way a cook doesn't just glance at ingredients and guess — they follow a recipe, use the right knife for each cut, and taste before plating. RadAgent has access to ten specialized tools: one for measuring lung nodules, one for spotting fluid, one for checking the heart, and so on. It is trained using reinforcement learning — meaning it learned through trial and reward — to call those tools in a sensible sequence, check a clinician-reviewed diagnostic checklist, and only then produce a final report. The result is measurable. Compared to CT-Chat working alone, RadAgent improved macro-F1 — a measure of how well it catches different pathology types — by 6 percentage points, which works out to a 36% relative improvement. When the researchers deliberately fed it corrupted or misleading inputs (an 'adversarial' test), its robustness improved by nearly 25 points. Perhaps most strikingly, it scored 37% on 'faithfulness' — meaning its report actually reflected what its tools found — while CT-Chat scored zero. Now, the catches. Both test datasets are focused on chest pathology; we have no idea how this holds up on rare diseases or edge cases. A 37% faithfulness score is better than nothing, but it also means nearly two-thirds of reports still contain claims that drift from the underlying evidence. This is not a replacement for a radiologist. It is a more credible first-pass tool than what existed before.

Glossary

macro-F1 — A score that measures how well a classifier performs across all disease categories equally, giving the same weight to rare and common findings alike.

reinforcement learning — A training approach where an AI improves by receiving rewards for good actions and penalties for bad ones, rather than being shown correct answers directly.

faithfulness — In this context, whether the AI's written report actually reflects the findings produced by its own tools — a measure of internal consistency.

Source: RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

              02 / 03
            

AI Models Lock In Their First Guess and Rarely Change Their Mind

Show an AI a picture with a clear answer, add one misleading sentence of text, and watch it abandon the picture.

A team tested 18 vision-language models — AIs that can process both images and text — across three benchmarks involving math, physics, and science questions. They tracked each model's confidence step by step as it worked through its reasoning out loud (a technique called Chain-of-Thought, where the model narrates its thinking before answering). What they found is something you might recognise from bad meetings: the models committed to an answer early and then spent the rest of their 'reasoning' reinforcing that first guess rather than genuinely reconsidering it. The researchers call this 'answer inertia.' It is as if a plumber glances at a pipe, decides it is the hot water line, and then — even as cold water keeps running out — finds reasons why that still must be the hot line. Models trained specifically to reason (rather than just answer) showed more ability to correct themselves, but this improvement broke down when the visual evidence was harder to read. More alarming: when the researchers planted a misleading text cue that contradicted the visual evidence, all 18 models were consistently steered toward the wrong answer — even when the image alone contained everything needed to answer correctly. And here is the uncomfortable part about monitoring: Chain-of-Thought traces looked thoughtful and grounded, but they were actually following the misleading text. The reasoning log appeared reliable while being wrong. So the thing we use to check whether AI is reasoning well — making it explain itself — turns out to be only a partial window. Honestly, nobody has a clean solution to this yet.

Glossary

Chain-of-Thought (CoT) — A technique where an AI model writes out its reasoning steps before giving a final answer, intended to improve accuracy and make thinking visible.

vision-language model (VLM) — An AI that processes both images and text together, so it can answer questions about pictures or documents.

modality — In AI, a type of input — text, image, audio, and so on are different modalities.

Source: Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

              03 / 03
            

Humans Score 100%, Best AI Scores 60% at Imagining a Turn in Space

If you stood in a room, turned 90 degrees, and I asked what you'd now be facing — you'd know instantly. The best AI gets it right only 60% of the time.

Researchers built a text-only benchmark called VRUBench. No images. Just descriptions of a space and a sequence of turns — rotate 90 degrees left, now 180 degrees right — followed by a question: what are you facing now? Humans nail this at 100%. The best model tested, Qwen3-VL, reached about 60%. Most others did worse. What makes this study more than just a benchmark score is the detective work the researchers did afterward. They looked inside the models — layer by layer — to ask why. They found that the models actually encode information about rotation direction and angle reasonably well in their middle layers. They know a 90-degree left turn happened. The breakdown occurs later: in the final layers, where the model has to bind that rotation to a specific position in space and produce an answer, something goes wrong. The researchers call this 'hallucination in the final layers' — the model has the ingredients but scrambles them at the last step, like knowing you need to add salt after the pasta is cooked but pouring it into the washing-up water instead. Interestingly, asking models to think out loud (Chain-of-Thought) actually helped here — unlike most spatial tasks where it makes no difference. And the team found that fine-tuning just a small set of identified attention heads — the specific parts of the network responsible for the final answer step — recovered most of the performance gain at half the computational cost of retraining everything. I simplified the interpretability method here; the real technique (path patching) is more involved. But the core finding stands: the knowledge is in there, and it still comes out wrong.

Glossary

attention head — A sub-component inside a transformer model that learns to focus on specific relationships between words or tokens when computing its output.

path patching — An interpretability technique that identifies which specific parts of a neural network are causally responsible for a particular output, by selectively replacing activations.

VRUBench — Viewpoint Rotation Understanding Benchmark — a text-only test requiring models to predict what they'd observe after rotating in a described space.

Source: How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

The bigger picture

Take these three studies together and a pattern emerges that I think is worth naming directly. We are getting genuinely better at building AI that follows structured procedures — RadAgent's checklist-driven, tool-using approach is a real step forward, and it works precisely because someone designed a process rather than just scaling a model. But the other two studies are a useful corrective to any enthusiasm. The core reasoning loop in today's best models is unreliable in ways that are hard to see from the outside: they commit early, they follow misleading text over clear visual evidence, and they fail at spatial problems humans find trivial — even when the relevant information is already encoded inside them. The common thread is that fluency and correctness are not the same thing, and our main tool for checking AI reasoning (asking it to explain itself) provides only partial visibility. The honest takeaway: structured, tool-using systems are outperforming end-to-end generation for high-stakes tasks. That is a design lesson, not a model-scaling lesson.

What to watch next

The RadAgent team evaluated on two chest CT datasets; the question worth watching is whether an independent hospital group replicates the accuracy and faithfulness numbers in a live clinical workflow — that gap between benchmark and bedside is where most medical AI stumbles. On the reasoning side, the answer-inertia finding applies to 18 models tested today; it would be worth seeing whether reasoning models released in the next few months (several are expected from major labs before summer) show any structural improvement, or whether the pattern holds.