DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

When AI Helps Doctors, Misses Tumors, and Fools Our Brains

Today's AI research asks whether the tools we trust most are quietly making us worse at our jobs.

            May 09, 2026
          

Three stories today, and honestly they form an unusually coherent set — each one pokes at the same uncomfortable question from a different angle: when AI gets good enough to help with something serious, what breaks? Let me walk you through a cancer staging test, a human attention trap, and a philosopher's argument that AI doesn't actually reason at all.

Today's stories

              01 / 03
            

Two AI chatbots were tested on real cancer patients — results are complicated

Could a chatbot read a patient's chart and correctly stage their cancer three times out of four?

A team at Zonguldak Bulent Ecevit University in Turkey fed clinical data from 180 real head and neck cancer patients — the kind of cancer that affects your throat, tongue, jaw, and sinuses — into both ChatGPT-4o and Gemini 1.5 Pro. Each AI was asked to do two things: stage the cancer (rank how far it has spread using the AJCC staging rulebook, which is the industry-standard system oncologists worldwide follow) and then suggest a treatment plan. The AI answers were checked against both that rulebook and a panel of human cancer specialists. Think of cancer staging like sorting parcels by zip code. Most of the time the addresses are clear. Both AIs got the right zip code roughly 75% of the time — three patients out of four. For a general-purpose chatbot, that's not nothing. The two models were statistically tied on staging. But treatment planning is more like actually routing the delivery truck, and here Gemini pulled ahead: 78.9% accuracy versus ChatGPT's 71.7%. That gap held up under statistical testing, so it's not random noise. The catch — and it's a real one — is that both models failed on roughly one patient in four. ChatGPT's errors weren't random either: it specifically stumbled on anatomically complex regions like the oropharynx (the back wall of your throat) and paranasal sinuses (hollow spaces around your nose). Those are exactly the spots where a wrong stage can mean the difference between radiation and surgery. No one is handing either model a prescription pad. But in clinics where access to specialists is limited, a 75% floor is a real starting point — not a destination.

Glossary

TNM staging — A standardised system that classifies cancer by Tumour size, whether it has reached nearby lymph Nodes, and whether it has spread (Metastasised) to distant organs.

AJCC 8th Edition — The current official rulebook published by the American Joint Committee on Cancer that defines how tumours are staged and used worldwide as a clinical benchmark.

oropharynx — The middle part of the throat, behind the mouth, including the soft palate, back of the tongue, and tonsils — an area with complex anatomy that is harder to stage correctly.

Source: Accuracy of large language models in head and neck cancers: a comparative analysis of ChatGPT and Gemini in TNM staging and clinical decision support

              02 / 03
            

When AI is too reliable, humans quietly stop paying attention

What if the most dangerous version of AI isn't the one that fails constantly, but the one that almost never does?

This paper — an experimental study that hasn't yet received a DOI, so treat it as work in progress — set up a simulated task where people had to identify phishing emails, those fake messages designed to steal your passwords or money, with the help of an AI assistant. The AI was tuned to perform very well. That's the trap. The researchers wanted to test something psychologists call automation complacency — the gradual erosion of your own vigilance when a tool is consistently right. Think of it like a passenger who stops glancing at the road once the GPS has given perfect directions for a thousand trips. The skill of reading the road quietly atrophies. The finding is that sustained exposure to a highly accurate AI shifts how people think. You move from what psychologists call System 2 processing — slow, deliberate, effortful checking — toward System 1, which is fast and heuristic: 'the AI flagged it, so it's fine.' The trouble is that when the AI eventually makes a mistake (and it will), the human watching it has by then lost the sharpness to catch the error. This is the paradox the title names: reliability, pushed far enough, becomes a liability. The honest caveat is that I don't have the full paper in front of me — the experimental numbers weren't available in the data I received, so I can't tell you the effect size. The concept itself is well-documented in prior safety research on aviation and nuclear plant operators. Whether this study confirms it cleanly for AI-assisted email tasks specifically is something we'll need the published version to settle. Keep this one in your 'watch' pile.

Glossary

automation complacency — The tendency for human operators to stop actively monitoring a system after it has been consistently reliable, leaving them slower to catch the rare error.

System 1 / System 2 processing — Psychologist Daniel Kahneman's shorthand for fast, instinctive thinking (System 1) versus slow, deliberate reasoning (System 2) — two modes we shift between depending on how much effort a task seems to demand.

human-in-the-loop — A design where a human is kept in the decision chain alongside an AI, expected to check or override its outputs rather than simply accepting them.

Source: The Paradox of Perfection: Hidden Risks of High-Performing AI in Human-in-the-Loop Governance

              03 / 03
            

A philosopher argues AI systems cannot actually reason — and explains why

Knowing every recipe ever written is not the same as understanding why heat turns raw dough into bread.

Published in Frontiers in Artificial Intelligence, this paper is philosophy rather than engineering — no data, no training runs, no benchmark scores. The author's argument is worth understanding anyway, because it cuts at a question everyone who uses ChatGPT eventually asks: is it actually thinking, or is something else going on? The core claim, rooted in philosopher Robert Brandom's theory of meaning, is that genuine reasoning is not pattern-matching. It requires what the author calls reason relations — the ability to understand *why* one statement follows from another, not just that it tends to follow in a large corpus of text. A student who has memorised every past exam answer can score brilliantly on familiar questions. Put a novel problem in front of them and the machinery breaks, because there is no underlying structure to transfer — only stored outputs. The author argues that pure neural systems like today's large language models are, in principle, doing the second thing: very sophisticated pattern-matching over vast text. They are not following chains of inference; they are reproducing the shape of chains of inference. Pure symbolic systems — old-school logic engines — fare no better, because humans themselves cannot fully formalise the messiness of natural language into clean rules. The cautiously hopeful conclusion is that neuro-symbolic systems, hybrids that combine learned patterns with explicit logical structure, might be the path to something closer to genuine reasoning. The catch is large: this is a philosophical argument, not an experiment. It cannot be falsified by a benchmark score. Plenty of researchers would contest the definition of 'reasoning' the author uses. Think of it as a sharp map of what the debate is really about, not a measurement of where we are.

Glossary

reason relations — In inferentialist philosophy, the structural links between statements that make one a genuine justification for another — distinct from mere statistical co-occurrence.

neuro-symbolic NLI — A family of AI approaches that combine neural networks (which learn from data) with symbolic logic (which follows explicit rules) to handle language understanding and inference.

inferentialism — A philosophical theory, developed largely by Robert Brandom, that the meaning of a statement is defined by what it follows from and what follows from it — its role in reasoning, not its correspondence to objects.

Source: Putting reasons back into reasoning: how genuine reasoning is inference-based and why neuro-symbolic NLI could achieve it

The bigger picture

Put these three stories side by side and a single thread runs through them. Story one shows AI performing well enough on a high-stakes medical task to be genuinely useful — but failing in exactly the anatomically awkward corners where a human expert earns their salary. Story two shows that once AI gets reliable enough, the human in the loop starts to switch off — which means the failure mode is no longer just 'AI is wrong' but 'AI is right so often that nobody catches it when it's wrong.' Story three asks whether the whole enterprise is missing something structural: that even a highly accurate language model is doing something categorically different from the deliberate, justified inference we call reasoning. Taken together, this is not a picture of AI being overhyped or underpowered. It is a picture of a technology that is genuinely capable enough to be deployed in serious settings, but whose failure modes are subtle, intermittent, and partly created by its own success. That is a harder problem to solve than simple inaccuracy.

What to watch next

The cancer staging paper will likely attract follow-up studies at larger multi-institution cohorts — the 180-patient single-centre design is a known limitation the authors flagged, and replication in diverse hospital settings is the obvious next step. On the automation complacency front, watch for this paper to pick up a DOI and full data release; the experimental results will either sharpen or soften the warning considerably. The open question I'd most want answered: does brief, structured re-training of human reviewers restore vigilance, or does complacency return as soon as the AI resumes performing well?