DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI in clinics, drug labs, and memory games: what holds.

Today's AI research asks whether the tools we're deploying actually work — and the honest answer is complicated.

            June 18, 2026
          

Three stories today, and they fit together in a way I find genuinely interesting. One is about AI already answering patient questions in a real hospital. One is about AI failing roughly four times in ten at drug-discovery tasks the industry genuinely needs solved. And one is about a deceptively simple card game that exposes a deep memory problem in the best models available. Dense day. Let's dig in.

Today's stories

              01 / 03
            

An AI Answered Patient Questions at a Nuclear Medicine Clinic — Here's How It Did

Before a radioactive scan, patients have questions — and a study just tested whether ChatGPT answers them as well as a nuclear medicine doctor.

Picture the front desk of a specialist clinic: one side handles booking, prep instructions, and insurance paperwork; the other side handles medical questions like 'what will this scan show?' or 'can I eat before it?' A team published in npj Digital Medicine ran a real-world test at a nuclear medicine department — the kind that uses radioactive tracers for imaging — collecting actual patient questions and answering each one in parallel: once by a human expert, once by ChatGPT v4.1. Then independent raters scored both answers across fifteen quality dimensions. For administrative questions — scheduling, prep logistics, what to bring — the AI came out ahead by a wide margin. Non-expert raters found AI responses more informative in 97% of cases and preferred them outright in 86% of cases. On medical questions, the story split. In eight of ten quality dimensions, 76 to 98% of AI responses were rated equivalent or better than the human answer. But on readability, human responses scored higher 62% of the time. And raters disagreed more on which was overall better for medical queries — a split verdict, not a win. There is a real catch here. This was one department, one AI model, and the study was not testing whether the AI was safe to deploy unsupervised — only whether its answers were comparable in quality. Nuclear medicine involves radioactive materials and timing-sensitive protocols. A confident wrong answer about medication timing before a scan is not the same as a wrong answer about a hotel booking. The study itself does not claim readiness for deployment without oversight. What it does show — carefully and with real data — is that AI can reach the bar human experts set for informational queries. That bar being reached is the news. What happens next is a different question.

Glossary

nuclear medicine — A medical specialty that uses small amounts of radioactive material to image organs and diagnose disease.

QUEST framework — A structured scoring system for evaluating the quality of health information responses across multiple dimensions like accuracy and completeness.

PABAK — A statistical measure of agreement between raters that adjusts for how common each answer category is, making it more reliable than raw agreement percentages.

Source: Real-world evaluation of large language model for patients medical and administrative queries in nuclear medicine

              02 / 03
            

The Best AI Models Still Fail 40% of Drug Discovery Tasks

The best AI model on a new drug-discovery benchmark got 59% — which means it was wrong about four tasks in every ten.

Before a drug reaches a clinical trial, it goes through a long preclinical phase: researchers test whether it's absorbed by the body, whether it's toxic at certain doses, whether it interacts with other proteins it shouldn't. These are not open-ended creative tasks. They are structured, data-heavy decisions with clear right answers. TxBench-PP, a new benchmark from a research team, gave AI agents 100 tasks drawn from exactly this kind of work — spanning eight stages of preclinical pharmacology. Each agent received the kind of files a real researcher would use: data tables, assay results, workflow documents. It had to return structured answers, which were graded automatically against known correct values. Sixteen different AI configurations were tested, generating 4,800 runs in total. Think of it like handing someone the receipts, bank statements, and pay stubs they'd need to complete a complex tax return — and checking whether they filled in the right numbers. The question is not whether they understand tax law in the abstract. It is whether they can actually do the paperwork with the right data in front of them. The best performer was Claude Opus 4.8, scoring 59.3%. The second-best, GPT-5.5, scored 55.3%. The weakest systems scored around 18 to 20%. No configuration reliably succeeded across all task types. The catch is important: this benchmark is new, and its 100 tasks may not represent the full range of real preclinical work. The researchers took care to remove tasks solvable without actually engaging the data, which is methodologically honest. But 59% as a ceiling — for the best current model, on tasks with real data, in a structured setting — tells you something meaningful about where AI-assisted drug discovery actually stands right now.

Glossary

preclinical pharmacology — The stage of drug development before human trials, where researchers study how a drug behaves in laboratory and animal models.

endpoint pass rate — The percentage of tasks where the AI's answer matched the correct answer within acceptable tolerances.

model-harness configuration — A pairing of a specific AI model with a specific software environment that controls how the model accesses tools and files.

Source: TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

              03 / 03
            

AI Models Keep Forgetting What They Saw Earlier — and Now We Can Measure It

Researchers built a memory card game for AI — and the best models still take two and a half times as many moves as optimal, mostly because they forget what they already saw.

You know the card game where a deck is face-down, you flip two at a time, and you're trying to match pairs? The trick is remembering where the card you just saw five turns ago is sitting. A research team built a controlled version of this game specifically to test AI models — and added a second game, a 3D maze, for good measure. The point was not to be fun. The point was to create a situation where you cannot succeed by reasoning cleverly about the current moment alone. You have to remember earlier observations. They called this benchmark RNG-Bench. On the card game, the best model — Gemini-3.1-Pro — won all 16 head-to-head test rounds at the standard size. But the strongest models were taking on average 8 moves per matched pair. The mathematically optimal strategy takes 3.24 moves. That is a gap of nearly 2.5 times. The team introduced a clever diagnostic they call the Memory Gap. They ran models once normally, then again with the hidden information revealed — essentially telling the model what it would need to remember. The difference in performance between the two conditions tells you how much of the error comes from forgetting, versus how much comes from making bad decisions even when you know the state of the game. Their finding: most of the gap is forgetting, not bad reasoning. Why does this matter beyond a card game? Because real AI tasks — reading a long document, managing a multi-step project, navigating a conversation over many turns — are all memory problems in exactly this sense. The benchmark gives researchers a clean, controlled way to measure something that has been frustratingly hard to isolate until now. That is a modest but real step.

Glossary

non-Markov game — A game where the right move depends on the full history of what has happened, not just the current state — you can't win by only looking at what's in front of you right now.

Memory Gap — The performance difference between an AI playing normally and the same AI given access to all the information it should have remembered — a direct measure of how much forgetting is costing it.

multimodal large language model (MLLM) — An AI system that processes both text and images, allowing it to answer questions about what it sees as well as what it reads.

Source: Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

The bigger picture

Read these three together and a pattern emerges that I think is worth naming directly. We are deploying AI in clinical settings — hospital front desks, administrative workflows, diagnostic support — and in those settings, the tools are genuinely useful on informational tasks. That is real. But the moment the task requires precision reasoning over structured data, as in drug discovery, the best available models are still wrong roughly four times in ten. And cutting across both of those domains is the memory problem that RNG-Bench makes visible: AI systems that need to track what happened earlier in a long sequence — a conversation, a document, a lab protocol — are losing performance not because they reason badly but because they forget. That is not a minor footnote. It is the core constraint that drug discovery benchmarks and clinical AI both run into once tasks get long enough. The field is not uniformly progressing. Informational retrieval is ahead. Structured expert reasoning and long-horizon memory are not.

What to watch next

The nuclear medicine paper was published in npj Digital Medicine, which suggests it went through peer review before the LLM it used — ChatGPT v4.1 — was widely available; watch whether similar prospective trials appear for radiology and pathology departments over the next quarter. For TxBench-PP, the honest next question is whether labs building AI-assisted drug discovery pipelines will run their internal tools against it publicly — that would tell us far more than any academic comparison. And for the memory gap work in RNG-Bench, I'd want to see whether the fine-tuning experiment on Qwen3.5-9B actually holds up when tested by independent teams rather than the benchmark's own authors.