DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Hallucinates Citations, Forgets How to Cooperate, and Fails Doctors

Three new studies show AI reliability cracking under real-world pressure — in science, in teamwork, and in medicine.

            May 11, 2026
          

Hi — today's batch is genuinely unsettling, in the best possible way. I spent this morning working through three papers that all ask the same question from different angles: can we actually trust AI systems when the stakes are real? The answer, right now, is a qualified no. Let me walk you through why.

Today's stories

              01 / 03
            

AI Is Quietly Littering Science With Fake References

Somewhere in the scientific literature right now, roughly 150,000 citations point to papers that simply do not exist.

Imagine a library where a small but growing percentage of the cards in the catalog list books that were never written. You pull the card, you write down the title, you cite it in your own work — and the book isn't there. That is what is happening to scientific publishing right now. A team audited 111 million references across 2.5 million papers from arXiv, bioRxiv, SSRN, and PubMed Central, covering 2020 through 2025. Their method: match every cited reference against two large databases, flag the ones that match nothing, and compare the rate before and after ChatGPT's public release in late 2022. The pattern is unmistakable. Unmatched citation rates were flat through 2022, then climbed sharply from 2023 onward, with the steepest rise starting around mid-2024. Their conservative estimate: 146,932 hallucinated citations produced in 2025 alone. SSRN — a repository popular for economics and social science preprints — had the worst rate at 1.91%. The fabricated references also tend to disproportionately credit already-prominent, male scholars, which means they are quietly distorting whose work appears influential. Why does this matter beyond academic tidiness? Because scientific papers build on each other. If a cited source doesn't exist, any claim traced back to it is floating in air. And the scale here — nearly 150,000 ghost citations in a single year — is large enough to corrupt literature reviews, systematic analyses, and policy documents that rely on them. The catch: these are estimates, not exact counts. The pipeline the researchers used matches about 95% of references reliably; the remaining fraction is messy. The true number could be somewhat higher or lower. And the study can't tell us whether authors knowingly used AI-generated references or simply failed to check what their tools produced. Either way, the trend is real and it is accelerating.

Glossary

hallucinated citation — A reference to a paper, book, or article that does not actually exist, generated by a language model presenting it as real.

unmatched citation rate — The proportion of references in a paper that cannot be found in any academic database, used here as a proxy for fabricated citations.

Source: LLM hallucinations in the wild: Large-scale evidence from non-existent citations

              02 / 03
            

The More an AI Remembers Past Conflicts, the Worse It Cooperates

Give an AI agent a longer memory of past interactions and it becomes a worse collaborator — the exact opposite of what you'd expect.

Picture two neighbours who need to share a parking spot. Early on, they cooperate easily. But give them perfect recall of every minor slight from the past five years, and suddenly every small decision becomes a grudge match. You might think the same logic applies to AI agents — that more memory means smarter, more calibrated decisions. A large study suggests the opposite is true. Researchers at tested 7 different large language models playing 4 classic cooperation games — think Prisoner's Dilemma-style scenarios where two players each decide whether to work together or defect for personal gain. They ran each combination with 9 different memory lengths, 500 rounds each, generating 378,000 reasoning traces that they then analysed word by word. In 18 of the 28 model-game combinations, cooperation declined as memory grew. They call this the memory curse. The mechanism is subtle and important. It is not that agents become more paranoid or suspicious — the word analysis didn't find rising distrust. Instead, agents stopped thinking about what comes next. They started reasoning backwards through grudges rather than forwards toward shared outcomes. When researchers swapped real memory for a fake history of cooperative interactions — keeping the text length identical — cooperation snapped back. That tells you it is the content of the memory doing the damage, not just information overload. They also found a fix: fine-tuning models on reasoning traces that emphasise forward-looking thinking reduced the curse without hurting performance on unrelated tasks. The catch: all agent pairs in this study were homogeneous — the same model talking to itself. Real-world deployments mix different models, which could behave very differently. And the games, while well-studied, are still simplified versions of real coordination problems.

Glossary

social dilemma game — A controlled scenario where two or more players must choose between cooperation (which benefits everyone) and defection (which benefits only the individual), used to study strategic behaviour.

LoRA fine-tuning — A technique for adjusting a language model's behaviour by training a small add-on layer rather than retraining the entire model — cheaper and faster than full retraining.

chain-of-thought reasoning — A method where an AI model writes out its step-by-step reasoning before giving an answer, intended to improve accuracy.

Source: The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

              03 / 03
            

Medical AI Confidently Answers Questions Even When the Evidence Is Wrong

Show a medical AI a deliberately mislabelled chest X-ray and it will very likely give you a fluent, confident, wrong answer.

A good doctor, handed a scan that doesn't match the patient's symptoms, should stop and say: something here doesn't add up. That pause — the ability to notice broken evidence — is one of the things that separates careful diagnosis from dangerous overconfidence. A team of researchers built a benchmark specifically to test whether medical AI systems have learned that pause. Most of them haven't. The benchmark, called MedVIGIL, draws on 300 real medical cases from four public datasets — chest X-rays, radiology questions, medical images — and then deliberately breaks them in eight different ways. Some perturbations are textual: the question contains a false premise, or a subtle wording change. Others are visual: the region of interest in the image is masked out, or flipped. The question is not whether the AI gets the right answer. It is whether the AI notices that something is wrong and declines to answer, rather than forging ahead. Four board-certified radiologists supervised the construction and one served as the human baseline. That radiologist scored 83.3 on the benchmark's composite scale and had a silent-failure rate of 5.8% — meaning about 1 in 17 times, they also missed that the evidence was broken. The best AI model tested, Claude Opus 4.7, scored 69.2. GPT-4o fell to 44.1 on the safety dimension alone. Some models designed specifically for safety scored below 7 out of 100 on the rule-recognition part of the test — close to random. The catch: 300 cases is a small sample, even with 2,556 derived test probes. And MedVIGIL tests a specific failure mode — broken evidence — not overall diagnostic accuracy. A model could do well here and poorly elsewhere. Still, a 14-point gap between the best AI and a single human reference is not a rounding error.

Glossary

silent failure — When a model gives a confident, fluent-sounding answer despite the evidence being incomplete, contradictory, or deliberately wrong — failing without any visible warning sign.

vision-language model (VLM) — An AI system that processes both images and text together, used in medical imaging to interpret scans alongside written descriptions or questions.

false premise — A question built on an incorrect assumption — for example, asking 'how severe is the pneumonia?' about a scan that shows no pneumonia.

Source: MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

The bigger picture

Look at what these three papers are each measuring. One counts fabricated citations spreading through published science. One watches AI agents stop cooperating as their memory grows. One shows medical AI failing to notice when it has been handed a trap. They are measuring different things in different domains — but they are all pointing at the same underlying gap. AI systems are very good at producing fluent, confident output. They are much less good at knowing when to stop, flag uncertainty, or say 'something about this situation is off.' The hallucination-in-citations problem is fluency without accuracy checking. The memory curse is deliberation without forward judgment. The MedVIGIL result is pattern-matching without epistemic caution. This is the reliability problem in AI, stated plainly: the systems are impressive at generating answers, and genuinely weak at recognising when an answer shouldn't be given. That is a solvable engineering problem. But right now, it is not yet solved.

What to watch next

The MedVIGIL team has released their benchmark publicly, so watch for follow-up evaluations as newer model versions come out — Claude, GPT, and Gemini all update frequently enough that scores could shift meaningfully within months. On the citation-hallucination front, keep an eye on whether major preprint servers like arXiv introduce automated reference verification tools; there are quiet conversations happening about this. The open question I'd most want answered: does the memory curse appear in AI systems that work on real tasks — customer service, code review, legal drafting — and not just in controlled game settings?