DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

May 11, 2026

281

Papers

10/10

Roadblocks Active

Connections

⚡ Signal of the Day

• Large-scale empirical evidence now confirms that LLM hallucination is corrupting the scientific record at measurable scale, with an estimated 146,932 phantom citations produced in 2025 alone across major academic repositories.

• Unlike benchmark findings, this is real-world downstream harm: phantom references are diffused across many papers rather than concentrated in a few, making systematic detection and correction extremely difficult without automated tooling at the publisher or indexer level.

• Watch for follow-up work on automated citation-screening pipelines integrated into journal submission workflows; this study establishes the quantitative baseline needed to measure the problem going forward.

📄 Top 10 Papers

LLM hallucinations in the wild: Large-scale evidence from non-existent citations

Researchers audited 111 million citations from 2.5 million papers published across arXiv, bioRxiv, SSRN, and PubMed Central and found a sharp post-2023 surge in references that do not exist anywhere in the scholarly record, conservatively estimating at least 146,932 hallucinated citations in 2025 alone. The phantom references are spread diffusely across many papers rather than concentrated in a few bad actors, which makes them hard to spot through ordinary peer review. This is the first large-scale, high-confidence empirical proof that LLM hallucination is already degrading the integrity of published science at a scale that demands systematic intervention.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

MedVIGIL tests 16 medical vision-language models by deliberately breaking the evidence they are given—introducing false text premises, corrupting image regions of interest, or flipping images—and measuring whether models refuse to answer or silently return confident but wrong responses. Every model tested showed high rates of silent failure, with the best AI system (Claude Opus 4.7, composite score 69.2) still 14 points below a human radiologist (83.3), who refused to answer when evidence was insufficient. Silent failures in clinical AI are more dangerous than obvious errors because neither the clinician nor the patient has a signal that something went wrong.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

Across 7 LLMs and 4 social dilemma games, giving agents access to longer histories of past interactions consistently degraded cooperative behavior—18 of 28 model-game combinations showed significant declines. The mechanism is not increased paranoia or distrust of partners; rather, real interaction histories erode the agents' forward-looking intent, a finding confirmed by replacing real histories with sanitized synthetic records (which restored cooperation at identical prompt lengths). This means simply expanding context windows to give AI agents better memory of past interactions may undermine coordination in multi-agent deployments.

██████████ 0.9 alignment-safety Preprint

Read Save Connections

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

A benchmark of 270 escape-room-style puzzles with dependency chains of 5 to 25 steps shows that the best LLM agent drops from 90% accuracy on simple tasks to 60% on complex ones, while humans only drop from 98% to 80% over the same range. The primary failure mode is not individual tool invocation but long-range state tracking—agents lose intermediate results and fail to propagate them through later steps. This identifies a concrete, measurable gap between current agents and human performance in the kind of multi-step, tool-chained reasoning required for real-world automation.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

RuleSafe-VL converts real platform moderation policies into 93 atomic rules and 92 typed decision relations, then tests 10 VLMs on 2,166 expert-annotated cases to diagnose exactly where reasoning breaks down. The dominant bottleneck is rule-relation recovery—figuring out which rules apply to a given case—with best performance at 64.8 Macro-F1 and safety-oriented models scoring as low as 7. This reveals that safety fine-tuning for refusal behavior does not confer rule-based reasoning ability, which is the actual competency needed for policy-governed content moderation.

██████████ 0.8 reasoning-reliability Preprint

Read Save Connections

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

GazeVLM gives a 4B-parameter vision-language model the ability to direct its own visual attention by generating special gaze tokens paired with bounding-box coordinates that suppress non-focal image regions via causal attention masks—simulating foveal fixation without re-encoding the image. The model is trained with reinforcement learning rewards that penalize geometrically invalid or redundant gaze, making attention allocation a learned skill rather than a fixed mechanism. On high-resolution benchmarks (HRBench-4k, HRBench-8k), this yields a 4% improvement over same-class models, suggesting that internalized attention control can meaningfully close the gap between visual grounding and reasoning.

██████████ 0.8 multimodal-understanding Preprint

Read Save Connections

FactoryBench: Evaluating Industrial Machine Understanding

FactoryBench constructs roughly 71,000 question-answer pairs from real robotic telemetry data, organized along Pearl's causal ladder from basic state observation up to counterfactual and decision-level reasoning about industrial machines. No frontier LLM exceeds 50% accuracy on the causal levels, and decision-making accuracy caps at 18%, revealing that strong general language ability does not transfer to operational understanding of physical machinery. The dataset is publicly released on HuggingFace, giving the community a concrete benchmark to track progress in industrial AI.

██████████ 0.8 embodied-ai Preprint

Read Save Connections

Learning CLI Agents with Structured Action Credit under Selective Observation

Training reinforcement learning agents to navigate large codebases via command-line interfaces is difficult because rewards only appear at the end of long multi-step trajectories. This paper addresses the problem with two mechanisms: selective token-budgeted context retrieval (so agents only read task-relevant parts of large codebases) and AST-based action decomposition that assigns credit to sub-steps rather than entire trajectories. The combination enables effective learning from sparse terminal rewards, which matters for automating software engineering tasks where agents must read, modify, and test code across complex repository structures.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

Melding LLM and temporal logic for reliable human-swarm collaboration in complex scenarios

This paper encodes mission rules as Linear Temporal Logic automata and uses them as hard constraints on an LLM planning pipeline, ensuring that a swarm of 40+ heterogeneous robots executes only valid, rule-compliant task sequences. Tested in simulation and on real hardware, the framework guarantees 100% LTL constraint satisfaction while reducing human oversight to sparse, event-triggered confirmations rather than continuous monitoring. This is a practical path for deploying LLMs in safety-critical multi-robot settings without assuming the model's native reliability is sufficient.

██████████ 0.8 reasoning-reliability Preprint

Read Save Connections

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

Applying reinforcement learning from human feedback to video understanding models has been held back by the lack of quality preference data and evaluation benchmarks; this paper addresses both gaps with VURB (2,100 expert-annotated video preference pairs with chain-of-thought traces averaging 1,143 tokens) and an automated pipeline for constructing VUP-35K at scale. Two reward model architectures—discriminative (VideoDRM) and generative (VideoGRM)—are trained and evaluated, with majority-voting used to reduce position bias. The infrastructure matters because video reward models are a necessary prerequisite for aligning video AI systems the same way text models have been aligned.

██████████ 0.8 multimodal-understanding Preprint

Read Save Connections

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Hallucination & Grounding	93	Active	Two high-quality empirical studies today—one documenting ~147k phantom academic citations in 2025, another showing silent clinical failures in 16 medical VLMs—shift the hallucination problem from benchmark abstraction to measured real-world harm.
Reasoning Reliability	106	Active	Industrial causal reasoning benchmarks and content-moderation rule-recovery evaluations both show frontier models failing well below 50% on structured multi-step tasks, reinforcing that general language fluency does not guarantee reliable reasoning.
Alignment & Safety	81	Active	The memory-curse finding—that larger context windows erode cooperative intent in multi-agent systems—adds a novel, empirically grounded mechanism to alignment concerns beyond the usual single-model safety framing.
Agent Tool Use	58	Active	AgentEscapeBench and the CLI agent paper together quantify a consistent long-range dependency bottleneck: agents handle individual tool calls acceptably but fail at propagating intermediate state across chains longer than ~10 steps.
Multimodal Understanding	70	Active	Work on internalized visual attention control (GazeVLM) and video reward modeling infrastructure both address foundational gaps in how multimodal models allocate attention and receive learning signal.
Interpretability	100	Active	Activity remains high in volume but today's top papers lean toward behavioral evaluation rather than mechanistic interpretability; no strong interpretability-specific result surfaced in the top tier.
Data Quality & Curation	112	Active	The hallucinated-citations study is the strongest data-quality signal of the day, showing that LLM-generated noise is now measurably entering training and citation corpora at scale.
Efficiency & Scaling	96	Active	No strong efficiency-scaling papers surfaced in the top tier today; the roadblock remains active in volume but without a headline result.
Long Context	33	Active	The memory-curse study provides an unexpected negative result for long-context: more context does not straightforwardly improve agent behavior and can actively degrade it in cooperative settings.
Embodied AI	23	Active	FactoryBench exposes an 82-point gap between LLM decision-making accuracy and the industrial telemetry domain, establishing a concrete evaluation baseline for embodied and industrial AI systems.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe