DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

May 10, 2026

279

Papers

10/10

Roadblocks Active

Connections

⚡ Signal of the Day

• Across multiple independent papers today, LLM agents fail in systematic, hard-to-detect ways — stale memories, citation hallucinations, and alignment blind spots — suggesting the reliability gap is structural, not incidental.

• The pattern is consistent: agents produce outputs that pass surface-level checks (valid links, syntactically correct code, plausible safety assessments) while failing on substance (factual accuracy, constraint adherence, belief revision), which is a more dangerous failure mode than obvious errors.

• Watch for whether physics-aware verification gates — like the vision-language checker in AI CFD Scientist — can be generalized to other agentic scientific workflows as a practical mitigation strategy.

📄 Top 10 Papers

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

An AI agent framework autonomously runs the full cycle of computational fluid dynamics research — reading papers, running simulations, modifying solver source code, and writing up results — with a vision-language gate that checks whether flow-field outputs are physically plausible before accepting them. This gate caught 14 of 16 deliberately injected silent failures that standard solver-level checks missed entirely. The result shows that domain-specific sanity checking, not just execution ability, is the key ingredient separating reliable scientific AI agents from ones that silently propagate bad results.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

Automated alignment is harder than you think

This position paper argues that using AI agents to automate AI safety research is riskier than it appears, because alignment tasks — such as judging whether a behavior is safe — lack clear right-or-wrong answers that make errors detectable. The core concern is that optimization pressure pushes agents to generate mistakes that cluster precisely in the blind spots of human reviewers, meaning systematic safety failures could accumulate unnoticed. The argument is theoretical but directly relevant to the growing number of AI safety programs that already use AI to assist with safety evaluations.

██████████ 0.9 alignment-safety Preprint

Read Save Connections

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Benchmarking 14 LLMs on whether their cited sources actually support their factual claims reveals that frontier models achieve only 39–77% factual accuracy in citations, even while maintaining link validity above 94% — meaning links work but the cited content does not support the stated claim. Factual accuracy drops by roughly 42% as the number of tool calls scales from 2 to 150 retrievals, suggesting that deeper research runs produce progressively less reliable attribution. This is a practically important finding because well-formatted, link-valid citations create an appearance of rigor that masks frequent factual misrepresentation.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

The paper identifies a failure mode called Implicit Conflict, where an agent's earlier stored belief is invalidated by later evidence without any explicit contradiction — requiring the agent to infer the conflict through commonsense reasoning rather than detecting a direct negation. Tested on a 400-scenario benchmark with contexts up to 150K tokens, frontier LLMs score only 55.2% overall, barely above chance. For any agent deployed over extended sessions — customer service, medical, legal — this means stale beliefs will routinely shape decisions long after the underlying facts have changed.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

A benchmark of 759 hours of real clinical procedure videos (colonoscopies, surgeries) with 1,253 multiple-choice questions finds that current multimodal LLMs achieve only 41.1% accuracy on full-procedure understanding — and that feeding more video frames does not reliably improve performance, contrary to common assumptions. Evidence retrieval and clinical interpretation are the weakest stages in the pipeline. This matters because many proposed medical AI applications assume models can reason across complete clinical workflows; this benchmark provides a concrete measurement showing they currently cannot.

██████████ 0.8 long-context Preprint

Read Save Connections

Autonomous Adversary: Red-Teaming in the age of LLM

LLM-based cyberattack agents were tested in controlled Windows Active Directory environments across three planning modes: fully autonomous, self-scaffolded, and expert-guided. Expert-guided agents completed more tasks, but all modes failed frequently on lateral-movement scenarios. The dominant failure modes were not strategic reasoning errors but brittle command syntax, credential management mistakes, and inability to track environment state across steps — practical engineering gaps rather than fundamental reasoning limits, which gives practitioners a clearer picture of where current offensive AI tools actually break.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

Plausibility, persuasion, and truth: why language models may appear designed to deceive

This theoretical commentary identifies four structural mechanisms — plausibility optimization, post-training incentive bias, hallucination, and source bias — that cause LLMs to consistently produce confident-sounding false statements without any intent to deceive. The argument is that training processes reward outputs that sound persuasive and coherent over outputs that are strictly accurate, creating a systematic gap between apparent trustworthiness and actual reliability. While not empirical, the framework is useful for understanding why alignment interventions that focus on output-level rewards may be structurally insufficient.

██████████ 0.8 alignment-safety Peer-reviewed

Read

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

MANTRA converts plain-language procedural manuals into machine-checkable compliance test suites for AI agents by independently generating a symbolic world model and trace-level compliance checks via LLM, then using an SMT solver — a formal logic engine — to detect contradictions between the two artifacts and automatically repair them. This produces 285 benchmark tasks across 6 domains from manuals up to 50 pages long. The significance is practical: evaluating whether tool-using agents follow procedures correctly currently lacks standardized, formally grounded benchmarks, and MANTRA provides an automated path to generating them.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning

A 4B open-source language model trained to translate natural language requirements into Signal Temporal Logic — a formal language used to specify safety constraints in control systems — matches or beats much larger commercial LLM APIs on this task. The key is process-level reward supervision: the model is rewarded not just for producing a correct final formula but for using intermediate tool calls correctly, such as unit conversion and temporal normalization, which prevents reinforcing plausible-but-wrong reasoning chains. This matters for AI safety because formal specification is a prerequisite for formally verified agent behavior.

██████████ 0.8 reasoning-reliability Preprint

Read Save Connections

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

Testing LLM coding agents across 100 backend tasks reveals a systematic pattern: assertion pass rates drop by roughly 30 percentage points as non-functional constraints accumulate (framework choice, architectural pattern, database backend, ORM), even when each constraint is individually straightforward. Agents handle minimal, explicit frameworks like Flask much better than convention-heavy ones like FastAPI or Django, and data-layer errors — incorrect query composition and ORM violations — are the leading cause of failure. This gives software teams concrete guidance on where LLM coding agents are currently unreliable in production-grade contexts.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Model Interpretability	116	Active	Highest paper volume of any roadblock today, with activity spread across mechanistic analysis, probing studies, and regulatory applications such as driver monitoring systems.
Data Quality and Curation	110	Active	Second-highest volume; no single standout paper today, suggesting broad baseline activity rather than a concentrated breakthrough.
Reasoning Reliability	97	Active	STALE and ReasonSTL both address structural failures in multi-step and belief-update reasoning, reinforcing the emerging view that reliability failures are architectural, not just scaling issues.
Efficiency and Scaling	86	Active	Solid volume with no top-tier paper surfaced today; SIRA's single-shot retrieval approach is the closest to a systems-efficiency contribution but lacks released code for verification.
Hallucination and Grounding	78	Active	Cited but Not Verified and STALE together demonstrate that hallucination manifests not just in generation but in downstream reasoning over retrieved or stored information — a more dangerous and harder-to-detect failure mode.
Alignment and Safety	76	Active	Two independent papers today — one theoretical, one argumentative — converge on the same warning: alignment evaluation itself is vulnerable to the hallucination and plausibility-optimization dynamics it is trying to measure.
Multimodal Understanding	71	Active	MedHorizon is the day's clearest multimodal result, revealing that scaling frame count does not improve performance on long clinical videos — a direct challenge to naive data-scaling intuitions.
Agent Tool Use	64	Active	Four top-10 papers address agent tool-use today (AI CFD Scientist, Autonomous Adversary, MANTRA, Constraint Decay), with a shared theme: agents break not on strategy but on brittle execution of individual tool calls under compounding constraints.
Long-Context Processing	32	Active	MedHorizon provides the sharpest long-context finding today, showing that simply increasing input length does not help models reason over extended clinical procedure videos.
Embodied AI	27	Active	Lowest activity among tracked roadblocks; the Prediction and Empowerment theory paper touches embodied settings via POMDP formalism but is too preliminary to move the needle.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe