DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

June 15, 2026

292

Papers

11/11

Roadblocks Active

Connections

⚡ Signal of the Day

• Two independent benchmarks published today converge on the same uncomfortable finding: frontier AI agents succeed on realistic tasks only 17–19% of the time, while non-expert humans clear 80%+.

• SIMMER finds that even when agents produce plausible-looking plans, up to 56% contain 'latent failures' — errors with no immediate feedback that lead to irreversible consequences — meaning standard success metrics are systematically blind to the most dangerous failure mode.

• Watch for whether the community responds by hardening evaluation (more world-model-grounded benchmarks like SIMMER and GauntletBench) or by attacking the capability gap directly; the agent-tool-use roadblock has the second-largest paper volume (68) today, signaling active pressure on both fronts.

📄 Top 10 Papers

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

SIMMER evaluates LLM-generated household plans against a symbolic world model encoding 77 actions, 262 objects, and ~46,800 interactions, revealing that even frontier models produce error-free plans only 17% of the time — and that up to 56% of plans contain 'latent failures' that produce no immediate error signal but lead to irreversible outcomes. This matters because most agent benchmarks score final outcomes, not intermediate state validity, meaning current evaluations systematically miss the most dangerous failure class. The finding that the majority of latent failures are irreversible sets a concrete safety target: agents deployed in physical or semi-physical contexts need world-model grounding, not just output scoring.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

GauntletBench tests frontier agentic systems on web-based tasks drawn from less-familiar application environments, finding only 19.1% task success against non-expert human performance above 80% — a gap that strongly suggests current agents are pattern-matching to training distributions rather than exhibiting genuine generalizable capability. This matters because many widely-cited agent benchmarks are now saturated, creating an illusion of near-human performance that GauntletBench directly contradicts. The benchmark's modular design supports both open- and closed-source agent frameworks, making it a practical evaluation tool for the community.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

ClinHallu decomposes medical multimodal LLM reasoning into three distinct stages — visual recognition, knowledge recall, and reasoning integration — and shows that hallucinations originate from different stages across different samples, meaning a single mitigation strategy cannot fix all cases. Using 7,031 validated instances from four medical VQA datasets with structured reasoning trace annotations, the benchmark enables stage-replacement interventions that pinpoint exactly where each model's reasoning breaks down. Code and data are publicly released, making this an immediately usable diagnostic tool for medical AI safety.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

SkillMutator: Benchmarking and Defending Language-and-Code Cross-modal Attacks on LLM Agent Skills

SkillMutator reveals that existing open-source and commercial skill scanners detect only 2–17% of adversarial attacks that exploit the gap between natural-language skill descriptions and the executable code those skills actually run — an attack surface almost entirely overlooked by the safety community. The benchmark generates 187 adversarially mutated scenarios across 13 attack categories using iterative evasion refinement, then trains a small open-weight model (7B parameters) that substantially outperforms existing scanners. The transferability of attacks across scanners suggests the underlying vulnerability is structural, not model-specific.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

CORA identifies and quantifies a persistent problem in multimodal reinforcement learning from verifiable rewards (RLVR): models frequently produce reasoning traces that are semantically inconsistent with their final answers, and this inconsistency is not resolved by standard GRPO training across the full training run. The authors address this by adding a lightweight NLI-style consistency reward model and a hybrid reward splitting strategy that prevents task and consistency signals from interfering with each other. Evaluated across five benchmarks with four Qwen-series models (2B–7B), CORA improves both task performance and reasoning trace reliability simultaneously.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

Trust but Verify: Mitigating Medical Hallucinations via Post-Hoc Adversarial Auditing and Multi-Agent Feedback Loops

Using a 103-question adversarial dataset where correct historical answers reference banned or withdrawn pharmaceuticals, this study shows that all tested LLM families reliably hallucinate clinically dangerous responses that match their training data rather than current medical standards — and that proprietary models may actually regress on safety as capability scales. A five-agent 'Trust but Verify' architecture using a shared LLM backbone with a post-hoc safety audit agent reduces the Hallucination Error Rate by approximately 53% across all tested models. The finding that capability scaling may worsen hallucination in safety-critical domains is a direct counter-argument to the assumption that better base models automatically mean safer medical AI.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification

VeriGeo generates geometry problems where the natural-language statement, diagram, algebraic constraints, and solution are provably mutually consistent, using a three-stage verification pipeline that catches and repairs invalid generations rather than discarding them. Fine-tuning Qwen2.5-VL-7B on 8,700 VeriGeo-generated examples improves performance on GeoQA, PGPS9K, and MathVista-GPS benchmarks, demonstrating that verified synthetic data quality matters more than raw quantity. The dual-agent Author/Solver framework with executable action sequences is a reusable pattern for generating grounded training data in any domain where consistency between modalities can be formally checked.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails

This paper shows that LLM-based guardrails — the components designed to block jailbreaks and prompt injection in agentic pipelines — can themselves be weaponized: crafted natural-language payloads trap them in extended reasoning loops, causing denial-of-service rather than protection. Attack payloads optimized on a single open-source surrogate model (TS-Guard-8B) transfer successfully across eight leading closed and open-source model backbones including Claude, GPT, Gemini, and DeepSeek. The finding that the safety mechanism becomes the attack surface inverts the standard security model and has direct implications for any production deployment using reasoning-heavy guardrails.

██████████ 0.9 alignment-safety Preprint

Read Save Connections

CacheRL: Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

CacheRL trains a 4B-parameter model to perform multi-step tool-calling across 1,185 unique tools by replacing live tool execution during reinforcement learning with a three-tier fuzzy cache, eliminating the primary computational bottleneck in agent RL training. The resulting model achieves 92% process accuracy approaching GPT-5's 94% while requiring approximately 100× less compute, with a hybrid reward that dynamically weights outcome-based and process-based signals depending on cache confidence. The efficiency gain is significant, but the reliance on GPT-5 API access for trajectory generation means the full training pipeline cannot be reproduced without equivalent proprietary access.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K constructs 100,000 instruction-tuning examples for audio-visual video reasoning using entity-anchored scripting — a method that preserves cross-segment referential consistency by tracking entities across video segments before generating questions — addressing the problem that most existing datasets treat audio and visual streams as independent inputs. A 505-sample human-verified test set covering 10 audio-visual tasks shows that fine-tuning on OmniVideo-100K improves performance on multiple established benchmarks. Code and data are publicly released, making this a directly usable resource for teams working on multimodal understanding.

██████████ 0.8 multimodal-understanding Preprint

Read Save Connections

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Hallucination & Grounding	110	Active	High activity day with two substantive contributions: ClinHallu provides stage-level diagnostic decomposition of medical hallucinations with a public benchmark, while the Trust but Verify study shows that capability scaling in proprietary models may worsen hallucination rates in safety-critical clinical contexts.
Agent Tool Use & Reliability	68	Active	SIMMER and GauntletBench independently benchmark frontier agents at 17–19% success on realistic tasks, SkillMutator exposes near-total failure of skill security scanners, and From Shield to Target reveals that guardrails themselves are a new DoS attack surface — a convergent signal that agent reliability remains deeply unsolved.
Reasoning Reliability	96	Active	CORA quantifies and reduces thinking-answer inconsistency in multimodal RLVR training, while VeriGeo demonstrates that formally verified synthetic geometry data improves downstream reasoning benchmarks, together suggesting that verification-grounded data pipelines are becoming a tractable path to more reliable reasoning.
Alignment & Safety	98	Active	From Shield to Target introduces a novel attack class where reasoning-heavy guardrails are denial-of-service targets, while the automotive LLM safety appraisal highlights that ISO compliance frameworks are not yet compatible with LLM latency and alignment requirements in real-time safety-critical systems.
Multimodal Understanding	104	Active	OmniVideo-100K addresses the largely ignored problem of cross-modal audio-visual consistency in long video understanding, with a public release that gives the community a concrete training and evaluation resource for this gap.
Interpretability	100	Active	Moderate activity with the Verifiable User Simulation tutorial proposing a design-and-audit framework for LLM-based user simulators, and the Raman spectroscopy ML survey touching interpretability needs in high-dimensional biomedical signal analysis — neither delivers new empirical results today.
Data Quality & Curation	114	Active	Highest paper volume roadblock today, with VeriGeo and OmniVideo-100K both contributing verified or structured synthetic datasets, suggesting the field is shifting toward formal verification and structured pipelines as a quality assurance mechanism rather than relying on scale alone.
Efficiency & Scaling	64	Active	CacheRL's three-tier fuzzy cache approach achieves a claimed 100× compute reduction for agent RL training by eliminating live tool execution, but the theoretical canonical-code paper's 30–500× efficiency claims are explicitly self-labeled as hypotheses rather than measured results — a meaningful contrast in evidential quality.
Embodied AI	40	Active	Hy-Embodied-0.5-VLA presents a full end-to-end robot learning stack from data collection to real-world deployment, while the UAV autonomy paper shows that selective agentic invocation via a learned gate raises recovery success from 5% to 95% in hard scenarios — both pointing toward hybrid local/remote reasoning as a practical deployment pattern.
Long-Context Understanding	29	Active	StreamMemBench finds that current agent memory systems frequently fail to use stored evidence even when it is present, and a plausible cross-paper connection suggests that floor-plan-style compact spatial representations could generalize to structured context compression for long documents.
Catastrophic Forgetting	1	Low	Minimal activity today with only one paper touching this roadblock — no meaningful signal to report.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe