DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

April 19, 2026

277

Papers

10/10

Roadblocks Active

Connections

⚡ Signal of the Day

• Agent architecture is the dominant empirical story today, with multiple papers showing concrete performance gains from tool-orchestration, process reward models, and policy-switching data synthesis rather than raw model scaling.

• A recurring mechanistic finding across several independent papers is that models form early commitments and fail to revise them — whether in spatial reasoning, multimodal QA, or abstention decisions — suggesting answer inertia is a structural property of current transformer training, not a dataset artifact.

• Watch the intersection of process reward models and test-time scaling: SWE-TRACE demonstrates that a rubric-based PRM trained for RL can be reused at inference to prune action candidates, which suggests a dual-use architectural pattern worth tracking across other agentic domains.

📄 Top 10 Papers

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

RadAgent breaks CT scan interpretation into discrete steps, routing each to one of 10 specialized analysis tools guided by a clinician-reviewed checklist, then trained end-to-end with reinforcement learning using a composite reward. Compared to a direct-prediction 3D vision-language model baseline, it improves macro-F1 by 6 points (36% relative) and adversarial robustness by 24.7 points (42% relative), while achieving measurable output faithfulness the baseline scores zero on. This matters because it demonstrates that tool decomposition — rather than larger models — can simultaneously improve accuracy, robustness, and interpretability in high-stakes medical AI.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

Across 18 vision-language models and three public benchmarks, confidence tracking over chain-of-thought steps reveals that models commit to an answer early and reinforce it rather than revising it — a pattern the authors call answer inertia. Even reasoning-trained models are consistently steered by misleading textual cues even when visual evidence alone is sufficient to answer correctly. This is a safety-relevant finding because models can produce reasoning traces that look corrective while actually following spurious correlations established in their first few tokens.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

MM-AQA is a 2,079-sample benchmark built by transforming answerable questions into unanswerable ones along two axes — whether the visual modality is necessary, and whether sufficient evidence exists — to test whether vision-language models know when to abstain. Under standard prompting, frontier VLMs almost never abstain, and simple confidence baselines outperform them on abstention decisions; multi-agent systems improve abstention but trade off answer accuracy. The practical implication is that deployed multimodal AI systems currently lack reliable self-knowledge about their own uncertainty, which is a prerequisite for safe deployment.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling

SWE-TRACE addresses the difficulty of training software engineering agents on long-horizon tasks through three stages: filtering 140K candidate trajectories to 60K token-efficient examples via LLM-guided oracle verification, then training with reinforcement learning using a rubric-based process reward model (PRM) that provides dense step-level feedback rather than sparse outcome signals. Critically, the trained PRM is then reused at inference time to prune low-value action candidates, converting training infrastructure into a test-time scaling mechanism. This dual-use pattern — where the same reward model improves both training stability and inference efficiency — is a generalizable architectural insight for agentic RL.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

OpenMobile generates training data for mobile UI agents by first building a global memory map of the environment through exploration, then synthesizing trajectories using a policy-switching strategy that alternates between the learner model and an expert — deliberately capturing error-recovery behavior that pure imitation learning misses. Agents trained on this data reach 51.7% (Qwen2.5-VL) and 64.7% (Qwen3-VL) success rates on AndroidWorld, with gains attributed to broad functional coverage rather than benchmark-specific tuning. The error-recovery data collection mechanism is a practical solution to a known gap in behavior cloning for GUI agents.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

VRUBench is a text-only benchmark requiring models to predict what they would observe after a sequence of rotational steps — a task humans solve with 100% accuracy but on which both LLMs and VLMs fail substantially. Layer-wise probing shows models do encode viewpoint position information in hidden states, but cannot bind that position to the corresponding observation content; hallucination emerges specifically in the final layers when processing these spatial tasks. The targeted finding — that the bottleneck is positional binding rather than spatial encoding — suggests that fine-tuning only the identified attention heads responsible for this binding is more effective than full fine-tuning.

██████████ 0.8 interpretability Preprint

Read Save Connections

Feedback-Driven Execution for LLM-Based Binary Analysis

FORGE applies a Dynamic Forest of Agents architecture to binary vulnerability analysis, decomposing tasks across parallel agents while bounding each agent's context window to prevent the performance degradation that occurs when long analysis traces overwhelm a single LLM context. Reasoning-action-observation feedback loops let agents incrementally build evidence across multiple tool calls rather than attempting single-pass analysis. The system identifies 1,274 vulnerabilities across 591 unique binaries at 72.3% precision, demonstrating that context management through task decomposition is a practical lever for agentic reliability in technically complex domains.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

Standard Vision-Language-Action models predict actions directly, but planning over long action-space horizons is intractable because the probability of producing a feasible trajectory decays exponentially with horizon length. The World-Value-Action (WVA) model instead learns structured latent representations of future trajectories conditioned on visual observations and language instructions, then uses latent-space inference to reshape the search distribution toward feasible regions before committing to actions. This implicit planning mechanism addresses a fundamental scaling limitation of direct action prediction for robotic tasks requiring extended decision sequences.

██████████ 0.8 embodied-ai Preprint

Read Save Connections

Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC

This paper presents the first system where LLM agents iteratively rewrite the source code of ABC — a widely used logic synthesis tool in chip design — operating across the full integrated codebase while preserving the single-binary execution model. Through continuous feedback loops comparing synthesized circuit quality before and after code modifications, the system discovers optimizations that exceed human-designed heuristics. This is a concrete demonstration that LLM agents can improve specialized engineering software autonomously, with implications for how AI might accelerate tool development in other technically constrained domains.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications

The Data Intelligence Layer (DIL) treats LLMs, the web, and users as first-class queryable data sources alongside traditional relational databases, defining query interfaces for each and allowing a data planner to transform natural-language user requests into executable query plans that unify relational operators with multi-modal retrieval. The architecture addresses the practical limitation that real user information needs typically require iterative, multi-source access patterns that no single SQL query can satisfy. While the paper is a system demonstration without controlled benchmarks, the unified query planning abstraction is a substantive architectural contribution to multi-source AI application design.

██████████ 0.8 multimodal-understanding Preprint

Read Save Connections

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Efficiency and Scaling	119	Active	Process reward model reuse for test-time scaling (SWE-TRACE) is the day's strongest concrete signal, suggesting training infrastructure can double as inference-time pruning without additional overhead.
Reasoning Reliability	117	Active	Answer inertia — early commitment reinforced rather than revised during chain-of-thought — emerges across multiple independent studies as a structural failure mode in both language and vision-language models.
Agent Tool Use	81	Active	Strong empirical day with RadAgent (medical CT), FORGE (binary analysis), OpenMobile (GUI automation), and the self-evolving EDA tool all demonstrating that tool decomposition and feedback loops outperform monolithic model approaches.
Interpretability	79	Active	Mechanistic work on viewpoint rotation localizes a specific binding failure between positional encoding and observation content, showing that targeted attention head fine-tuning can address spatially specific reasoning deficits.
Alignment and Safety	76	Active	The answer inertia finding in VLMs provides concrete mechanistic evidence that models can produce superficially corrective reasoning traces while actually following misleading priors, directly relevant to deceptive alignment detection.
Multimodal Understanding	69	Active	Modality reliance failures dominate today — models default to text cues even when visual evidence is sufficient, and abstention in multimodal settings is nearly absent under standard prompting conditions.
Hallucination and Grounding	60	Active	Hallucination is localized to final transformer layers in spatial reasoning tasks, and VLMs almost never self-identify unanswerable questions — two independent convergent findings pointing to output-layer overconfidence as a mechanism.
Data Quality and Curation	36	Active	OpenMobile's policy-switching trajectory collection addresses the known gap in imitation learning where expert-only data misses error-recovery, a practical curation insight applicable beyond mobile agents.
Embodied AI	25	Active	The World-Value-Action model's latent-space planning approach directly addresses the exponential feasibility decay problem in long-horizon action prediction, offering a principled alternative to direct action generation.
Long Context	22	Active	Context management through task decomposition (FORGE's bounded per-agent contexts) is the practical engineering solution gaining traction, alongside the theoretical SSM linear-complexity alternative surfaced in today's connection analysis.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe