DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

April 12, 2026

269

Papers

10/10

Roadblocks Active

Connections

⚡ Signal of the Day

• A structural gap between visual perception and reasoning in multimodal AI models is emerging as a concrete, measurable phenomenon — not just a vague concern.

• Multiple independent papers today converge on the same finding from different angles: models can see correctly, retrieve information correctly, and still fail to reason correctly, suggesting the reasoning bottleneck is architectural rather than purely a data or scale problem.

• Watch the intersection of routing/expert-allocation mechanisms and reasoning quality in mixture-of-experts models — today's papers suggest this is where the next generation of fixes will be targeted.

📄 Top 10 Papers

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Multimodal AI models built on mixture-of-experts architectures can correctly describe what is in an image yet fail to solve the same math problem they answer correctly when given only text — the authors show this happens because visual inputs are routed to different internal 'expert' modules than text inputs, starving the reasoning-specialist experts of the visual signal. By tracking which experts activate layer-by-layer with Jensen-Shannon divergence, the team locates the divergence in middle layers where domain-reasoning experts concentrate, and proposes a routing-correction at inference time that improves performance across six benchmarks. This matters because it reframes a persistent VLM weakness as a fixable routing engineering problem rather than a fundamental limitation.

██████████ 0.9 multimodal-understanding Preprint

Read Save Connections

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Standard reinforcement learning fine-tuning of vision-language models suffers from gradient instability across tasks because reward scales differ wildly between domains; this paper replaces the usual linear reward normalization with a mathematically principled mapping (via 1D Optimal Transport) that forces all task reward distributions to converge to a standard normal, ensuring no single task dominates training updates. The resulting model, built on an 8B-parameter base, is evaluated across 18 benchmarks spanning 6 visual task categories and also introduces response-length shaping to distinguish when extended reasoning is warranted versus when a direct answer suffices. The mechanism is notable because it provides a theoretically grounded remedy to a known but poorly-solved multi-task RL instability problem.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

Current reward models used to align AI agents are tested almost exclusively on single-turn exchanges, but real agentic tasks unfold over many steps involving tool use, error recovery, and planning — this paper constructs Plan-RewardBench, the first pairwise preference benchmark specifically for multi-step agent trajectories across four task families including safety refusal and complex planning. All three major reward model families tested (generative, discriminative, and LLM-as-judge) degrade sharply on longer trajectories, exposing a systematic blind spot in how we evaluate and train alignment mechanisms for autonomous agents. This matters because reward model quality is the linchpin of RLHF-style alignment, and a broken reward signal for agents means alignment guarantees do not transfer from chat to agentic settings.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought

Long-context reasoning in LLMs typically works by stuffing all context into a fixed window and retrieving statically, but MemCoT instead treats reasoning as an iterative search process with two types of short-term memory — a semantic state memory tracking what has been understood and an episodic trajectory memory tracking what has been tried — paired with a zoom-in/zoom-out evidence localization mechanism. Evaluated on LoCoMo and LongMemEval-S using GPT-4o-mini and Qwen2.5-14B as backbones, it outperforms six memory-augmented baselines including RAG and Mem0 without any additional training. The training-free, plug-and-play design is practically significant because it can be layered onto existing deployed models rather than requiring expensive retraining.

██████████ 0.9 long-context Preprint

Read Save Connections

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Web agents capable of autonomously completing browser-based tasks have mostly relied on large proprietary models, but MolmoWeb demonstrates that an open-weight 8B-parameter model can surpass agents built on much larger closed models like GPT-4o by combining synthetic trajectory generation from multiple complementary data pipelines with instruction-conditioned visual-language action policies. Test-time scaling via parallel rollouts (best-of-4 selection) lifts pass rates from 78.2% to 94.7% on WebVoyager and from 35.3% to 60.5% on Online-Mind2Web, showing that inference-time compute is a meaningful lever even for smaller open models. The open-weight release is notable for the research community because it enables studying and improving web agent behavior without dependence on API access.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Agentic AI models trained with standard RL rewards to use external tools either overuse them (wasting compute and introducing errors) or underuse them (missing needed information), because a single scalar reward cannot simultaneously optimize for task accuracy and tool efficiency. HDPO (Hierarchical Decoupled Policy Optimization) separates these two objectives into independent optimization channels with conditional advantage estimation, resolving the tension without requiring manual penalty tuning. The practical consequence is an agent that can better judge when its internal knowledge is sufficient versus when a tool call is genuinely needed — a meta-cognitive capability that prior approaches could not reliably instill.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

Applying chain-of-thought reasoning at every navigation step in an embodied agent is expensive and often unnecessary, so HiRO-Nav uses the entropy of the action probability distribution as a real-time signal to decide when deeper reasoning is warranted — high-entropy moments (uncertainty about what to do next) trigger chain-of-thought; low-entropy moments skip it. Analysis confirms that these high-entropy decision points correlate strongly with task-critical junctures such as entering a novel scene or locating a target object, validating the selective reasoning strategy. The resulting training pipeline (supervised fine-tuning followed by two-stage online RL) achieves better success rates with fewer tokens spent on reasoning than both always-thinking and never-thinking baselines on the CHORES-S ObjectNav benchmark.

██████████ 0.9 embodied-ai Preprint

Read Save Connections

Less Approximates More: Harmonizing Performance and Confidence Faithfulness via Hybrid Post-Training for High-Stakes Tasks

LLMs trained purely for accuracy on high-stakes domains (cybersecurity, finance, medicine) tend to become overconfident in wrong answers, while models trained to express calibrated uncertainty sacrifice accuracy — HyTuning combines reinforcement learning from internal feedback with reasoning distillation using an adaptive weighting signal called Progressive Reasoning Gain (PRG), which measures whether each reasoning step monotonically increases the model's confidence toward the final answer. Tasks where reasoning genuinely builds toward the answer receive more distillation weight; tasks where reasoning is decorative receive more RL signal. The framework reduces the failure mode where models produce confidently wrong outputs with seemingly coherent reasoning chains — a practical concern in any deployment where calibration matters as much as correctness.

██████████ 0.8 reasoning-reliability Preprint

Read Save Connections

Preference Redirection via Attention Concentration: An Attack on Computer Use Agents

Computer-use agents that autonomously interact with GUIs can be manipulated by adversarial image patches (constrained to pixel perturbations invisible to humans at L∞ ≤ 8/255) that concentrate the model's visual attention onto a chosen target — in a simulated online shopping scenario, the attack redirects the agent to select an adversary-chosen product instead of the user's intended choice. Unlike text-based prompt injection, this attack targets the internal attention routing of the vision encoder rather than the language decoder, and it transfers to fine-tuned variants of the same base model under black-box conditions. This is practically concerning because it demonstrates a visual attack surface for autonomous agents that current safety evaluations largely ignore.

██████████ 0.8 alignment-safety Preprint

Read Save Connections

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Despite rapid progress in web agent benchmarks, this paper tests seven frontier models on 153 real-world everyday online tasks (not sandboxed replicas) using a lightweight request-interception layer that safely captures but blocks final submissions, finding that even the best model tested (Claude Sonnet 4.6) succeeds only 33.3% of the time. The gap between benchmark performance and real-world performance is traced to specific capability deficits: reading user-supplied documents, handling multi-step workflows, and filling complex forms — tasks that static offline benchmarks systematically underrepresent. The result is a concrete quantification of how far current agents are from being reliably useful for the mundane daily tasks that would constitute most real-world deployment value.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Efficiency and Scaling	116	Active	Highest paper volume of the day; HiRO-Nav's entropy-based selective reasoning and MolmoWeb's best-of-N test-time scaling both demonstrate that inference-time compute allocation strategies are now as important as model scale itself.
Multimodal Understanding	111	Active	The 'Seeing but Not Thinking' finding — that expert routing in MoE models structurally separates visual and reasoning processing — is the sharpest mechanistic insight of the day for this roadblock.
Reasoning Reliability	105	Active	Multiple independent approaches (entropy-based selective CoT in HiRO-Nav, PRG-weighted hybrid training in HyTuning, Gaussian reward normalization in OpenVLThinkerV2) converge on the need for reasoning to be adaptive and calibrated rather than uniform.
Interpretability	71	Active	Layer-wise expert routing analysis in 'Seeing but Not Thinking' provides a rare mechanistic explanation for a known VLM failure mode, turning an empirical observation into a structural diagnosis.
Agent Tool Use	64	Active	ClawBench's 33.3% real-world success ceiling and Plan-RewardBench's demonstration that reward models break on long trajectories together define the two most pressing unsolved problems in agent deployment today.
Hallucination and Grounding	63	Active	Entropy-Gradient Grounding offers a training-free mechanism that uses a model's own output uncertainty to identify which visual regions it is actually relying on, potentially useful as a lightweight hallucination diagnostic.
Alignment and Safety	53	Active	The PRAC adversarial patch attack and Plan-RewardBench's reward model failures both highlight that current alignment techniques were designed for chat-style interactions and do not transfer cleanly to agentic or multimodal settings.
Data Quality and Curation	45	Active	EditCaption's finding that over 47% of instructions synthesized by strong VLMs contain critical errors is a useful calibration point for any pipeline that uses LLMs to automatically generate training data.
Embodied AI	27	Active	HiRO-Nav and the Visually-grounded Humanoid Agents paper both push toward agents that act on selective, uncertainty-aware reasoning rather than exhaustive inference, reflecting a maturing understanding of compute-action tradeoffs in physical settings.
Long Context	23	Active	MemCoT reframes long-context reasoning as iterative stateful search rather than static retrieval, achieving gains without retraining by adding structured memory components on top of existing models.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe