All digests
ResearchersENArtificial Intelligencedaily

[Artificial Intelligence] Daily digest — 275 papers, 0 strong connections (2026-04-20)

DeepScience — Artificial Intelligence
DeepScience
Artificial Intelligence · Daily Digest
April 20, 2026
275
Papers
10/10
Roadblocks Active
4
Connections
⚡ Signal of the Day
• Chain-of-Thought prompting — AI's go-to technique for improving reasoning — consistently hurts performance on visual spatial tasks across 17 models and 13 benchmarks.
• A separate controlled study confirms the underlying mechanism: vision-language models process images by converting visual information into text space, and adding CoT amplifies shortcut-learning from textual priors rather than grounding reasoning in actual image content — meaning the two findings reinforce each other.
• Watch for follow-up work that tests whether training-time interventions (like the GRPO-based approach in Find, Fix, Reason) can break this text-dominance pattern, or whether the modality gap is structural to current transformer architectures.
📄 Top 10 Papers
Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Across 17 multimodal models and 13 spatial benchmarks, adding Chain-of-Thought prompting reliably made performance worse, not better, on tasks requiring visual spatial understanding. The culprit is shortcut learning: models hallucinate visual details from text-based priors even when no image is provided at all, revealing that their 'reasoning' is largely verbal pattern-matching rather than genuine visual inference. This matters because CoT is widely deployed as a default improvement strategy, and this finding suggests it may be actively harmful in vision-heavy applications.
█████████ 0.9 reasoning-reliability Preprint
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
Using a carefully controlled benchmark (CrossMath) where the same mathematical puzzle is presented as text-only, image-only, or both, the authors show that adding visual input frequently makes VLMs perform worse than using text alone — even when the image contains identical information. Human annotators verified that the information content was equivalent across formats, ruling out confounds. This provides clean evidence that current VLMs are fundamentally text-reasoning systems with a thin visual interface, not genuine multimodal reasoners.
█████████ 0.9 multimodal-understanding Preprint
ASMR-Bench: Auditing for Sabotage in ML Research
This benchmark tests whether frontier AI models can detect subtle sabotage in ML research codebases — flaws that change experimental conclusions while leaving high-level methodology intact. The best auditor (Gemini 3.1 Pro) achieved only 0.77 AUROC and a 42% fix rate, and LLM-generated sabotages sometimes evaded same-capability LLM auditors. This is a direct, empirically grounded argument that AI-assisted code review cannot yet provide reliable oversight of AI research pipelines, which has significant implications for the trustworthiness of automated ML workflows.
█████████ 0.9 alignment-safety Preprint
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
Instead of using a single model to score AI outputs, AgentV-RL deploys two agents in parallel: one checks whether a solution follows logically from its premises (forward), and one checks whether the premises are actually necessary (backward). A 4-billion-parameter model trained with this approach outperforms the previous best outcome reward models by 25.2% on math benchmarks. Better reward models directly improve the quality of AI training signals, which is a bottleneck for making reinforcement learning from human feedback more reliable.
█████████ 0.9 reasoning-reliability Preprint
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
This position paper proposes a single organizing framework for how AI agents store and reuse experience: from raw episodic memories (5–20× compression) through procedural skills (50–500×) to declarative rules (1,000×+). A citation analysis of 1,136 references reveals that the research communities working on agent memory and agent skill discovery cite each other less than 1% of the time, despite solving the same underlying problem. The key gap identified: no existing system can adaptively move knowledge across compression levels, which limits agents' ability to generalize efficiently.
█████████ 0.9 agent-tool-use Preprint
SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
SocialGrid puts LLM agents into a gridworld inspired by Among Us, where they must complete tasks while identifying adversarial agents — all without inter-agent communication. Even the strongest tested model (GPT-OSS-120B, 120 billion parameters) achieves below 60% task completion, and agents exhibit repetitive navigation failures against basic obstacles. Crucially, providing a symbolic pathfinding oracle to remove navigation as a variable reveals that social reasoning itself — inferring intent from behavior — remains a hard bottleneck independent of planning ability.
██████████ 0.8 reasoning-reliability Preprint
MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition
Testing 35 models across 12 families on ambiguous questions, this benchmark separates two components of metacognition: knowing when you are wrong (evaluation) versus actually correcting yourself (control). Larger models get better at evaluation, but control does not improve with scale — models that can accurately identify their errors still fail to fix them. This knowing-versus-doing gap suggests that scaling alone will not produce reliably self-correcting AI systems, and that control mechanisms need to be explicitly trained rather than assumed to emerge.
██████████ 0.8 alignment-safety Preprint
MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation
MARCH mimics a hospital's resident-fellow-attending hierarchy using three AI agents that draft, retrieve, and consensus-check radiology reports from CT scans. Evaluated on 25,692 CT scans, it outperforms single-model baselines on both clinical accuracy and text quality, and specifically reduces hallucinated medical findings. The result demonstrates that organizational structure — not just model scale — is a lever for reducing dangerous errors in high-stakes AI applications, though the dependence on GPT-4 APIs limits independent reproduction.
██████████ 0.8 hallucination-grounding Preprint
Fathom v16: Alignment-Inverted Cognitive Signals on Claude Haiku 4.5 via Consensus-Proxy Measurement (plus Cross-Model Replication on Llama-3.2-1B-Instruct)
This working paper tests whether making multiple API calls to a model and measuring output agreement (a 'consensus proxy' for internal confidence) can distinguish confabulation-prone prompts from genuine recall prompts. On 96 matched prompts, Claude Haiku 4.5 showed a strong inverted pattern: outputs converged when confabulating and diverged when recalling real information, with Cohen's d = -0.83. The same directional effect replicated on Llama-3.2-1B, suggesting a potentially cheap, model-agnostic signal for hallucination detection that doesn't require access to internal model states. Caveats apply: single author, small sample, not peer reviewed.
██████████ 0.8 hallucination-grounding Peer-reviewed
Neurosymbolic Repo-level Code Localization
State-of-the-art code localization models — which identify which files or functions a bug report refers to — perform dramatically worse when keyword hints like file paths or function names are removed from the query. The authors call this the Keyword Shortcut: models learn to match names rather than understand code logic, which fails on real-world issues described in abstract terms. Their neurosymbolic approach (LogicLoc) significantly outperforms existing methods on a new keyword-agnostic benchmark while remaining competitive on standard benchmarks, showing that deterministic structural reasoning over code graphs can compensate for this bias.
██████████ 0.8 reasoning-reliability Preprint
🔬 Roadblock Activity
Roadblock Papers Status Signal
Reasoning Reliability 105 Active Two independent studies today found that CoT prompting degrades visual spatial reasoning and that scale does not improve self-correction ability, challenging the assumption that standard reasoning techniques transfer uniformly across task types.
Efficiency and Scaling 88 Active Qwen3.5-Omni's Hybrid Attention MoE architecture reaches hundreds of billions of parameters with 256k context, but as an industry technical report with no public weights it contributes benchmark numbers without reproducible science.
Interpretability 82 Active The CrossMath modality gap study provides a new mechanistic insight — VLM reasoning is predominantly text-space — that could guide where interpretability researchers should look for meaningful internal representations in multimodal models.
Multimodal Understanding 79 Active Multiple papers today converged on the same finding from different angles: current vision-language models do not genuinely integrate visual and textual information, with adding images sometimes hurting rather than helping performance.
Alignment and Safety 73 Active ASMR-Bench empirically demonstrated that frontier LLMs cannot reliably detect subtle sabotage in ML codebases, while MEDLEY-BENCH showed that self-correction ability does not scale with model size — both findings tighten constraints on what automated oversight can currently guarantee.
Hallucination and Grounding 70 Active Three papers addressed hallucination from different angles today: multi-agent medical report generation reduced clinical hallucinations structurally, a consensus-proxy method showed promise for detecting confabulation cheaply, and CoT research revealed that verbal priors drive hallucination of visual details even when no image is present.
Agent Tool Use 61 Active AgentV-RL's bidirectional agentic verifier posted a 25.2% gain over prior reward models on math tasks, while ASMR-Bench's low sabotage detection rates suggest that LLM-as-auditor approaches are not yet trustworthy for monitoring agentic pipelines.
Data Quality and Curation 51 Active Activity today was indirect: a referenced connection paper (not in the main batch) formalized how synthetic data contamination degrades detection capacity itself, but no direct empirical data curation papers reached the top tier today.
Embodied AI 21 Active SocialGrid showed frontier LLMs failing below 60% on embodied social reasoning even with navigation assistance, and SAGR demonstrated that structured semantic graph abstractions can recover 18.8% search efficiency in multi-robot coordination.
Long Context 15 Active Long-context activity was modest today; Qwen3.5-Omni claims 256k context support but its closed nature prevents verification, and the Experience Compression Spectrum survey implicitly motivates long-context compression as an unsolved architectural need.
View Full Analysis
DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io