DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

May 01, 2026

279

Papers

10/10

Roadblocks Active

Connections

⚡ Signal of the Day

• Two independent studies reveal that multimodal AI models routinely fake visual understanding — achieving high scores by exploiting text shortcuts rather than genuinely analyzing images.

• A circuit-diagram-to-code benchmark found that replacing diagrams with blank images barely hurt model performance, while a medical imaging audit found top frontier models localize anatomy with only 19% accuracy — both failures stem from the same root cause: models anchor on identifiers and context rather than visual content.

• Watch for whether the field responds with identifier-anonymization training (as proposed in the circuit paper) or whether this becomes a broader evaluation reform story; the medical grounding results in particular have direct patient-safety implications.

📄 Top 10 Papers

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Five leading vision-language models — including Gemini 2.5 Pro, GPT-4o, and o3 — were tested on medical image question-answering tasks that required locating anatomical or pathological regions; the best model achieved only 19% accuracy at localization and all models showed dangerous confusion between left and right sides of the body. Worse, a 'self-grounding' pipeline where models first locate then answer actually degraded accuracy relative to answering without localization. This matters because it shows that high benchmark scores on medical AI conceal a concrete failure mode that would be dangerous in clinical deployment.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

When asked to convert circuit diagrams into hardware description code, multimodal models score well on standard metrics even when the diagram is replaced with a blank image — they are pattern-matching on module name identifiers rather than reading the circuit. Replacing those identifiers with anonymous labels causes sharp accuracy drops across eight tested models, exposing that apparent visual reasoning is largely a shortcut exploit. The paper proposes identifier anonymization during training and refusal augmentation as concrete fixes, making this one of the cleaner diagnosis-plus-remedy papers on multimodal grounding failures.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

Trace-Level Analysis of Information Contamination in Multi-Agent Systems

In multi-agent AI workflows, introducing corrupted or adversarially perturbed data into shared artifacts (documents, tables) does not reliably cause visible errors — agents sometimes reach correct answers through structurally deviant execution paths, and sometimes produce wrong answers with structurally normal traces. Across 614 paired runs on real benchmark tasks using three different language models, the authors identify three contamination signatures: silent corruption, detour-with-recovery, and early termination. This is important because it means correctness of the final output is a poor proxy for whether a multi-agent pipeline was compromised.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Standard practice for training multimodal AI is to first do supervised fine-tuning (SFT) and then apply reinforcement learning, but this paper shows SFT shifts the model's output distribution in ways that hurt subsequent RL training — with perception errors and reasoning errors drifting in distinct, compounding patterns. PRISM inserts an adversarial distillation step between SFT and RL, using a mixture-of-experts discriminator to realign distributions before RL begins, and this consistently improves performance across three different RL algorithms at 4B and 8B model scales. The practical takeaway is that the order and nature of post-training stages matters more than previously assumed.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Training AI agents to perform real computer tasks requires realistic environments, but creating thousands of these by hand is impractical; this paper automates the process by generating 1,000+ synthetic user computers — complete with realistic file systems, documents, and spreadsheets — and then running agents on month-long simulated productivity objectives over 2,000+ turns each. Agents trained on these synthetic environments improved on both in-domain and out-of-domain productivity benchmarks, and 100 environments plus 500 simulation reports are publicly released. The significance is methodological: it demonstrates a scalable path to training data for long-horizon computer-use agents without requiring real user data.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

MED-VRAG retrieves information from a 350,000-page medical literature corpus by treating pages as images rather than extracted text chunks, then iteratively refines its answers using a memory bank that accumulates evidence across retrieval rounds. On four standard medical QA benchmarks, this approach achieves 78.6% average accuracy, with the image-based retrieval alone contributing +1.0 point and the iterative refinement adding another +1.5 points over simpler baselines. The result suggests that preserving document layout and visual structure during retrieval carries meaningful signal that text extraction discards.

██████████ 0.8 multimodal-understanding Preprint

Read Save Connections

Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions

This benchmark tests whether LLMs can translate plain-English financial instructions — like 'swap 1 ETH for USDC on Uniswap' — into correct blockchain transactions, using 300 days of real Ethereum mainnet data to build nearly 32,000 test cases across 11 transaction categories. The key finding is that models frequently produce syntactically valid outputs that nonetheless fail to achieve the intended financial state change, and multi-step transaction sequences remain largely unsolved even by top models. Evaluating correctness via actual execution on a forked blockchain environment is a methodological contribution that sets a higher bar than text-similarity metrics for this class of tasks.

██████████ 0.8 reasoning-reliability Preprint

Read Save Connections

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

This survey maps the state of AI agents that operate graphical user interfaces (web browsers, desktop apps) trained with reinforcement learning, identifying why supervised fine-tuning alone breaks down: it cannot handle long sequences of actions where early mistakes only become apparent many steps later. A key practical finding is that input/output latency with live GUI environments is becoming a bottleneck, pushing researchers toward world-model-based training where agents simulate the GUI in their own representations rather than interacting with the real one. As a survey it introduces no new experiments, but its taxonomy of reward architectures is useful for understanding where the field is heading.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

Using the Iterated Prisoner's Dilemma as a controlled test bed, this study shows that showing a vision-language model images depicting concepts like kindness or aggression measurably shifts its subsequent cooperative or competitive behavior — even though the images are irrelevant to the task. Color-coding of game reward matrices also influences decisions, and different model architectures vary substantially in how susceptible they are to this visual priming. This matters for AI safety because it demonstrates that deployed multimodal systems can have their decision-making altered by ambient visual context in ways that are neither intended nor transparent.

██████████ 0.8 alignment-safety Preprint

Read Save Connections

Rethinking Agentic Reinforcement Learning In Large Language Models

This survey maps how reinforcement learning is being adapted for LLM-based agents that must set their own goals, reflect on failures, and plan over long time horizons — capabilities that go well beyond the episodic reward structures RL was originally designed for. The authors organize the space around four agent components (action, planning, memory, tools) and argue that handling real-world uncertainty requires cognitive-like mechanisms such as meta-reasoning and self-reflection built into the learning loop. As a narrative survey without original experiments it carries limited evidentiary weight, but it provides a useful conceptual map for understanding where agentic RL research is currently focused.

██████████ 0.8 reasoning-reliability Preprint

Read Save Connections

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Data Quality & Curation	127	Active	Highest-volume roadblock today; the synthetic computer environments paper offers one of the few concrete positive results — a publicly released dataset that may ease the data scarcity problem for long-horizon agent training.
Reasoning Reliability	115	Active	Two empirical papers (Intent2Tx, PRISM) add evidence that syntactic correctness and benchmark scores systematically overstate reliable reasoning, particularly for sequential decision tasks.
Hallucination & Grounding	111	Active	The day's dominant theme: both the medical VQA audit and the circuit-to-Verilog mirage paper independently confirm that visual grounding is largely absent in current multimodal models, replacing genuine perception with pattern-matching on textual cues.
Interpretability	107	Active	Activity is high in volume but today's papers are mostly narrative reviews (pharmaceutical AI, autonomous systems); no new mechanistic interpretability methods appeared in the top results.
Alignment & Safety	84	Active	The visual priming study adds a concrete behavioral vulnerability — ambient images shifting model decisions — alongside several ethics and regulatory opinion pieces that contribute volume but limited empirical grounding.
Multimodal Understanding	80	Active	The iterative multimodal RAG paper shows incremental gains from image-based document retrieval, while the grounding audit results set a sobering baseline for how far frontier models still are from reliable multimodal perception.
Agent Tool Use	78	Active	Three papers address agent reliability from different angles: contamination in multi-agent traces, GUI agent training with RL, and synthetic computer environments for long-horizon practice — a notably coherent cluster.
Efficiency & Scaling	74	Active	No top-10 papers directly address efficiency or scaling today; the roadblock is active by volume but today's strong papers are concentrated elsewhere.
Long Context	35	Active	Moderate activity; the synthetic computers simulation paper (2,000+ turn episodes) touches long-horizon planning but long-context modeling itself had no dedicated strong papers today.
Embodied AI	33	Active	Low activity and no top-10 papers; the visual generation survey touches embodied reasoning tangentially but produces no empirical results.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe