DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

June 13, 2026

295

Papers

11/11

Roadblocks Active

Connections

⚡ Signal of the Day

• AI agents are simultaneously getting better at using tools and being shown to be more vulnerable than previously understood — two trends colliding in today's papers.

• HyperTool more than doubled tool-use accuracy on benchmarks by rethinking how agents compose tool calls, while a separate study showed a single poisoned web page can fool all 12 tested commercial LLMs into recommending fake products at a 27% rate — rising to 74% with three poisoned pages.

• Watch the gap between benchmark-level agent performance and real-world robustness: EpiBench found no AI system passed a majority of professional epigenomics analysis tasks, and EvoBrowseComp highlighted that static benchmarks are themselves unreliable due to contamination — the evaluation infrastructure for AI agents is under stress.

📄 Top 10 Papers

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

Current AI agents call tools one step at a time, forcing the model to make many sequential decisions even when the correct sequence is deterministic. HyperTool lets agents fold predictable sequences of tool calls into a single compound action, reducing unnecessary decision points and context consumption. This nearly doubled accuracy on a multi-tool benchmark for both small (8B) and large (32B) models, suggesting the interface design between an agent and its tools matters as much as the underlying model.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

Frontier AI agents that try to control professional software like AutoCAD by watching and clicking the screen achieve near-zero success on real industrial CAD tasks. ComAct replaces that visual approach with deterministic Python script generation that talks directly to software internals via the Component Object Model protocol, yielding substantial immediate gains. The core insight is that some software interactions are too geometrically precise for vision-based control — a direct programmatic interface sidesteps the problem entirely.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

All 12 commercial and open-weights LLMs tested are vulnerable to adversarial manipulation of web content used for product recommendations: inserting a single rewritten page with a fake brand name fooled models up to 27% of the time, and swapping all top-3 retrieved pages raised that to 73.8%. The vulnerability correlates with how weakly the model's prior knowledge anchors a given product category — less familiar topics are more manipulable. Standard defenses like skepticism prompting provided only partial protection.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Widely used benchmarks for web search agents are vulnerable to data contamination — models can score well by recalling memorized facts rather than actually retrieving information. EvoBrowseComp addresses this by automatically generating fresh questions via live web traversal, creating a benchmark that updates continuously to stay ahead of training data cutoffs. The evaluation of current agents on this benchmark reveals that genuine multi-step retrieval remains a significant unsolved challenge.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

AI hallucinations in medical imaging are not just wrong answers — they manifest as fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements that can look clinically plausible. This review synthesizes existing taxonomies across five imaging modalities and finds that general-purpose foundation models actually outperform narrowly fine-tuned medical models on hallucination benchmarks, suggesting that aggressive domain specialization can introduce new confabulation risks. No single existing framework covers the full imaging pipeline, leaving a gap the authors map against FDA regulatory guidance.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

Despite strong general coding and reasoning capabilities, no AI system tested passed a majority of real epigenomics analysis tasks — the best performer, GPT-4.5, succeeded on only 45% of attempts. The benchmark evaluates agents on short-horizon tasks drawn from realistic bioinformatics workflows, meaning failures are not from task complexity alone but from gaps in domain-specific procedural knowledge. Performance varied substantially by assay type, pointing to specific technical bottlenecks rather than uniform weakness.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

Automated reproducibility assessments in the social and behavioral sciences using large language models

An LLM pipeline tasked with reproducing statistical analyses from published social science papers recovered original effect sizes in 41% of cases (±0.05 Cohen's d tolerance) and reached the same qualitative conclusion as the original study 96% of the time — compared to human reanalysts who recovered effect sizes in only 34% of cases. This suggests LLMs can automate a significant portion of reproducibility auditing, which is currently expensive and rarely done. The 96% qualitative agreement rate matters because it means the models are not just numerically close but are interpreting findings consistently with the original authors.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

Current multimodal LLMs struggle to diagnose UX problems in mobile app screenshots — tasks like detecting broken visual hierarchy or content inconsistency that human designers handle readily. The authors build a 2,000-sample benchmark and train a 4B-parameter model using reinforcement learning with task-specific reward routing, achieving 79.6% accuracy versus 65.5% for Claude-4.5-Sonnet. The result is notable because a small fine-tuned model outperforms a much larger frontier model when the reward signal is carefully matched to the task structure.

██████████ 0.9 multimodal-understanding Preprint

Read Save Connections

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

Weather and climate models can forecast accurately but cannot reason about their outputs in language, while LLMs can reason in language but cannot operate on high-dimensional gridded Earth-system data. TerraBench creates 403 executable scientific tasks bridging this gap, giving agents 77 specialized tools and evaluating both whether they used the right tools and whether their numeric answers fell within tolerance. The benchmark exposes that current agents handle routine retrieval better than multi-step scientific inference, which is where actual research value lies.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Robot control models trained on household tasks fail in scientific labs because lab environments involve precise instrument handling, transparent liquids, and rigid protocol sequences that household demonstrations never cover. LabVLA trains a vision-language-action model using synthetically generated laboratory demonstrations across 16 robot embodiments, then evaluates on a dedicated LabUtopia benchmark. It achieves the highest success rate among tested baselines in both in-distribution and out-of-distribution settings, but the broader finding is that data scarcity for lab-specific robot demonstrations is the central bottleneck — not model architecture.

██████████ 0.8 embodied-ai Preprint

Read Save Connections

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Hallucination & Grounding	130	Active	Three papers today converge on the same vulnerability: LLMs are manipulable through their retrieval context — whether web pages, medical images, or benchmark queries — and standard defenses remain insufficient.
Agent Tool Use	58	Active	HyperTool and ComAct both show that redesigning the interface between agents and tools — rather than improving the model itself — can double or more the success rate on practical tasks.
Reasoning Reliability	91	Active	EpiBench confirms that domain-specific procedural reasoning remains a hard wall for current AI agents, with no system clearing 50% on professional bioinformatics tasks.
Multimodal Understanding	79	Active	Task-specific fine-tuning with carefully designed reward signals is enabling small models to outperform frontier multimodal models on structured visual reasoning tasks, as shown by the UX reasoning work.
Data Quality & Curation	137	Active	EvoBrowseComp highlights that benchmark contamination is degrading the reliability of AI evaluation infrastructure itself — static benchmarks are becoming unreliable proxies for real capability.
Alignment & Safety	58	Active	Today's alignment-relevant papers are primarily conceptual frameworks rather than empirical results, with no strong new safety findings from this batch.
Embodied AI	33	Active	LabVLA makes a case that scientific laboratory automation is a distinct and underserved embodied AI domain, with data scarcity — not model capability — as the primary bottleneck.
Efficiency & Scaling	66	Active	HyperTool provides an indirect efficiency signal: reducing unnecessary model-visible decision steps improves both accuracy and context efficiency, suggesting interface design as a lever for scaling efficiency.
Interpretability	65	Active	No strong interpretability-focused papers surfaced in today's top results; activity in this roadblock was spread across tangential work without a clear focal advance.
Long Context	32	Active	Long-context was not a primary focus in today's top papers, though HyperTool's context-reduction mechanism touches on managing context length in multi-step agent workflows.
Domain Specificity & Generalization	1	Low	Only one paper explicitly tagged this roadblock today; domain generalization themes appeared implicitly in LabVLA and EpiBench but were not the primary framing.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe