All digests
ResearchersENArtificial Intelligencedaily

[Artificial Intelligence] Daily digest — 295 papers, 0 strong connections (2026-06-13)

DeepScience — Artificial Intelligence
DeepScience
Artificial Intelligence · Daily Digest
June 13, 2026
295
Papers
11/11
Roadblocks Active
0
Connections
⚡ Signal of the Day
• AI agents are simultaneously getting better at using tools and being shown to be more vulnerable than previously understood — two trends colliding in today's papers.
• HyperTool more than doubled tool-use accuracy on benchmarks by rethinking how agents compose tool calls, while a separate study showed a single poisoned web page can fool all 12 tested commercial LLMs into recommending fake products at a 27% rate — rising to 74% with three poisoned pages.
• Watch the gap between benchmark-level agent performance and real-world robustness: EpiBench found no AI system passed a majority of professional epigenomics analysis tasks, and EvoBrowseComp highlighted that static benchmarks are themselves unreliable due to contamination — the evaluation infrastructure for AI agents is under stress.
📄 Top 10 Papers
HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents
Current AI agents call tools one step at a time, forcing the model to make many sequential decisions even when the correct sequence is deterministic. HyperTool lets agents fold predictable sequences of tool calls into a single compound action, reducing unnecessary decision points and context consumption. This nearly doubled accuracy on a multi-tool benchmark for both small (8B) and large (32B) models, suggesting the interface design between an agent and its tools matters as much as the underlying model.
██████████ 0.9 agent-tool-use Preprint
ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm
Frontier AI agents that try to control professional software like AutoCAD by watching and clicking the screen achieve near-zero success on real industrial CAD tasks. ComAct replaces that visual approach with deterministic Python script generation that talks directly to software internals via the Component Object Model protocol, yielding substantial immediate gains. The core insight is that some software interactions are too geometrically precise for vision-based control — a direct programmatic interface sidesteps the problem entirely.
█████████ 0.9 agent-tool-use Preprint
One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders
All 12 commercial and open-weights LLMs tested are vulnerable to adversarial manipulation of web content used for product recommendations: inserting a single rewritten page with a fake brand name fooled models up to 27% of the time, and swapping all top-3 retrieved pages raised that to 73.8%. The vulnerability correlates with how weakly the model's prior knowledge anchors a given product category — less familiar topics are more manipulable. Standard defenses like skepticism prompting provided only partial protection.
█████████ 0.9 hallucination-grounding Preprint
EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
Widely used benchmarks for web search agents are vulnerable to data contamination — models can score well by recalling memorized facts rather than actually retrieving information. EvoBrowseComp addresses this by automatically generating fresh questions via live web traversal, creating a benchmark that updates continuously to stay ahead of training data cutoffs. The evaluation of current agents on this benchmark reveals that genuine multi-step retrieval remains a significant unsolved challenge.
█████████ 0.9 hallucination-grounding Preprint
Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints
AI hallucinations in medical imaging are not just wrong answers — they manifest as fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements that can look clinically plausible. This review synthesizes existing taxonomies across five imaging modalities and finds that general-purpose foundation models actually outperform narrowly fine-tuned medical models on hallucination benchmarks, suggesting that aggressive domain specialization can introduce new confabulation risks. No single existing framework covers the full imaging pipeline, leaving a gap the authors map against FDA regulatory guidance.
█████████ 0.9 hallucination-grounding Preprint
EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis
Despite strong general coding and reasoning capabilities, no AI system tested passed a majority of real epigenomics analysis tasks — the best performer, GPT-4.5, succeeded on only 45% of attempts. The benchmark evaluates agents on short-horizon tasks drawn from realistic bioinformatics workflows, meaning failures are not from task complexity alone but from gaps in domain-specific procedural knowledge. Performance varied substantially by assay type, pointing to specific technical bottlenecks rather than uniform weakness.
█████████ 0.9 reasoning-reliability Preprint
Automated reproducibility assessments in the social and behavioral sciences using large language models
An LLM pipeline tasked with reproducing statistical analyses from published social science papers recovered original effect sizes in 41% of cases (±0.05 Cohen's d tolerance) and reached the same qualitative conclusion as the original study 96% of the time — compared to human reanalysts who recovered effect sizes in only 34% of cases. This suggests LLMs can automate a significant portion of reproducibility auditing, which is currently expensive and rarely done. The 96% qualitative agreement rate matters because it means the models are not just numerically close but are interpreting findings consistently with the original authors.
█████████ 0.9 reasoning-reliability Preprint
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach
Current multimodal LLMs struggle to diagnose UX problems in mobile app screenshots — tasks like detecting broken visual hierarchy or content inconsistency that human designers handle readily. The authors build a 2,000-sample benchmark and train a 4B-parameter model using reinforcement learning with task-specific reward routing, achieving 79.6% accuracy versus 65.5% for Claude-4.5-Sonnet. The result is notable because a small fine-tuned model outperforms a much larger frontier model when the reward signal is carefully matched to the task structure.
█████████ 0.9 multimodal-understanding Preprint
TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?
Weather and climate models can forecast accurately but cannot reason about their outputs in language, while LLMs can reason in language but cannot operate on high-dimensional gridded Earth-system data. TerraBench creates 403 executable scientific tasks bridging this gap, giving agents 77 specialized tools and evaluating both whether they used the right tools and whether their numeric answers fell within tolerance. The benchmark exposes that current agents handle routine retrieval better than multi-step scientific inference, which is where actual research value lies.
█████████ 0.9 agent-tool-use Preprint
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
Robot control models trained on household tasks fail in scientific labs because lab environments involve precise instrument handling, transparent liquids, and rigid protocol sequences that household demonstrations never cover. LabVLA trains a vision-language-action model using synthetically generated laboratory demonstrations across 16 robot embodiments, then evaluates on a dedicated LabUtopia benchmark. It achieves the highest success rate among tested baselines in both in-distribution and out-of-distribution settings, but the broader finding is that data scarcity for lab-specific robot demonstrations is the central bottleneck — not model architecture.
██████████ 0.8 embodied-ai Preprint
🔬 Roadblock Activity
Roadblock Papers Status Signal
Hallucination & Grounding 130 Active Three papers today converge on the same vulnerability: LLMs are manipulable through their retrieval context — whether web pages, medical images, or benchmark queries — and standard defenses remain insufficient.
Agent Tool Use 58 Active HyperTool and ComAct both show that redesigning the interface between agents and tools — rather than improving the model itself — can double or more the success rate on practical tasks.
Reasoning Reliability 91 Active EpiBench confirms that domain-specific procedural reasoning remains a hard wall for current AI agents, with no system clearing 50% on professional bioinformatics tasks.
Multimodal Understanding 79 Active Task-specific fine-tuning with carefully designed reward signals is enabling small models to outperform frontier multimodal models on structured visual reasoning tasks, as shown by the UX reasoning work.
Data Quality & Curation 137 Active EvoBrowseComp highlights that benchmark contamination is degrading the reliability of AI evaluation infrastructure itself — static benchmarks are becoming unreliable proxies for real capability.
Alignment & Safety 58 Active Today's alignment-relevant papers are primarily conceptual frameworks rather than empirical results, with no strong new safety findings from this batch.
Embodied AI 33 Active LabVLA makes a case that scientific laboratory automation is a distinct and underserved embodied AI domain, with data scarcity — not model capability — as the primary bottleneck.
Efficiency & Scaling 66 Active HyperTool provides an indirect efficiency signal: reducing unnecessary model-visible decision steps improves both accuracy and context efficiency, suggesting interface design as a lever for scaling efficiency.
Interpretability 65 Active No strong interpretability-focused papers surfaced in today's top results; activity in this roadblock was spread across tangential work without a clear focal advance.
Long Context 32 Active Long-context was not a primary focus in today's top papers, though HyperTool's context-reduction mechanism touches on managing context length in multi-step agent workflows.
Domain Specificity & Generalization 1 Low Only one paper explicitly tagged this roadblock today; domain generalization themes appeared implicitly in LabVLA and EpiBench but were not the primary framing.
View Full Analysis
DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io