DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

June 10, 2026

284

Papers

10/10

Roadblocks Active

Connections

⚡ Signal of the Day

• Today's dominant signal is a consistent, multi-benchmark collapse in real-world agentic AI performance: frontier models top out at roughly 21–31% success on physical tool use and professional GUI workflows, far below the 95%+ human baseline.

• This matters because it exposes a two-layer failure: agents can recognise tools and instructions in isolation but cannot chain perception, planning, and execution across many steps without error propagation — a gap that scaling alone has not closed.

• Watch for whether new critic architectures (like HiViG) and structured neuro-symbolic planners begin moving these numbers, and whether the MemVenom memory-poisoning findings accelerate security audits of production agent deployments.

📄 Top 10 Papers

MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents

MemVenom shows that the external memory web agents use to remember past actions can be poisoned with malicious text-image content, causing the agent to follow attacker instructions on future matching tasks with up to 99% success. The attack works in a black-box setting — no access to model weights is needed, only the ability to write one entry into the agent's memory store. This matters because external memory is a standard component of production agents, and the attack transfers across multiple agent frameworks and frontier vision-language models.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

This benchmark uses an 8-stage automated evaluation pipeline to test whether vision-language models can solve engineering problems the way an engineer would — reading a technical diagram, selecting the right physical principle, and checking that the answer is physically valid. Models that score well on general reasoning benchmarks consistently fail at interpreting technical diagrams and produce solutions that look plausible but violate basic physics. The stage-wise design pinpoints exactly where reasoning breaks down, which is more actionable than a single aggregate score.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

PhysTool-Bench tests 13 vision-language models on 2,510 real-world queries involving physical tools from agriculture, manufacturing, and healthcare: the best model (Gemini-3.1-Pro) correctly identifies only 59% of tools in a scene and completes only 21% of tasks end-to-end. The study identifies two separate problems — poor visual recognition of domain-specific tools, and inability to translate what is seen into a sensible action plan — showing these are distinct failure modes, not one. This directly limits real-world robot and agentic deployment in any context outside curated web interfaces.

██████████ 0.9 embodied-ai Preprint

Read Save Connections

AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning

AllDayNav trains a robot navigation policy using reinforcement learning that autonomously generates its own training goals — no human labelling needed — and stores experience in a multimodal memory that holds visual keyframes, semantic descriptions, and time context. The system claims near-100% success across room-crossing and task-switching scenarios in both simulation and physical environments, outperforming map-based and VLM baselines on path efficiency. The self-supervised training loop is the key contribution: it allows the agent to keep improving from its own experience without pre-built maps or manual annotation.

██████████ 0.9 embodied-ai Preprint

Read Save Connections

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Workflow-GYM is a benchmark that requires GUI agents to complete professional workflows involving at least 40 sequential actions in real software environments — the kind of task a junior office worker would handle. State-of-the-art agents succeed on only about 30% of tasks, failing primarily through objective drift (losing track of what they were supposed to do), skipping required workflow stages, and misreading professional software interfaces. The benchmark fills a gap between short toy tasks, where agents look capable, and the sustained multi-step execution that real professional deployment requires.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

A History-Aware Visually Grounded Critic for Computer Use Agents

HiViG adds a critic model to GUI agents that does two things existing critics do not: it tracks a compact summary of what the agent has already accomplished (macro-action history) to prevent short-sighted replanning, and it visually verifies that the next intended action's screen coordinates actually make sense before executing it. Tested on web, mobile, and desktop benchmarks with both open-weight and commercial policy models, HiViG outperforms verbal and scalar critics. The key insight is that failures in long GUI tasks are often caused by forgetting prior context or clicking the wrong area, not by reasoning errors in isolation.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Using China's standardized national office software certification exam as a benchmark, the study finds that frontier LLMs in single-turn mode score only 37% while humans score 96%; adding agentic execution-feedback loops raises this to 69%, still less than three-quarters of human performance. The test is grounded in verifiable office document outputs — spreadsheets, presentations, documents — not open-ended text, so results are harder to game. The persistent 27-point gap after agentic augmentation suggests the bottleneck is fine-grained document manipulation skill, not just planning.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

This systematic review of 247 papers organises LLM agent security research into a unified lifecycle framework covering how attacks enter (prompt injection, tool misuse, memory corruption), how they spread (multi-agent propagation), and what defences exist. The key finding is that current defences address individual attack vectors reasonably well but fail when combined — there is no principled way to compose them into a coherent security posture. For AI practitioners deploying agents in production, this is a practical gap inventory rather than an abstract academic exercise.

██████████ 0.9 alignment-safety Preprint

Read Save Connections

Kwai Keye-VL-2.0 Technical Report

Keye-VL-2.0 is a 30-billion-parameter mixture-of-experts model that activates only 3 billion parameters per forward pass, yet achieves top results among similarly-sized models on video understanding, temporal grounding, and STEM benchmarks. A key technical contribution is adapting DeepSeek's sparse attention mechanism to support 256,000-token context windows without accuracy loss in a multimodal setting — something previously not demonstrated in this architecture class. The combination of MoE efficiency and long-context video capability in a single open-weights model is what makes this practically notable.

██████████ 0.9 long-context Preprint

Read Save Connections

Bridging Semantics and Physical Execution: A Neuro-Symbolic Framework for Multi-Pair Robotic Assembly

This paper addresses multi-pair robotic assembly — the problem of fitting many component pairs together without any single pair's assembly interfering with another's — using a framework that splits work between an LLM for local symbolic planning and a lightweight Transformer for resolving cross-pair conflicts. The system reaches 97% executability in offline evaluation and is deployed on a physical UR3 arm, with the LLM constrained to high-level atomic actions to reduce the hallucinated steps that plague unconstrained LLM planners. The result demonstrates that constraining where LLMs operate in a pipeline is often more effective than trying to make LLMs reliable end-to-end.

██████████ 0.9 embodied-ai Preprint

Read Save Connections

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Data Quality & Curation	113	Active	Highest paper volume today; activity concentrated in benchmark construction papers that expose how dataset curation choices directly determine apparent model capability.
Interpretability	109	Active	Second-highest paper count; the systematic review on AI in anaesthesiology underscores that interpretability gaps remain the primary barrier to clinical AI adoption after a decade of research.
Reasoning Reliability	104	Active	Multiple benchmarks today independently confirmed that models fail at multi-step physical and engineering reasoning even when they appear capable on general reasoning tasks.
Efficiency & Scaling	90	Active	Keye-VL-2.0's MoE design (3B active / 30B total) with 256K context support shows sparse activation is now a viable path to long-context multimodal capability without proportional compute cost.
Hallucination & Grounding	88	Active	The Data Journalist Agent and neuro-symbolic robotics papers both demonstrate that constraining LLM scope to structured subtasks — rather than end-to-end generation — is the most reliable current mitigation for hallucination in high-stakes pipelines.
Multimodal Understanding	71	Active	P3D-Bench and PhysTool-Bench both revealed that models handling natural images competently collapse when images carry technical or domain-specific visual encodings (CAD diagrams, physical tools in cluttered scenes).
Agent Tool Use	71	Active	A cluster of new benchmarks and attack papers converged today, collectively showing that agents fail at long-horizon tool use (30% GUI success, 21% physical tool success) while also being newly exposed as vulnerable to memory-poisoning attacks with near-perfect attack success rates.
Alignment & Safety	66	Active	MemVenom and the LLM agent security survey both highlight that agentic system security is fragmented — defences exist for individual attack types but cannot yet be composed into a coherent safe deployment posture.
Embodied AI	36	Active	Three papers today advanced embodied AI from different angles — navigation (AllDayNav), physical tool recognition (PhysTool-Bench), and robotic assembly (neuro-symbolic framework) — with real-world robot results in two of three cases.
Long Context	32	Active	DocTrace's 53% compute reduction over strong RAG baselines and Keye-VL-2.0's lossless 256K context extension both suggest that architectural query-conditioning — processing only what is needed — is more tractable than scaling brute-force context windows.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe