DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

May 19, 2026

286

Papers

11/11

Roadblocks Active

Connections

⚡ Signal of the Day

• A cluster of empirical papers today collectively exposes a consistent theme: AI agents fail at memory, multi-step reasoning, and simulating humans — all three gaps surfacing on the same day across independent research groups.

• The LongMINT benchmark puts a number on memory failure (27.9% accuracy), EnvFactory shows synthetic training environments hallucinate tool behavior, and two independent UX studies find GPT systematically misrepresents human preferences — suggesting reliability ceilings are closer than capability headlines imply.

• Watch the embodied-AI sub-cluster: four papers (ESI-Bench, Key-Gram, Robo-Cortex, Seeing Together) all advance evaluation or architecture for physical agents simultaneously, signaling the field is maturing from proof-of-concept demos toward systematic benchmarking.

📄 Top 10 Papers

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory builds a pipeline of three AI agents that autonomously discover, verify, and package real software tools into executable training environments — replacing LLM-hallucinated simulators that teach agents wrong tool behavior. Using only 85 verified environments, it generates enough training data to match or beat systems trained on five times more synthetic data. This matters because the dominant bottleneck for tool-using AI agents has been the cost and unreliability of training data, and executable verification directly addresses both.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a benchmark of 15,600 questions over contexts averaging 138,800 tokens, where information is frequently updated and interleaved, forcing agents to track multiple changing facts simultaneously. Seven representative systems — including RAG, vector memory, and long-context LLMs — averaged only 27.9% accuracy, with the sharpest failures on tasks requiring aggregation across multiple evidence fragments. The result quantifies a specific and practically important gap: current memory systems are adequate for simple lookups but collapse under the kind of multi-threaded reasoning real tasks require.

██████████ 0.9 long-context Preprint

Read Save Connections

Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models

SP-CoR lets multiple robots pool their individual camera views to answer spatial questions that no single robot could answer alone, using physics-informed fusion during training and distilling pose knowledge into prompt tokens so robots need no special sensors at test time. Evaluated on a 114,227-question benchmark across Habitat and iGibson simulators plus real quadruped robots, it improves over the strongest baseline by 3.9% on Habitat and 7.1% on iGibson. The practical implication is a plausible path to multi-robot spatial awareness without requiring expensive shared sensing infrastructure.

██████████ 0.9 multimodal-understanding Preprint

Read Save Connections

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

ESI-Bench tests AI agents on 3,081 spatial reasoning tasks derived from Spelke's core knowledge systems, comparing agents that actively choose what to look at versus those given fixed or random extra views. Active exploration substantially outperformed passive observation, while random multi-view acquisition actually hurt performance by adding visual noise. This quantifies something intuitive but previously unmeasured: purposeful looking is not interchangeable with simply seeing more.

██████████ 0.9 embodied-ai Preprint

Read Save Connections

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

CrossView Suite introduces a dataset, fine-tuned model, and benchmark specifically for training multimodal language models to reason consistently about the same objects seen from different angles — a capability current models handle poorly. The core finding is that a three-stage training progression (perception, then cross-view alignment, then reasoning) is necessary; skipping any stage degrades results. This provides a concrete curriculum for the underexplored problem of viewpoint-invariant spatial understanding in vision-language models.

██████████ 0.9 multimodal-understanding Preprint

Read Save Connections

Key-Gram: Extensible World Knowledge for Embodied Manipulation

Key-Gram adds an external memory of language-derived world knowledge to robot manipulation models, retrievable in O(1) time via deterministic hashing, without retraining the underlying model. By separating what language instructions imply from what the robot currently sees, it achieves a 29.5% relative improvement on the π₀ backbone on RoboTwin2.0. The modular design means this knowledge store can be updated or extended for new tasks without touching the core model — important for deploying robots across changing environments.

██████████ 0.9 embodied-ai Preprint

Read Save Connections

Bio-Harness: Reliable Local-First Bioinformatics Agents with a Calibrated Fast-Signal Methodology

Bio-Harness combines LLM-based planning with deterministic template compilers and strict tool-configuration contracts that prevent the LLM from ever hallucinating a tool name or setting a wrong parameter — the agent reasons, but execution is locked to validated templates. Across 144 test cases it recorded zero hallucination events, zero fallbacks, and zero fail-open failures. The architecture demonstrates that separating LLM reasoning from execution enforcement is a viable and auditable solution to tool-use reliability in regulated scientific domains.

██████████ 0.9 hallucination-grounding Peer-reviewed

Read

Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction

Robo-Cortex addresses the problem that embodied navigation agents forget useful lessons between episodes by building a two-tier memory: a short-term reflective store for current-task observations, and a long-term library of distilled heuristics (guiding rules and cautionary patterns) extracted from past trajectories by an Autonomous Knowledge Induction mechanism. An adaptive variant updates this heuristic library during inference, not just training. Evaluated on three navigation benchmarks, the system shows that structured knowledge distillation from experience outperforms raw trajectory replay.

██████████ 0.9 embodied-ai Preprint

Read Save Connections

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

MementoGUI adds a learned memory controller to GUI-operating AI agents that selectively compresses and stores task-relevant interface events — including region-of-interest visual evidence and text summaries — rather than replaying full interaction histories. The system works as a plug-in without retraining the underlying vision-language model backbone, and outperforms both no-history and text-only memory baselines on GUI-Odyssey and MM-Mind2Web. For anyone building AI assistants that operate computers over many steps, this points toward selective memory as preferable to raw context growth.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

Distorted Perspectives of LLM-Simulated Preferences: Can AI Mislead Design?

Using real design preference data from a UX research platform, this study tested multiple GPT configurations — varying reasoning mode, sampling temperature, and persona specificity — against actual human responses and found systematic misalignment that no configuration corrected. LLM-generated justifications for design choices were found to lack the nuance and variability of genuine human reasoning, even when they appeared plausible. The practical implication is that using AI-simulated users as a substitute for human testing in product design carries a measurable risk of steering decisions in the wrong direction.

██████████ 0.8 alignment-safety Preprint

Read Save Connections

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Training Data Quality & Curation	133	Active	Highest-volume roadblock today, with EnvFactory and CrossView Suite both contributing architectural approaches to synthetic data quality — executable verification and multi-agent curation pipelines respectively.
Model Interpretability	116	Active	Second-highest volume with no standout papers surfacing in today's top selection, suggesting broad but diffuse activity without a clear methodological breakthrough.
Hallucination & Grounding	106	Active	Bio-Harness demonstrates a deterministic contract architecture achieving zero hallucination events in tool execution, while EnvFactory shows that executable environment verification reduces hallucination in agent training data.
Reasoning Reliability	102	Active	LongMINT's 27.9% average accuracy result and the UX simulation studies both quantify reasoning failures in deployed systems, with chain-of-thought prompting shown to not rescue LLM reliability under real task pressure.
Efficiency & Scaling	86	Active	Active but no papers in today's top selection directly address this roadblock; the volume suggests broad background activity in architecture and inference optimization.
Multimodal Understanding	79	Active	Strong day: ESI-Bench, CrossView Suite, and Seeing Together all advance evaluation and training frameworks for spatial and multi-view reasoning, a subfield that is rapidly maturing from isolated demos to systematic benchmarks.
Agent Tool Use	76	Active	EnvFactory and MementoGUI both advance training and memory infrastructure for tool-using agents, with EnvFactory's executable verification approach being the most concrete methodological contribution.
Alignment & Safety	75	Active	Two independent UX studies confirm that LLM-simulated human preferences are systematically wrong in ways that resist prompt engineering, raising practical concerns about AI standing in for users in product and design decisions.
Long Context	43	Active	LongMINT establishes a new benchmark exposing that memory-augmented agents average only 27.9% accuracy on long contexts with frequent updates, providing a concrete measurement target for future work.
Embodied AI	41	Active	Unusually dense day with four papers advancing embodied AI evaluation and architecture simultaneously — ESI-Bench, Key-Gram, Robo-Cortex, and Seeing Together — suggesting a coordinated maturation of the subfield toward reproducible benchmarking.
Training Data Quality & Curation (subcategory)	1	Low	Minimal activity under this specific tag today; subsumed by the broader data-quality-curation roadblock.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe