DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Agents Hit Three Different Walls on the Same Day

Today's papers ask one sharp question: when does AI stop being impressive and start being unreliable?

            April 14, 2026
          

Three papers landed today that, taken together, tell a coherent story — and it isn't a flattering one for AI agents. Each one went looking for the ceiling of current systems in a different domain, and each one found it. Let me walk you through what they found and what it actually means.

Today's stories

              01 / 03
            

Zero Investment Banking Reports Were Good Enough for Real Clients

Zero — that's how many AI-generated investment banking reports were deemed ready to hand to a real client.

Picture someone who can pull any number from any database in seconds, draft polished slides, and work through the night without complaining. Sounds useful. Here's the problem: 502 real investment bankers from Goldman Sachs, JPMorgan, and Evercore just evaluated AI-generated work — and not one output made the cut for an actual client. Not one. The benchmark, called BankerToolBench, was built in direct collaboration with those 502 bankers. Each task — valuing a company, building a deal model, writing a pitch — mirrors actual workflows, with rubrics covering over 100 criteria each. Nine AI systems, including GPT-5.4, were tested in a realistic environment with access to market data feeds and company filings. The best model still failed nearly half the rubric criteria. Think of it like making a layered cake for a dinner party. You might get the sponge right, the filling right, the decoration right — but if the layers don't align and the flavours clash, nobody serves it. The core failure the researchers identified is what they call cross-artifact consistency: the numbers in the Excel model don't match the claims in the slide deck, which don't match the narrative in the written memo. The pieces exist. They just don't cohere. The honest catch: a task that takes a human banker up to 21 hours is genuinely hard. This benchmark may sit near the outer edge of what complex professional work even looks like. What it tells us is that AI agents need better ways of keeping their own outputs consistent across a multi-document workflow — and nobody has cracked that yet. A demanding test, not a final verdict on AI in finance broadly.

Glossary

cross-artifact consistency — The ability to keep numbers, claims, and conclusions aligned across multiple linked documents — e.g., a spreadsheet, a slide deck, and a written memo all saying the same thing.

rubric — A detailed scoring checklist defining what a correct or complete answer looks like, used here to judge AI outputs against professional standards.

Source: BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

              02 / 03
            

A Robot That Learns to Forget the Right Things Stays Useful Longer

Your kitchen junk drawer fills up until you can't find the scissors anymore — and it turns out robots have exactly the same problem.

A robot running continuously for months, if it stores every single thing it observes, will eventually drown in irrelevant footage of empty hallways and routine tasks. The computing cost of answering even a simple question balloons. The system slows. The researchers behind Armar-7 — a large humanoid robot used for household research — built a system called H²-EMV to tackle this directly: they taught the robot to selectively forget. The system uses a language model — an AI trained on text — to estimate which memories are worth keeping, based on what kinds of questions the robot actually gets asked. Footage of an empty corridor at 3am? Low priority. A specific shelf location the user asks about repeatedly? Keep it. Tested on 20.5 hours of continuous real-world robot recordings, the approach cut stored memory by 45% and reduced the computing cost of answering questions by 35% — while maintaining roughly the same accuracy. There's a clever two-round design: in the first pass, the robot answers using its current memory. After receiving user feedback, it reorganises what it keeps. Think of it like cleaning out your junk drawer based on what you actually needed last month. Second-round question-answering accuracy improved by 70% relative after that feedback loop ran. The honest catch is important: first-round accuracy for the forgetting system is roughly half that of a system that forgets nothing at all. You're trading storage efficiency for real upfront precision losses. The team tested this on one robot platform. Whether the approach transfers to different hardware, different home environments, or different kinds of tasks is still wide open.

Glossary

episodic memory — A record of specific past events — what happened, when, and where — as opposed to general knowledge or skills.

language model — An AI system trained on large amounts of text that can read, summarise, and reason about written content.

Source: Learning to Forget -- Hierarchical Episodic Memory for Lifelong Robot Deployment

              03 / 03
            

AI Research Assistants Still Fail When Asked to Read 1,500 Papers at Once

What happens when you ask a 'deep research' AI to synthesise not one paper but 1,500 of them at once?

There's a class of AI tools marketed specifically for scientific research — systems that claim to read literature, find connections, and synthesise findings across many sources. A team of researchers just built a rigorous stress test for them, and the results are humbling. PaperScope was assembled from nearly 25,500 papers published on ArXiv and at major AI conferences between 2023 and 2025. The team built a knowledge graph — think of it as a map where ideas are towns and citations are the roads between them — connecting over 2,000 AI papers. From that map, they generated 2,400 questions covering tasks like summarising conflicting documents, spotting trends across dozens of sources, and diagnosing why a method underperformed. The hardest questions require synthesising more than 5,000 pages of material. That's not something you can answer by skimming titles. Sixteen AI systems were tested, including OpenAI's Deep Research and Tongyi Deep Research. All of them scored poorly. The benchmark is deliberately designed so that correct answers require actually integrating information spread across hundreds of documents — no single source contains the full answer. Long-context retrieval and multi-source reasoning are where today's systems fall apart. A few caveats worth naming: PaperScope is brand new, has zero independent citations yet, and was constructed by a single team whose methods haven't been peer-reviewed. It's possible some questions are better designed than others, and specific numerical scores aren't yet public in the version I read. But the directional finding — that AI research assistants struggle badly when evidence is scattered across hundreds of documents — matches what working researchers report daily. This is a plausible, useful stress test. Just not a final verdict.

Glossary

knowledge graph — A structured map of concepts and the relationships between them, where nodes are ideas or entities and edges are named connections like 'cites,' 'contradicts,' or 'builds on.'

long-context retrieval — The ability of an AI system to find and use relevant information from a very large body of text — the challenge is that most systems lose track of details buried deep in long documents.

benchmark — A standardised test with known correct answers, used to measure and compare the performance of different AI systems on the same tasks.

Source: PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers

The bigger picture

These three papers are measuring the same cliff from different angles. In banking, AI fails because it can't keep its own story consistent across multiple linked deliverables. In robotics, a long-running agent collapses under the weight of its own accumulated memory if it isn't taught what to discard. In scientific research, AI tools fall apart when evidence is scattered across hundreds of documents rather than neatly packaged in one place. What they share is a gap between handling a single, well-formed task and handling a task that requires sustained, coherent reasoning across many sources and many steps. Short demos look impressive. Extended deployments, high-stakes workflows, and genuinely multi-document questions expose the seam. None of these are new problems in principle. But having three separate teams measure the same gap — in banking, in robotics, in science — on the same day suggests this is the real frontier right now. Not raw capability. Sustained coherence.

What to watch next

The BankerToolBench paper names GPT-5.4 as a tested model — a version not yet widely public — which suggests this benchmark may be timing itself to coincide with an upcoming release cycle. Keep an eye on whether OpenAI or competing labs respond with updated agent performance claims. On the robotics side, the Armar-7 team's approach to relevance-based forgetting is the kind of component that tends to get absorbed into larger robotics frameworks quietly — worth tracking whether it appears in deployments over the next few months. The open question I'd most want answered: can any of today's systems pass PaperScope's hardest questions if given more time and more compute? Nobody has tested that yet.