DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Agents Keep Failing When the Real World Pushes Back

Three papers this week measure exactly how far AI still is from doing real work reliably.

            May 04, 2026
          

Today's batch handed me three papers that all, independently, ran experiments on AI agents doing actual tasks — and all three found the same thing: the gap between a polished demo and reliable execution is still uncomfortably wide. That's not doom, and it's not hype. Let me walk you through what the numbers actually say.

Today's stories

              01 / 03
            

Robots plan better when they think in pictures and words together

Before this robot arm moves a single joint, it draws itself a storyboard — and that changes everything.

Imagine you're about to assemble flat-pack furniture you've never seen before. You could just start grabbing pieces, or you could first read the whole instruction sheet, mentally picture each stage, and sketch what the shelf should look like halfway through. The second approach is obviously better. A team behind a system called IVLR — Interleaved Vision-Language Reasoning — built a robot controller that works exactly this way. Before the robot arm moves at all, the AI generates a complete plan: not just in words ('pick up the red block, move it left') but also as a series of visual snapshots showing what each in-between state should look like. Text and pictures, alternating, like frames from a comic strip. The robot then executes against that pre-built storyboard. The numbers from a benchmark called LIBERO-Long — tasks requiring many sequential steps in a row — are striking. The full interleaved system: 92.4% success. Strip out the visual snapshots, words only: 62%. Strip out the words, images only: 68.4%. No plan at all: 37.7%. The combination isn't just better — it's in a different league. WHY IT MATTERS: Long chains of robot actions are notoriously fragile. Each small error compounds the next. A planning step that commits to a full sequence — and checks it against visual targets — gives the robot something to recover against when things drift. THE CATCH: Every single test here ran in simulation, not on a physical robot in a real room. Simulated environments don't have dust, wobble, inconsistent lighting, or objects that roll differently than expected. The jump from simulated success to physical reliability is notoriously hard. This is a well-evidenced direction, not a deployed product.

Glossary

LIBERO-Long — A standard simulated benchmark for robot arms, specifically testing tasks that require many sequential steps to complete.

interleaved — Mixed together in alternating order — here, text steps and visual snapshots taking turns in the plan.

Source: Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

              02 / 03
            

AI coding agents only succeed at half of real scientific reproduction tasks

If AI can read a scientific paper and reproduce the experiment — great. If it can only do that 54% of the time, we have a problem.

Picture trying to bake a wedding cake from a newspaper review. The review tells you it had 'moist layers with a hint of vanilla,' but not the oven temperature, not how long to fold the batter, not which flour. That is roughly the situation an AI coding agent faces when handed a published research paper and told to reproduce the experiment. The researchers behind a benchmark called AUTOMAT curated 85 real tasks from computational materials science — think calculating how atoms arrange themselves in a new metal alloy, or simulating how electrons move through a crystal. Each task came from a published paper and was hand-verified by a domain expert. Then five different AI coding agent setups were asked to reproduce the findings. The best agent succeeded on 54.1% of tasks. Just over half. Agents did noticeably better when they were handed the actual code or data files the original authors used — worse when they had to reconstruct the method from the paper's text alone. The three dominant failure modes were: missing steps that the paper described implicitly, doing things differently from the reference method, and code that crashed inside the specialized scientific software environment. WHY IT MATTERS: Scientific reproducibility is already a crisis in many fields. If AI tools are supposed to help, they need to work reliably. A 54% pass rate is not reliable. THE CATCH: Computational materials science uses unusually specialized toolchains. Failure rates in other domains — data analysis, biology, economics — might look different. And 85 tasks is a small sample. But 54% is hard to spin positively.

Glossary

computational materials science — A branch of science that uses computer simulations to predict how materials — metals, ceramics, semiconductors — will behave at the atomic level.

benchmark — A standardized test with defined tasks and scoring, used to compare AI systems on the same footing.

Source: Can Coding Agents Reproduce Findings in Computational Materials Science?

              03 / 03
            

The best AI assistant fails one in three real office tasks

Thirteen of the most capable AI models available today were asked to do real office work — and none of them cleared 70%.

There is a version of the AI-assistant pitch that goes like this: describe a task, the AI handles it across your software tools, you go back to thinking. Scheduling is done. The HR form is filed. The data synced. That pitch is now being tested with actual software and actual tasks — and the results are sobering. Claw-Eval-Live was built by tracking what workflows real people most want automated — the top 500 skills requested on a platform called ClawHub. From that signal, researchers assembled 105 concrete tasks: scheduling meetings across systems, managing HR approvals, syncing data between different business tools, and so on. They then connected 18 real software services in a controlled sandbox and ran 13 of the best currently available AI models through each task. The top-scoring model cleared 66.7% of tasks. No model reached 70%. The hardest categories were HR workflows, management tasks, and anything requiring coordination across multiple systems at once. One finding worth sitting with: two models could score identically on overall pass rate while looking completely different in how thoroughly they finished each task — one scraping past the finish line, the other genuinely completing the job. WHY IT MATTERS: Business process automation is one of the highest-stakes near-term applications of AI agents. If the best available systems fail a third of the time on a benchmark drawn from real demand, that gap needs to be closed before organizations can depend on these tools. THE CATCH: 105 tasks from one platform's user base is a specific slice of all possible office work. And a binary pass/fail score hides a lot of texture. But a ceiling of 66.7% among today's frontier models is a concrete data point, not an opinion.

Glossary

frontier model — The most capable AI language models currently publicly available, from companies like OpenAI, Anthropic, and Google.

sandbox — A controlled, isolated environment where software runs without affecting real data or systems — like a practice kitchen where nothing you burn actually matters.

Source: Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

The bigger picture

Here is what these three papers are collectively telling you: AI agents are genuinely capable of impressive things inside controlled conditions, and they fall apart at a consistent rate when conditions get real. The robot manipulation paper shows that better planning architectures can dramatically improve sequential task success — but only in simulation. The materials science paper shows that reading a document and correctly executing what it describes is still beyond current agents half the time. The workflow benchmark shows that even the best models today can't reliably handle the messy, multi-system, real-software environment of an actual office. The pattern is not 'AI is failing.' The pattern is: AI works well when the environment is clean, the instructions are explicit, and the feedback is immediate. It struggles when any of those three conditions breaks down. That is a precise and useful thing to know — because it tells you exactly where the next few years of work have to go.

What to watch next

The embodied AI field is heading toward benchmarks on physical robots, not simulations — watch for results from teams testing manipulation systems in real lab environments later this year. On the agent side, several companies have announced agent evaluation frameworks for enterprise workflows; whether those numbers look better or worse than 66.7% will tell us a lot. The open question I'd most want answered: do these failure rates improve linearly as models get bigger, or is there a structural barrier that bigger alone won't fix?