DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI plays Mario, robots plan ahead, and agent skills have holes

Three papers show AI agents getting sharper at long, multi-step tasks — and one shows how badly that can go wrong.

            May 04, 2026
          

Happy Monday. Three papers landed today that, taken together, tell a surprisingly coherent story: AI systems are learning to act across dozens or hundreds of steps in a row — and that new reach is exciting, fragile, and already being exploited. Let me walk you through each one.

Today's stories

              01 / 03
            

Training an AI to beat Super Mario 100 moves at a time

What if teaching an AI to play Super Mario Land actually tells you something important about how robots will one day navigate a hospital?

Think of the difference between making a single left turn and driving across an unfamiliar city. The first is a one-shot decision. The second requires you to remember what happened ten minutes ago, adapt when a road is closed, and keep a destination in mind the whole time. Most AI systems are good at the left turn. The city-drive is still hard. A team from Odysseus tackled this by training a vision-language model — an AI that can see images and read text at the same time — to play Super Mario Land using reinforcement learning (RL), a training method where the AI gets rewarded for progress and penalised for mistakes, over millions of tries. The game requires more than 100 consecutive decisions per run, which is a genuinely long horizon for current AI. The model they started with, Qwen3-VL-8B, scored 270 on average across the game's first five levels. After RL training with a special stabilising addition called a turn-level critic — a lightweight referee that scores each individual move rather than waiting for the game to end — the same model scored 1,512. That's roughly a fivefold improvement. It also outperformed GPT-5.4 (which scored 310) and GLM-4.6V (513) without any game-specific training. Here is the catch: Super Mario Land is a controlled world with clear rules and a visible score. Real tasks — planning a warehouse route, assisting in surgery, managing a calendar across a week — are messier and harder to reward automatically. The paper doesn't test those. What it does show is that RL training plus a smarter feedback structure can unlock long-horizon planning that raw model size alone does not buy you. That is a real and useful result.

Glossary

vision-language model (VLM) — An AI model that can process both images and text as inputs, and generate text as output.

reinforcement learning (RL) — A training method where an AI learns by trying actions, receiving rewards for good outcomes, and adjusting its behaviour over many repeated attempts.

turn-level critic — A lightweight scoring component that evaluates each individual step in a long sequence, rather than waiting until the end to assign a reward.

long-horizon decision-making — The ability to complete a task that requires many sequential decisions, where early choices affect what happens much later.

Source: Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

              02 / 03
            

Robots that think in pictures AND words plan tasks much better

Imagine trying to assemble flat-pack furniture using only written instructions — then imagine how much easier it gets when you can also picture each step before doing it.

That gap between reading instructions and actually visualising them is exactly what a research team addressed with a system they call Interleaved Vision-Language Reasoning, or IVLR. The idea is simple enough to state: before a robot arm starts a task, it generates a mental plan that alternates between text notes ('pick up the red cup') and visual snapshots of what each intermediate state should look like. Then it works through the task guided by that combined plan. To test whether this mattered, the team ran experiments on LIBERO-Long, a simulation benchmark where a robot arm must complete multi-step manipulation sequences — the kind that involve opening a drawer, placing an object, closing a lid, in order. Without any plan, the robot succeeded 37.7% of the time. With a text-only plan, it climbed to 62%. With a vision-only plan (snapshots but no descriptions), 68.4%. With the full interleaved plan — text and images woven together — it reached 92.4%. That jump from 68% to 92% by adding language on top of vision, and from 62% to 92% by adding vision on top of language, suggests the two types of representation aren't just complementary — they cover each other's blind spots in a way that neither can do alone. The honest limit here is that all tests ran in simulation, not on a physical robot in a real room. Simulation benchmarks are useful but they strip out the chaos of the physical world: dust, inconsistent lighting, objects that slip. Whether the interleaved plan survives contact with reality is the next question nobody has answered yet.

Glossary

LIBERO-Long — A simulation benchmark that tests whether a robot arm can complete long sequences of manipulation steps, like stacking and placing objects in a specific order.

manipulation — In robotics, the ability to physically move, pick up, or reposition objects using a robot arm or hand.

interleaved reasoning trace — A step-by-step plan that alternates between written descriptions of goals and visual snapshots of what each stage should look like.

Source: Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

              03 / 03
            

More than half of AI agent 'skills' carry serious security risks

You wouldn't install a browser extension without knowing what it can access — so why would you trust an AI agent plugin any more?

AI agents are increasingly sold with add-on 'skills' — small programs that let the agent send emails, query databases, browse the web, or control software on your behalf. Think of them like apps on a phone, except the AI can trigger them automatically. A team of researchers built a tool called Semia to audit these skills at scale, and what they found should make you pause. Out of 13,728 real-world agent skills pulled from public marketplaces, more than half carried at least one critical semantic risk. These aren't typos or crashes — they're logical flaws: conditions buried in the prose description of a skill that, if exploited, could let a malicious actor hijack what the agent does. Standard code-checking tools miss these because the dangerous logic is written in plain English, not in code. Semia works like a building inspector who reads not just the blueprints but also every sticky note left on the walls. It translates each skill's natural-language description into a structured representation, then runs 11 automated checks across seven known vulnerability types. On a labelled test set of 541 skills vetted by human experts, Semia achieved 97.7% recall — meaning it catches nearly all real problems — and an F1 score of 90.6%, a combined measure of accuracy and completeness. It also found 17 confirmed, deployable vulnerabilities in live skills, verified by the OpenClaw registry maintainers. The catch: Semia itself relies on a language model to generate those structured representations, and its precision sits at 84.5% — meaning roughly one in six flagged skills is a false alarm. That's good enough to be a useful first filter, but not good enough to replace a human reviewer on anything critical.

Glossary

agent skill — A plug-in module that gives an AI agent the ability to perform a specific action, like sending an email or querying a database.

semantic risk — A security flaw that arises from the meaning or logic of a text description, not from a coding error that a traditional scanner would catch.

recall — The fraction of real problems that a detection system successfully identifies — high recall means few dangerous things slip through.

F1 score — A single number combining both precision (how often a flagged item is genuinely a problem) and recall (how many real problems are found).

zero-day vulnerability — A security flaw that is present in deployed software but not yet known to or patched by its developers.

Source: Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis

The bigger picture

Here is what today's three papers, side by side, are quietly telling you. AI systems are crossing a threshold: they can now sustain coherent behaviour across dozens or hundreds of steps, whether that's navigating a video game, planning a robot's manipulation sequence, or executing tasks on your behalf through agent plugins. That is genuinely new, compared to a model that just answers one question at a time. But longer chains of action mean longer chains of things that can go wrong. Odysseus shows you need special training tricks just to keep RL stable across 100 turns. IVLR shows that without an explicit multi-modal plan, robots fall apart at step 10 of a 20-step task. And Semia shows that the infrastructure we've already built for agents — the marketplace of skills — is riddled with flaws nobody was looking for. The more capable the agent, the more consequential those flaws become. That's the pattern worth watching.

What to watch next

The agent security finding from Semia is likely to draw rapid follow-up: expect researchers to probe whether similar auditing approaches generalise to other plugin ecosystems, including tool-use frameworks in widely deployed assistants. On the robotics side, the next meaningful test for IVLR's interleaved planning is whether it survives transfer to a physical robot arm — keep an eye on upcoming results from groups using WidowX and similar hardware in real-world settings. The open question I'd most want answered: can the same long-horizon RL training that worked in Super Mario be made stable and efficient enough for a task that matters outside a game environment?