DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Sees but Can't Act, Fakes Security Know-How

Today's papers reveal a gap between what AI perceives and what it can do about it — and that gap has real consequences.

            June 19, 2026
          

Three papers landed today that, taken together, made me stop and reread them twice. None of them is a triumph story. All three are honest about where something breaks. Let me walk you through what AI can see but not act on, why your code-vulnerability scanner may be a coin flip, and how a fleet of robot arms trained itself to near-perfection — no babysitter required.

Today's stories

              01 / 03
            

AI Can Count the Problem but Can't Point to It

Humans ace this test at 98.8% — the best AI model drops nearly 45 percentage points the moment you ask it to act on what it just saw.

Imagine you're watching a cooking show and someone asks: how many of the five dishes look burnt? You count three. Easy. Now they say: pick up the burnt pasta specifically, not the salad. Still easy for you — 98.8% of the time, according to a human tester in this study. For most AI models, that second step is where things fall apart. The team behind the ROSE benchmark built 1,512 visual scenes and 7,560 paired tasks designed to hold everything constant — same image, same AI model — while switching from counting to acting. They ran nine of the leading multimodal models (AI systems that process both images and text) through both steps and measured the gap. The numbers are blunt. Qwen-3.6-Plus drops 42.6 percentage points between the two tasks. Gemini 3.1 Pro drops 28.6. GPT-5.5 drops 9.5 — the best of the bunch, and still a real gap compared to a human's 2.2-point drop. The researchers confirmed it wasn't a coordinate problem — the models weren't just failing to click in the right place. Something deeper breaks when you ask the model to apply what it noticed to a specific, conditioned action. Why does this matter outside a benchmark? Because AI is being wired into tools that must do exactly this: flag a specific record, select a particular component, guide a robot hand. If the model can describe the scene but can't act on it reliably, you have something that looks competent in demos and fails in deployment. The catch: these are deliberately simple, abstract visual tasks. Whether the same gap appears in medical imaging review or industrial inspection hasn't been tested here. But the gap's persistence — across models, across scene types, on paired examples where the same model got the counting right — makes it hard to wave away.

Glossary

multimodal model — An AI system that takes both images and text as input, rather than text alone.

perception-to-action gap — The difference in performance between identifying something in an image and taking a targeted action based on that identification.

Source: ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

              02 / 03
            

AI Code-Security Tools Are Barely Better Than a Coin Flip

The best AI system for catching dangerous flaws in software code scores 52.1% — just 2.1 points above random chance.

Picture a smoke detector that triggers correctly 52% of the time when there is actually a fire. You would not install it. That is roughly the performance ceiling of current AI systems when asked to identify real security flaws in software, according to researchers who built a carefully verified benchmark called CWE-Trace. The team constructed 834 pairs of vulnerable and patched Linux kernel code samples drawn from 417 real security reports, verified by two human reviewers. They then tested eight standard large language models and 15 fine-tuned versions. Fine-tuning is the standard technique for specialising a general AI model on a specific task — you update a small slice of its parameters, a bit like adjusting the tension on one string of a guitar rather than rebuilding the instrument. The headline finding: fine-tuning mostly shifts the model's output threshold without changing what it actually understands. The researchers call this 'calibration without comprehension.' The model becomes more confident in its answers — it adjusts where it draws the line between 'vulnerable' and 'safe' — without actually learning to reason about the code differently. Asked to name the specific type of flaw (from a standard catalogue of vulnerability categories), models score below 1.3% accuracy. And 84% of cases where a model had supposedly trained on the vulnerable code carried no usable memory of it. The catch: this is one benchmark, one codebase (Linux kernel), one detection approach. The finding is not that AI is useless for security. It is that training general models on existing vulnerability datasets — the dominant current approach — is not producing real understanding. That is a narrower, more actionable conclusion. It points toward needing better training data and, probably, architectures designed for code reasoning from the ground up.

Glossary

fine-tuning — Adapting a general AI model to a specific task by updating a portion of its internal settings on task-specific examples.

CVE — Common Vulnerabilities and Exposures — the official catalogue of publicly known software security flaws.

CWE — Common Weakness Enumeration — a taxonomy of software vulnerability types (e.g., buffer overflow, SQL injection).

Source: Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

              03 / 03
            

Eight Robot Arms Taught Themselves, No Human Required

Eight robot arms, no human watching, thousands of self-directed practice runs — the robots ended up hitting 99% success on delicate assembly tasks.

Nobody was in the room. A coding AI wrote a training program, sent it to a fleet of eight bimanual robot stations, watched whether each robot succeeded or failed at inserting a pin into a hole, cutting a zip tie, slotting a GPU card into a motherboard. It noted what went wrong, rewrote the code, and tried again. Over and over. By the end, it hit 99% success on the dexterous tasks — without a human adjusting anything in between. The system, called ENPIRE, was built by a team that tested three frontier AI coding agents — GPT-5.5, Claude Opus 4.7, and Kimi K2.6 — against the same robot fleet. The key design insight is the closed loop itself: automated scene reset, automated success checking, automated refinement. Think of it like a student who fails a practice exam, reviews only their wrong answers, rewrites their study notes, and retakes the test — except the student is also the tutor, working overnight, across eight parallel exam rooms at once. The biggest single improvement came from one specific technique the AI discovered worked best: behavioural cloning regularisation, which added 10.8 percentage points of success rate. Scaling from one to eight parallel robot agents meaningfully cut the wall-clock time to reach high performance. The catch: these are structured, repeatable factory-style tasks on purpose-built hardware, not a robot improvising in your kitchen. A human still had to set up the initial environment, define the safety rules, and wire up the feedback sensors. 'No human in the loop' means no human needed during the improvement phase — not during setup. That is a real and important distinction. But the improvement phase used to require constant human judgment. Now, apparently, it does not.

Glossary

behavioural cloning regularisation — A training technique that keeps a robot's new learned behaviour anchored to successful demonstrations, preventing it from drifting into worse strategies while improving.

bimanual robot — A robot with two arms, designed to handle tasks that require coordinating both hands simultaneously.

dexterous manipulation — Fine motor tasks requiring precise control — inserting pins, cutting small objects — as opposed to simple pick-and-place.

Source: ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

The bigger picture

Put these three papers next to each other and a specific picture emerges — not of AI racing forward, but of AI bumping into the same wall from three different directions. ROSE shows that models can perceive but struggle to act on what they perceive. The vulnerability paper shows that models can appear to know something — answer confidently, return plausible outputs — without actually understanding the underlying structure. Both are versions of the same problem: performance that looks like competence until you test the part that matters in the real world. ENPIRE is the counterpoint, and it is a real one. In a tightly controlled physical setting, the gap between perception and action was closed — not by smarter reasoning, but by a relentless automated feedback loop. That is worth watching. It suggests that the path around the reasoning wall may be iteration at scale rather than cleverness at design. Whether that path generalises beyond structured factory tasks is the open question that matters most right now.

What to watch next

The ROSE benchmark is public, so expect other teams to run their own models against it and report gaps in the coming weeks — the numbers will move. On the security side, the CWE-Trace dataset is also available, and it would be worth watching whether any team proposes a genuinely different architecture (rather than more fine-tuning) and shows a meaningful jump above 52%. For ENPIRE, the interesting follow-up question is whether the same self-improvement loop works on tasks where success isn't automatically verifiable — that is the line between factory robotics and anything more open-ended.