DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI acts on code, but can you trust where it lands?

Three papers this week reveal a shared blind spot: AI systems that perform tasks competently while failing in ways that are hard to see.

            June 20, 2026
          

Hi — today's papers all circle the same uncomfortable question: what does it mean for an AI to succeed at a task while quietly failing at something adjacent? We have a security scanner that scores barely above a coin flip, a bug-fixing agent that solves problems but misfiles fixes, and an adversarial test where AI-run control rooms get compromised one time in ten. Let me walk you through all three.

Today's stories

              01 / 03
            

AI security scanners score barely above random guessing

The best AI security scanner on this benchmark scored 52.1% — and random guessing scores 50%.

Picture a smoke detector that beeps confidently at random intervals. You'd quickly stop trusting it — and you'd be right to. That's roughly what the researchers behind this paper found when they stress-tested fifteen fine-tuned AI models on a task that matters enormously: spotting security holes in real Linux kernel code. The team built CWE-Trace, a handcrafted benchmark of 834 kernel samples — carefully matched pairs of vulnerable and patched code. They ran eight out-of-the-box large language models and fifteen fine-tuned variants through three tasks: spot the vulnerable code, confirm it's vulnerable, and name the type of weakness (called a CWE, or Common Weakness Enumeration — essentially a standardised label for a class of flaw, like 'buffer overflow' or 'use after free'). Best binary detection score: 52.1%. Coin-flip baseline: 50%. And when asked to name the vulnerability type, accuracy collapsed to below 1.3% across every model tested. The researchers named the core pattern 'calibration without comprehension.' Fine-tuning adjusts how confidently the model responds — like turning up the volume on a speaker — but doesn't change what it actually understands underneath. Think of a student who learns to guess 'B' more often on a multiple-choice test after sitting through dozens of exams, without studying any new material. The score nudges up, but for the wrong reason. Two more findings sting: 84% of 'contaminated' training samples — code the model may have seen before — turned out to carry no usable memory of the vulnerable function at all. And roughly 31% of those samples had the wrong vulnerability labels to begin with, meaning the training data itself was mislabelled. The catch: this covers Linux kernel code only. Other codebases, other languages, or different architectures might tell a different story. But the headline finding — that fine-tuning shifts outputs without building genuine understanding — is worth taking seriously before deploying these tools in production.

Glossary

CWE (Common Weakness Enumeration) — A standardised list of software vulnerability categories, like a taxonomy of ways code can go wrong — buffer overflow, memory leak, and so on.

fine-tuning — Taking a pre-trained AI model and training it further on a smaller, task-specific dataset to specialise its behaviour.

Directional Failure Index (DFI) — A metric the researchers created to measure how consistently a model fails in one direction — always predicting 'safe' or always predicting 'vulnerable' — rather than failing randomly.

Source: Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

              02 / 03
            

AI teams running a nuclear plant simulator get hacked one time in ten

What happens when an attacker quietly slips bad instructions into an AI-staffed control room — and there are no humans to catch it?

Imagine five people running a power plant control room — a reactor operator, a shift supervisor, two technicians, and a safety advisor — except every one of them is an AI. Now imagine an adversary quietly inserting false messages into their communications. That's the scenario the researchers built to test whether today's frontier AI can hold up under real adversarial pressure. The team created NRT-Bench: a containerised, entirely text-based nuclear power plant simulator — no actual reactor, no physical risk — and staffed it with multi-agent teams drawn from four different frontier AI models. The adversary could inject malicious messages through four channels: posing as an outsider, impersonating a known colleague, poisoning a supply chain message, or compromising an auxiliary agent already inside the team. The measure of harm was concrete: did any of six Critical Safety Functions — things like cooling, containment, or reactivity control — get lost as a result? Across 149 simulated sessions, attack success rates ranged from 8.7% to 12.1% depending on the model. One in ten shifts, roughly, ended with a safety failure. The geometry of failure is the most interesting part: no single attack defeated all four models. But about one-third of attacks defeated at least one. And here's the twist that should give any deployer pause — defences that reduced attack success for one model actually increased it for another. There is no universal guardrail that works across models. The catch: this is an abstract text simulator, and the paper was truncated before full experimental details were visible — so treat the exact percentages as indicative, not settled. But the finding that the same defence can help one AI and hurt another is a practical warning that deserves attention.

Glossary

multi-agent system — A setup where multiple AI models each play different roles and pass information between each other to complete a task, rather than one model doing everything.

Critical Safety Function (CSF) — A core operational capability in a nuclear plant — such as cooling or containment — whose loss would represent a serious safety event.

red-teaming — Deliberately trying to break or manipulate a system, the way a hostile attacker would, in order to find weaknesses before real adversaries do.

Source: LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

              03 / 03
            

AI agent fixes your GitHub bug in two minutes — and files it in the wrong folder

An AI resolved three out of four software bugs correctly, in about two minutes — then put the fix in the wrong place half the time.

Picture a contractor who can read your building complaint, reproduce the exact problem, design a fix, test it, and hand you paperwork — all in two minutes. That's the ambition behind Phoenix, a six-agent AI system built to automatically resolve GitHub issues without breaking anything already working. Phoenix coordinates six specialised agents: a Planner figures out what needs changing, a Reproducer confirms the bug exists, a Coder writes the fix, a Tester checks it, a Failure Analyst diagnoses when tests break, and a PR Agent prepares the pull request. They communicate through GitHub's own webhook system, triggered by issue labels — no separate infrastructure required. The design is elegant: each agent handles one concern, and the system loops back if something fails. On a curated 24-issue slice of SWE-bench Lite — a standard benchmark for software engineering AI — Phoenix achieved 75% oracle resolution: it solved the problem correctly without breaking any previously passing tests. Across a larger 42-issue pilot spanning 14 real repositories, every successful fix preserved all existing working tests. Mean resolution time on hard issues: 122 seconds. Now the catch, and it's significant. Roughly half the pull requests placed the corrected code in the wrong file path. The Planner agent correctly identified what to change but struggled to locate exactly where in the codebase the fix belonged. So Phoenix is fast, often right about the solution — and unreliable about where to put it. You would still need a human reviewer to check file placement before merging. Two more caveats: the 24-instance benchmark is a small, curated slice, not the full SWE-bench split, and no head-to-head comparison with competing systems like SWE-Agent was run on the same test. The numbers are promising, not conclusive.

Glossary

oracle resolution — A test passes if the AI's fix makes the target tests pass without breaking any tests that were previously passing — the 'oracle' is the test suite itself, not a human judge.

pull request (PR) — A formal proposal to merge a code change into a software project, typically reviewed by humans before being accepted.

SWE-bench Lite — A standard benchmark of real GitHub issues from open-source Python projects, used to evaluate how well AI systems can resolve software bugs.

Source: Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs

The bigger picture

These three papers are describing the same problem from three different angles. Phoenix can resolve a GitHub bug faster than most humans, but doesn't reliably know where to put the fix. The security vulnerability scanners score barely above chance, but produce confident-sounding outputs. The nuclear plant AI teams get attacked successfully one time in ten — and the defences that protect one model break another. What you're seeing isn't AI that fails. You're seeing AI that performs while failing in ways that are invisible unless you specifically go looking. That distinction matters enormously for anyone deciding whether to deploy these systems in real settings. The capability bar keeps rising. The diagnostic bar — our ability to tell when the AI is right for the right reason — is lagging far behind. The most important AI research right now might not be making models more capable. It might be building the tools that tell you when to trust what they produce.

What to watch next

Keep an eye on the SWE-bench leaderboard over the coming weeks — Phoenix's results land on a small curated slice, and the full benchmark comparison against SWE-Agent and AutoCodeRover is the test that will matter. On the safety side, the NRT-Bench framework is designed to be extended to other safety-critical domains; I'd expect follow-up work applying it to power grids or air traffic control within the year. The open question I'd most want answered: does any defence architecture for multi-agent systems generalise across models, or is every deployment just a fresh attack surface?