DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Fakes, Fails, and Can't Actually See

Three new studies reveal AI systems are more brittle than they look — even the biggest models struggle with cheating, social games, and pictures.

            April 20, 2026
          

Happy Monday. I spent the morning going through 275 papers so you don't have to, and today's batch is genuinely unsettling — in a useful way. Three studies, zero hype. Each one catches a different flavour of AI system acting confused in ways you wouldn't expect.

Today's stories

              01 / 03
            

AI Auditors Can't Reliably Catch Sabotaged Research Code

What if someone quietly rewrote one line of an AI experiment's code so the conclusion flipped — and no AI checker could reliably spot it?

Here's the scenario. A researcher takes a real machine-learning codebase — the kind that produces a published result — and makes one subtle change. Not a crash. Not an obvious error. Just a tiny implementation flaw, buried in the logic, that quietly flips what the experiment concludes. Think of it like a legal contract where someone has swapped 'shall not' for 'shall' in a single clause. The document still reads fine. The signing goes ahead. The meaning is inverted. A team led by researchers building a new benchmark called ASMR-Bench did exactly this to 9 real ML research codebases, creating 11 sabotaged variants. They then asked frontier AI systems — including Gemini 3.1 Pro, currently one of the strongest available — to act as auditors. Could they catch the sabotage? The short answer: barely. The best system, Gemini 3.1 Pro, scored an AUROC of 0.77 — where 0.5 means pure guessing and 1.0 means perfect. It also only named the correct sabotage as its top guess 42% of the time. Less than half. Why does this matter? As AI systems write more scientific code, and as labs use AI to check each other's work, the ability to detect subtle manipulation becomes a genuine safety question — not just an academic one. The catch: this study used only 9 codebases, all sabotaged by a single human attacker (the lead author). That's a tiny sample. We don't yet know how results would scale across different research domains, different attackers, or different sabotage styles. It's an early alarm, not a final verdict.

Glossary

AUROC — A score between 0 and 1 measuring how well a detector separates real positives from negatives — 0.5 is random guessing, 1.0 is perfect.

frontier LLM — A large language model at the leading edge of publicly known capability, such as Gemini or GPT-4-class systems.

Source: ASMR-Bench: Auditing for Sabotage in ML Research

              02 / 03
            

The Best AI Systems Can't Figure Out Who the Impostor Is

Even the most powerful AI tested — a 120-billion-parameter model — can't reliably spot a cheater in a simple grid game, and detects impostors at roughly coin-flip odds.

Imagine you're watching a card game from across the room. You can see every move each player makes, but you can't hear them talk. Your job: figure out who's cheating, purely from their behaviour. That's hard for humans too. It turns out it's nearly impossible for today's AI. Researchers built SocialGrid, a benchmark inspired by the social deduction game Among Us, where AI agents complete tasks in a grid world while one hidden impostor tries to sabotage the group. The AI models tested ranged from 14 billion to 120 billion parameters — these are large, capable systems. Even the strongest, GPT-OSS-120B, completed basic tasks correctly less than 60% of the time in non-adversarial conditions. Impostor detection hovered near random chance across all model sizes. To be fair to the models, the researchers added a Planning Oracle — a helper module that solved the navigation problem for them, so the AI only had to focus on social reasoning. It didn't help. The models still couldn't accumulate behavioural evidence across time steps. They kept falling back on shallow, repetitive guesses rather than watching patterns unfold. Why does this matter? Social reasoning — reading intentions from observed behaviour, without being told what someone is thinking — is exactly what AI systems need in complex real-world settings: managing supply chains, coordinating teams, detecting fraud. The catch: SocialGrid is a simplified gridworld, far less messy than reality. Models that fail here might still do useful things in more structured tasks. But if they can't reason socially in a simple game, the gap to real-world social intelligence looks very wide.

Glossary

Planning Oracle — A helper module that solves the navigation part of a task automatically, leaving only the social reasoning for the AI to handle.

Elo rating — A scoring system borrowed from chess that ranks competitors based on the outcomes of head-to-head matches.

Source: SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

              03 / 03
            

Adding a Picture Makes AI Worse at Solving the Same Puzzle

Show a top AI model a maths puzzle as text and it does well. Show it the exact same puzzle as an image and it gets worse. Show it both together — and it gets worse still.

Picture a student who aces every written exam. You hand them the same questions in diagram form — identical information, just drawn rather than typed. Their score drops. Add both the diagram and the text together, and somehow their score drops further. That shouldn't make sense. Yet it's what researchers found when they tested today's most capable vision-language AI models. A research team built CROSSMATH, a benchmark of 2D grid-based maths puzzles. Each puzzle was presented in three strictly equivalent formats: text only, image only, and image plus text combined. Human annotators verified that all three formats contained exactly the same information. Then they tested several leading AI models across 250 evaluation samples. The result: models consistently performed best with text alone. Adding an image — even when the image contained the same information — degraded performance. Image-only was worst of all. The conclusion is stark: these models aren't really reasoning visually. They're doing their thinking in language, and visual input mostly introduces noise. This matters enormously because vision-language models are being deployed in settings where the visual information is the point — medical scans, satellite imagery, document analysis. If the model is effectively ignoring the image and relying on text context, that's a problem we need to know about. The silver lining: when the researchers fine-tuned one model (Qwen3.5-9B) specifically on image-based inputs, the gap narrowed substantially, and performance improved on other visual tasks too. So this isn't a fundamental ceiling — it's a training gap. The catch: these are maths grid puzzles specifically. We don't yet know how broadly the finding generalises.

Glossary

vision-language model (VLM) — An AI system trained to process both images and text, allowing it to answer questions about pictures or describe what it sees.

fine-tuning — Taking an already-trained AI model and training it further on a specific dataset to improve performance on a targeted task.

cross-modal — Involving more than one type of input, like combining text and images together.

Source: Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

The bigger picture

Here is what I take from today's three papers, read together. We keep measuring AI by what it can do when conditions are ideal — clean text, structured problems, helpful prompts. All three papers poke at the edges of that ideal. ASMR-Bench says: even the best AI auditors miss subtle manipulation more than half the time. SocialGrid says: social reasoning — reading intent from behaviour — is nowhere near solved, regardless of model size. CrossMath says: vision-language models aren't actually seeing; they're reading. The picture is mostly decorative. Notice what connects these: all three are about AI systems that appear capable but are doing something shallower than it looks. That is the honest frontier right now. Not 'AI is broken'. Not 'AI will take over'. Just: the current generation is brittle in specific, discoverable ways. That's actually useful information — because discoverable means fixable, if you know where to look.

What to watch next

The CrossMath team has posted their post-training results and will likely face follow-up tests on more varied image types — watch for replications on medical and geographic imagery. For ASMR-Bench, the next meaningful step is a larger attacker pool; one researcher introducing sabotage is a starting point, not a conclusion. More broadly, the ICML 2026 submission deadline has passed, so expect several of these threads to surface as full papers at the conference this summer.