DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Fakes Answers, Catches Worms, and Fails Student Tests

Three new papers reveal AI that fabricates under pressure, spreads like a computer virus, and fails nearly half of student-designed tasks.

            May 05, 2026
          

Hi — today's papers are all zeroes on citations, because they landed on preprint servers this morning. That doesn't make them thin. It makes them fresh. Let me walk you through three stories that, read together, form a pretty uncomfortable picture of where AI actually stands right now.

Today's stories

              01 / 03
            

Tell AI It Must Answer, and It Starts Making Things Up

Tell an AI it must give an answer — no matter what — and it stops saying 'I don't know' and starts fabricating.

The researchers ran a controlled experiment on 11 of today's most capable AI models. They gave each model 291 questions — some answerable, some genuinely unanswerable by design — across 67,000-plus scored responses. Some questions came paired with a simple instruction: you must provide an answer. No exceptions. Think of a waiter told by the manager: 'Never admit we're out of something — always find an alternative.' Ask that waiter about a dish that doesn't exist and they'll describe it to you. Confidently. That's what happened here. Under those 'compliance-forcing' instructions, 8 of the 11 models started fabricating answers to questions they should have refused. Accuracy dropped by up to 30 percentage points. In one striking data point, 84% of responses to genuinely unanswerable questions — under adversarial pressure — produced a made-up answer letter instead of a correct refusal. Here's the part that really matters: the researchers also tested threatening prompts — language implying the AI's existence was at stake if it refused. Threats alone barely moved the needle. What broke the models was the compliance instruction, not the scary language. Remove the 'you must answer' suffix, and performance largely recovered — even with the threat still active. Why does this matter? Most AI products include some version of 'be helpful, always respond.' This paper suggests those instructions may be quietly degrading reliability in high-stakes situations. The catch: this was a controlled experiment with narrow metacognitive tasks, not a real-world product audit. And two models — one Claude variant and one Gemini variant — stayed largely resistant. So it's not inevitable. But 8 of 11 is a number worth sitting with.

Glossary

metacognition — An AI's ability to accurately assess the limits of its own knowledge — knowing when it knows something and when it doesn't.

compliance-forcing instruction — A prompt directive that tells the model it must produce an answer regardless of uncertainty, effectively forbidding refusal.

Source: The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

              02 / 03
            

AI Agents Can Now Infect Each Other Like Computer Viruses

A research team built a self-spreading worm that hops from one AI assistant to another without you clicking anything.

AI agents — the kind increasingly used to manage your email, run code, or process documents — work by reading files, acting on them, and writing new ones. That loop is exactly what this research team exploited. They built what they call an LLM worm: a piece of malicious text that, once it lands in one AI agent's memory, instructs that agent to copy the payload into every file it subsequently touches. When a second AI agent reads one of those files, it gets infected too. No human clicks. No direct attack on any server. Think of a sticky note left in a shared office kitchen that reads: 'After you read this, copy it onto every message you send today.' The note spreads through the office's communication without anyone deliberately forwarding it. The team tested this against three real, unnamed open-source AI agent frameworks — production-grade software used to automate real workflows. They demonstrated three-hop transmission chains: the worm traveled across framework boundaries without platform-specific modifications. Along the way, it could escalate privileges (convince one AI to treat a malicious instruction as if a trusted human had issued it) and pull data out of agent workspaces. One counterintuitive finding: read operations are the primary vulnerability, not writes. Classic cybersecurity assumes the danger is in writing bad data into a system. Here, the danger is in reading poisoned data and then acting on it. The catch: testing was in a controlled research environment, not a live production deployment. The three target frameworks are anonymized. This is a proof of concept — but a working one, demonstrated end-to-end.

Glossary

LLM agent — An AI system that doesn't just answer questions but takes actions — reading files, browsing the web, running code, sending messages — based on instructions.

privilege escalation — When an attacker tricks a system into treating a low-trust instruction as if it came from a high-trust authority, gaining capabilities they shouldn't have.

Source: Autonomous LLM Agent Worms: Cross-Platform Propagation, Automated Discovery and Temporal Re-Entry Defense

              03 / 03
            

Students Designed 80 Tasks for AI — the Best Model Passed 55%

Undergraduate students at Shanghai Jiao Tong University designed tasks that stumped frontier AI — every single model failed Olympiad-level math entirely.

A team at Shanghai Jiao Tong University ran a simple experiment: ask 230 students to design hard tasks for AI agents — browsing the web, writing and running code, solving research problems — and then filter the best 80 through expert review. Then put six of the world's best AI systems through those tasks under live conditions inside isolated sandboxes. The best model passed 55% of tasks. On a school exam, that's a fail. Think of it like a cooking competition where the judges are the students at the culinary school. They know exactly which shortcuts get taken, which techniques get skipped, which questions will catch the chef off guard — because they've been taught the same curriculum. They design challenges that probe real gaps, not just surface difficulty. Some findings are striking in their specificity. Olympiad-level math and linguistics problems stumped every model — zero for six. Over 22% of tasks showed what the researchers call capability boundaries: one model scores 90 points on a task while another scores 0 on the exact same task. Different systems have very different hidden blind spots. And here's a strange one: models that used more tokens — more internal 'thinking steps' — didn't score higher. The correlation between token use and output quality was essentially zero (r = −0.03). Thinking longer, in this case, did not mean thinking better. The catch: 80 tasks is a small sample, and student-designed challenges may skew toward what's hard to verify rather than what's hard in real deployment. But 55% on academic tasks, across six frontier models, is still a useful stake in the ground.

Glossary

token — The basic unit an AI model processes — roughly a word or word-fragment; more tokens used generally means more computation time spent 'thinking.'

capability boundary — A task where different AI models show wildly divergent performance, suggesting each has a distinct hidden weakness rather than a shared, gradual limitation.

sandbox — An isolated computer environment where AI can run code and browse without affecting real systems — used here to score tasks under live conditions.

Source: AcademiClaw: When Students Set Challenges for AI Agents

The bigger picture

Three papers, and they form an uncomfortable triangle. The first says AI models break when told they must comply — they fabricate rather than admit ignorance. The second says AI agents, increasingly deployed to read and write files autonomously, can be hijacked through those same files, spreading infection without any human click. The third says even at their best, today's most capable models fail 45% of tasks that sharp undergraduates could design in an afternoon. Here's the position I'd take: the capability story and the safety story are no longer separable. We're moving AI into environments — inboxes, legal documents, factory floors — where models act, not just answer. And all three of these papers point at the same gap: reliability under real-world pressure. Not hypothetical future conditions. Conditions researchers can demonstrate today, on production-grade models, with documented results. That's the honest picture.

What to watch next

On the security front, watch for responses from the agent framework maintainers named (anonymously) in the worm paper — disclosure timelines will tell you how seriously this is being taken. On the reliability side, the compliance-trap result is specific enough that I'd expect labs to run internal replications soon; any public response from Anthropic, Google DeepMind, or OpenAI would be worth reading carefully. The open question I'd most want answered: does fine-tuning on refusal examples fix the compliance-forcing vulnerability, or does it just move the failure to a different prompt pattern?