DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

Fake Pages Fool AI, Medical AI Invents Findings, Tools Get Smarter

Today's AI research asks a hard question: can we trust what AI agents read, see, and do on our behalf?

            June 13, 2026
          

Hi — three stories today, all circling the same uncomfortable truth: AI systems are increasingly acting in the world on your behalf, and we keep finding new cracks in the foundation. I spent the morning reading papers so you don't have to. Let's dig in.

Today's stories

              01 / 03
            

One Fake Webpage Can Trick AI Into Recommending Anything

Someone quietly edits a webpage, swaps in a brand you've never heard of — and suddenly every AI assistant you ask recommends it.

Imagine a restaurant review site where one bad actor can rewrite any listing overnight. Now imagine that AI shopping assistants read those listings to answer your questions. That's the scenario researchers tested with FORGE, a benchmark built around 225 real products across 15 categories. The team took real webpages, locally rewrote them to insert a fake brand, then fed those pages to 12 commercial and open-weights AI models and asked for recommendations. With just one polluted page appearing in the AI's search results, some models recommended the fake brand up to 27% of the time. Give the attacker three polluted pages — the full top of the search results — and fooled rates climbed to 73.8% for the most vulnerable model. The effect was nearly dose-dependent: more polluted pages, more fooled recommendations, almost every time. Why it matters directly to you: if you've ever asked an AI assistant what blender to buy, which restaurant to try, or which service to use, this is your loop. AI recommendations are already embedded in search tools and chatbots, and this attack requires nothing exotic — just controlling a few webpages. The catch is real, though. This was a controlled simulation. The researchers rewrote pages locally; they didn't have to actually publish and rank fake pages on the live web, which is genuinely harder. But here's the part that should make you pause: most proposed defences backfired. Telling the AI to 'be skeptical' of sources — what the researchers call skepticism prompting — actually raised fooled rates by 24 percentage points on average, and by 44 points for Gemini 3.1 Pro specifically. The AI argued itself into trusting the fake source harder. Nobody has a clean fix yet.

Glossary

parametric memorization — When an AI answers from facts baked into its training, rather than by actually looking anything up.

fooled rate — The percentage of the time an AI model recommended the fake brand after being shown polluted web content.

Source: One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

              02 / 03
            

AI Reading Medical Scans Sometimes Invents What It Sees

What if the AI reading your MRI confidently described a fracture that wasn't there — and nobody caught it?

When an AI model reads a medical scan, it doesn't flag uncertainty the way a tired radiologist might say 'I'd want a second look.' It produces a report, fluently and confidently. The problem a review published this week maps out is that these reports can include fabricated anatomical structures, invented measurements, missed findings, and laterality errors — meaning the AI says something is on the left when it's on the right. The authors synthesized research across five imaging types — MRI, CT, X-ray, ultrasound, and pathology slides — and found no single existing framework covers all the ways AI can hallucinate in this context. Three different taxonomies are needed together to map the whole problem. Here's the counterintuitive finding. You'd assume that an AI fine-tuned specifically on thousands of medical images would hallucinate less than a general-purpose model that was trained on everything. The data suggests the opposite. General-purpose foundation models showed a median hallucination-free rate of 76.6%, compared to 51.3% for medical-specialist models. The suspected mechanism: when you drill a model too narrowly into one domain, it starts pattern-matching so aggressively — like a new hire who memorised the manual so hard they stopped using common sense — that it invents findings matching its training patterns even when those findings aren't there. The catch: this paper is a structured narrative review, not a new experiment. The authors synthesised other teams' benchmark results rather than running fresh tests. The comparison between general and specialized models comes from aggregating heterogeneous studies, and you can't perfectly equate them. But the direction of the finding is striking, and the FDA is actively updating its regulatory framework for AI in medical imaging right now. The timing matters.

Glossary

hallucination — When an AI produces output that sounds plausible and confident but is factually wrong or entirely invented.

fine-tuning — Taking a general AI model and further training it on a specialised dataset to improve performance in a narrow area.

foundation model — A very large AI model trained on broad, general data — think of it as the generalist before any specialisation happens.

Source: Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

              03 / 03
            

AI Agents Got Three Times Better by Stopping Micro-Narrating Every Step

Your AI assistant is burning through its thinking budget narrating every single click — and there's a surprisingly simple fix.

When an AI agent uses tools — browsing the web, running code, checking a database — it currently treats each micro-action as a separate decision. Want to look up a fact and summarise it? Step one: open the search. Step two: type the query. Step three: read the result. Step four: process it. Each step passes through the model's full reasoning machinery, chewing through its limited working memory — called a context window — in the process. The team behind HyperTool, working with Qwen models, describes this as an execution-granularity mismatch. Think of a sous chef who has to narrate every single knife stroke to the head chef before making the next cut. The sauce never gets made; everyone exhausts themselves on procedure. HyperTool's fix is elegant: instead of one action at a time, let the model write a small script that bundles predictable steps into a single grouped call. The model calls the bundle, the bundle runs, the model gets back a result. It only needs to reason about the outcome, not each micro-step. Tested on MCP-Universe, a benchmark for multi-tool tasks, the results were stark. A smaller Qwen3-8B model went from 9.93% accuracy to 33.33% — more than tripling. The larger Qwen3-32B went from 15.69% to 35.29%, enough to beat GPT-OSS on the benchmark. The catch: these gains come from supervised fine-tuning on synthetic examples — scenarios the team constructed themselves. MCP-Universe is one benchmark. Whether the improvement holds on messier, open-ended real-world tasks is still unknown, and the paper doesn't fully detail the training data size. A promising result, but one that needs replication outside the conditions it was designed for.

Glossary

context window — The maximum amount of text an AI model can hold in its working memory at one time — like a desk with limited surface area.

supervised fine-tuning — Training an AI model on human-labelled examples of correct behaviour, so it learns to imitate those examples.

MCP (Model Context Protocol) — A standard for connecting AI models to external tools, like a universal plug socket for AI capabilities.

Source: HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

The bigger picture

Three stories today, and they share a thread you should hold onto. AI systems are increasingly operating in the world on your behalf — searching the web for you, reading medical scans for you, clicking through tools for you. And each story reveals a different fragility in that premise. The web pollution paper shows that the content AI reads can be quietly manipulated to steer your decisions. The medical hallucination review shows that even purpose-built specialist AI can invent clinical findings, and that narrowing a model's training doesn't reliably fix it. HyperTool shows that tool-using agents had an architectural inefficiency that tripled their accuracy once fixed — which implies there was a ceiling we accepted without knowing it was artificial. The uncomfortable through-line: we are deploying these systems in high-stakes contexts — medicine, commerce, software pipelines — while still discovering fairly fundamental flaws. That's not an argument to stop. It's an argument to test adversarially, audit honestly, and take the catch sections in these papers seriously.

What to watch next

On web pollution, the open question is whether AI systems can be trained to verify source provenance rather than simply trust retrieved content — nobody has a working solution yet. On medical imaging, the FDA's ongoing updates to its Total Product Life Cycle framework for AI devices will be the regulatory signal to watch in the second half of 2026. And for HyperTool, the real test comes when someone runs it on genuinely open-ended tasks rather than a controlled benchmark — I'd want to see that replication before drawing firm conclusions.