DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Medicine Fumbles Left and Right — Literally

Today's papers ask whether AI that sounds confident can actually be trusted to act in the real world.

            May 01, 2026
          

Hi — three papers today, all pointing at the same awkward truth from different angles. AI has learned to say the right things faster than it has learned to do the right things. Let me walk you through what that looks like in a hospital, a fake office, and an image generator.

Today's stories

              01 / 03
            

Top AI Medical Tools Can't Locate What They're Describing

The best AI medical tools can tell you all about a finding on an X-ray — they just can't reliably point to where it is.

A team audited five of the most capable AI vision models currently available — Gemini 2.5 Pro, GPT-4o, o3, GLM-4.5V, and Qwen 2.5-VL — on a specific medical task: look at an image, find the relevant region, answer a question about it. The results are uncomfortable. The best-performing model correctly located the target region — a lesion, an organ, a highlighted structure — only 19.1% of the time, using a measure called IoU (think of IoU as the percentage overlap between the box the AI drew and the box a doctor would draw). The average IoU score was 0.23 out of 1.0. That's like asking someone to circle the right ingredient on a recipe card and watching them draw the circle in the wrong place four times out of five. Worse: every single model showed systematic left-right confusion on chest X-rays. That is not a minor calibration issue. In medicine, left and right are the difference between operating on the correct lung and the wrong one. The team also tried a two-step approach — ask the model to locate the region first, then answer based on what it found. That made accuracy worse for all five models, not better. On one test, the format failure rate (the model producing output so garbled it couldn't even be parsed) reached 99% for some models. There is a catch worth naming. The researchers fine-tuned one smaller model, Qwen 2.5 VL 7B, on training data drawn from the same benchmarks being tested, and it hit 85.5% accuracy — a genuine improvement. But a model trained on examples from a test benchmark tells us less about real-world readiness than it might seem. The open question is how these systems perform on real hospital data they have never seen, under the clinical conditions where being wrong costs something.

Glossary

IoU (Intersection over Union) — A score from 0 to 1 measuring how much two regions overlap — 1.0 means perfect overlap, 0 means no overlap at all.

VQA (Visual Question Answering) — A task where an AI looks at an image and answers a natural-language question about it.

fine-tuning — Taking an already-trained AI model and doing additional training on a specific dataset to improve it for a particular task.

Source: Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

              02 / 03
            

Building a Thousand Fake Offices to Teach AI How Computers Work

If you want to train an AI to work on a real computer, you might need to build a thousand fake ones first.

Most AI training works like this: here are ten thousand examples of the right answer, now learn the pattern. That works well for tidy tasks. It does not work well for the messy, multi-step work most of us actually do on computers — opening a file, cross-checking a calendar, drafting an email, going back to the file, noticing a mistake, fixing it. There is no single right answer at each step, and a wrong move twenty steps ago might only hurt you fifty steps later. A research team built 1,000 synthetic computers to solve this problem. Each fake computer has a realistic folder structure, documents, spreadsheets, and presentations generated to match a fictional user persona — their job title, their work history, their habits. Think of it like building a thousand detailed dollhouses, each with a fictional resident's entire working life inside, then sending a robot into each one to figure out how offices function. They then ran AI agents on these synthetic machines for over 2,000 turns each — more than eight hours of runtime per simulation — having one agent set monthly productivity goals and another agent actually try to achieve them. Agents trained this way showed improvements on productivity tasks both similar to and different from the training environments. The honest catch: the paper says those improvements are 'significant' but does not give the actual numbers in the text made available. That is a real gap. The team has released 100 of the synthetic computers and logs from 500 simulations publicly, which means other researchers can now poke at the method independently. Whether 1,000 carefully-constructed fake computers can capture enough of the chaos of real office work — that is the question that only follow-up will answer.

Glossary

long-horizon task — A task that requires many sequential decisions over a long stretch of time, where early choices affect late outcomes.

credit assignment — The problem of figuring out which earlier action in a long sequence deserves credit or blame for a result that only appeared much later.

Source: Synthetic Computers at Scale for Long-Horizon Productivity Simulation

              03 / 03
            

AI Image Generators Are Beautiful and Physically Clueless

Ask an AI to generate a completed jigsaw puzzle and it will produce something gorgeous — with pieces that don't fit together.

A survey from researchers synthesizing work across dozens of visual generation models has built a clear catalogue of where today's AI image tools quietly fail. The images look beautiful. The underlying reality is absent. Ask a model to generate a completed jigsaw puzzle: the result is visually convincing but the pieces don't geometrically fit. Ask it to draw a metro map: the stops are in the wrong places for the routes to make sense. Ask it to simulate a fluid pouring: the physics is wrong. Ask it to apply consistent edits across ten rounds of conversation: each edit slightly corrupts what came before, like painting over a canvas that keeps changing while you're not looking. Think of someone who learned to paint entirely from photographs, never once picking up an object or watching water flow. The pictures look right. The physical understanding is missing. The survey team proposes a five-level framework for visual generation capability, from basic pixel generation all the way up to genuine 'world modeling' — where a system understands structure, physics, causality, and persistent state. Their honest assessment: even the best current systems sit around level three. They can follow instructions and context well. They cannot reason about what should physically happen next or maintain a consistent picture of the world across many edits. The catch here is structural: this is a survey, not a controlled experiment. The failures are demonstrated through expert-designed stress tests chosen specifically to expose weaknesses — not through a random sample of everyday prompts. The authors also argue that current benchmarks overstate progress because they measure perceptual quality (does it look good?) rather than physical coherence (is it actually correct?). That critique rings true, but a replacement benchmark that everyone agrees on does not yet exist.

Glossary

IoU — Already defined above — overlap score between predicted and actual regions.

world modeling — The ability of an AI system to maintain an internal, consistent understanding of how the world works — including space, time, physics, and cause-and-effect — not just how things look.

Markovian drift — A pattern where each new edit to an image only uses recent context, causing small errors to accumulate and corrupt the overall result over many steps.

Source: Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

The bigger picture

Three different scales, one recurring pattern. Today's papers are all about the gap between what AI appears to be able to do and what it can actually be trusted to do when the consequences are real. Medical AI can describe a chest finding fluently but can't reliably locate it. Image generators produce physically impossible scenes with complete visual confidence. AI agents trained on clean, labeled examples fail to hold up across the long, messy sequences that real computer work requires — which is why one team is now building synthetic reality at scale just to generate better training data. The surface capability arrived before the underlying understanding. AI got very good at producing plausible output before it got good at grounding that output in something real. The synthetic computers paper is the most explicit attempt here to close that gap by engineering the messy context AI has been missing. Whether more data and more simulation is enough — or whether the architecture itself needs to change — is the question that connects all three stories and that nobody has answered yet.

What to watch next

The medical AI audit is particularly timely: both the US FDA and the EU AI Act are actively working on how to classify and regulate AI tools used in clinical settings, and 'grounding failures' of exactly this kind are what regulators are trying to catch. Worth watching for any response from the model developers named in the paper. On the synthetic computers front, the team has released data publicly — look for follow-up benchmarks from other labs in the coming months that will tell us whether the training gains hold up outside the original evaluation.