DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI sees the right answer, then outputs the wrong one

Today's AI research reveals that the failures aren't random — they're structural, predictable, and worth understanding before the next deployment.

            June 01, 2026
          

Three papers landed today that are, on the surface, about completely different things: gender representation, sports strategy, and shared AI notepads. I spent the morning reading all three and kept hitting the same wall. Each one is showing you a different face of the same problem: AI systems that look capable at the surface layer and fall apart — quietly, in specific ways — the moment you need more from them. Let me walk you through all three.

Today's stories

              01 / 03
            

AI models internally register 'woman' — then output 'man' anyway

The AI looked at a photo of a babysitter, internally registered 'female' — and then said 'male' out loud.

Imagine a translator who reads the original sentence, understands it says 'she,' and then writes 'he' on the page — with no conscious decision to do so. That is roughly what a team of researchers found when they built a tool called LALS (Latent Association Leaning Score) to act like an X-ray for the inside of vision-language models. They fed these models 900 AI-generated images of people in ambiguous situations — photos where you genuinely cannot tell the person's gender — across 15 occupations, including babysitter, nurse, and engineer. Then they tracked what the model was 'thinking' layer by layer as it processed each image. What they found is striking. Midway through the model's internal processing, it correctly activated female associations for strongly female-stereotyped occupations. But by the time it reached the output stage and produced a word, that female signal had been filtered away. The models defaulted to male. Across all four tested models and all 15 occupations, that pattern held under forced-choice prompting. They also found that changing the clothing in the image from blue to pink substantially shifted the internal signal — meaning the models had absorbed cultural colour-gender shortcuts from their training data. This matters because AI tools are being used in hiring, content generation, and image captioning. If a bias audit only checks the model's output, it will miss this entirely — the suppression happens before the answer is written. The catch: the images were AI-generated, not real photographs, and only one human annotator verified that they were genuinely ambiguous. The study covered four models in the 7–8 billion parameter range. Whether much larger commercial models behave the same way is an open question nobody has answered yet.

Glossary

vision-language model (VLM) — An AI system trained to process both images and text together, so it can describe photos, answer questions about pictures, or generate images from words.

LALS (Latent Association Leaning Score) — A measurement tool introduced by this study that projects the model's internal activations into language space to read which concepts the model is associating with an input, before it produces any output.

forced-choice prompting — Asking a model to pick between two specific options (e.g., 'is this person male or female?') rather than letting it describe freely.

Source: Vision-Language Models Suppress Female Representations Under Ambiguous Input

              02 / 03
            

AI can spot a foul but collapses when asked to plan a play

The best AI model scored 73% on identifying what just happened in a game — and 5% on deciding what to do about it.

A team built SVI-Bench, a benchmark using 35,000 hours of basketball, soccer, and hockey footage — 15 million annotated actions, plus expert commentary and statistical records — to test AI models at progressively harder levels of understanding. Think of it like a staircase. The first step is perception: can you spot that a foul just happened? The second is causal reasoning: why did it happen? The third is strategic simulation: what should the team do next? The fourth — the top step — is agentic synthesis: go find relevant evidence across a library of 1.8 million clips and make a decision. The staircase collapses dramatically. The best tested model scored about 73% on basic action identification — genuinely respectable, like a knowledgeable fan watching the match. But accuracy fell sharply at every higher step, bottoming out at 5% on tasks requiring autonomous evidence-gathering. That is a 69-percentage-point drop from the best perception score to the hardest reasoning task. Why use sports? Because team sports give you something rare: real-world complexity (10 to 22 coordinated players under adversarial pressure) combined with explicit rules and verifiable right answers. It is a clean test bed that avoids the messiness of open-ended real-world domains. The staircase collapse you see here is almost certainly present in other high-stakes domains — medical diagnosis, legal reasoning, logistics — where we can't measure it as cleanly. The catch: this is a new benchmark, and the team deferred full construction details to a longer paper. Human performance baselines are not yet reported, so we don't know exactly how large the human-AI gap is at the top of the staircase. The direction of the finding, though, is consistent with what other research keeps showing.

Glossary

benchmark — A standardised test set used to measure how well AI models perform on a defined task, so different models can be compared fairly.

agentic synthesis — A task that requires an AI to autonomously search for information, gather evidence from multiple sources, and combine it into a decision — rather than answering a single question with materials already provided.

NDCG (Normalized Discounted Cumulative Gain) — A scoring method that rewards correct answers ranked higher in a list more than correct answers buried lower down.

Source: SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

              03 / 03
            

When AI agents share notes, their mistakes spread like rumours

Giving two AI agents a shared whiteboard to collaborate on didn't help — it made the errors harder to fix.

The idea sounds sensible: if one AI is reading a document and another is checking the work, letting them share a notepad should improve accuracy. That is the premise of multi-agent collaboration, and it is already being built into coding assistants and document-review tools. A team built a diagnostic framework called CoSee to test whether this actually works on reasoning-heavy tasks — things like answering questions about charts or slide presentations. The results are a warning. The team found two failure modes that kick in as soon as you introduce a shared workspace. The first they call Noise Reinforcement. One agent writes an uncertain guess on the shared board. The second agent picks it up as established fact. The original error is now circulating as evidence. Imagine two cooks in a kitchen: one tastes the soup and scribbles 'maybe needs salt?' on a sticky note. The second cook reads it and adds salt as confirmed. The first cook sees the updated note and concludes the problem is solved. Nobody actually checked the soup. The second failure is Policy Collapse — having a shared board nudges models toward short, vague answers, as if the extra context creates a kind of decision paralysis. On reasoning-heavy benchmarks (ChartQAPro and VQAonline), the two-agent setup actually performed worse than a single model working alone. The catch: these tests used small models, 4 to 8 billion parameters, on a single GPU. That is a realistic budget constraint for many deployed systems, but larger models may be more robust. The team found that adding a quality-check gate before anything gets written to the shared board partially fixed the ChartQAPro problem — so the failure is not inevitable, just easy to walk into.

Glossary

hallucination — When an AI confidently states something false — generating plausible-sounding but unsupported information.

multi-agent system — An AI setup where two or more separate models work together, each taking on a role, and passing information between them.

document VQA (Visual Question Answering) — The task of answering questions about the content of a document — a chart, a slide, a form — by reading both its text and its visual layout.

Source: Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

The bigger picture

Put these three papers side by side and a pattern appears. In the gender study, the model internally registers the right answer but its final layer overrides it with a cultural default. In the sports study, AI perception is strong but strategic reasoning collapses — the model can see but can't plan. In the collaboration study, connecting two agents doesn't pool their strengths; it pools their uncertainties and amplifies them. What these studies are collectively mapping is the frontier between surface-level AI capability and structural AI reliability. The models we have are remarkably good reporters — describing what is visible, matching patterns, naming things. They fail in specific, predictable places: maintaining a coherent internal belief, reasoning across time, and coordinating without reinforcing each other's mistakes. That is not a scattered list of bugs. It is a structural problem. The next wave of AI research — the work that will actually change what these systems can do — is going to live at this exact boundary. We are not there yet, and these three papers are pointing at the same wall from different angles.

What to watch next

The SVI-Bench team explicitly flagged a longer paper with full benchmark construction details — when that lands, the human performance baselines will be the number to look at. On the gender-suppression research, the EU AI Act's August 2026 enforcement deadline for high-risk AI systems is approaching fast, and studies like this one will directly feed compliance and auditing debates. The open question I keep coming back to: is female-signal suppression happening because of training data distribution, because of the fine-tuning process, or because of something in the architecture itself? Nobody has cleanly untangled that yet, and the answer matters enormously for how you fix it.