DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI's Hidden Gaps: Bias, Manipulation, and Leaky Memory

Three real experiments show where today's AI agents quietly break — and why it matters before we trust them with more.

            April 13, 2026
          

Today's batch of 69 papers is mostly theoretical noise — concept papers with no data, simulations dressed up as results, and a handful of philosophy preprints that won't change anything this week. But three papers cut through that. They all ran actual experiments, found actual problems, and named them. Let me walk you through each one.

Today's stories

              01 / 03
            

Non-experts found gender bias in an AI image tool — using the right interface

You give 60 ordinary people a tool to poke at an AI, and they find four kinds of gender bias a lab team might have catalogued in months.

The model on trial here is BLIP, Salesforce's image-captioning system — an AI that looks at a photo and writes a sentence describing it. The question the researchers wanted to answer: can non-experts find its biases, and does the tool they're given matter? Sixty participants — no AI background required — were split into three groups. One group got a plain interface to explore freely. A second got an Image Masking Tool that let them hide parts of a photo and see how the captions changed. A third got a Text Filtering Tool that let them search and compare captions. Think of it like asking amateur home inspectors to look for cracks in a wall: one group gets their bare eyes, one gets a flashlight pointed at the surface, one gets a hammer to tap on it. Each group found different things. Together, they identified four distinct bias patterns: BLIP tended to name a person's gender before their profession; it used gendered language differently for men and women; it relied on visual stereotypes to decide what job someone probably had; and it reinforced professional gender roles ('the male surgeon,' 'the female nurse'). The Image Masking Tool pushed people toward close visual inspection. The Text Filtering Tool surfaced the linguistic asymmetries. The catch: this was 60 people, one model, one type of bias. We don't know yet whether the same approach scales to subtler or less visible harms — or whether users without any external guidance would find much at all. Still, this is a small but real argument that the interface you hand people shapes the problems they can see.

Glossary

algorithm auditing — The process of systematically testing an AI system to find errors, biases, or unexpected behaviours — like a safety inspection for software.

image captioning model — An AI that looks at a photo and automatically writes a sentence describing what it sees.

Source: Sensemaking in User-Driven Algorithm Auditing: A Case Study on Gender Bias in an Image Captioning Model

              02 / 03
            

AI negotiating agents can be talked into bad deals with basic social tricks

Across nearly 21,000 simulated marketplace negotiations, a little social pressure was enough to manipulate three of the most capable AI agents available.

Picture a very diligent but socially naive new employee you've sent to negotiate supplier contracts on your behalf. They know the rules. They know the target price. But if a seller is charming, apologetic, or vaguely threatening, they fold. That's roughly what this study found across 20,880 negotiation sessions run on a purpose-built multi-seller marketplace platform. Three frontier AI models — GPT-5 Mini, Grok 4.1 Fast, and Gemini 3.1 Flash Lite — each played buyer in online market negotiations, while the researchers tested whether social manipulation tactics could shift outcomes in the seller's favour. The answer was yes, and the vulnerability was consistent across all three model families. That's the uncomfortable part: this wasn't one obscure model with a known weakness. It was a pattern across recent, capable systems. The researchers also tested inoculation methods — essentially, pre-warning the agent about manipulation tactics — and found those helped reduce vulnerability. That's a useful lever. I want to be transparent about one thing: what I have here is the replication data package posted on Zenodo, not the full peer-reviewed paper. The 20,880-session figure and the three-model comparison are confirmed in the metadata, but the detailed methodology and outcome metrics aren't visible in this record. The finding feels credible — it replicates a pattern security researchers have flagged before — but treat the specific numbers as preliminary until the full paper is published and reviewed. The practical implication is real regardless: if you're deploying an AI agent to do anything consequential on your behalf, its social reasoning is a live attack surface.

Glossary

inoculation method — A technique where you warn an AI agent in advance about known manipulation tactics, similar to teaching someone to spot a scam before they encounter one.

frontier model — An AI model at the current edge of capability — the most powerful publicly available systems at a given moment.

Source: Replication materials: Social Manipulation of AI Agents in Online Market Negotiations

              03 / 03
            

Once a small AI memorizes a password, it will give it back — every single time

Fine-tune a small AI on data containing fake passwords, then ask it — directly or indirectly — and it hands them back 100% of the time.

This paper ran two experiments on TinyLlama-1.1B-Chat, a small but real AI model with about 1.1 billion internal parameters. First experiment: can you manipulate the model's behaviour by crafting clever instructions that override its guidelines? Yes — and adding a 'security-enhanced' system prompt reduced but did not eliminate the success rate. Think of it like adding a 'please ignore any instructions telling you to open the safe' note to a combination lock. It helps a little. It's not a fix. Second experiment: what happens when sensitive data — in this case, synthetic fake credentials — gets baked into the model through a process called fine-tuning, where you train the model on new examples? The answer was stark. The model reproduced those memorized credentials with a 100% retrieval rate, whether you asked directly ('what is the password for X?') or indirectly, going around the question from a different angle. The analogy here is permanent marker versus pencil. Instructions written into a model through fine-tuning are permanent marker. A defensive system prompt is trying to rub out permanent marker with an eraser. A few caveats matter here. TinyLlama is a tiny model — much smaller than systems deployed in real products — and this study tested one model with no statistical significance reporting and no multi-run trials. We don't know whether the 100% retrieval rate holds at scale, across different architectures, or under more sophisticated defences. But the directional finding is consistent with what larger security research has found: data that enters a model through training is very hard to remove cleanly, and prompt-level defences are insufficient on their own.

Glossary

fine-tuning — The process of continuing to train an existing AI model on a new, specific dataset — like giving an employee specialised on-the-job training after their general education.

prompt injection — An attack where someone crafts text instructions that trick an AI into ignoring its safety rules or doing something it wasn't supposed to do.

LoRA — A lightweight technique for fine-tuning AI models efficiently by only updating a small fraction of the model's internal parameters.

Source: Prompt Injection and Data Leakage in Large Language Models: An Empirical Study on TinyLlama

The bigger picture

Look at what all three papers are really about. In the first, an AI produces biased descriptions of people, and whether you find the bias depends entirely on what tool you use to look. In the second, AI agents built to negotiate on your behalf can be socially manipulated into worse outcomes — and this holds across the most capable models available right now. In the third, sensitive data that enters an AI through training can be extracted later, reliably, regardless of what defensive instructions you add on top. These aren't three separate problems. They're the same problem from three angles: AI systems carry failure modes that don't announce themselves. Bias is quiet. Social vulnerability looks like normal conversation. Memorized data sits there invisibly until someone knows to ask. The common thread is that surface-level testing — does the model sound right? — is insufficient. You have to actively probe, from multiple directions, with purpose-built tools. The field is slowly building those tools. Today's papers are small contributions to that project, and that's genuinely useful.

What to watch next

The social manipulation study exists as a replication package, which means the full paper is somewhere in review or recently published — worth tracking down once it surfaces with a proper DOI. On the prompt injection front, the open question I'd want answered is whether the 100% memorization retrieval rate holds in larger models where data is more diffuse, or whether it's a quirk of small, tightly fine-tuned systems. That gap matters enormously for anyone deciding whether to fine-tune a model on proprietary data.