DeepScience — Mental Health

DeepScience · Mental Health · Daily Digest

AI mental health tools look right but keep getting things wrong

Three new studies ask the same uncomfortable question: do our mental health technologies actually work, or do they just look like they do?

            April 27, 2026
          

Good morning. Today's pile of papers is heavy on a theme I find genuinely important right now: the gap between a tool that appears to understand mental health and one that actually does. I've pulled three studies that each catch that gap in a different place — in AI simulators, in chatbot habits, and in the wristband you're probably wearing right now. Let's dig in.

Today's stories

              01 / 03
            

AI Generates Believable Fake Patients But Gets the Statistics Completely Wrong

An AI can write you a perfectly believable depressed patient — and still get the statistics of depression completely wrong.

Imagine asking ten people to describe 'a typical song' from a playlist. They'd all pick something from the middle — moderate tempo, familiar structure. The resulting playlist would sound perfectly normal track by track, but you'd never hear the extremes: the seven-minute epic, the 40-second noise burst. That's essentially what a team of researchers discovered when they stress-tested four major AI systems — GPT-4o-mini, DeepSeek-V3, Gemini Flash, and GLM-4.7 — by asking each one to generate thousands of fake patient profiles and answer a standard depression questionnaire. They created 28,800 synthetic profiles spread across 120 demographic groups and compared the resulting population against real epidemiological surveys. Each individual profile read as clinically plausible — no obvious nonsense. But the population those profiles added up to was wrong in almost every direction. The AIs squeezed the full human spectrum into a narrower band. The most severe cases were underrepresented. Depression scores were inflated by 3 to 6 PHQ-9 points on average for most groups — that's a large shift on a 27-point scale. For transgender women, the models went the other direction, underestimating depression by 5 points and capturing only 8 to 46% of the minority stress that real-world research documents. The most unsettling number: even though the AIs gave very similar scores on repeat runs, 37% of virtual patients crossed the clinical threshold for depression between one test and the next. Same profile, different diagnosis. That level of instability would disqualify a human clinician. The catch worth naming: this study tests simulated profiles, not live clinical tools. But developers routinely use AI-generated synthetic data to train the next generation of mental health software. If the training population is statistically distorted, the resulting tools will be too — quietly, invisibly.

Glossary

PHQ-9 — A standard 27-point questionnaire doctors use to measure the severity of a patient's depression symptoms.

epidemiological fidelity — How accurately a simulated population matches the statistical patterns of the real population it is supposed to represent.

variance compression — When a model produces outputs that cluster too tightly around the average and under-represents extreme or unusual cases.

Source: PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

              02 / 03
            

Using AI as a Tool May Help Your Brain; Using It as a Therapist May Not

Using an AI chatbot to finish your homework might be good for your brain; using it to process your loneliness might not be.

A team at a Chinese university scanned the brains of 222 students and asked them a pointed question: do you use AI chatbots to get things done, or do you use them to talk about how you feel? The distinction turned out to matter more than anyone expected. Students who used AI functionally — for writing, research, problem-solving — had modestly better grades and showed larger gray matter volume in the dorsolateral prefrontal cortex, the region involved in planning and executive control. Think of it like a muscle that stays in shape because it keeps getting used, even if the AI is doing some of the lifting. Students who used AI socio-emotionally — chatting with bots about loneliness, anxiety, or personal struggles — showed the opposite pattern: lower gray matter in regions linked to social processing and emotion, and higher rates of self-reported depression and social anxiety. The two habits were also statistically independent of each other, meaning these aren't just heavy versus light AI users. They are genuinely different types of people doing genuinely different things. For context: 82.5% of students reported frequent functional use. Only 6.8% reported frequent socio-emotional use. So this is a minority behaviour — but not a negligible one. Now the catch, and it is a real one: this is a cross-sectional study, a single photograph in time. Causality is completely open. Do AI chatbots reshape your brain, or do people who already struggle emotionally reach for AI as a substitute for human connection? Nobody knows yet. The sample is also 222 Chinese university students — a narrow slice of humanity. I'd want to see this replicated in different populations and followed over time before drawing firm conclusions.

Glossary

gray matter volume — The amount of brain tissue in a given region, sometimes used as a rough proxy for how active or developed that region is.

dorsolateral prefrontal cortex — A brain region involved in planning, decision-making, and self-control — roughly, the part that helps you stay organised and on task.

cross-sectional study — A study that measures everyone at one point in time, rather than following the same people over months or years, which limits what you can say about cause and effect.

Source: Mapping generative AI use in the human brain: divergent neural, academic, and mental health profiles of functional versus socio emotional AI use

              03 / 03
            

Your Smartwatch Can Spot a Hard Bike Ride but Misses Emotional Stress

Your fitness tracker knows when you just finished a bike ride — it has almost no idea whether you just bombed a job interview.

Six volunteers spent multiple sessions in a lab. One session: hard cycling at 75% of maximum heart rate. Another: a social stress test — presenting to a panel of evaluators who sat expressionless and gave no encouraging feedback, a well-validated way to reliably induce real psychological distress. A third session: rest. A wearable device tracked heart rate, heart rate variability — the tiny fluctuations between heartbeats that reflect your nervous system — sweat response, and movement the whole time. Separately, saliva samples were collected at five points per session to measure cortisol, the hormone your body releases when it perceives a threat. Using machine learning on the wearable data alone, the team achieved 77.8% overall accuracy. Sounds solid. But pull it apart: the algorithm was excellent at detecting physical exertion and rest, which are easy to read from movement and heart rate. When it came to psychological stress — sitting completely still and feeling awful — it was right about half the time. Barely better than a coin flip. Add the cortisol readings, and overall accuracy jumped to 94.4%. Psychological stress recall climbed from 50% to 83%. The hormone carried the information the wristband couldn't see. Think of it like trying to tell whether your oven is actually preheating just by looking at it from across the kitchen — you can't, until you open the door and feel the heat rushing out. The enormous catch: six people. This is a proof-of-concept study, not a product validation. But it identifies a real structural problem: the outer physical signal and the inner emotional state often diverge, and almost every consumer wearable today only tracks the outer one.

Glossary

heart rate variability — The natural variation in time between consecutive heartbeats — higher variability generally signals a more relaxed, adaptable nervous system.

cortisol — A hormone released by your adrenal glands in response to stress; often called the 'stress hormone' because its levels rise during both physical and psychological threat.

recall — In machine learning classification, the percentage of real cases in a category that the model correctly identifies — low recall means it misses a lot of actual cases.

Source: Differentiating Physical and Psychological Stress Using Wearable Physiological Signals and Salivary Cortisol

The bigger picture

Read these three stories together and a pattern emerges that I think is worth sitting with. We have AI systems that produce individual outputs which look clinically right but fail statistically at scale. We have evidence that the way people relate to AI tools — whether as instruments or as confidants — is already leaving measurable traces in how they feel and possibly how their brains are organised. And we have wearables that detect the easy signals while missing the ones that matter most for mental health. None of these is a catastrophe finding. None of them says the technology is useless. What they collectively say is that the mental health tech space has a measurement problem: our tools optimise for the visible and the plausible, and real psychological suffering keeps slipping through the gaps. The uncomfortable question these three papers together put to anyone building in this space is: what exactly are you measuring, and is it the thing that matters?

What to watch next

The PsychBench findings about LLM bias in mental health simulation feel like the beginning of a conversation, not an endpoint — I'd watch for responses from the teams behind GPT-4o and DeepSeek, and for whether benchmark audits like this start appearing as a requirement in AI mental health research. On the wearables side, the cortisol-plus-wearable combination will only matter once someone figures out how to get a cortisol reading without making someone spit into a tube in a lab — non-invasive cortisol sensing from sweat patches is an active area to follow.