DeepScience — Mental Health

DeepScience · Mental Health · Daily Digest

Your Smartwatch, Your Biases, and Your AI Therapist's Blind Spots

Three new papers reveal how AI reads, misreads, and outright cheats when it comes to detecting and simulating depression.

            April 22, 2026
          

Today's papers are all circling the same uncomfortable question: can we trust AI to measure mental health? I spent the morning with three studies that each poke a different hole in the current picture — one hopeful, one humbling, one a little alarming. Let me walk you through them.

Today's stories

              01 / 03
            

Your Smartwatch's Sleep Data May Signal Depression Risk

What if your fitness tracker noticed you were depressed before you did — just from your bedtime drift?

A team built CoDaS, a multi-agent AI system that reads wearable sensor data — the kind your fitness tracker or smartwatch already collects — and hunts for patterns that correlate with depression. Think of it like a detective scanning a year of diary entries, except the diary is made of step counts, resting heart rates, and bedtimes rather than words. Working across three datasets totalling nearly 10,000 participants, CoDaS surfaced 41 candidate digital biomarkers — measurable signals from everyday devices that track with mental health outcomes. The finding that replicated most reliably across two independent depression cohorts: people with depression tend to have erratic sleep timing. Not necessarily less sleep, but sleep that starts at wildly different hours night to night. A consistent 11 p.m. bedtime looks very different from a pattern that swings between 9 p.m. and 2 a.m. week to week, and that instability showed up clearly in the data. Why does this matter? If wearables can reliably signal depression risk, that opens a door to passive, continuous monitoring for the hundreds of millions of people who never see a therapist, or who get a clinical assessment only once a year. The catch: 'correlation' is doing a lot of work here. Adding CoDaS features improved depression prediction models by a delta-R² of 0.040 — that is real, but modest. The study cannot tell us whether erratic sleep causes depression, accompanies it, or is caused by it. And this was validated on research datasets, not a deployed clinic. Before your fitness app starts sending you depression alerts, a lot of real-world testing still has to happen.

Glossary

digital biomarker — A measurable signal collected from a digital device — like step count or heart rate — that is statistically linked to a health outcome.

delta-R² — How much additional predictive power new features add to a statistical model; 0.040 means the new features explain 4 more percentage points of variance.

Source: CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

              02 / 03
            

AI Can Fake One Depressed Patient Convincingly But Gets the Crowd All Wrong

An AI that plays a single depressed patient flawlessly can still be a terrible stand-in for ten thousand of them.

Researchers built PsychBench to stress-test whether large language models — GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, and GLM-4.7 — can realistically simulate psychiatric patients at population scale. They generated 28,800 synthetic patient profiles across different age, gender, race, and socioeconomic backgrounds, gave each simulated patient standardised mental health questionnaires, then compared the results to real population health surveys. Individually, the simulated patients sounded plausible — no glaring clinical nonsense. But zoom out and the picture collapsed. Think of a chef who makes one perfect dish but, when asked to run a full restaurant, serves a slightly modified version of that same dish to everyone. The AI squeezed out the extremes: the very severe cases, the mildest presentations, the unusual profiles. Depending on the model, symptom variance was compressed by between 14% and 62% compared to real populations. Depression severity was systematically overestimated for most demographic groups by 3 to 6 PHQ-9 points. And 37% of simulated patients crossed different diagnostic thresholds between two identical runs — like a thermometer that reliably reads 'around normal' but randomly shows either 36.2 or 38.1 whenever you look. Why this matters: these simulations are increasingly used to generate synthetic training data for clinical AI tools and to evaluate chatbot therapists. A distorted simulation will pass its distortions downstream. The catch: some models (GLM-4.7) performed noticeably better than others. This is not a universal failure, and the paper is a measurement tool, not a clinical trial. It identifies the problem more than it solves it.

Glossary

PHQ-9 — A nine-question questionnaire that scores depression severity from 0 (none) to 27 (severe); a 3–6 point overestimate is clinically meaningful.

variance compression — When a system produces outputs that cluster too tightly around the average, erasing the realistic spread of mild, moderate, and severe cases.

Source: PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

              03 / 03
            

Depression-Detecting AI Was Learning the Doctor's Script, Not the Patient's Words

An AI built to detect depression scored 98% accuracy on one dataset — using only the therapist's side of the conversation.

A research team studying automated depression detection found something embarrassing in the data: the AI models weren't always learning to recognise depression. They were learning the interviewer's rhythm. Semi-structured clinical interviews follow a loose but consistent format — the clinician asks roughly the same prompts in roughly the same order every time. When the researchers trained models using only what the interviewer said — stripping out the patient's words entirely — the AI still classified depressed versus healthy subjects with near-perfect accuracy on one dataset, hitting a macro-F1 score of 0.98. It had learned to spot structural fingerprints of the interview format itself, not signs of illness. Think of a student who passes an exam by memorising which question appears in which position, rather than actually knowing the subject. The team tested this across three linguistically different datasets — two in North American English (DAIC-WOZ and E-DAIC) and one in Italian (ANDROIDS) — and two different AI architectures. The interviewer bias appeared in all of them, though it was strongest in ANDROIDS. When models were restricted to participant speech only, the decision evidence spread more naturally across the whole conversation rather than clustering around fixed prompt positions. Why this matters: the benchmarks used to validate depression-detection tools are partly measuring the wrong thing. High scores on these datasets may reflect script-learning as much as genuine clinical insight. The catch: participant-only models still detected depression — the fix works — but comes with some accuracy trade-off. The finding calls for better experimental design, not a wholesale rejection of the field.

Glossary

macro-F1 score — A measure of classification accuracy that balances precision and recall across all categories; 0.98 out of 1.0 is very high.

semi-structured clinical interview — A mental health assessment where a clinician follows a loose script of standard questions but can adapt the conversation — making the format consistent enough to accidentally teach an AI the script.

Source: When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews

The bigger picture

Step back from these three papers and a pattern comes into focus — and it is not a comfortable one. We have AI that can find real signals in wearable data (CoDaS), but those signals are modest and unproven in the wild. We have AI that can simulate mental health patients convincingly at the individual level, but systematically distorts the population picture (PsychBench). And we have AI that appears to detect depression well, but may partly be detecting the clinician's habits instead (the interviewer-bias paper). None of these studies say AI in mental health is worthless. What they collectively say is that the field is validating itself against the wrong benchmarks — small research datasets, synthetic populations, structured interviews — and then extrapolating to a far messier real world. The gap between 'works in the lab' and 'trustworthy in a clinic' is wider than the headline accuracy numbers suggest. That is worth knowing.

What to watch next

The CoDaS finding about sleep-timing instability is the one I'd most want to see replicated in a prospective clinical trial — where participants are tracked forward in time, not analysed retrospectively. On the benchmarking side, the PsychBench paper opens a question that no one has cleanly answered yet: if LLM patient simulations are this miscalibrated, what happens to the chatbot therapists trained partly on synthetic data? That audit hasn't been done publicly yet, as far as I can tell.