DeepScience — Mental Health

DeepScience · Mental Health · Daily Digest

What Your Sleep and Stories Quietly Say About Depression

AI is learning to read depression from your wearable data and your storytelling — but simulating mental health populations is a different, trickier problem.

            May 10, 2026
          

Three papers landed this week that, taken together, tell you something useful: the signals for depression are hiding in places nobody thought to look systematically until very recently — your sleep schedule's chaos, the structure of how you narrate your day, and the gap between what AI gets right about individuals versus what it quietly gets wrong about crowds. Let me walk you through each one.

Today's stories

              01 / 03
            

Your Wearable's Sleep Data May Already Be Tracking Depression Signals

What if the most honest record of your mental health isn't what you'd tell a therapist, but how erratically your body falls asleep?

The CoDaS team built an AI system — a multi-agent setup where different software components generate hypotheses, run statistics, challenge each other's conclusions, and then write up a summary — and turned it loose on wearable data from over 9,000 participants. Think of it like a team of analysts working a spreadsheet in shifts, each one checking the previous person's work before passing it on. The system surfaced 41 candidate signals — called digital biomarkers, meaning measurable patterns from your body or behaviour that might flag a health condition — linked to depression. The two that showed up independently in two separate depression datasets were both sleep-related: how much your sleep duration varies night to night, and how much your sleep start time shifts around. Not whether you sleep eight hours. How unpredictable those hours are. The correlation numbers are modest — roughly 0.13 to 0.25 on a scale where 1.0 would be perfect prediction — and adding these features to a basic prediction model improved accuracy by about 4 percentage points. That's real, but it's not dramatic. Here is the catch. These are candidates, not confirmed biomarkers. The study is cross-sectional — meaning it's a snapshot in time, not a before-and-after — so we cannot say yet whether sleep irregularity causes depression, accompanies it, or just correlates by coincidence. Before any of this reaches a clinician's desk, the signals need validation in prospective studies where researchers follow people over time. For now, what CoDaS shows is that AI can help surface patterns a human analyst might miss across enormous datasets. The hypothesis engine works. The hypotheses still need testing.

Glossary

digital biomarker — A measurable signal from a device or behaviour — like sleep timing from a wearable — that may indicate a health condition.

cross-sectional study — Research that captures data at one point in time rather than following people forward to track cause and effect.

Spearman correlation — A number between -1 and 1 measuring how consistently two variables move together in the same direction; values near 0 mean little relationship.

Source: CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

              02 / 03
            

It Is Not the Words You Choose — It Is How You Structure Your Story

Depressed people don't just use sadder words — they tell stories that are structurally harder to follow.

Researchers analyzing 830 therapeutic writing samples from Chinese participants — ranging from schoolchildren to adults, across six studies conducted between 2018 and 2024 — tested three different ways of reading text for mental health signals. The first approach counted word types: how often someone used first-person pronouns, negative-emotion words, and so on. The second looked at whether sentences felt semantically connected to each other — a kind of paragraph-level flow check. The third asked a large language model to evaluate the overall structure of each piece of writing: did it have a clear beginning, a complicating event, a resolution? Did the narrative hold together causally? This third approach, the structural one, beat the other two by a significant margin for predicting depression, anxiety, and trauma severity. Think of it like music. You can analyse a song by listing the notes used (word choice), or by checking whether adjacent notes sound harmonious (sentence flow), or by asking whether the song has a proper verse-chorus-bridge architecture that builds and resolves tension (narrative structure). The researchers found that the architecture is what carries the clinical signal. Two specific patterns stood out: people with depression tended to show temporal disorganisation — their story's timeline was scrambled — while people with anxiety tended to show weak spatial grounding, describing events without anchoring them in a place. The catch is real: all samples were in Chinese, all writing was therapeutic (not casual text), and the statistical methods for comparison are only partially described in the published version. Replication in other languages and contexts is the obvious next step. I simplified here — the LLM evaluation is zero-shot, meaning the model received no specific training on depression; it was reading for story structure cold.

Glossary

lexical features — Characteristics based purely on word choice — which words appear and how often, without considering sentence or story structure.

semantic embeddings — A way of converting text into numerical coordinates that capture meaning, so that sentences with similar meanings end up mathematically close to each other.

zero-shot — An AI system given a task it has not been specifically trained for, relying only on its general knowledge.

RST coherence — Rhetorical Structure Theory — a framework for analysing how parts of a text logically and rhetorically connect to form a whole.

Source: Multi-Level Narrative Evaluation Outperforms Lexical Features for Mental Health

              03 / 03
            

AI Simulates Plausible Depressed Individuals But Gets Populations Badly Wrong

An AI can produce a convincing portrait of a single depressed person — but ask it to paint a crowd and everyone looks suspiciously similar.

The PsychBench team generated 28,800 synthetic patient profiles across four major AI models — GPT-4o-mini, DeepSeek-V3, Gemini-Flash, and GLM-4.7 — using standardised psychiatric questionnaires, then compared what the AI produced against actual population data from large U.S. health surveys. Individual profiles looked clinically plausible. No model generated a patient whose symptoms violated basic diagnostic logic — if depression was flagged, the right gateway symptoms were present. But the moment you zoomed out to look at the whole crowd of simulated patients, something went wrong. The AI models dramatically compressed variation. In a real population, mental health severity is spread across a wide range — some people are mildly affected, some severely, and the extremes matter enormously for clinical planning. The AI models squeezed everyone toward the middle. DeepSeek-V3 eliminated 62% of the real-world spread; even the best model, GLM-4.7, still eliminated 14%. Worse: 37% of simulated cases switched diagnostic category — depressed or not — between two runs of the same prompt, even though the overall correlation looked high. That is like a thermometer that reads 98.6°F reliably on average but randomly reads 96 or 103 for individual patients on different days. The bias was not uniform. AI models overestimated depression severity for most demographic groups by 3 to 6 points on a standard scale, while simultaneously underestimating it for transgender women by 5 points — capturing only a fraction of the documented mental health burden that group actually carries. Why does this matter? Researchers are increasingly using synthetic data to train and test AI diagnostic tools. If the training crowd is artificially smoothed and demographically miscalibrated, the tools built on it inherit those blind spots. The portrait looks fine. The population data is quietly broken.

Glossary

synthetic patient profiles — AI-generated fictional patient records that are meant to statistically resemble real patients, used to train or test medical algorithms without using actual patient data.

variance compression — When a model produces outputs that cluster too close to the average, erasing the real-world spread between mild and severe cases.

ICC (Intraclass Correlation Coefficient) — A measure of how consistently a rating system produces the same result when repeated; high ICC means reliable, but it does not guarantee the ratings are accurate.

epidemiological fidelity — How accurately a simulated population reflects the real-world distribution of a condition, not just whether individual cases look believable.

Source: PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

The bigger picture

Here is what these three papers are collectively telling you: the field is getting better at finding depression signals in unexpected places — erratic sleep timing, fractured story structure, vocal patterns — but it is also discovering that the tools doing the finding have systematic blind spots baked in. CoDaS finds real patterns in wearable data; the narrative paper finds real patterns in how people structure language. Both are genuinely useful steps. But PsychBench is the uncomfortable third piece: the AI systems being used to simulate, train, and evaluate these tools distort the very populations they claim to represent, and they do it unevenly — smoothing out extremes and mis-weighting minority groups. You cannot separate the signal-finding from the tool-auditing. Better biomarkers built on miscalibrated synthetic populations will carry those distortions forward. The honest position is: the detection science is advancing faster than the trust infrastructure around it.

What to watch next

The immediate question is whether the sleep irregularity signals from CoDaS hold up in a prospective study — meaning one that follows people forward in time rather than capturing a snapshot. That kind of validation is the difference between a candidate and a clinical tool, and it typically takes two to three years. On the AI simulation problem, watch for whether journals and funding bodies start requiring epidemiological audits of synthetic datasets used in mental health AI research — that norm does not yet exist, and PsychBench is essentially making the case that it should.