DeepScience — Mental Health

DeepScience · Mental Health · Daily Digest

Wristbands, Fake Patients, and Your AI Companion's Brain Bill

Three new papers ask whether our best mental health tools are measuring the right things — and whether we can trust what they find.

            May 02, 2026
          

Happy Saturday. I spent the morning working through a pile of 282 papers so you don't have to. Three of them genuinely stopped me: one about wristbands quietly flagging depression clues, one about AI-generated fake patients that look right but are statistically broken, and one that scanned students' brains and found that chatting with AI for emotional support looks different — and worse — than using it to write essays. Let's dig in.

Today's stories

              01 / 03
            

A Wristband AI Found 41 Possible Depression Clues in Sleep Data

Your sleep schedule's wobble from night to night might say more about your mental health than almost anything you'd tell a doctor.

The team behind CoDaS built a multi-agent AI pipeline — think of it as a team of six specialized analysts passing a file around, each adding a layer of checking — and fed it wearable sensor data from over 9,000 people across three separate studies. It wasn't looking at raw sleep duration. It was looking at variability: how much your sleep length shifts from Monday to Tuesday to Wednesday, and how much your bedtime drifts across the week. Two of those circadian wobble features showed up independently in two separate depression datasets. That's the kind of replication that makes researchers sit up. In total, the system flagged 41 candidate digital biomarkers for mental health and 25 for metabolic health. Adding the best of these features to a prediction model improved its ability to explain depression scores by about 4 percentage points — modest, but statistically real and validated across held-out data. Why does this matter? Most depression diagnosis today relies on questionnaires you fill out in a doctor's office. You self-report how you've been feeling. But you might not notice you've been sleeping erratically for six weeks until someone asks you to look back. A wristband that spots the pattern before you do is a very different kind of early warning system. The catch: 'candidate biomarker' is researcher language for 'we found a correlation worth investigating.' These patterns do not diagnose depression, and a 4-point improvement in explained variance is a hint, not a clinical tool. What needs to happen next is a prospective trial — follow people over time, collect wearable data, and see whether the patterns predict who develops depression before it arrives. That trial has not been run yet.

Glossary

digital biomarker — A measurable signal from a device (like a wristband or phone) that might indicate something about your health.

circadian instability — Day-to-day irregularity in your body's 24-hour rhythms, such as shifting sleep and wake times.

prospective trial — A study that follows real people forward in time rather than looking backward at existing data.

Source: CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

              02 / 03
            

AI-Generated Fake Patients Look Convincing but Get Reality Badly Wrong

An AI model can write a clinically perfect fake depressed patient and still produce a population of fake patients that looks nothing like the real world.

The PsychBench team generated 28,800 synthetic patient profiles using four major AI models — GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, and GLM-4.7 — then held those profiles up against two large American health surveys to see how well the fake population matched the real one. At the individual level, the AI did fine. Zero cases violated clinical rules — no fake patient was given a diagnosis that was logically impossible given their symptoms. But zoom out to the full population and something quietly broke. Think of it like a costume department that nails every single outfit but somehow dresses the whole cast in shades of beige: the extremes vanished. In real life, symptom severity runs from barely-there to absolutely crushing. In the AI populations, that range was compressed by between 14% and 62% depending on the model. The sickest people — the tails of the distribution — were systematically erased. There were also calibration failures. Depression severity was overestimated for most demographic groups by 3 to 6 PHQ-9 points — a clinically significant gap. Transgender women, meanwhile, were dramatically underestimated, with models capturing only 8–46% of their documented vulnerability to symptoms. And 37% of simulated cases flipped their diagnosis — depressed or not — just from one run to the next. This matters because researchers and companies are increasingly using AI-generated synthetic patients to train other tools, test guidelines, and model therapies. If the fake population is systematically skewed, everything downstream inherits that skew. The honest limit here: this is an audit of AI behavior. It does not yet tell us whether clinical tools built on synthetic patients have harmed anyone. That evidence does not exist yet.

Glossary

epidemiological fidelity — How accurately a simulated population matches the real-world distribution of a condition across different groups.

PHQ-9 — A nine-question questionnaire that measures depression severity on a 0–27 scale.

variance compression — When a model produces outputs that cluster near the average and eliminates the extreme cases found in reality.

Source: PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

              03 / 03
            

Using AI for Homework and Using It for Friendship Look Different in the Brain

Among 222 university students, leaning on AI to do tasks and leaning on it for emotional company pointed in opposite directions — in grades, in anxiety scores, and in brain scans.

A team scanned the brains of 222 university students with high-resolution MRI and asked them how they use AI tools — but crucially, they separated two very different kinds of use. Functional use: writing essays, summarizing readings, solving problems. Socio-emotional use: chatting with AI for companionship, processing feelings, using it as a social stand-in. Think of it as the difference between using a calculator and talking to the calculator. Students who used AI more for tasks had, on average, modestly higher grades, slightly larger gray matter volume in the part of the brain associated with planning and executive function (the dorsolateral prefrontal cortex), and more efficient connectivity in the hippocampal network — the brain's filing and retrieval system for memories. Students who used AI more for emotional support showed the reverse. Lower gray matter in regions tied to social processing and emotional recognition. Higher depression scores. Higher social anxiety scores. Only 6.8% of participants reported frequent socio-emotional use, but the signal was clear. This is some of the first brain imaging data connecting not how much you use AI but how you use it to measurable mental health differences. As AI companions become commercially mainstream, that distinction is going to matter. The catch — and it's a big one — is that this is a snapshot, not a film. The researchers took measurements at one point in time. We cannot tell whether emotional AI use caused the differences, or whether people who were already more isolated or anxious were drawn to AI companionship in the first place. Effect sizes are also small. This is a signal worth tracking, not a verdict.

Glossary

gray matter volume — The amount of brain tissue in a region — more is generally associated with greater processing capacity in that area, though the relationship is complex.

hippocampal network — A set of connected brain regions centered on the hippocampus that handles memory formation and spatial navigation.

cross-sectional study — A study that takes a single measurement at one point in time rather than following the same people over months or years.

Source: Mapping generative AI use in the human brain: divergent neural, academic, and mental health profiles of functional versus socio emotional AI use

The bigger picture

Put these three papers next to each other and a pattern emerges. CoDaS is building better instruments — wristbands that catch signals your questionnaire would miss. PsychBench is asking whether the AI models we're building to represent patients actually reflect the full range of human suffering, and finding they quietly erase the extremes and misrepresent the most vulnerable groups. The brain study is asking what these tools are doing to the people using them in the first place — and finding that the answer depends entirely on what you're using them for. These are not separate problems. If we deploy wearable biomarker tools trained on AI-generated synthetic populations that underrepresent severely ill people, we will miss the people who need help most. If we normalise emotional reliance on AI without understanding the brain-level consequences, we may be trading one kind of loneliness for another. The field is doing the right thing by asking these questions now. But the questions are running ahead of the answers.

What to watch next

The big open question from today is whether any of the CoDaS biomarker candidates will survive a prospective validation trial — one that follows real people forward, not backward. That kind of trial takes years to run, so don't hold your breath for a result soon, but watch for announcements of trials being registered. On the PsychBench side, it will be worth seeing whether AI health companies respond by auditing their synthetic training data — or quietly ignore the findings. And the brain imaging study on AI use types is crying out for a longitudinal replication: follow the same students over two or three years and see whether the patterns hold or reverse.