DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

Wearables That Know You, Robots That Check Themselves

Three papers show AI learning from your pulse, your robot's mistakes, and a tractor that still can't pick the right wrench.

            May 24, 2026
          

Three stories today, and honestly it's a mixed bag — one is genuinely impressive, one is a small but honest step forward, and one is a useful reality check. Let me walk you through each. No overpromising.

Today's stories

              01 / 03
            

An AI Trained on a Trillion Minutes of Heartbeats and Sleep Data

What if a doctor had silently watched five million people sleep, exercise, and stress out — without ever treating a single one of them?

That is roughly the logic behind SensorFM, a new foundation model built by a team whose paper dropped this week on arXiv. They pretrained it on more than one trillion minutes of unlabeled wearable sensor data — accelerometer readings, heart rate, blood oxygen, skin temperature — collected from five million people across 100 countries and 20+ device types. One trillion minutes, by the way, is roughly 1.9 million years of continuous recording. So: big. Think of a foundation model like a cook who has tasted ten thousand dishes before ever reading a recipe. When you hand them a new dish to learn, they pick it up faster than someone starting cold. SensorFM works the same way: after all that pretraining on raw sensor signals, it only needs a small number of labeled examples to learn new health prediction tasks — things like detecting metabolic conditions, estimating sleep stages, or flagging mental health signals. The team validated SensorFM on three independent studies covering 13,985 participants and 35 different health prediction tasks. Performance improved over traditional approaches by somewhere between 10% and 70% depending on the task — a wide range, but consistently in the right direction. They also tested a 'personal health agent' that answered health questions using the model, and 1,860 clinician ratings judged its responses as relevant and safe. The catch: this model was trained by a company on proprietary data. You cannot reproduce it. The paper also does not report confidence intervals for most of its results, so the true size of those gains is harder to trust than it looks. And 'the model says your sleep is poor' is very different from 'a doctor has reviewed your sleep.' Do not throw away your GP.

Glossary

foundation model — A large AI model pretrained on vast amounts of raw data so it can be quickly adapted to many specific tasks with little additional training.

self-supervised pretraining — Training an AI on unlabeled data by getting it to predict or reconstruct parts of its own input, without human-provided labels.

scaling law — The observed pattern that AI model performance improves predictably as you increase both the size of the model and the amount of training data.

Source: Towards a General Intelligence and Interface for Wearable Health Data

              02 / 03
            

A Robot That Double-Checks Its Own Moves Before It Makes Them

Your spell-checker catches typos before you hit send — what if your robot could do the same with physical actions?

Modern robots increasingly rely on what researchers call vision-language-action models — VLAs for short. You can think of a VLA as the brain that takes in a camera feed and a verbal instruction, then outputs a movement command: 'pick up the cup,' 'open the drawer.' The problem is that these brains sometimes generate bad moves, and once a robot starts a bad move, errors compound fast. Pre-VLA, from a team posting on arXiv this week, is a module that sits in front of those movement commands and acts like a fast spell-checker. Before the robot commits to a motion, Pre-VLA scores it: is this action likely to succeed, or is it about to go sideways? If the score is low, it asks the brain for a different option. The whole check takes about 184 milliseconds — a fifth of a second. Tested on the LIBERO robotic manipulation benchmark, a standard simulation suite of household tasks, Pre-VLA pushed the overall task success rate from 30.79% to 37.62% over the baseline system. That is an absolute gain of about seven percentage points. Not a solved problem — the system still fails on roughly six tasks in ten — but a real and consistent improvement. The catch, and it is a big one: all of this was tested in simulation, not on a physical robot. Simulators are clean. Real kitchens have loose cables, wet surfaces, and cats. The 184-millisecond verification time also assumes you have a reasonably fast computer nearby; deploy this on constrained hardware and those numbers change. Small but real step forward.

Glossary

vision-language-action model (VLA) — An AI model that takes visual input and text instructions together and outputs physical movement commands for a robot.

closed-loop success rate — The percentage of tasks a robot completes successfully when it is continuously reacting to what it sees, rather than following a fixed pre-planned script.

LIBERO benchmark — A standard simulation environment used to test robotic manipulation skills across a set of household tasks.

Source: Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

              03 / 03
            

Give AI a Farming Toolkit and Watch It Fumble

Handing a powerful AI a set of farming tools and asking it to help a grower turns out to be a lot harder than it looks.

There is a gap between 'AI can identify a diseased leaf in a photo' and 'AI can help a farmer actually do something about it.' AgroTools, a new benchmark from researchers posting on arXiv this week, is designed to measure exactly that gap. Imagine giving someone a full tool shed — soil sensors, weather APIs, crop calendars, pesticide calculators — and asking them to diagnose why a field is underperforming. The right answer requires picking the right tools in the right order, feeding them the right inputs, handling failures when a tool returns an error, and then synthesising everything into a clear recommendation. That multi-step, recover-and-keep-going process is where current AI agents fall apart. The team built a benchmark of 539 agricultural questions paired with 1,097 images drawn from 12 public datasets, covering five task families and 14 custom tools. They tested 13 different multimodal AI models — nine open-source, four from major commercial providers. The finding is blunt: all of them struggle. The specific failure points are tool planning (choosing the right tool), argument generation (giving it the right inputs), execution recovery (handling errors gracefully), and final synthesis (pulling it all together into a useful answer). Stronger models benefit more from having tools; weaker models actually perform worse when given tools because they can't manage the extra complexity. The honest reading here: AI-assisted precision agriculture is not arriving next season. The gap is not just model capability — it is that we do not yet have good ways to teach AI systems to navigate realistic, multi-step, tool-dependent workflows. AgroTools at least tells us precisely where the problem is.

Glossary

multimodal AI model — An AI system that can process multiple types of input at once — typically images and text — rather than just one.

tool-augmented agent — An AI that can call external software tools (databases, calculators, sensors) during a task, rather than relying only on what it already knows.

process-level evaluation — Measuring not just whether an AI gets the right final answer, but whether it used the correct intermediate steps to get there.

Source: AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture

The bigger picture

Put these three papers side by side and a pattern emerges. SensorFM shows what happens when you throw vast amounts of real-world sensor data at a model and let it learn without being told what to look for: you get genuine gains on health tasks, and the scaling keeps working. Pre-VLA shows a complementary instinct — instead of making the model bigger, add a lightweight checker that catches mistakes before they cascade. Both are strategies for wringing more reliability out of systems that are already capable but not yet trustworthy. AgroTools is the useful cold water. It reminds you that 'capable on a clean benchmark' and 'useful in a messy, multi-step real-world task' are still very different things. The farming experiment is a stand-in for dozens of practical domains — logistics, field medicine, construction — where the same gap exists. We are not short of capable AI. We are short of AI that can navigate failure gracefully and use tools like a sensible adult would.

What to watch next

The SensorFM paper hints that the team is moving toward clinical validation studies — watch for downstream trials that test whether the model's health predictions hold up against physician diagnoses in real patient populations, not just retrospective datasets. On the robotics side, the Pre-VLA team needs to run Pre-VLA on physical hardware to know whether the simulation gains survive contact with reality; that paper would be worth tracking. Open question I'd want answered: does the 'check before you move' approach in Pre-VLA generalise to VLAs trained by other labs, or is it tuned specifically to RynnVLA-002?