DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI That Admits Its Limits — Three Small, Real Steps

Today in AI research: keeping algorithms honest in genomics, clinical notes, and your home heating system.

            May 20, 2026
          

Honest warning before we start: today's batch of 89 papers is mostly philosophical preprints, bibliometric datasets, and conference slide decks — not exactly headline material. But three papers buried in there are worth your time. Let me pull them out.

Today's stories

              01 / 03
            

A Genomic AI Tool That Never Makes Up Tool Names

What if the AI running your genetic analysis just… invented the name of the software it was supposed to use?

That is not a hypothetical. AI agents — software that plans and executes multi-step tasks — have a habit of hallucinating tool names, making up configuration settings, and generally improvising in ways that break silently. In genomics, where one misconfigured parameter can invalidate an entire analysis, that is a serious problem. A team working on Bio-Harness built a system to fix exactly this. Think of it like the difference between asking a kitchen assistant to cook 'something nice' versus handing them a laminated recipe card with every measurement pre-filled. Bio-Harness uses eleven deterministic template compilers — strict recipe cards — that the AI must fill in correctly before it can execute any step. If a tool name or parameter doesn't match what's actually installed, the whole thing stops. No improvising, no silent failure. The results, across 144 test cases spanning two open AI model families (Qwen and Gemma), were 144 passes, zero repairs needed, zero fallbacks. The system also runs entirely on local hardware — no cloud calls — which matters enormously when the data is patient genomic information you cannot legally send to an external server. The catch: Bio-Harness compared itself to its own internal benchmarks, not against other published systems. 144 cases is a reasonable validation set for a lab, but it is small relative to the variety of real clinical deployments. This is a promising architecture, not a finished product. Still, 'zero hallucinations across all tests' is a number worth paying attention to.

Glossary

AI agent — Software that plans and executes a sequence of steps autonomously, rather than just answering a single question.

deterministic template compiler — A rule that forces the AI to fill in a fixed, pre-approved structure instead of generating freeform output.

hallucination — When an AI confidently produces something incorrect — like inventing a software tool name that doesn't exist.

Source: Bio-Harness: Reliable Local-First Bioinformatics Agents with a Calibrated Fast-Signal Methodology

              02 / 03
            

AI Reads Hospital Notes to Flag Suicide Risk More Accurately

A hospital stay generates thousands of sentences of notes — and most of them have nothing to do with why you need to find the few that do.

Electronic health records are messy. A patient admitted for a broken arm generates notes about medication doses, nursing handovers, billing codes, and discharge logistics. Buried somewhere in there might be a sentence that signals serious suicide risk. Finding it automatically is genuinely hard. A research team proposed a 'waterfall' architecture to solve this — which works exactly like it sounds. Imagine reading a dense document by first using a highlighter to cross out everything clearly irrelevant, then crossing out anything contradictory, and only then reading what's left carefully. The system processes clinical notes in stages, filtering irrelevant sentences before ever attempting a risk classification. Tested on the ScAN benchmark — a standard dataset of real clinical notes annotated for suicide attempt history — the framework reached an overall accuracy score (macro F1) of 0.93. More importantly, the hardest categories improved dramatically. Cases labelled 'unsure' or 'negative' had previously scored 0.52, barely better than guessing. After the waterfall approach, those same categories scored 0.83. The catch is real: this was tested on one benchmark dataset with unknown train-test splits, and the paper text was truncated, so I cannot verify the baseline comparisons fully. There is also an odd detail — the 'unsure' and 'negative' categories appear to share the same baseline score, which suggests some reporting ambiguity. Nobody is putting this into a hospital next week. But the gap from 0.52 to 0.83 on the difficult cases is the kind of improvement that makes a clinical tool worth taking seriously.

Glossary

macro F1-score — A summary accuracy measure that averages performance equally across all categories, including rare ones — so it doesn't hide poor performance on small groups.

EHR (electronic health record) — The digital record of everything that happens to a patient in a hospital: notes, test results, prescriptions, and more.

ScAN — Suicide Attempt in Notes — a benchmark dataset of real clinical notes used to test and compare AI risk classification systems.

Source: Enhancing Suicide Risk Classification: A Multi-Stage Framework with Sentence-Level Waterfall Architecture for Clinical Notes Analysis

              03 / 03
            

AI Learns Faster to Control Home Heat Pumps by Knowing Physics

Teaching an AI to control your heating by letting it make mistakes is slow — what if it already knew a bit of physics first?

Heat pumps are one of the most promising technologies for cutting household energy use, but controlling a cluster of them — say, across a street of twenty homes — is surprisingly complex. Each building loses heat differently. Outdoor temperatures change. Grid electricity prices fluctuate. An AI that learns purely by trial and error takes a long time to get good, and in a real building, mistakes cost real money. A research team in Belgium proposed a framework called Dyna-PINN that gives the AI a head start by baking in physics knowledge. Think of it like the difference between learning to ride a bike by falling off repeatedly versus having someone first explain how balance works. You still need practice, but you get competent much faster. The system combines two approaches: model-based learning (the AI builds an internal simulation of how heat moves through a building, guided by known physics equations) and model-free learning (it also learns directly from what actually happens). A second component, called ScaleONet, acts as a surrogate — a fast, learned shortcut that can mimic the thermal behaviour of buildings without running a full physics simulation each time. The framework was also tested with multiple AI agents coordinating across a cluster of homes, rather than each home acting independently. The honest catch: this is a master's thesis, not a peer-reviewed journal paper, and the results have not been validated in real deployed systems. The methods are sound and the ideas are well-motivated, but treat the numbers as 'promising in simulation' rather than 'ready for your boiler room'.

Glossary

physics-informed learning — A technique where known scientific equations are built into an AI's training process, so it learns faster and stays within physically plausible behaviour.

reinforcement learning — A way of training AI by letting it try actions, observe results, and gradually learn which actions lead to better outcomes — like training a dog with rewards.

surrogate model — A fast approximation of an expensive simulation — like using a weather app instead of running a full atmospheric model every time you want a forecast.

multi-agent coordination — Multiple AI systems working together toward a shared goal, each controlling its own part of the system.

Source: Modellering en regeling van warmtepompen in clusters van residentiële gebouwen met behulp van machine learning - Op weg naar energie-flexibiliteit

The bigger picture

Look at what connects today's three stories and you see the same underlying problem approached from three directions: AI that acts in the world makes mistakes, and mistakes in high-stakes domains — genomic analysis, clinical risk assessment, home energy systems — are not acceptable. Bio-Harness answers this by constraining what the AI is allowed to do. The suicide risk paper answers it by filtering what the AI is allowed to see before it decides. The heat pump work answers it by giving the AI knowledge before it starts learning. Three different levers: constrain the outputs, filter the inputs, frontload the knowledge. None of these is a general solution — each is a bespoke fix for a specific domain. But that is exactly where applied AI is right now. Not one elegant answer, but a growing toolkit of partial fixes, each one earning a little more trust in a little more context. That is slower and less exciting than the headlines suggest. It is also how reliable things actually get built.

What to watch next

The ScAN benchmark used in the suicide risk paper is a standard against which several teams are now publishing results — it is worth watching whether the waterfall approach holds up when other groups independently test it. On the energy side, the European Commission is finalising rules for 'demand-response' home energy systems in late 2026, which will determine how commercially viable local AI-controlled heat pump clusters can become. The open question I would want answered: do these physics-informed agents actually behave better when something unexpected happens — a broken sensor, a cold snap — or only in the clean conditions of a simulation?