DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

Tiny models, ancient medicine, and the hallucination that won't quit

Today in AI research: one real clinical study, one surprising training lesson, and one very honest failure report.

            May 11, 2026
          

Hi — today's batch of 214 papers is mostly protocol documents and white papers with zero experimental data, which I'll spare you. But three papers buried in the pile are worth your time: a genuine multi-hospital AI study fusing traditional Chinese medicine with modern machine learning, a solo engineer's honest report on teaching an 81-megabyte model to use security tools, and a one-researcher deep dive into why language models lie — and what almost works. Let's dig in.

Today's stories

              01 / 03
            

AI Learns Traditional Chinese Medicine Diagnosis Across Five Hospitals

What happens when you ask an AI to read a tongue and feel a pulse — then validate it across five different hospitals?

The Mingzheng system, tied to a paper appearing in the journal Information Fusion, tries to do something genuinely unusual: combine traditional Chinese medicine diagnosis — things like tongue appearance and pulse character — with cancer comorbidity data, and let a machine learning model make sense of both at once. Think of it like teaching a single translator to work fluently in two very different dialects simultaneously, while also checking medical records. The team used data from 478 patients spread across 5 hospital sites, which matters a lot — most AI medical studies train and test at the same hospital, which inflates results. Here, they deliberately tested the model at hospitals it had never learned from, rotating through sites one at a time. That's called Leave-One-Site-Out validation, and it's a much harder test. The main model scored a Macro-AUC of 0.818, which means it correctly distinguished between conditions about 82% of the time across all sites. That's not perfect, but it's meaningfully above chance, and the methodology is solid enough to take seriously. The catch: the external prospective cohort — real new patients enrolled going forward — shrank from 105 enrolled to just 47 eligible after exclusions. That's a 55% dropout rate before evaluation even began. We don't know yet how the system performs in real clinical deployment, outside of a research protocol. The researchers also benchmarked several well-known language models, including DeepSeek-R1 and Llama-3.1, as comparisons — and the specialized fusion system outperformed all of them on this task.

Glossary

Macro-AUC — A score from 0 to 1 measuring how well a model distinguishes between multiple categories on average; 0.5 is random chance, 1.0 is perfect.

Leave-One-Site-Out validation — A method where the model is tested on data from a hospital it never trained on, rotating through each site — a stricter test than keeping training and testing at the same location.

multimodal fusion — Combining different types of data — in this case, tongue images, pulse descriptions, and clinical records — into a single model.

Source: Mingzheng — Reproducibility Data Package (Information Fusion 2026)

              02 / 03
            

An 81-Megabyte Security AI Teaches Us What Training Data Really Does

A single engineer built a cybersecurity AI the size of a few MP3 files — and accidentally discovered something important about how AI learns to use tools.

VectraYX-Nano is a language model with 42 million parameters — tiny by today's standards, fitting in roughly 81 megabytes. Its creator trained it from scratch in Spanish, focused entirely on cybersecurity, and then tried to teach it to use software tools: pick the right command, format the right call, get the right output. The engineering story here is less important than what the training revealed. When the tool-use examples were mixed into a general training dataset, the model scored exactly zero — a complete failure — at selecting the right tool. Not near-zero. Zero. But when the researcher flipped the ratio so that 1 in every 22 training examples was specifically about tool use, the score jumped to a real positive number. Think of it like a new hire learning a job: if your onboarding manual is 95% company history and 5% the actual task they need to do on day one, they'll struggle with that task no matter how smart they are. You have to give them enough repetition on the specific skill. The study also found a counterintuitive result: lower text perplexity — normally a sign of better language modeling — actually correlated with worse conversational behavior at this tiny scale, suggesting the usual rules of thumb don't transfer cleanly to very small models. The honest limits here: this is a single-author project, evaluated with non-standard metrics and just four random seeds. It's an interesting data point, not a settled finding.

Glossary

parameters — The numbers inside a neural network that are adjusted during training; more parameters generally means more capacity to learn, though not always.

perplexity — A measure of how surprised a language model is by text; lower perplexity usually means the model has learned language patterns well.

BLEU-4 — A score used to compare AI-generated text to a reference answer; 0 means no match, 1 means perfect match.

Source: VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity LLM with Native MCP Tool Integration

              03 / 03
            

Every Attempt to Fix Hallucination Inside a Model Failed. One Workaround Didn't.

Someone spent 58 experiments trying to stop GPT-2 from lying — and every single internal fix failed completely.

A solo researcher ran 58 rounds of experiments on GPT-2 — a small, older language model — trying to understand exactly where and why it suppresses facts it appears to 'know'. The core observation is striking: about 70% of facts that are present in the model's middle layers get ranked lower and lower as you approach the final output layer, until the wrong answer comes out. It's a bit like a message being passed down a chain of people in a game of telephone: by the time it reaches the last person, the original fact has been nudged aside. The researcher tried 12 different methods to interrupt this suppression from inside the model — including techniques called gradient descent corrections, attention surgery, and Monte Carlo tree search. Every single one achieved exactly 0% improvement. The one thing that worked: a method called Logit Lens, which doesn't fix the model at all, but instead reads off the answer at an intermediate layer before the suppression completes. That improved factual recall from 10% to 40%. Now — and this is important — you should be skeptical here. This study uses only GPT-2, a 124-million-parameter model that is nothing like the systems you use today. The sample size is 27 prompts for some key measurements. The researcher is working alone, with no external peer review visible yet, and several results are suspiciously clean: twelve different methods all failing at exactly zero is a number that should make you raise an eyebrow. The Logit Lens finding aligns with other published work and is the most credible piece here. The rest warrants a 'interesting if true' label.

Glossary

Logit Lens — A technique for reading a language model's 'working answer' at intermediate layers before the final output, revealing what the model 'knows' mid-process.

attention surgery — A method of directly modifying the parts of a neural network that decide which words to focus on, hoping to correct errors mid-generation.

gradient descent correction — Nudging the model's internal settings in real time to push it toward a correct answer — applied here during inference, not training.

Source: Project Aletheia: The Seven Laws of LLM Hallucination Physics — From Phase Transitions to Grammatical Suppression of Facts

The bigger picture

Three papers, three very different scales — a 5-hospital clinical study, an 81-megabyte model built by one engineer, and a 58-experiment solo investigation. What they share is an honest collision with the limits of current AI. Mingzheng shows that when you build a specialized system with real methodology and diverse data, you get results worth paying attention to — but also that deployment always shrinks the promising numbers down. VectraYX-Nano shows that data composition is often more important than model size: you can't skip past specificity with raw compute. And Aletheia, flaws and all, puts a concrete name on something many people suspect — that AI systems often 'know' the right answer at some internal stage and then talk themselves out of it by the final word. What connects these: the AI field is not short of ideas about what to build. It is short of clean, validated evidence that those ideas work outside the lab, at scale, with real users. That gap is the story today.

What to watch next

Keep an eye on whether the Mingzheng paper's full Information Fusion publication includes the complete external cohort results — the 55% exclusion rate is the number that needs explaining. On the hallucination front, watch for follow-up work applying Logit Lens to modern large models; if the 10%→40% recall improvement holds at scale, that becomes genuinely useful. The open question I'd want answered: does Logit Lens still work when the model is 100 times larger, or is the suppression pattern different in today's systems?