DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI fails quietly when the job gets complex

Today's papers show AI at its sharpest and most fragile — sometimes on the same task.

            April 21, 2026
          

Three papers today, and they pull in opposite directions at once. On one side: AI writing psychology questionnaires and reading brain tumour scans with real promise. On the other: a careful autopsy of exactly how and why AI assistants collapse when real-world work gets messy. Let's dig in.

Today's stories

              01 / 03
            

Why AI Coding Assistants Break Down on Complex, Real Jobs

If you've ever watched an AI assistant confidently give you wrong code for the tenth time, someone finally wrote down exactly why.

If you've used an AI coding assistant for more than a few hours, you've probably noticed something: it handles small, isolated tasks brilliantly, then does something bafflingly wrong when the project gets large. A team of researchers studied this across multiple real software development workflows and gave the feeling three names worth knowing. The first is the Complexity Cliff. Think of baking a layered cake where each layer depends on the one below. Up to a point, a good recipe helps. But once enough layers are interdependent, a small error cascades — and performance doesn't slowly degrade. It drops off a cliff. The second failure mode is Context Window Blindness. Every AI has a limit on how much text it can see at once — like a sliding spotlight on a stage. A codebase spread across dozens of files means the AI can read only what fits in that spotlight. It makes promises about code in one file without knowing another file has already broken those rules. Silently. Without flagging it. The third is the Memory Illusion. Start a new session with an AI assistant and it forgets everything from yesterday. Architectural decisions, agreed conventions, the context you spent an hour building — gone. These aren't bugs that a patch will fix soon. They're structural limits of how today's large language models work. The catch: this is a case-study analysis, not a controlled experiment. The researchers observed these failures across real workflows but didn't systematically measure how often each failure occurs, or under what exact conditions. The taxonomy is sharp and useful. The numbers behind it are thin. Think of it as a map drawn by someone who walked the terrain — not a satellite image.

Glossary

Context window — The maximum amount of text an AI model can read and hold in 'attention' at one time — anything outside the window is invisible to it.

Large language model (LLM) — A type of AI trained on vast amounts of text to predict and generate language — the technology behind tools like ChatGPT or GitHub Copilot.

Source: Is AI Really Intelligent? Practical Insights from Real-World Use of Generative AI

              02 / 03
            

AI Now Writes the Psychology Tests That Are Used to Measure You

Writing a good psychology survey question takes expert years — a team just showed an AI can do a comparable job in minutes, tested on nearly 5,000 people.

Writing a good psychology survey question is harder than it looks. You need to know what you're measuring, avoid nudging the respondent toward an answer, and make sure your questions cluster meaningfully together — that a question about loneliness doesn't accidentally measure anxiety instead. For decades, that required a trained human psychometrician. A research team has now built a system called AI-GENIE — Automatic Item Generation and Validation with Network-Integrated Evaluation — that automates most of it. Here is how it works. You give a large language model a target concept — say, 'academic burnout' — and it generates many candidate survey questions. Then, instead of having an expert pick the winners, the system uses a technique called network psychometrics to analyse how the questions connect to each other — which ones genuinely cluster around the same idea — and prunes the weak ones. Think of it like tuning a guitar: the goal isn't that each string sounds fine in isolation, it's that they sound right together. The team tested five AI models, including GPT-4o and Llama 3, across five nationally representative U.S. samples totalling nearly 5,000 people. The AI-generated scales reached structural validity comparable to traditionally expert-built scales — meaning the questions actually organised themselves around the concept they were supposed to measure. The catch: structural validity is one dimension of a good test. It doesn't tell you whether the questions capture something culturally meaningful, whether they translate across languages, or whether a human expert would spot a blind spot the AI missed entirely. This is a powerful first filter. It is not a full replacement, and the team are honest about that.

Glossary

Psychometrics — The science of designing and validating tools — usually questionnaires — that measure psychological traits like personality, wellbeing, or ability.

Structural validity — The degree to which the items in a test actually cluster around the concept they are supposed to measure, rather than measuring something else.

Network psychometrics — A method that treats survey questions as nodes in a network and maps which ones are genuinely interconnected, to identify redundant or off-topic items.

Source: Generative Psychometrics via AI-GENIE: Automatic Item Generation and Validation with Network-Integrated Evaluation

              03 / 03
            

AI Combines Brain Scans and Blood Tests to Diagnose Brain Tumours Better

The most aggressive brain tumour has a 10-year survival rate of roughly 0.71% — which is exactly why researchers are pushing AI to read every available signal at once.

Glioblastoma, the most aggressive form of brain tumour, kills most patients within 16 months of diagnosis. The 10-year survival rate is approximately 0.71% — not a typo. So any tool that helps doctors understand a patient's tumour faster, and more precisely, matters enormously. What AI researchers and clinicians are now building is something like a diagnostician who never has to choose which signal to focus on. Imagine a doctor who can simultaneously study your MRI images, your PET scan, and your blood markers — holding all three in mind at once, without forgetting a detail from one while reading another. That is the promise of multimodal AI applied to glioma: pulling spatial imaging data and biological signals into a single model that spots patterns no one signal would reveal alone. A narrative review published in Frontiers in Oncology synthesises where this is heading. Machine learning models applied to MRI can already segment a tumour — draw its boundary automatically — infer its molecular subtype, which determines how it should be treated, and estimate likely patient outcomes. Adding PET scans and advanced MRI techniques, such as diffusion and perfusion imaging, feeds the model biologically specific information that a standard structural scan simply cannot provide. Blood markers such as inflammatory ratios are being tested as an additional layer. The catch here is significant, and the paper is honest about it. This is a review, not a new study — the authors synthesised existing literature, not original data. None of these AI tools has yet been validated in large prospective clinical trials in a way that would justify routine clinical use today. The path from 'works in a study' to 'your neurosurgeon trusts it' is long, slow, and deliberately cautious. We are not there yet.

Glossary

Glioma — A family of brain tumours that grow from glial cells — the support cells of the nervous system — ranging from slow-growing to highly aggressive.

Multimodal AI — An AI system that takes in more than one type of input — for example, images and lab values — and processes them together rather than separately.

Molecular subtype — A classification of a tumour based on specific genetic or molecular markers, which determines which treatments are likely to work.

Perfusion imaging — An MRI technique that maps blood flow through the brain, helping identify how a tumour is fed by blood vessels.

Source: Multimodal artificial intelligence in glioma management: integrating neuroimaging and hematologic biomarkers for precision oncology

The bigger picture

Put these three papers side by side and a pattern emerges that is worth sitting with. AI-GENIE works because the task is bounded and well-defined: generate items, run a network analysis, prune. The glioma work works — at least in research settings — because the inputs are controlled: structured scans, structured blood markers, structured outcomes. The AI is a very fast pattern-matcher operating on clean, labelled data. The LLM failure paper tells you what happens the moment that structure disappears. The moment a task is distributed across dozens of files, or spans multiple sessions, or requires holding a city's worth of context in mind — the same technology that just wrote your psychology survey falls off a cliff without warning you. That is not a contradiction. It is the same underlying truth: AI is powerful within a spotlight and unreliable the moment the job grows larger than the spotlight. The exciting medical and psychological applications work because researchers have carefully designed the spotlight. Most of real working life has not.

What to watch next

The AI-GENIE team has not yet published cross-language or cross-cultural validation — that is the obvious next test, and it would tell us a great deal about how far this approach can generalise. On the medical side, several multimodal AI tools for neuro-oncology are entering early-stage clinical evaluation in 2026 at centres including those in the EORTC network; the first prospective results are worth watching. And the open question I would most want answered on AI coding assistants: what is the actual, measured failure rate on real enterprise codebases — not demos, not toy projects?