DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Memory Collapses When Facts Change — Three Stories

Today's AI research shows agents that act impressively until the world changes — and then fall apart completely.

            May 13, 2026
          

Three papers caught my eye today out of a large batch — I spent the morning filtering down to the ones you can actually feel. Two are new benchmarks that expose embarrassing gaps; one is a simulation result with traffic lights that made me stop and re-read. Let's dig in.

Today's stories

              01 / 03
            

AI Memory Systems Totally Fail When Facts Change or Disappear

Ask an AI what your friend's job is and it answers correctly — then tell it your friend quit, and ask again.

Imagine you keep a notebook about the people in your life: who works where, which projects are active, what changed last month. A good assistant flipping through that notebook should handle two kinds of questions: 'Did the project survive the budget cut?' (answer depends on a chain of events) and 'Is Sarah still at that company?' (answer might simply be no, she left). Those are dependency questions and absence questions. They sound easy. They are not. The MEME benchmark, published this week, tested six different AI memory systems on exactly these two tasks across 100 controlled episodes. The results were stunning. On dependency tasks — where the right answer requires noticing that one fact changed because of a connected event — average accuracy across all systems was 3%. On absence tasks — where the correct answer is 'this thing no longer exists' — accuracy was 1%. Not 30%. Not 10%. One percent. The researchers tried everything to fix it: better prompting, deeper retrieval, stripping out noise, swapping in more powerful language models. Nothing moved the needle. The one partial fix was pairing a file-based agent architecture with Anthropic's most capable model. That combination helped — but at roughly 70 times the baseline computing cost. Here is the catch: MEME is a synthetic benchmark in two narrow domains (personal life, software projects). Real memory is messier. But the failure is not at the edges. These systems can find the old stored fact. They just cannot reason that a newer fact has replaced it. It is like a friend who keeps sending mail to your old address — not because they lost the new one, but because they never connected 'you gave me a new address' to 'the old one is now wrong.'

Glossary

dependency reasoning — Working out that one fact changes because a connected fact changed — like knowing your rent goes up because your landlord sent a new lease.

memory system — The part of an AI that stores and later retrieves information from previous conversations, documents, or interactions.

absence reasoning — Correctly concluding that something no longer exists or applies, rather than defaulting to the last known state.

Source: MEME: Multi-entity & Evolving Memory Evaluation

              02 / 03
            

No Traffic Lights, Just an AI — Simulation Results Are Striking

Every time you sit at a red light watching empty lanes, you are experiencing a problem that has been theoretically solved for decades and practically solved almost nowhere.

LISA is a system that manages a four-way intersection with no traffic lights at all. Each vehicle tells the system where it wants to go and how fast it is moving. A large language model — specifically Google's Gemini 2.5 Flash Lite — acts like an air traffic controller: it reads all the incoming vehicle intentions simultaneously, figures out who can cross safely, and tells each car exactly what speed to hold. No red phase. No wasted green phase. Continuous flow. The numbers from simulation are hard to dismiss. Compared to a standard fixed-cycle traffic light, LISA cut average waiting delay by up to 89%. Mean queue length — the worst jam that formed — fell 60%. Fuel consumption dropped close to 50%. On the standard traffic engineering grading scale, LISA maintained a C (acceptable flow) while every competing system, including smarter signal-based baselines, degraded to F, which is gridlock. The analogy that fits: imagine a roundabout managed by a very fast, very patient coordinator who can see every car simultaneously and whisper 'slow to 30 kilometres per hour, then go' in each driver's ear. No wasted cycles, no empty lanes getting a green. Now the catches, because there are several worth naming. LISA was tested only in SUMO, a standard but still simulated traffic environment, at what appears to be a single four-way intersection. Real intersections involve pedestrians, cyclists, delivery trucks, and drivers who ignore instructions. The paper also reports slightly different fuel savings figures in two different sections — 48.8% in one place, 51% in another — which suggests the analysis has some rough edges. And the entire system assumes connected vehicles that can talk to the AI in real time. Most roads today do not work that way.

Glossary

Level of Service — A traffic engineering grade from A (free flow) to F (complete breakdown); C means some delay but acceptable movement.

Memoized Arbitration Table — A cache that stores the AI's previous crossing decisions so it does not have to re-reason every identical situation from scratch.

SUMO — An open-source traffic simulator widely used by researchers to test intersection and road management strategies before any real-world trial.

Source: LISA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management

              03 / 03
            

Give an AI a Recipe Book and It Gets 20% Better at Phone Tasks

Every time your phone's AI tries to do something complicated, it is essentially improvising from scratch — re-reading the whole instruction manual in real time.

When an AI agent navigates your phone — booking a flight, filling a form, opening the right settings screen — it typically figures out each step by looking at what is on screen and guessing the next move. That works passably for simple two-step tasks. For anything with ten steps, small errors compound and the whole thing falls apart. EAM, short for Executable Agentic Memory, takes a different approach. The first time the system explores an app, it builds a Knowledge Graph — think of it like a recipe card file. Each card records a sequence of actions that actually worked to accomplish a specific goal inside that app. Later, when you ask it to do something, it pulls the right card and executes it, rather than improvising from nothing. On AndroidWorld, a standard benchmark for phone-use tasks, EAM outperformed the best existing open-weight AI agent by 19.6 percentage points. It also ran at 2.8 seconds per step and used six times fewer tokens than a GPT-4o-based system — meaning it cost a fraction of the alternative at the same quality level. The catch: EAM requires an upfront exploration phase to build its recipe file, so it has to 'learn' an app before it becomes efficient at using it. That is a real cost, not a footnote. The paper is also brand new with zero citations, posted this week, so independent replication has not happened yet. A 19% improvement on a benchmark is meaningful but does not automatically translate into the same gain in everyday use on real devices, where apps update, interfaces shift, and edge cases multiply.

Glossary

Knowledge Graph — A structured map where each item is linked to related items — like an index card system where cards reference each other rather than sitting in a flat pile.

token — The basic unit an AI language model reads and writes; roughly three-quarters of a word. More tokens processed means higher cost.

AndroidWorld — A standard research benchmark where AI agents are scored on how reliably they complete real Android phone tasks end-to-end.

Source: Executable Agentic Memory for GUI Agent

The bigger picture

Three papers, one thread: AI agents are getting better at acting in the world — navigating phones, directing traffic — but the moment they need to track how the world changes over time, they collapse. MEME shows that even the best memory architectures fail catastrophically when a fact gets replaced or disappears. EAM shows that a structured approach to memory makes phone agents dramatically more capable, but only for the narrow slice of tasks it has already explored. LISA sidesteps the memory problem entirely by working with live real-time data rather than anything stored. My read: the bottleneck is not intelligence in the narrow sense. These systems can reason, plan, and execute. The bottleneck is dynamic memory — updating a model of the world as the world changes. MEME exposes it starkly. EAM makes a partial dent. LISA routes around it. None of them solves it. That gap — between 'I remember this fact' and 'I know this fact is now outdated' — is, right now, one of the most underrated open problems in applied AI.

What to watch next

Watch for follow-up work citing the MEME benchmark — a 1% accuracy floor on absence reasoning is the kind of result that provokes fast responses from other labs. On LISA, the natural next step is a real-world pilot at a connected-vehicle test site; a few such sites exist in Europe and Singapore, and simulation results this clean tend to attract that kind of interest quickly. The open question I would most want answered: can any memory architecture close that gap without a 70-times cost penalty — or is cheap dynamic memory simply not possible with today's retrieval approaches?