DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Can Look Around a Room — But Can't Remember It

Today's AI research asks one urgent question: can an AI keep track of a changing world, not just describe a frozen snapshot of it?

            April 27, 2026
          

Three papers today, and they happen to tell one connected story. Each one pokes at the same bruise: AI systems are surprisingly good at understanding a single image, and surprisingly bad at tracking what happens next. I'll walk you through the failure, one attempt to fix it from the vision side, and one from the robotics side.

Today's stories

              01 / 03
            

AI Models Forget Where Objects Are When They Have to Walk Around

An AI that can describe a photo in perfect detail will still forget where the cup was the moment you move it.

Here is what SpaMEM is testing. Imagine you are playing a very simple game: you walk through a house, objects appear and disappear, and every few steps you are asked where something is. Not a photo quiz — an ongoing walk. You have to remember what you saw earlier and update your mental map as things change. Current AI vision models — the kind that power image description tools and visual assistants — were built for the photo quiz. They are trained to look at one frame and answer a question about it. SpaMEM's team built a large synthetic benchmark to test what happens when you take those same models and put them on the ongoing walk instead. Ten million images, 25,000 interaction sequences, 1,000 procedurally generated houses. The result is a clean collapse. One leading model, InternVL3, scored an F1 of 0.36 — already modest — on single static frames. Put it in an ongoing visual stream where it has to track changes itself, and that score drops to 0.13. That is not a small dip. That is the difference between passable and nearly random. The team also found a sharp cliff between two conditions: when you hand the model a text summary of the scene so far (like a helpful assistant whispering 'the red block is now on the left'), it does fine. Take away the text crutch and give it only raw video, and it falls apart. The bottleneck is not sensor quality — adding depth cameras changed almost nothing. The problem is that these models have no reliable way to revise their internal picture of the world as evidence accumulates. The catch: this is a synthetic benchmark, not a real kitchen or factory floor. Whether the same collapse shows up in deployed systems — and how much it matters — is still an open question.

Glossary

F1 score — A combined measure of accuracy that balances two things: how often the model is right when it answers, and how often it answers when it should.

egocentric observations — What the AI 'sees' from its own moving viewpoint, like a body camera, rather than from a fixed camera above the scene.

Source: SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

              02 / 03
            

A Scientific AI That Zooms In and Crops Images While It Thinks

When the map is blurry, a good navigator doesn't just stare harder — they pull out a magnifying glass.

Most AI reasoning with images works like this: you hand the model a photo, it looks at the whole thing at once, and then it starts writing its answer. The image is frozen. The model never goes back to poke at it. S1-VL, a 32-billion-parameter model built on top of Qwen3-VL by the paper's research team, tries something different. During its reasoning chain — the internal 'thinking out loud' before it gives an answer — it can write Python code to actually manipulate the image: zoom into a corner, crop a section, sharpen a region. Then it looks at the result and keeps thinking. The team calls this 'Thinking-with-Images', and the analogy is exact: it is like a student who, mid-sentence while writing an answer, picks up the exam paper and holds it under a desk lamp to read small print. The model is aimed at scientific tasks — reading high-resolution charts, diagrams, and real-world visual data — where that kind of active interrogation genuinely matters. On five dedicated benchmarks testing that ability, S1-VL claims the top slot across all five. On broader scientific reasoning tests, the team says it approaches leading models but stops short of claiming the top position there. Two important catches. First, the paper does not report concrete numerical scores in the portions available for review — so 'state-of-the-art on five benchmarks' is the team's claim, not a figure you can verify independently right now. Second, this only works in domains where code execution in a safe sandbox is available. Real-world deployment, where you cannot run arbitrary code, is a different story entirely.

Glossary

chain-of-thought reasoning — A technique where an AI writes out intermediate steps before giving its final answer, like showing its work on a maths problem.

fine-tuned — A base AI model that has been further trained on a specific type of task to make it better at that task.

Source: S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

              03 / 03
            

A Robot That Keeps a Running Map of Objects — Instead of Forgetting Between Glances

A robot arm that forgets where it put the red block the moment it looks away is useless for any real job.

Most robot vision systems are built on a tidy assumption: the latest camera frame tells you everything you need to decide what to do next. Grab the cup, move it, look again, grab the next thing. Each action is independent. Researchers call this the Markovian assumption — fancy name, simple idea: the past does not matter, only right now does. For simple tasks, that is fine. But the real world constantly breaks it. An object rolls behind another one. A previous step determines whether the next step is even possible. The robot needs a memory, not just a camera. CodeGraphVLP, from the paper's team, tackles this with two pieces. First, a persistent semantic graph — think of it as a sticky-note corkboard that the robot updates continuously, recording which objects are present, where they are, and how they relate to each other. This corkboard survives between glances. Second, instead of asking a large language model to decide every move in real time (slow, expensive), the system writes a short computer program once at the start of the task. That program then runs quickly, querying the corkboard as needed, without phoning home to the language model every few seconds. Tested on three real-world tabletop manipulation tasks with non-Markovian dependencies — tasks where the current frame genuinely is not enough — CodeGraphVLP outperformed systems that use step-by-step text memory, memory tokens baked into the model, and systems that do query the language model continuously. It also ran faster. The catch: three tabletop tasks is a narrow test. The specific success rate numbers are not in the text available, and it is a long road from a lab bench to an unstructured environment.

Glossary

Markovian assumption — The idea that the current state of the world is all you need to decide your next action — previous history does not matter.

semantic graph — A data structure that stores objects and their relationships to each other, like a map that says 'the cup is to the left of the plate'.

vision-language-action model (VLA) — An AI system that takes visual input and natural language instructions and outputs physical actions for a robot to perform.

Source: CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

The bigger picture

Look at what these three papers share. SpaMEM shows you the size of the gap: give an AI a moving visual world to track, and its performance falls off a cliff. S1-VL and CodeGraphVLP are two different attempts to build something back up. S1-VL says: let the AI reach back into the image itself during reasoning, actively, like a reader who re-reads a paragraph. CodeGraphVLP says: give the agent an explicit external memory it maintains over time, independent of what is currently on screen. Neither of those is a general solution. Both are engineering patches on a deeper problem: today's AI vision systems were mostly built to answer a question about a single frozen moment. The real world is not a series of frozen moments. It is a continuous stream where earlier events constrain later ones, objects disappear, and context accumulates. Until that gap closes, autonomy in physical environments — robots, self-driving systems, any AI that has to act rather than just describe — will keep hitting the same wall.

What to watch next

The SpaMEM benchmark is now public, which means other teams can run their models against it and report back — expect follow-up results within weeks as labs probe where their systems stand. The more interesting question to watch is whether techniques like S1-VL's active image manipulation and CodeGraphVLP's persistent graph can be combined: an agent that both pokes at its current input AND maintains a running map of what came before. Nobody has demonstrated that cleanly yet.