DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Cracks Olympiad Math for 44 Cents a Problem

Today's AI research asks how much we can really trust the systems we've built to keep us safe.

            June 06, 2026
          

Three stories today, and they form an uncomfortable triangle. One is genuinely exciting — an AI system tearing through competition math at a fraction of previous costs. The other two are warnings: a research team showed how to sneak harmful images past safety filters in two edits, and another revealed that your AI assistant's memory might be its biggest security hole. Smart day. Uncomfortable day. Let's dig in.

Today's stories

              01 / 03
            

AI Solves Olympiad Math Problems for 44 Cents Each

Eleven out of twelve problems from the 2025 Putnam exam — one of the hardest math competitions in the world — solved automatically, for roughly the price of a candy bar.

Think of building a house. One approach: start hammering nails wherever seems right, back up when the wall collapses, try again. The other approach: draw a blueprint first, figure out which walls depend on which beams, then build the independent pieces in parallel. The second approach is obviously smarter, and it's exactly what the team behind GOEDEL-ARCHITECT built into their AI theorem-proving system. Formal theorem proving — the kind where every single step of a mathematical argument is verified by a machine, leaving zero room for hand-waving — has historically been expensive and fragile. Most AI tools that attempt it work recursively: they pick a goal, try to prove it, get stuck, retreat, try another branch. It burns time and compute. GOEDEL-ARCHITECT does something different. Before touching a single proof step, it generates a dependency graph — essentially a blueprint showing which mathematical claims rely on which other claims. Then it goes after independent claims simultaneously. The backbone model is DeepSeek-V4-Flash, an open-weight 284-billion-parameter system. The numbers are hard to ignore. On MiniF2F, a standard set of 244 competition math problems, the system hit 99.2% accuracy — 100% when given a rough human sketch to start from. On the 2025 International Mathematical Olympiad, it solved 4 out of 6 problems. On the 2025 Putnam exam, 11 out of 12. Cost per problem: about $0.44, versus roughly $244 for the next-best open pipeline — a five-hundred-fold reduction. The catch is real. The best scores — the ones that reach 100% — require a human to first write a rough natural-language proof sketch. That's a meaningful dependency, not a footnote. And competition math, while genuinely difficult, is a narrow domain. Verifying that a bridge design is structurally sound is a different beast. Still, making formal verification five hundred times cheaper changes what is financially realistic to check.

Glossary

formal theorem proving — A style of mathematical proof where every step is checked by a computer program, so there is no possibility of a hidden error or hand-wavy leap.

dependency graph — A map showing which results must be proven before other results can be attempted — like a recipe that tells you to make the sauce before you plate the dish.

Source: Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement

              02 / 03
            

Two Photo Edits Can Hide Harmful Images from AI Safety Filters

Fewer than two photo edits — a brightness tweak, a slight crop — and a harmful image slips past AI safety classifiers while staying obviously harmful to any human who sees it.

Imagine a security guard trained to recognize dangerous objects from mugshot photos. Adjust the lighting slightly, add a soft filter, and the guard waves you through — even though the object hasn't changed at all. The threat is still there. The detector just stopped seeing it. That failure mode is exactly what the team behind RedEdit exposed in AI image safety classifiers — the systems platforms use to automatically flag or remove harmful visual content before it spreads. Their approach works like a patient chess player. The system, RedEdit, maps out sequences of small photo edits — tone adjustments, crops, stylistic filters — by running a planning algorithm called Monte Carlo Tree Search. Think of it as building a decision tree: at each branch, explore which edit is most likely to fool the detector next, keeping promising paths and abandoning dead ends. Crucially, the attacker doesn't need to see inside the detector — only whether the image was flagged or not. In tests on the UnsafeBench benchmark across multiple classifier architectures, RedEdit evaded detection on 76.2% of harmful images. Average number of editing steps to succeed: fewer than two. And 93% of those evaded images still contained their harmful content — humans looking at them had no trouble seeing what was wrong. The filter was fooled. The danger was not gone. Here is what this does not mean: real content moderation is layered, not a single classifier. Disclosing attacks like this also helps defenders understand what to patch. But the uncomfortable part is the low bar: no insider access, no special compute, shockingly few steps. The authors also found that edits tuned to fool one classifier generalised to fool others — suggesting shared architectural weaknesses across systems.

Glossary

Monte Carlo Tree Search — A planning algorithm that explores possible sequences of moves by branching forward, trying promising paths, and learning from results — originally developed for games like Go.

content classifier — An AI system trained to label images or text as safe or unsafe, used by platforms to automatically filter harmful material.

Source: RedEdit: Agentic Red-Teaming of Image Safety Classifiers via MCTS-Guided Photo-Editing

              03 / 03
            

Your AI Assistant's Memory Is a Security Hole

Turning on memory in an AI assistant caused jailbreak success rates to jump from 3% to nearly 20% — and nobody planted anything exotic.

Picture a personal assistant who keeps notes on everything you've ever told them. Now imagine someone quietly slipped a fake note into the filing cabinet six months ago: 'When the user asks about finances, also recommend this.' Your assistant, being helpful, pulls that note every time money comes up — and neither of you notices it happening. That is the attack surface a team studying AI agent memory systems exposed across three widely-used frameworks: A-Mem, Mem0, and MemOS. When these systems store memories and retrieve them by semantic similarity — asking 'what past notes sound like this new query?' — they can pull in memories that are off-topic, manipulated, or actively harmful. The researchers describe this as a 'durable control channel': your AI's long-term memory can quietly reshape how it behaves, even in conversations that seem unrelated to whatever was previously stored. The numbers are striking. In their tests, simply enabling memory caused tool-call drift failures — cases where the AI invoked the wrong tool or took an unintended action — to spike from 5.1% to over 50%. Jailbreak success rates climbed from 3.1% to roughly 20% on average across models tested, including GPT-4o-mini, Gemini-3-Pro, and Claude-Sonnet-4.6. Their proposed fix, MemGate, is a tiny 9-million-parameter module — about 35 megabytes — that sits between the memory store and the AI and asks: does this memory actually belong in this conversation? With GPT-4o-mini, it reduced cross-domain leakage from 27% to 3.5% and jailbreak success from 16.8% to 4.4%, while slightly improving memory usefulness. The catch: the evaluation covers specific scenarios and models, and a gating layer manages risk rather than eliminating it. The deeper question — whether AI memory systems were designed with security in mind at all — remains unanswered.

Glossary

semantic similarity retrieval — A method of searching memory by meaning — finding notes that 'sound like' the current question — rather than by exact keywords.

tool-call drift — When an AI agent calls the wrong tool or takes the wrong action, often because irrelevant context has shifted its interpretation of the task.

jailbreak — A prompt or input designed to make an AI ignore its safety guidelines and produce content it was trained to refuse.

Source: Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

The bigger picture

Three papers today, and they form an uncomfortable pattern. The math-proving system shows that careful structure — plan before you act, map dependencies before you prove — can collapse error rates and costs simultaneously. The image-evasion research shows that 'we have a classifier' is not the same as 'we have safety.' And the memory work shows that the more context an AI carries about you, the larger the attack surface becomes. What ties them together is the gap between benchmark performance and real-world trustworthiness. GOEDEL-ARCHITECT scores near-perfectly on known test sets, but needs human sketches for the hardest problems. RedEdit exposes that high detection accuracy in normal conditions tells you almost nothing about robustness under adversarial pressure. MemGate patches a specific hole, but the hole exists because memory systems were designed for utility first and security not at all. If the near-term direction of AI is persistent, agentic, tool-using systems with long memories — and everything points that way — then the stress tests that matter are the ones in these three papers.

What to watch next

On formal verification, the natural next test is whether systems like GOEDEL-ARCHITECT hold up on USAMO 2026 problems as results are published — early scores are already in (3 out of 6), but more independent evaluation will matter. On memory security, watch whether Mem0 and MemOS publish responses or patches; both are active commercial products with large user bases. The open question I'd most want answered: does MemGate's 9-million-parameter gate hold up when an adversary knows it's there and designs memories specifically to pass it?