DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Gets Fake Computers, Better Training, Cheaper Security Tests

Today's three papers are all about fixing the machinery that trains and tests AI — not just building shinier models.

            May 02, 2026
          

Three stories today, and they fit together more neatly than most days allow. All three are about the same underlying problem: the gap between how we build AI systems and how those systems actually perform when something goes wrong. Let me walk you through each one.

Today's stories

              01 / 03
            

Researchers Built 1,000 Fake Computers for AI Agents to Practice On

How do you teach an AI to manage your files if it has never seen a messy inbox in its life?

Teaching an AI assistant to handle real computer work — reorganizing folders, drafting emails, catching up on overdue tasks — sounds manageable until you realize the AI has never actually sat in front of a cluttered desktop. Real computers are messy in very human ways: half-finished projects, duplicated files, calendars that got out of hand three weeks ago. The team behind this paper decided to manufacture that mess at scale. Think of it like a flight simulator for pilots. You do not learn emergency procedures by crashing real planes. You practice in a simulation where the dials, the turbulence, and the warnings feel realistic enough to build real instincts. The researchers built 1,000 synthetic computers — each seeded from a human persona (a teacher, a project manager, a designer) — and filled each one with the kind of digital clutter that persona would actually accumulate. Then two AI agents worked together on each machine. The first invented month-scale productivity goals: catch up on these reports, reschedule these meetings, coordinate with these collaborators. The second agent ground through them, navigating directories and creating files step by step. Each simulation ran for over 2,000 conversational turns and took more than eight hours of clock time. Agents trained on this synthetic experience improved on both familiar and unfamiliar productivity benchmarks. The team has released 100 of these computers — 50 Windows-style, 50 macOS-style — plus 500 simulation reports on HuggingFace for anyone to use. The catch: the paper describes the gains as 'significant' but does not specify the exact numbers or name the benchmarks in the available text. The direction of the result is encouraging; the magnitude is still fuzzy. Treat this as a credible proof of concept, not a finished product.

Glossary

long-horizon simulation — A training run where an AI completes a very long sequence of steps — hundreds or thousands — rather than answering a single question.

in-domain vs. out-of-domain — In-domain means tested on tasks similar to training; out-of-domain means tested on tasks the model has not seen before. Improving on both is harder and more meaningful.

Source: Synthetic Computers at Scale for Long-Horizon Productivity Simulation

              02 / 03
            

A Simple Bridge Between Two Training Steps Makes AI Vision Models Sharper

There is a crack in how we train AI to reason about images — and plugging it gains several accuracy points at essentially no extra cost.

Modern vision-language models — the kind that can look at a photo and answer questions about it — are trained in two stages. First, supervised fine-tuning, or SFT: you show the model thousands of worked examples, like a student copying out correct solutions from a textbook. Then reinforcement learning, or RL: the model practices on new problems and gets scored, learning from its mistakes. The problem, identified by the researchers behind PRISM, is that these two stages do not connect cleanly. Fine-tuning pushes the model into a particular 'style' of responding — a comfort zone shaped by the training examples. But the RL stage needs the model to explore differently, and that comfort zone gets in the way. It is like drilling a footballer on set plays in an empty gym for weeks, then throwing them straight into a chaotic cup final with no transition training in between. PRISM inserts a short bridge phase between SFT and RL. During this phase, the model plays an adversarial game against a classifier that tries to flag whether its outputs look wrong. The classifier has two specialists — one watching for perception mistakes, one watching for reasoning mistakes — and the model learns to satisfy both before RL even starts. Tested on Qwen3-VL, an open vision-language model from Alibaba, PRISM added +4.4 average accuracy points at the 4-billion-parameter scale and +6.0 at 8 billion, consistently across three different RL algorithms. The gains held across all tested benchmarks. The catch: these are controlled benchmark results. Whether the improvement survives in messy real-world conditions — ambiguous photos, oddly phrased questions, domains the model has never seen — still needs independent testing.

Glossary

supervised fine-tuning (SFT) — A training step where a model learns by copying correct answers from a curated set of examples, like studying a worked answer key.

reinforcement learning (RL) — A training step where a model tries things, gets scored, and adjusts — learning from feedback rather than example answers.

multimodal — Able to process more than one type of input — in this case, both images and text.

Source: PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

              03 / 03
            

Stress-Testing AI for Security Vulnerabilities Just Got 7x Cheaper

Finding one way to trick a large AI model used to cost an hour of computation and the GPU memory of a small server room.

Before you trust any AI with anything important, someone should try to break it. In the security world, that process is called red-teaming — deliberately attacking a system to find weaknesses before real adversaries do. The most rigorous version involves crafting adversarial prompts: strings of text, often gibberish-looking, that are mathematically optimized to make a model do things it was trained not to do, like leak private information or ignore its safety instructions. The problem is the price tag. Running one of these attacks on a model handling a 32,000-token context — roughly the length of a short novel — has required up to 264 gigabytes of GPU memory and around an hour of computation time. That is not a quick sanity check. That is a full research project. The team behind FlashRT cut those costs by rethinking which parts of the calculation actually need to be computed in full. Think of it like recipe testing: if you want to know whether more salt improves a dish, you do not bake the entire cake every time you adjust the salt. You taste a small portion. FlashRT applies the same logic to the attack algorithm's forward and backward passes — the two computationally expensive directions of calculation — skipping redundant work while preserving the parts that actually determine whether an attack succeeds. The results on Llama-3.1-8B and two Meta-SecAlign models: 2–7× faster attacks, 2–4× lower memory. The 264 GB requirement dropped to 65 GB. Attack success rate actually went up by 10% compared to the prior baseline — meaning the cheaper attacks are also sharper. The catch: this applies to white-box attacks only — situations where you have full access to the model's internal workings. Most AI deployed in the real world is black-box. The method also needs broader testing beyond the two model families examined here.

Glossary

red-teaming — Deliberately attacking a system to find security weaknesses before real adversaries do — named after military exercises where one team plays the enemy.

prompt injection — An attack where carefully crafted text tricks an AI into ignoring its instructions and doing something it was trained not to do.

white-box attack — An attack carried out with full access to a model's internal structure — as opposed to a black-box attack, where you can only observe inputs and outputs.

context window — The maximum amount of text a language model can process in one go — like the size of a desk: more space means more material to consider at once.

Source: FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

The bigger picture

Three papers today, and they connect in a way that is worth sitting with for a moment. AI development right now does not have a raw capability problem — it has a feedback-loop problem. We build systems that practice in environments too clean for the real world (synthetic computers fix that), with training pipelines that have poorly matched stages (PRISM fixes that), and we then struggle to stress-test them affordably before deployment (FlashRT fixes that). None of these papers announces a new model that does something astonishing. All three quietly make the machinery around model-building work better. That is not a glamorous story, but I would argue it is the more important one. The bottleneck in AI reliability right now is not imagination — we know what we want these systems to do. It is the unglamorous engineering of practice environments, training pipelines, and security checks. Today's papers are three small but real steps on that path.

What to watch next

The synthetic computers dataset is live on HuggingFace now, so watch for follow-up papers from other groups testing whether the training signal holds across different agent architectures — that independent replication is what will tell us how general this approach really is. On the security side, the open question I would want answered is whether FlashRT's efficiency gains hold up against models specifically hardened against adversarial prompts — the paper hints at Meta-SecAlign's vulnerability, but the full picture is still forming.