DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

Satellites, Deepfakes, and Smarter Look-Ups: AI Checks Its Work

This week AI got better at chaining tools together, spotting fake videos, and knowing when to search before answering.

            July 02, 2026
          

Happy Wednesday. Today's papers are technically varied but share a quiet theme: AI systems learning to reach outside themselves — to satellite databases, reference images, knowledge graphs — instead of relying on what they already 'know'. That's a modest but real pattern worth tracking. Let me walk you through three stories.

Today's stories

              01 / 03
            

An AI dispatcher routes satellite tools for environmental monitoring

What if a scientist with no coding skills could ask a satellite system to check for algal blooms — and the AI figured out, on its own, which tools to call in what order?

That is roughly what a team building a system called Terra AI tried to demonstrate. They connected a large language model — the kind of AI that understands plain-language instructions — to a suite of satellite analysis tools, including a classifier that detects algal blooms in water and an estimator that measures moisture levels in peat bogs. The AI's job was to act like a kitchen manager during a dinner rush: when an order comes in, figure out which station needs to do what, and in what sequence, without a human writing out the recipe each time. The key finding is deceptively simple. When the team gave the AI explicit workflow rules — essentially, a written recipe encoded into its instructions — reliability jumped sharply. Their measure of how well the AI picked the right tool improved from 0.71 to 0.89 out of 1.0. Their measure of whether it put tools in the right order went from 0.79 to 0.99. Those numbers suggest that the AI's biggest weakness wasn't reasoning — it was the absence of clear instructions. Why does this matter? Most specialist satellite analysis tools are built by scientists for scientists. Terra AI's approach, using something called the Model Context Protocol (MCP) — a standardised way to wrap any software tool so an AI can call it — hints at a future where a journalist or a park manager could query a satellite archive without knowing Python. The catch: the benchmark used only 20 test cases. No statistical significance tests were run. The researchers don't say which underlying language model they used. This is a proof of concept, not a deployed system. Treat the numbers as directionally interesting, not definitive.

Glossary

Model Context Protocol (MCP) — A standardised interface that lets an AI model call external software tools the same way a USB port lets any device plug into any computer.

F1 score — A single number between 0 and 1 that combines how often the system gets it right with how often it misses things — 1.0 is perfect.

Source: Agentic Workflow Architecture for Environmental Remote Sensing Analytics

              02 / 03
            

Vision-language AI gets better at recognising fake faces and videos

Spotting a deepfake used to mean looking for blurry ears — now the fakes are good enough that you need a system that actually looks things up before deciding.

A team whose paper is currently written in French — it appears to be a thesis or academic report — has developed a deepfake detection approach that works more like an art authenticator than a simple filter. Instead of just asking 'does this face look odd?', the system pulls up visually similar real images from a reference database and lets a vision-language model — an AI that understands both images and text simultaneously — compare the suspect face against what it found. If the visual story doesn't hold together, it flags the content as likely synthetic. Think of it like a wine sommelier who doesn't just taste a glass but also checks the bottle against a reference catalogue before deciding if it's genuine. The comparison step is what provides both the decision and an explanation for it. For video, the team added a temporal layer: rather than checking one frame, the system aggregates evidence across time and accounts for real-world degradation — the compression artifacts you get when a video gets uploaded to a social platform, resized, and re-encoded. That last part matters a lot in practice, because most deepfakes arrive slightly damaged by the time anyone sees them. The results show improved generalisation across generative models the system had never seen during training, which is the hard problem in this field: today's detector is usually trained on last year's fakes. The catch: the full paper text wasn't publicly available for this digest, so specific accuracy numbers can't be reported here. The claims are plausible and the methodology sounds solid, but I'd call this a 'watch this space' story rather than a confirmed result.

Glossary

vision-language model (VLM) — An AI system trained to understand both images and text at the same time, allowing it to answer questions about what it sees.

temporal aggregation — Combining evidence from many video frames over time, rather than judging a single frozen image.

generalisation — How well a system performs on examples it has never seen before — the real test of whether a detector is useful in the wild.

Source: Détection d'images et de vidéos générées par l'IA par apprentissage multimodal et guidé par la connaissance

              03 / 03
            

Teaching AI to know when to look something up before answering

Every time an AI answers a question from pure memory, it risks confidently making something up — SPARKLE is a system designed to decide, question by question, whether to check first.

You probably know someone who always answers questions immediately, sometimes wrong, and someone else who says 'let me look that up' before committing. Language models are currently the first type. SPARKLE, built by a research team (the paper doesn't specify institution in the available text), tries to train a system that behaves more like the second. The technical approach: they trained a small, lightweight model — call it a proxy — to act as a librarian standing between you and a large language model. For each question, the proxy decides whether the main AI should answer from memory, search a knowledge base first, or search again with a reformulated question. The proxy was trained using reinforcement learning — essentially trial and error, with rewards for getting questions right. The concrete results are modest but real. On standard question-answering benchmarks the system hadn't been trained on, SPARKLE beat the best existing 'adaptive retrieval' systems by an average of 2.85 percentage points. On benchmarks it had seen during training, the improvement was 9.17 percentage points. Those numbers aren't huge, but they're consistent across seven different test sets, which is the right kind of evidence. The important detail: the proxy is deliberately decoupled from both the language model and the search engine it uses. That means, in principle, you can drop it into an existing AI system without retraining everything — like adding a fact-checking layer to a workflow you already have. The catch: question-answering benchmarks are a controlled setting. Real-world questions are messier, more ambiguous, and often don't have a clean correct answer sitting in a database. Whether 2.85 points of benchmark improvement translates into meaningfully fewer wrong answers in practice is genuinely unknown.

Glossary

retrieval-augmented generation (RAG) — A technique where an AI looks up relevant text from an external database before generating its answer, rather than relying only on what it learned during training.

reinforcement learning — A training method where an AI learns by trying things and receiving rewards or penalties based on the outcome — like training a dog, but the dog is a maths function.

adaptive retrieval — Deciding on the fly whether a question needs a database search, rather than always searching or never searching.

Source: SPARKLE: A Structured and Plug-and-play Agentic Retrieval Policy for Adaptive RAG Models

The bigger picture

Look at these three papers side by side and a pattern emerges: all of them are about AI systems that reach outside themselves to be more reliable. Terra AI reaches out to satellite tools it was told to trust. SPARKLE reaches out to a database before committing to an answer. The deepfake detector reaches out to a reference gallery before making a judgment. This is different from the 'bigger model, more parameters' approach that dominated the last few years. It is closer to what humans do when we're being careful: we check. We look it up. We compare against something we already trust. The limitation today is that all three systems still need someone to design the scaffolding — what tools exist, what to search, what references to compare against. The AI doesn't discover those resources on its own. That gap between 'AI that uses tools reliably' and 'AI that discovers which tools it needs' is still wide open, and nobody has convincingly closed it.

What to watch next

The deepfake detection space is moving into competition season — a few major benchmarks, including FakeBench and the DFAD challenge, typically report updated results in the autumn. If SPARKLE's authors publish an institution-affiliated preprint with ablation tables, that will be worth reading closely. The open question I'd most want answered: does a 3-point benchmark improvement on QA translate into any measurable reduction in hallucinated citations in real document workflows?