DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

June 12, 2026

297

Papers

10/10

Roadblocks Active

Connections

⚡ Signal of the Day

• MaxProof reports 35/42 on IMO 2025 and 36/42 on USAMO 2026, both above the human gold-medal threshold — the strongest publicly reported result for AI formal mathematical reasoning to date.

• The result is architecturally notable: it combines generative-verifier reinforcement learning with a defense-in-depth proof checker (low false-positive rate) and population-level test-time scaling via tournament selection, suggesting that reliable verification is as important as generation quality for frontier reasoning.

• Watch whether the verification architecture generalises: if low-FP generative verifiers can be ported to domains like code correctness or scientific derivation, this could unlock similar step-change gains in those areas — the tournament-selection scaling strategy also deserves independent replication.

📄 Top 10 Papers

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

MaxProof trains a proof-generation model with reinforcement learning while pairing it with a 'defense-in-depth' generative verifier designed to reject incorrect proofs with very low false-positive rates. At test time, many candidate proofs are generated and a tournament-selection process picks winners, allowing compute to be traded for accuracy. The system scores above the human gold-medal threshold on both IMO 2025 and USAMO 2026, making it the strongest reported result on competition-level formal mathematics and a concrete demonstration that reliable verification is the key bottleneck to unlock, not just generation.

██████████ 1.0 reasoning-reliability Preprint

Read Save Connections

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

TerraBench is a benchmark of 403 executable agentic tasks spanning weather forecasting, climate simulation, and geospatial processing, testing whether AI agents can coordinate specialist scientific tools — not just language — to answer Earth-science questions. The benchmark separates two failure modes: tool-use proficiency (did the agent call the right tools correctly?) and final numerical accuracy, exposing cases where agents produce plausible-sounding answers via wrong workflows. Current LLMs score significantly below what domain scientists need, pinpointing workflow coordination and precise tool parameterisation as the limiting factors.

██████████ 0.9 agent-tool-use Preprint

Read Save Connections

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

When LLMs generate product recommendations by retrieving live web evidence, injecting a single fake-product page into a top-3 result bundle can fool all 12 tested models up to 27% of the time; replacing all three retrieved pages raises that to 73.8%. The attack works because models treat retrieved text as authoritative, particularly in product categories where the model has weak prior knowledge. Three defenses (skepticism prompting, prior-consensus filtering, cross-document consensus) reduce but do not eliminate the vulnerability, showing that RAG-based recommendations inherit a structural trust problem with the open web.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

This structured review maps how AI hallucination manifests across five medical imaging modalities — CT, MRI, PET/SPECT, ultrasound, and digital pathology — identifying clinically dangerous failure modes such as fabricated anatomical structures and incorrect laterality. A counter-intuitive finding is that general-purpose foundation models outperform medically fine-tuned models on hallucination benchmarks, suggesting that narrow domain fine-tuning introduces overfitting that increases confabulation rather than reducing it. The paper also maps these failure modes to current FDA regulatory guidance, which is useful for anyone building or approving clinical AI systems. Note: the paper is a narrative review, and some reported p-values lack transparent statistical basis.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

Most CAD-generation models are one-shot: they produce a design from a prompt and stop. IterCAD treats CAD creation as a multi-turn loop in which an agent writes code, executes it in a CAD sandbox, inspects the result, and repairs errors — the same way a human engineer would. A geometry-aware reinforcement learning stage with 'viable-prefix masking' (only penalising steps that were geometrically reachable but wrong) significantly improves both code executability and shape accuracy. The approach demonstrates that closed-loop, self-correcting agents can handle structured technical domains, not just open-ended text tasks.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

InterleaveThinker: Reinforcing Agentic Interleaved Generation

InterleaveThinker adds multi-turn planning and self-critique to any frozen image generator without retraining it, via a planner agent that sequences the generation steps and a critic agent that detects deviations from the plan and rewrites instructions. The critic is trained with GRPO reinforcement learning using a dual reward that scores both per-step quality and final accuracy, allowing it to learn from multi-step trajectories using only single-step RL updates. This is practically useful because it decouples reasoning quality from the image generator itself — improvements in planning transfer to any generator.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

VLA models trained on household robot data consistently fail in scientific labs: transparent liquids, complex instruments, and rigid multi-step protocols break the assumptions baked in by everyday-object training data. LabVLA addresses this by pretraining an action-tokenisation layer on a chemistry-lab corpus, then using flow-matching posttraining with a separate action expert module insulated from the vision-language backbone — preventing the robot policy from degrading the language understanding. The model sets a new best on the LabUtopia benchmark under both in-distribution and held-out conditions, identifying data scarcity as the remaining central bottleneck.

██████████ 0.8 embodied-ai Preprint

Read Save Connections

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification

Fact verification pipelines have multiple stages (claim decomposition, evidence retrieval, reasoning, verdict), but most RL training only rewards the final verdict, leaving intermediate steps without a learning signal — a sparse supervision problem. ProFact introduces process-aware rewards that assign credit to each intermediate stage, enabling end-to-end optimisation of the full trajectory rather than just the final answer. The result is both higher verification accuracy and lower inference cost compared to strong baselines, making this approach relevant beyond fact-checking to any multi-stage agent pipeline where delayed feedback is a bottleneck.

██████████ 0.8 reasoning-reliability Preprint

Read Save Connections

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

Vision-language models can localise objects well in a single pass, but when asked to iteratively refine bounding box predictions by looking at their own rendered output, performance collapses catastrophically (79.6% to 48.7% accuracy) without special training. IVT teaches self-correction by generating 2,400 synthetic training examples where the model's own wrong predictions are the starting point, with a teacher model providing corrective reasoning traces, followed by RL fine-tuning rewarding IoU improvement. The key finding — that self-correction is a learnable skill acquirable with only 2,400 samples — suggests it can be grafted onto existing VLMs cheaply, and may generalise to other modalities where iterative refinement matters.

██████████ 0.8 multimodal-understanding Preprint

Read Save Connections

MÖVE: A Holistic LLM Benchmark for the German Public Sector

MÖVE evaluates 39 LLMs across both performance tasks (summarisation, QA, topic extraction) and governance dimensions (hallucination, energy use, transparency, political value alignment) in German public administration contexts. The headline finding is that no single model dominates across all criteria, and model size is a poor predictor of quality — a result with direct procurement implications for government agencies that currently use scale as a proxy for reliability. The inclusion of governance metrics alongside accuracy is the methodological contribution, offering a template for how regulated-sector benchmarks should be structured.

██████████ 0.8 hallucination-grounding Preprint

Read Save Connections

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Data Quality & Curation	145	Active	Highest paper volume today; activity is dominated by benchmark construction papers exposing label-quality and annotation-process failures as the primary bottleneck across domains from CAD to medical imaging.
Hallucination & Grounding	135	Active	Strong activity with practical attack-surface work (web-content pollution in recommenders) and domain-specific failure mapping (medical imaging), alongside a public-sector benchmark that treats hallucination as a governance metric.
Reasoning Reliability	87	Active	MaxProof's gold-medal-threshold result on IMO/USAMO is the standout signal, with supporting work on process-aware RL for multi-stage pipelines and Earth-system agent benchmarking highlighting that reliable multi-step reasoning remains the field's central challenge.
Multimodal Understanding	77	Active	Spatial self-correction in VLMs and iterative CAD generation both show that closed-loop visual feedback loops are tractable but require targeted training — naive iteration without special supervision reliably degrades performance.
Interpretability	69	Active	Steady background volume with no landmark paper today; activity appears distributed across routine mechanistic-analysis work rather than concentrated on a single finding.
Efficiency & Scaling	68	Active	On-device deployment (TimeLens's 5.97 MB TFLite model at mAP 0.995) and tournament-selection test-time scaling (MaxProof) represent opposite ends of the compute spectrum, both active today.
Agent Tool Use	61	Active	Productive day with empirical benchmarks (TerraBench), closed-loop CAD agents (IterCAD), and multi-agent image generation (InterleaveThinker) all advancing the understanding of where tool-use coordination fails.
Alignment & Safety	60	Active	Activity today is primarily in position and framework papers (neuro-symbolic regulated-process agents, machine creativity requirements) rather than empirical safety work; the web-pollution attack paper is the most concrete safety-relevant empirical result.
Long Context	37	Active	VideoRAG work on hour-long egocentric video highlights that optimal retrieval granularity varies chunk-by-chunk, a finding that has implications for any long-context retrieval pipeline beyond video.
Embodied AI	33	Active	LabVLA's failure analysis of household-trained VLA models in scientific labs is the primary signal today, identifying domain shift and data scarcity — not architecture — as the binding constraint for laboratory robotics.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe