DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

May 14, 2026

Papers

10/10

Roadblocks Active

Connections

⚡ Signal of the Day

• A clinical evaluation finds that LLM citation integrity and diagnostic accuracy vary meaningfully across four models on real otolaryngology cases — a concrete probe of hallucination risk where errors carry patient-safety consequences.

• The broader dataset is dominated by Zenodo software deposits, speculative whitepapers, and low-reproducibility preprints; today is a weak day for verifiable AI research findings.

• Watch the hallucination-grounding roadblock: both mechanistic interpretability work (Project Aletheia) and applied clinical benchmarking are converging on citation fabrication as a measurable failure mode, though methodology quality across these papers remains uneven.

📄 Top 10 Papers

Diagnostic accuracy and citation integrity of four large language models on otolaryngology vignettes

Four LLMs were tested on structured clinical vignettes in otolaryngology, measuring both whether the model gave the right diagnosis and whether the citations it provided actually existed. Diagnostic performance varied across models, and citation hallucination was present — meaning models confidently cited non-existent sources. This matters because medical deployment of LLMs requires both accuracy and traceable evidence; fabricated citations undermine the auditing that makes clinical AI trustworthy.

██████████ 0.8 hallucination-grounding Peer-reviewed

Read

Project Aletheia: The Seven Laws of LLM Hallucination Physics — From Phase Transitions to the Code Mode Switch

Using GPT-2 as a controlled testbed and logit-lens analysis to inspect layer-by-layer activations, the author identifies specific attention heads (L9H6 and L11H7) that systematically suppress factual tokens in favor of grammatically plausible ones — a phenomenon termed 'Grammatical Suppression of Facts.' The key claim is that this suppression is fact-specific: mathematical tokens experience near-zero suppression while factual tokens are suppressed ~70% of the time. Methodology is informal (single researcher, no peer review, no statistical testing), so these numbers should be treated as exploratory hypotheses rather than established results, but the mechanistic framing is concrete enough to be testable.

██████████ 0.8 hallucination-grounding Peer-reviewed

Read

Seeking and mapping coral reef biological hotspots with an autonomous underwater vehicle

An autonomous underwater vehicle was deployed at two coral reef sites using a combination of passive acoustic fish-call detection, YOLO-based visual fish counting, and 3D rugosity mapping to autonomously locate biological hotspots — and then navigate back to them via onboard homing. Nine homing experiments and one real-time barracuda tracking run demonstrate that multimodal sensing can close the perception-action loop in unstructured open-water environments. All data and analysis code are publicly archived with checksums, making this one of the more reproducible embodied-AI field deployments in the dataset.

██████████ 0.7 embodied-ai Peer-reviewed

Read

Seeking and mapping coral reef biological hotspots with an autonomous underwater vehicle

This companion dataset record to the Science Robotics study archives 5.3 GB of raw AUV sensor data including .wav hydrophone recordings, orthomosaic imagery, 3D mesh files, and odometry logs alongside all scripts needed to reproduce Figures 1–7. The significance for AI is methodological: it demonstrates how passive acoustics and visual detection can be fused via multimodal regression against environmental covariates (rugosity) to model biological activity — a template for sensor-fusion grounding in real-world autonomous systems.

██████████ 0.7 multimodal-understanding Peer-reviewed

Read

Agentic Scientific Machine Learning for Autonomous Model Discovery in Systems Pharmacology

A four-agent LLM pipeline — Modeler, Implementer, Judge, Reporter — is proposed to autonomously generate mechanistic hypotheses, implement them in code, evaluate fit, and produce reports for pharmacological modeling tasks like tumor growth and drug resistance. The framework captured adaptive drug resistance through time-varying model components without manual intervention. This is a conference abstract only with no reproducibility details, so the claims cannot be verified, but the task decomposition pattern (generate → implement → evaluate → report) illustrates a concrete agentic architecture for scientific workflows.

██████████ 0.7 agent-tool-use Peer-reviewed

Read

Safe Robots Beyond the Brake: Learning Human Behavior Using Strategic Games and Vision-Language Models

A quadruped robot uses a vision-language model to estimate whether a nearby human is paying attention based on visual cues (early phase), then switches to Bayesian inference over the human's trajectory to assess whether they are actually adapting to the robot (later phase). These two signals are fused with phase-dependent weighting and fed into a chance-constrained motion planner. The idea that safety-relevant human state estimation requires different inference mechanisms at different interaction timescales is conceptually well-motivated, though the paper currently exists only as a .docx presentation with no experimental statistics.

██████████ 0.7 embodied-ai Peer-reviewed

Read

Smart Human Machine Interface Using Piezoelectric Sensors and Artificial Intelligence

PVDF piezoelectric sensors combined with 1D convolutional neural networks achieve 94.6% accuracy at discriminating material hardness from tactile contact signals, with a median inference latency of 0.21 seconds on the Hannes prosthetic hand platform. A 64-channel distributed sensing array further achieves 94.7% spatial accuracy for localizing contact across fingers and palm. These are real hardware results with quantified performance, and the sub-second latency demonstrates that edge-deployable CNNs can close the tactile perception loop for prosthetics without cloud offloading.

██████████ 0.7 embodied-ai Peer-reviewed

Read

MAAC Study 1 Hypothesis Testing Dataset: Complexity Validation of LLM-Generated Decision Scenarios

A multi-LLM judge panel was used to evaluate the complexity of AI-generated decision scenarios, achieving an intraclass correlation of ICC = .997 — indicating near-perfect agreement between LLM evaluators on a composite four-framework complexity instrument. This contributes to the question of whether LLMs can reliably serve as evaluators for AI-generated content, which matters for scalable quality control in synthetic data pipelines. The underlying paper content is not fully accessible, limiting assessment of sample size and statistical rigor.

██████████ 0.6 data-quality-curation 🔗 2 cited Peer-reviewed

Read

From Prediction to Action: A Structured Scoping Review and Framework Synthesis of Integrative AI Decision Systems

A scoping review synthesizes how AI systems can be designed to move from generating predictions to actually supporting or triggering decisions, with clinical decision support as the primary domain. The central problem it addresses is structural: most AI models optimize for predictive accuracy but are deployed in contexts where the relevant outcome is whether a decision improves — a gap that requires workflow integration, not just model performance. As a framework synthesis rather than an empirical study, it offers a taxonomy rather than new data.

██████████ 0.6 interpretability Peer-reviewed

Read

Gold-Standard AGI: Outer AGI Superalignment

This theoretical paper decomposes the AGI alignment problem into outer alignment (correctly specifying what humanity wants as a final goal) and inner alignment (ensuring the system pursues that goal), then proposes a conceptual solution to outer alignment for superintelligent systems. The paper is written to be accessible to policymakers as well as technical audiences, which reflects a real gap in alignment discourse. The work is entirely philosophical with no formal proofs or empirical components, so the claimed 'solution' cannot be evaluated technically from the available abstract.

██████████ 0.5 alignment-safety Peer-reviewed

Read

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Data Quality & Curation	49	Active	Highest-volume roadblock today, but much of the activity consists of synthetic-data deposits with unspecified generation procedures — a red flag for downstream reliability.
Multimodal Understanding	39	Active	Real-world multimodal fusion from the coral reef AUV work (acoustics + vision + rugosity) stands out as a credible concrete result amid a cluster of weaker deposits.
Interpretability	32	Active	Project Aletheia proposes mechanistic attention-head-level explanations for factual suppression in GPT-2, but the single-researcher methodology limits confidence in the specific numerical claims.
Reasoning Reliability	28	Active	Several papers touch reasoning reliability indirectly through clinical vignette evaluation and agent orchestration, but no paper today directly advances the mechanistic understanding of reasoning failures.
Hallucination & Grounding	23	Active	Citation integrity testing in a clinical domain (otolaryngology) provides the most externally valid signal today: LLM hallucination of references is measurable and varies across models in high-stakes contexts.
Embodied AI	22	Active	Coral reef AUV and prosthetic hand tactile sensing offer two concrete hardware deployments with quantified performance, making this roadblock's papers among the more credible in today's set.
Efficiency & Scaling	18	Active	The RCA vertical cascade architecture proposes a hierarchical token-densification alternative to flat GPU clusters, but the work is an invention disclosure with no benchmarks.
Agent Tool Use	17	Active	The agentic systems pharmacology framework illustrates a promising generate-implement-evaluate-report loop, but the absence of reproducibility details prevents meaningful assessment of whether the approach actually works.
Alignment & Safety	16	Active	Activity today is split between applied robot-safety work (VLM-based human-awareness estimation) and abstract AGI alignment theory — little bridging the two.
Long Context	6	Open	Minimal activity today; the RCA architecture touches long-context via Reactive State compression but makes no empirical claims about context length handling.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe