Back to Roadmap
RoadblockArtificial IntelligencePartial

AI alignment and value alignment

Current methods for aligning large language models with human values — RLHF, DPO, constitutional AI — remain brittle and do not scale reliably. Models can exhibit reward hacking, sycophancy, and deceptive alignment, where surface behavior appears aligned while internal objectives diverge. Scalable oversight of superhuman systems, robust value specification, and corrigibility guarantees are unsolved. The gap between behavioral compliance and genuine alignment widens as model capabilities increase.

Recent papers / Artificial Intelligence

Uncertainty analysis in digital twins and integration of aleatory uncertainties for virtual entity models

June 10, 2026openalex

G-SENSE: Generalized Sensorless External Force Estimation for Humanoid Robots via Centroidal Dynamics

June 10, 2026openalex