Sustainable Green Computing and Carbon-Aware Artificial Intelligence
June 10, 2026openalex
Current vision-language models can describe images and answer questions about them, but struggle with fine-grained spatial reasoning, temporal understanding in video, and genuine cross-modal inference. Unified architectures that natively process text, images, audio, and video remain inferior to specialized models in many benchmarks. Achieving human-level multimodal understanding that seamlessly integrates perception across modalities — including physical intuition and commonsense spatial reasoning — is an open challenge.