Alampara Nawaf, Schilling-Wilhelmi Mara, Ríos-García Martiño, Mandal Indrajeet, Khetarpal Pranav, Grover Hargun Singh, Krishnan N M Anoop, Jablonka Kevin Maik
Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Jena, Germany.
School of Interdisciplinary Research, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India.
Nat Comput Sci. 2025 Oct;5(10):952-961. doi: 10.1038/s43588-025-00836-3. Epub 2025 Aug 11.
Recent advancements in artificial intelligence have sparked interest in scientific assistants that could support researchers across the full spectrum of scientific workflows, from literature review to experimental design and data analysis. A key capability for such systems is the ability to process and reason about scientific information in both visual and textual forms-from interpreting spectroscopic data to understanding laboratory set-ups. Here we introduce MaCBench, a comprehensive benchmark for evaluating how vision language models handle real-world chemistry and materials science tasks across three core aspects: data extraction, experimental execution and results interpretation. Through a systematic evaluation of leading models, we find that although these systems show promising capabilities in basic perception tasks-achieving near-perfect performance in equipment identification and standardized data extraction-they exhibit fundamental limitations in spatial reasoning, cross-modal information synthesis and multi-step logical inference. Our insights have implications beyond chemistry and materials science, suggesting that developing reliable multimodal AI scientific assistants may require advances in curating suitable training data and approaches to training those models.
人工智能领域的最新进展引发了人们对科学助手的兴趣,这类助手可以在从文献综述到实验设计和数据分析的全科学工作流程中为研究人员提供支持。此类系统的一项关键能力是能够处理和推理视觉和文本形式的科学信息——从解释光谱数据到理解实验室设置。在此,我们引入MaCBench,这是一个全面的基准,用于评估视觉语言模型如何在数据提取、实验执行和结果解释这三个核心方面处理现实世界中的化学和材料科学任务。通过对领先模型的系统评估,我们发现,尽管这些系统在基本感知任务中展现出了有前景的能力——在设备识别和标准化数据提取方面实现了近乎完美的性能——但它们在空间推理、跨模态信息合成和多步逻辑推理方面存在根本局限性。我们的见解不仅适用于化学和材料科学领域,还表明开发可靠的多模态人工智能科学助手可能需要在策划合适的训练数据以及训练这些模型的方法方面取得进展。