Mandal Indrajeet, Soni Jitendra, Zaki Mohd, Smedskjaer Morten M, Wondraczek Katrin, Wondraczek Lothar, Gosvami Nitya Nand, Krishnan N M Anoop
School of Interdisciplinary Research, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India.
Department of Materials Science and Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India.
Nat Commun. 2025 Oct 14;16(1):9104. doi: 10.1038/s41467-025-64105-7.
Large language models (LLMs) are transforming laboratory automation by enabling self-driving laboratories (SDLs) that could accelerate materials research. However, current SDL implementations rely on rigid protocols that fail to capture the adaptability and intuition of expert scientists in dynamic experimental settings. Here, we show that LLM agents can automate atomic force microscopy (AFM) through our Artificially Intelligent Lab Assistant (AILA) framework. Further, we develop AFMBench-a comprehensive evaluation suite challenging LLM agents across the complete scientific workflow from experimental design to results analysis. We find that state-of-the-art LLMs struggle with basic tasks and coordination scenarios. Notably, models excelling at materials science question-answering perform poorly in laboratory settings, showing that domain knowledge does not translate to experimental capabilities. Additionally, we observe that LLM agents can deviate from instructions, a phenomenon referred to as sleepwalking, raising safety alignment concerns for SDL applications. Our ablations reveal that multi-agent frameworks significantly outperform single-agent approaches, though both remain sensitive to minor changes in instruction formatting or prompting. Finally, we evaluate AILA's effectiveness in increasingly advanced experiments-AFM calibration, feature detection, mechanical property measurement, graphene layer counting, and indenter detection. These findings establish the necessity for benchmarking and robust safety protocols before deploying LLM agents as autonomous laboratory assistants across scientific disciplines.
大型语言模型(LLMs)正在通过实现能够加速材料研究的自动驾驶实验室(SDLs)来改变实验室自动化。然而,当前的SDL实现依赖于严格的协议,这些协议无法在动态实验环境中捕捉专家科学家的适应性和直觉。在这里,我们展示了LLM智能体可以通过我们的人工智能实验室助手(AILA)框架实现原子力显微镜(AFM)自动化。此外,我们开发了AFMBench——一个全面的评估套件,在从实验设计到结果分析的完整科学工作流程中挑战LLM智能体。我们发现,最先进的LLMs在基本任务和协调场景方面存在困难。值得注意的是,在材料科学问答方面表现出色的模型在实验室环境中表现不佳,这表明领域知识并不能转化为实验能力。此外,我们观察到LLM智能体可能会偏离指令,这种现象被称为“梦游”,这引发了对SDL应用中安全一致性的担忧。我们的消融实验表明,多智能体框架明显优于单智能体方法,不过两者对指令格式或提示中的微小变化仍然敏感。最后,我们评估了AILA在越来越先进的实验——AFM校准、特征检测、力学性能测量、石墨烯层数计数和压头检测中的有效性。这些发现确立了在将LLM智能体作为跨学科的自主实验室助手部署之前进行基准测试和制定稳健安全协议的必要性。