MicroscopyGPT：使用视觉语言Transformer从二维材料的显微镜图像生成原子结构描述

MicroscopyGPT: Generating Atomic-Structure Captions from Microscopy Images of 2D Materials with Vision-Language Transformers.

作者信息

Choudhary Kamal

机构信息

Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, United States.

Department of Electrical and Computer Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21218, United States.

出版信息

J Phys Chem Lett. 2025 Jul 10;16(27):7028-7035. doi: 10.1021/acs.jpclett.5c01257. Epub 2025 Jul 1.

DOI:10.1021/acs.jpclett.5c01257

PMID:40590052

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12257582/

Abstract

Determining complete atomic structures directly from microscopy images remains a long-standing challenge in materials science. MicroscopyGPT is a vision-language model (VLM) that leverages multimodal generative pretrained transformers to predict full atomic configurations, including lattice parameters, element types, and atomic coordinates, from scanning transmission electron microscopy (STEM) images. The model is trained on a chemically and structurally diverse data set of simulated STEM images generated using the AtomVision tool and the JARVIS-DFT as well as the C2DB two-dimensional (2D) materials databases. The training set for fine-tuning comprises approximately 5000 2D materials, enabling the model to learn complex mappings from image features to crystallographic representations. I fine-tune the 11-billion-parameter LLaMA model, allowing efficient training on resource-constrained hardware. The rise of VLMs and the growth of materials data sets offer a major opportunity for microscopy-based analysis. This work highlights the potential of automated structure reconstruction from microscopy, with broad implications for materials discovery, nanotechnology, and catalysis.

摘要

直接从显微镜图像中确定完整的原子结构仍然是材料科学中一个长期存在的挑战。MicroscopyGPT是一种视觉语言模型（VLM），它利用多模态生成预训练变压器，从扫描透射电子显微镜（STEM）图像中预测完整的原子构型，包括晶格参数、元素类型和原子坐标。该模型在使用AtomVision工具、JARVIS-DFT以及C2DB二维（2D）材料数据库生成的化学和结构多样的模拟STEM图像数据集上进行训练。用于微调的训练集包含大约5000种二维材料，使该模型能够学习从图像特征到晶体学表示的复杂映射。我对拥有110亿参数的LLaMA模型进行了微调，从而能够在资源受限的硬件上进行高效训练。视觉语言模型的兴起和材料数据集的增长为基于显微镜的分析提供了一个重大机遇。这项工作突出了从显微镜进行自动结构重建的潜力，对材料发现、纳米技术和催化具有广泛的意义。