Kjær Emil T S, Anker Andy S, Kirsch Andrea, Lajer Joakim, Aalling-Frederiksen Olivia, Billinge Simon J L, Jensen Kirsten M Ø
Department of Chemistry and Nano-Science Center, University of Copenhagen 2100 Copenhagen Ø Denmark
Department of Applied Physics and Applied Mathematics Science, Columbia University New York NY 10027 USA
Digit Discov. 2024 Mar 27;3(5):908-918. doi: 10.1039/d4dd00001c. eCollection 2024 May 15.
Synchrotron X-ray techniques are essential for studies of the intrinsic relationship between synthesis, structure, and properties of materials. Modern synchrotrons can produce up to 1 petabyte of data per day. Such amounts of data can speed up materials development, but also comes with a staggering growth in workload, as the data generated must be stored and analyzed. We present an approach for quickly identifying an atomic structure model from pair distribution function (PDF) data from (nano)crystalline materials. Our model, MLstructureMining, uses a tree-based machine learning (ML) classifier. MLstructureMining has been trained to classify chemical structures from a PDF and gives a top-3 accuracy of 99% on simulated PDFs not seen during training, with a total of 6062 possible classes. We also demonstrate that MLstructureMining can identify the chemical structure from experimental PDFs from nanoparticles of CoFeO and CeO, and we show how it can be used to treat an PDF series collected during BiFeO formation. Additionally, we show how MLstructureMining can be used in combination with the well-known methods, principal component analysis (PCA) and non-negative matrix factorization (NMF) to analyze data from experiments. MLstructureMining thus allows for real-time structure characterization by screening vast quantities of crystallographic information files in seconds.
同步加速器X射线技术对于研究材料的合成、结构与性能之间的内在关系至关重要。现代同步加速器每天可产生多达1拍字节的数据。如此大量的数据能够加速材料研发,但同时也带来了工作量的惊人增长,因为所产生的数据必须进行存储和分析。我们提出了一种从(纳米)晶体材料的对分布函数(PDF)数据中快速识别原子结构模型的方法。我们的模型MLstructureMining使用基于树的机器学习(ML)分类器。MLstructureMining已被训练用于从PDF中对化学结构进行分类,在训练期间未见过的模拟PDF上,对于总共6062种可能的类别,其前3名准确率达到99%。我们还证明了MLstructureMining可以从CoFeO和CeO纳米颗粒的实验PDF中识别化学结构,并展示了它如何用于处理在BiFeO形成过程中收集的PDF系列。此外,我们展示了MLstructureMining如何与著名的主成分分析(PCA)和非负矩阵分解(NMF)方法结合使用来分析实验数据。因此,MLstructureMining能够通过在几秒钟内筛选大量晶体学信息文件来实现实时结构表征。