Huang Yiru, Zhang Lei
Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China.
J Chem Theory Comput. 2024 Aug 13;20(15):6790-6800. doi: 10.1021/acs.jctc.4c00465. Epub 2024 Jul 22.
Directly applying big language models for material and molecular design is not straightforward, particularly for real-scenario cases, where experimental validation accuracy is required. In this study, we propose a multimode descriptor design method for materials prediction and analysis, leveraging the advantages of the natural language processing literature model and density functional theory (DFT) calculations with the assistance of the genetic algorithm (GA). A case study on prediction of aqueous photocurrents of multisolvent engineered halide perovskite CHNHPbI is performed, and the following-up validation experiments are carried out to demonstrate the improved accuracy of the multimode descriptors (an unprecedented experimental validation accuracy of 87.5% via the GA is achieved) for predicting aqueous photocurrents of perovskite materials (c.f. only 50% experimental accuracy for other common machine learning models). The improved experimental accuracy of the descriptors is attributed to the successful deployment of a language model incorporating concise scientific information from >1 million articles into molecular descriptors in combination with DFT calculations. The subsequent machine learning analysis suggests the importance of cation···π and crystallization in molecule-modified halide perovskite materials representing ontological and conceptual understanding. Importantly, the genetic process affords an accurate "white-box" model to describe the perovskite stability (accuracy = 90.2% for the test data set and 92.3% for the train data set) with the mathematical equation , where ∼ atomic-level structural and chemical details such as cation···π interactions and highest occupied molecular orbital levels. This study offers a feasible descriptor design route to accurately predict complex material properties, leveraging both language models and density functional theories.
直接将大语言模型应用于材料和分子设计并非易事,尤其是在需要实验验证准确性的实际场景中。在本研究中,我们提出了一种用于材料预测和分析的多模式描述符设计方法,借助遗传算法(GA),利用自然语言处理文献模型和密度泛函理论(DFT)计算的优势。对多溶剂工程卤化物钙钛矿CHNHPbI的水光电流预测进行了案例研究,并开展了后续验证实验,以证明多模式描述符在预测钙钛矿材料水光电流方面具有更高的准确性(通过GA实现了前所未有的87.5%的实验验证准确性;相比之下,其他常见机器学习模型的实验准确性仅为50%)。描述符实验准确性的提高归因于成功部署了一种语言模型,该模型将来自100多万篇文章的简洁科学信息与DFT计算相结合,纳入分子描述符中。随后的机器学习分析表明,阳离子···π和结晶在分子修饰卤化物钙钛矿材料中具有重要意义,代表了本体论和概念性理解。重要的是,遗传过程提供了一个准确的“白盒”模型,用数学方程描述钙钛矿稳定性(测试数据集的准确性为90.2%,训练数据集的准确性为92.3%),其中 ∼ 原子级结构和化学细节,如阳离子···π相互作用和最高占据分子轨道能级。本研究提供了一条可行的描述符设计途径,利用语言模型和密度泛函理论准确预测复杂材料特性。