Korolev Vadim, Protsenko Pavel
Department of Chemistry, Lomonosov Moscow State University, 119991 Moscow, Russia.
Patterns (N Y). 2023 Aug 2;4(10):100803. doi: 10.1016/j.patter.2023.100803. eCollection 2023 Oct 13.
Property prediction accuracy has long been a key parameter of machine learning in materials informatics. Accordingly, advanced models showing state-of-the-art performance turn into highly parameterized black boxes missing interpretability. Here, we present an elegant way to make their reasoning transparent. Human-readable text-based descriptions automatically generated within a suite of open-source tools are proposed as materials representation. Transformer language models pretrained on 2 million peer-reviewed articles take as input well-known terms such as chemical composition, crystal symmetry, and site geometry. Our approach outperforms crystal graph networks by classifying four out of five analyzed properties if one considers all available reference data. Moreover, fine-tuned text-based models show high accuracy in the ultra-small data limit. Explanations of their internal machinery are produced using local interpretability techniques and are faithful and consistent with domain expert rationales. This language-centric framework makes accurate property predictions accessible to people without artificial-intelligence expertise.
长期以来,属性预测准确性一直是材料信息学中机器学习的关键参数。因此,表现出最先进性能的先进模型变成了缺乏可解释性的高度参数化黑箱。在这里,我们提出了一种使它们的推理透明的巧妙方法。我们建议在一套开源工具中自动生成的基于文本的人类可读描述作为材料表示。在200万篇同行评审文章上预训练的Transformer语言模型将化学成分、晶体对称性和位点几何等众所周知的术语作为输入。如果考虑所有可用的参考数据,我们的方法在分析的五个属性中有四个属性的分类方面优于晶体图网络。此外,经过微调的基于文本的模型在超小数据限制下显示出高精度。使用局部可解释性技术对其内部机制进行解释,这些解释忠实且与领域专家的原理一致。这个以语言为中心的框架使没有人工智能专业知识的人也能进行准确的属性预测。