Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
Department of EECS and CSAIL, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
J Chem Inf Model. 2020 Mar 23;60(3):1194-1201. doi: 10.1021/acs.jcim.9b00995. Epub 2020 Jan 28.
Leveraging new data sources is a key step in accelerating the pace of materials design and discovery. To complement the strides in synthesis planning driven by historical, experimental, and computed data, we present an automated, unsupervised method for connecting scientific literature to inorganic synthesis insights. Starting from the natural language text, we apply word embeddings from language models, which are fed into a named entity recognition model, upon which a conditional variational autoencoder is trained to generate syntheses for any inorganic materials of interest. We show the potential of this technique by predicting precursors for two perovskite materials, using only training data published over a decade prior to their first reported syntheses. We demonstrate that the model learns representations of materials corresponding to synthesis-related properties and that the model's behavior complements the existing thermodynamic knowledge. Finally, we apply the model to perform synthesizability screening for proposed novel perovskite compounds.
利用新的数据源是加速材料设计和发现步伐的关键步骤。为了补充由历史、实验和计算数据驱动的合成规划方面的进展,我们提出了一种自动化、无监督的方法,将科学文献与无机合成见解联系起来。从自然语言文本开始,我们应用语言模型的词嵌入,将其输入到命名实体识别模型中,然后对条件变分自动编码器进行训练,以便为任何感兴趣的无机材料生成合成方案。我们仅使用在首次报道合成之前十年内发布的训练数据,通过预测两种钙钛矿材料的前体,展示了该技术的潜力。我们证明了该模型学习了与合成相关的材料表示,并且模型的行为补充了现有的热力学知识。最后,我们应用该模型对提出的新型钙钛矿化合物进行可合成性筛选。