Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
Google LLC, Mountain View, CA, USA.
Nature. 2019 Jul;571(7763):95-98. doi: 10.1038/s41586-019-1335-8. Epub 2019 Jul 3.
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure-property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
绝大多数科学知识都是以文本形式发表的,无论是传统的统计分析还是现代的机器学习方法都很难对其进行分析。相比之下,材料研究界机器可解释数据的主要来源来自结构化属性数据库,这些数据库只包含研究文献中存在的一小部分知识。除了属性值外,出版物还包含有关作者解释的数据项之间的联系和关系的有价值的知识。为了提高对这些知识的识别和利用,一些研究已经集中在使用受监督的自然语言处理从科学文献中检索信息上,这需要大量的人工标记数据集进行训练。在这里,我们表明,可以在无需人工标记或监督的情况下,将发表文献中的材料科学知识有效地编码为信息密集型的词嵌入(单词的向量表示)。这些嵌入无需任何明确的化学知识插入,即可捕获复杂的材料科学概念,例如元素周期表的底层结构和材料的结构-性能关系。此外,我们证明,一种无监督的方法可以在其发现之前几年就为功能应用推荐材料。这表明有关未来发现的潜在知识在很大程度上嵌入在过去的出版物中。我们的发现强调了以集体的方式从大量科学文献中提取知识和关系的可能性,并为科学文献的挖掘提供了一种通用方法。