J Chem Inf Model. 2019 Sep 23;59(9):3692-3702. doi: 10.1021/acs.jcim.9b00470. Epub 2019 Aug 19.
The number of published materials science articles has increased manyfold over the past few decades. Now, a major bottleneck in the materials discovery pipeline arises in connecting new results with the previously established literature. A potential solution to this problem is to map the unstructured raw text of published articles onto structured database entries that allow for programmatic querying. To this end, we apply text mining with named entity recognition (NER) for large-scale information extraction from the published materials science literature. The NER model is trained to extract summary-level information from materials science documents, including inorganic material mentions, sample descriptors, phase labels, material properties and applications, as well as any synthesis and characterization methods used. Our classifier achieves an accuracy () of 87%, and is applied to information extraction from 3.27 million materials science abstracts. We extract more than 80 million materials-science-related named entities, and the content of each abstract is represented as a database entry in a structured format. We demonstrate that simple database queries can be used to answer complex "meta-questions" of the published literature that would have previously required laborious, manual literature searches to answer. All of our data and functionality has been made freely available on our Github ( https://github.com/materialsintelligence/matscholar ) and website ( http://matscholar.com ), and we expect these results to accelerate the pace of future materials science discovery.
过去几十年,发表的材料科学文章数量呈多倍增长。如今,材料发现管道的一个主要瓶颈在于将新结果与先前建立的文献联系起来。解决这个问题的一个潜在方法是将已发表文章的非结构化原始文本映射到允许进行编程查询的结构化数据库条目上。为此,我们应用命名实体识别(NER)的文本挖掘技术,从发表的材料科学文献中进行大规模信息提取。NER 模型经过训练,可以从材料科学文档中提取摘要级别的信息,包括无机材料提及、样品描述符、相标签、材料性质和应用,以及使用的任何合成和表征方法。我们的分类器的准确率为 87%,并应用于 327 万篇材料科学摘要的信息提取。我们提取了超过 8000 万个与材料科学相关的命名实体,并且每个摘要的内容都以结构化格式表示为数据库条目。我们证明,简单的数据库查询可用于回答以前需要费力的人工文献搜索才能回答的已发表文献中的复杂“元问题”。我们的数据和功能已全部在我们的 Github(https://github.com/materialsintelligence/matscholar)和网站(http://matscholar.com)上免费提供,我们预计这些结果将加速未来材料科学发现的步伐。