School of Artificial Intelligence, Beijing Normal University, Beijing, 100875, China.
Beijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, 100083, China.
Sci Data. 2022 Jul 13;9(1):401. doi: 10.1038/s41597-022-01492-2.
Information Extraction (IE) in Natural Language Processing (NLP) aims to extract structured information from unstructured text to assist a computer in understanding natural language. Machine learning-based IE methods bring more intelligence and possibilities but require an extensive and accurate labeled corpus. In the materials science domain, giving reliable labels is a laborious task that requires the efforts of many professionals. To reduce manual intervention and automatically generate materials corpus during IE, in this work, we propose a semi-supervised IE framework for materials via automatically generated corpus. Taking the superalloy data extraction in our previous work as an example, the proposed framework using Snorkel automatically labels the corpus containing property values. Then Ordered Neurons-Long Short-Term Memory (ON-LSTM) network is adopted to train an information extraction model on the generated corpus. The experimental results show that the F1-score of γ' solvus temperature, density and solidus temperature of superalloys are 83.90%, 94.02%, 89.27%, respectively. Furthermore, we conduct similar experiments on other materials, the experimental results show that the proposed framework is universal in the field of materials.
自然语言处理 (NLP) 中的信息抽取 (IE) 旨在从非结构化文本中提取结构化信息,以帮助计算机理解自然语言。基于机器学习的 IE 方法带来了更多的智能和可能性,但需要广泛而准确的标记语料库。在材料科学领域,给出可靠的标签是一项费力的任务,需要许多专业人员的努力。为了减少手动干预并在 IE 过程中自动生成材料语料库,在这项工作中,我们提出了一种通过自动生成的语料库进行材料的半监督 IE 框架。以我们之前工作中的高温合金数据提取为例,所提出的框架使用 Snorkel 自动标记包含属性值的语料库。然后采用有序神经元-长短期记忆 (ON-LSTM) 网络在生成的语料库上训练信息提取模型。实验结果表明,高温合金γ'固溶温度、密度和固相线温度的 F1 分数分别为 83.90%、94.02%、89.27%。此外,我们在其他材料上进行了类似的实验,实验结果表明,所提出的框架在材料领域具有通用性。