Dagdelen John, Dunn Alexander, Lee Sanghoon, Walker Nicholas, Rosen Andrew S, Ceder Gerbrand, Persson Kristin A, Jain Anubhav
Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
Materials Science and Engineering Department, University of California, Berkeley, CA, USA.
Nat Commun. 2024 Feb 15;15(1):1418. doi: 10.1038/s41467-024-45563-x.
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.
从科学文本中提取结构化知识对机器学习模型来说仍然是一项具有挑战性的任务。在此,我们提出一种用于联合命名实体识别和关系提取的简单方法,并展示如何对预训练的大语言模型(GPT-3、Llama-2)进行微调,以提取复杂科学知识的有用记录。我们在材料化学中测试了三个具有代表性的任务:关联掺杂剂和主体材料、编目金属有机框架,以及提取一般的组成/相/形态/应用信息。记录从单个句子或整个段落中提取,输出可以以简单的英语句子形式返回,也可以以更结构化的格式返回,例如JSON对象列表。这种方法代表了一条简单、易操作且高度灵活的途径,可用于获取从研究论文中提取的结构化专业科学知识的大型数据库。