• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用大语言模型从科学文本中提取结构化信息。

Structured information extraction from scientific text with large language models.

作者信息

Dagdelen John, Dunn Alexander, Lee Sanghoon, Walker Nicholas, Rosen Andrew S, Ceder Gerbrand, Persson Kristin A, Jain Anubhav

机构信息

Lawrence Berkeley National Laboratory, Berkeley, CA, USA.

Materials Science and Engineering Department, University of California, Berkeley, CA, USA.

出版信息

Nat Commun. 2024 Feb 15;15(1):1418. doi: 10.1038/s41467-024-45563-x.

DOI:10.1038/s41467-024-45563-x
PMID:38360817
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10869356/
Abstract

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.

摘要

从科学文本中提取结构化知识对机器学习模型来说仍然是一项具有挑战性的任务。在此,我们提出一种用于联合命名实体识别和关系提取的简单方法,并展示如何对预训练的大语言模型(GPT-3、Llama-2)进行微调,以提取复杂科学知识的有用记录。我们在材料化学中测试了三个具有代表性的任务:关联掺杂剂和主体材料、编目金属有机框架,以及提取一般的组成/相/形态/应用信息。记录从单个句子或整个段落中提取,输出可以以简单的英语句子形式返回,也可以以更结构化的格式返回,例如JSON对象列表。这种方法代表了一条简单、易操作且高度灵活的途径,可用于获取从研究论文中提取的结构化专业科学知识的大型数据库。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/aed5a35190ab/41467_2024_45563_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/831d8f44585c/41467_2024_45563_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/ff33fed75eea/41467_2024_45563_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/f4304072a91e/41467_2024_45563_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/522ee80bda01/41467_2024_45563_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/d37d4bf65c0c/41467_2024_45563_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/aed5a35190ab/41467_2024_45563_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/831d8f44585c/41467_2024_45563_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/ff33fed75eea/41467_2024_45563_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/f4304072a91e/41467_2024_45563_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/522ee80bda01/41467_2024_45563_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/d37d4bf65c0c/41467_2024_45563_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/676b/10869356/aed5a35190ab/41467_2024_45563_Fig6_HTML.jpg

相似文献

1
Structured information extraction from scientific text with large language models.利用大语言模型从科学文本中提取结构化信息。
Nat Commun. 2024 Feb 15;15(1):1418. doi: 10.1038/s41467-024-45563-x.
2
Fine-tuning large language models for chemical text mining.针对化学文本挖掘对大语言模型进行微调。
Chem Sci. 2024 Jun 7;15(27):10600-10611. doi: 10.1039/d4sc00924j. eCollection 2024 Jul 10.
3
Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。
Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.
4
Deep learning-based automatic action extraction from structured chemical synthesis procedures.基于深度学习从结构化化学合成程序中自动提取操作
PeerJ Comput Sci. 2023 Aug 18;9:e1511. doi: 10.7717/peerj-cs.1511. eCollection 2023.
5
A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System.一种用于家族病史信息识别与关系抽取的混合模型:一个端到端信息抽取系统的开发与评估
JMIR Med Inform. 2021 Apr 22;9(4):e22797. doi: 10.2196/22797.
6
Extracting entities with attributes in clinical text via joint deep learning.通过联合深度学习从临床文本中提取具有属性的实体。
J Am Med Inform Assoc. 2019 Dec 1;26(12):1584-1591. doi: 10.1093/jamia/ocz158.
7
Extracting Clinical Information From Japanese Radiology Reports Using a 2-Stage Deep Learning Approach: Algorithm Development and Validation.使用两阶段深度学习方法从日本放射学报告中提取临床信息:算法开发与验证
JMIR Med Inform. 2023 Nov 14;11:e49041. doi: 10.2196/49041.
8
Automatic knowledge extraction from Chinese electronic medical records and rheumatoid arthritis knowledge graph construction.从中国电子病历中自动提取知识并构建类风湿性关节炎知识图谱。
Quant Imaging Med Surg. 2023 Jun 1;13(6):3873-3890. doi: 10.21037/qims-22-1158. Epub 2023 May 8.
9
Task definition, annotated dataset, and supervised natural language processing models for symptom extraction from unstructured clinical notes.从非结构化临床记录中提取症状的任务定义、标注数据集和监督自然语言处理模型。
J Biomed Inform. 2020 Feb;102:103354. doi: 10.1016/j.jbi.2019.103354. Epub 2019 Dec 12.
10
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

引用本文的文献

1
Large Language Models can extract morphological data from taxonomic descriptions, but their stochastic nature makes automation challenging: a test on Australian Asteraceae.大型语言模型可以从分类描述中提取形态学数据,但其随机性使得自动化具有挑战性:对澳大利亚菊科植物的一项测试。
PhytoKeys. 2025 Aug 19;261:189-210. doi: 10.3897/phytokeys.261.158396. eCollection 2025.
2
ReviewGenie: a novel automated system for systematic reviews-an exploratory study in speech and language disorders.ReviewGenie:一种用于系统评价的新型自动化系统——言语和语言障碍的探索性研究
Syst Rev. 2025 Aug 18;14(1):167. doi: 10.1186/s13643-025-02895-z.
3

本文引用的文献

1
ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis.用于文本挖掘和金属有机框架合成预测的ChatGPT化学助手
J Am Chem Soc. 2023 Aug 16;145(32):18048-18062. doi: 10.1021/jacs.3c05819. Epub 2023 Aug 7.
2
A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor.使用 ChemDataExtractor 从科学文献中自动生成的热电材料数据库。
Sci Data. 2022 Oct 22;9(1):648. doi: 10.1038/s41597-022-01752-1.
3
Machine-Learning Rationalization and Prediction of Solid-State Synthesis Conditions.
A chemical autonomous robotic platform for end-to-end synthesis of nanoparticles.
一种用于纳米颗粒端到端合成的化学自主机器人平台。
Nat Commun. 2025 Aug 14;16(1):7558. doi: 10.1038/s41467-025-62994-2.
4
A vision of human-AI collaboration for enhanced biological collection curation and research.关于人类与人工智能合作以加强生物样本管理和研究的愿景。
Bioscience. 2025 Mar 28;75(6):457-471. doi: 10.1093/biosci/biaf021. eCollection 2025 Jun.
5
Annotated textual dataset PV600 of perovskite bandgaps for information extraction from literature.用于从文献中提取信息的钙钛矿带隙注释文本数据集PV600。
Sci Data. 2025 Aug 11;12(1):1401. doi: 10.1038/s41597-025-05637-x.
6
SpeciMate: Improving metadata extraction from digitised biological specimens.SpeciMate:改进从数字化生物标本中提取元数据的方法。
Biodivers Data J. 2025 Jul 31;13:e160553. doi: 10.3897/BDJ.13.e160553. eCollection 2025.
7
Enhanced global oil spill dataset from 1967 to 2023 based on text-form incident information.基于文本形式的事故信息生成的1967年至2023年全球石油泄漏增强数据集。
Sci Data. 2025 Aug 8;12(1):1394. doi: 10.1038/s41597-025-05601-9.
8
Developing foundations for biomedical knowledgebases from literature using large language models - A systematic assessment.利用大语言模型从文献中开发生物医学知识库的基础——一项系统评估
Comput Struct Biotechnol J. 2025 Jul 24;27:3299-3306. doi: 10.1016/j.csbj.2025.07.042. eCollection 2025.
9
Autogenerating a Domain-Specific Question-Answering Data Set from a Thermoelectric Materials Database to Enable High-Performing BERT Models.从热电材料数据库自动生成特定领域的问答数据集以启用高性能的BERT模型。
J Chem Inf Model. 2025 Aug 25;65(16):8579-8592. doi: 10.1021/acs.jcim.5c00840. Epub 2025 Aug 7.
10
Multi-Label Classification with Generative AI Models in Healthcare: A Case Study of Suicidality and Risk Factors.医疗保健领域中基于生成式人工智能模型的多标签分类:自杀倾向及风险因素的案例研究
ArXiv. 2025 Jul 22:arXiv:2507.17009v1.
固态合成条件的机器学习合理化与预测
Chem Mater. 2022 Aug 23;34(16):7323-7336. doi: 10.1021/acs.chemmater.2c01293. Epub 2022 Aug 5.
4
Materials Discovery With Machine Learning and Knowledge Discovery.基于机器学习和知识发现的材料探索
Front Chem. 2022 Jul 7;10:930369. doi: 10.3389/fchem.2022.930369. eCollection 2022.
5
Perovskite- and Dye-Sensitized Solar-Cell Device Databases Auto-generated Using ChemDataExtractor.使用ChemDataExtractor自动生成的钙钛矿和染料敏化太阳能电池器件数据库。
Sci Data. 2022 Jun 17;9(1):329. doi: 10.1038/s41597-022-01355-w.
6
Dataset of solution-based inorganic materials synthesis procedures extracted from the scientific literature.从科学文献中提取的基于溶液的无机材料合成工艺数据集。
Sci Data. 2022 May 25;9(1):231. doi: 10.1038/s41597-022-01317-2.
7
A database of refractive indices and dielectric constants auto-generated using ChemDataExtractor.使用 ChemDataExtractor 自动生成的折射率和介电常数数据库。
Sci Data. 2022 May 3;9(1):192. doi: 10.1038/s41597-022-01295-5.
8
Auto-generated database of semiconductor band gaps using ChemDataExtractor.使用 ChemDataExtractor 自动生成半导体带隙数据库。
Sci Data. 2022 May 3;9(1):193. doi: 10.1038/s41597-022-01294-6.
9
Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science.量化特定领域预训练在材料科学命名实体识别任务中的优势。
Patterns (N Y). 2022 Apr 8;3(4):100488. doi: 10.1016/j.patter.2022.100488.
10
A database of battery materials auto-generated using ChemDataExtractor.使用 ChemDataExtractor 自动生成的电池材料数据库。
Sci Data. 2020 Aug 6;7(1):260. doi: 10.1038/s41597-020-00602-2.