• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

命名实体识别和规范化在材料科学文献的大规模信息抽取中的应用。

Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature.

出版信息

J Chem Inf Model. 2019 Sep 23;59(9):3692-3702. doi: 10.1021/acs.jcim.9b00470. Epub 2019 Aug 19.

DOI:10.1021/acs.jcim.9b00470
PMID:31361962
Abstract

The number of published materials science articles has increased manyfold over the past few decades. Now, a major bottleneck in the materials discovery pipeline arises in connecting new results with the previously established literature. A potential solution to this problem is to map the unstructured raw text of published articles onto structured database entries that allow for programmatic querying. To this end, we apply text mining with named entity recognition (NER) for large-scale information extraction from the published materials science literature. The NER model is trained to extract summary-level information from materials science documents, including inorganic material mentions, sample descriptors, phase labels, material properties and applications, as well as any synthesis and characterization methods used. Our classifier achieves an accuracy () of 87%, and is applied to information extraction from 3.27 million materials science abstracts. We extract more than 80 million materials-science-related named entities, and the content of each abstract is represented as a database entry in a structured format. We demonstrate that simple database queries can be used to answer complex "meta-questions" of the published literature that would have previously required laborious, manual literature searches to answer. All of our data and functionality has been made freely available on our Github ( https://github.com/materialsintelligence/matscholar ) and website ( http://matscholar.com ), and we expect these results to accelerate the pace of future materials science discovery.

摘要

过去几十年,发表的材料科学文章数量呈多倍增长。如今,材料发现管道的一个主要瓶颈在于将新结果与先前建立的文献联系起来。解决这个问题的一个潜在方法是将已发表文章的非结构化原始文本映射到允许进行编程查询的结构化数据库条目上。为此,我们应用命名实体识别(NER)的文本挖掘技术,从发表的材料科学文献中进行大规模信息提取。NER 模型经过训练,可以从材料科学文档中提取摘要级别的信息,包括无机材料提及、样品描述符、相标签、材料性质和应用,以及使用的任何合成和表征方法。我们的分类器的准确率为 87%,并应用于 327 万篇材料科学摘要的信息提取。我们提取了超过 8000 万个与材料科学相关的命名实体,并且每个摘要的内容都以结构化格式表示为数据库条目。我们证明,简单的数据库查询可用于回答以前需要费力的人工文献搜索才能回答的已发表文献中的复杂“元问题”。我们的数据和功能已全部在我们的 Github(https://github.com/materialsintelligence/matscholar)和网站(http://matscholar.com)上免费提供,我们预计这些结果将加速未来材料科学发现的步伐。

相似文献

1
Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature.命名实体识别和规范化在材料科学文献的大规模信息抽取中的应用。
J Chem Inf Model. 2019 Sep 23;59(9):3692-3702. doi: 10.1021/acs.jcim.9b00470. Epub 2019 Aug 19.
2
FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining.FamPlex:生物医学文本挖掘中人类蛋白质家族和复合物的实体识别和关系解析资源。
BMC Bioinformatics. 2018 Jun 28;19(1):248. doi: 10.1186/s12859-018-2211-5.
3
Application of machine reading comprehension techniques for named entity recognition in materials science.机器阅读理解技术在材料科学中用于命名实体识别的应用
J Cheminform. 2024 Jul 2;16(1):76. doi: 10.1186/s13321-024-00874-5.
4
Text mining in livestock animal science: introducing the potential of text mining to animal sciences.文本挖掘在畜牧动物科学中的应用:介绍文本挖掘在动物科学中的应用潜力。
J Anim Sci. 2012 Oct;90(10):3666-76. doi: 10.2527/jas.2011-4841. Epub 2012 Jun 4.
5
Biomedical named entity recognition and linking datasets: survey and our recent development.生物医学命名实体识别与链接数据集:综述及我们的最新进展
Brief Bioinform. 2020 Dec 1;21(6):2219-2238. doi: 10.1093/bib/bbaa054.
6
Full-text chemical identification with improved generalizability and tagging consistency.全文化学物质识别,具有更好的泛化能力和标签一致性。
Database (Oxford). 2022 Sep 28;2022. doi: 10.1093/database/baac074.
7
DEBBIE: The Open Access Database of Experimental Scaffolds and Biomaterials Built Using an Automated Text Mining Pipeline.DEBBIE:使用自动化文本挖掘管道构建的实验支架和生物材料的开放获取数据库。
Adv Healthc Mater. 2023 Oct;12(25):e2300150. doi: 10.1002/adhm.202300150. Epub 2023 Aug 10.
8
Biomedical named entity recognition using deep neural networks with contextual information.基于上下文信息的深度神经网络的生物医学命名实体识别。
BMC Bioinformatics. 2019 Dec 27;20(1):735. doi: 10.1186/s12859-019-3321-4.
9
SPRENO: a BioC module for identifying organism terms in figure captions.SPRENO:一个用于在图注中识别生物学术语的 BioC 模块。
Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bay048.
10
Biomedical Literature Mining and Its Components.生物医学文献挖掘及其组成部分。
Methods Mol Biol. 2022;2496:1-16. doi: 10.1007/978-1-0716-2305-3_1.

引用本文的文献

1
Steering towards safe self-driving laboratories.转向安全的自动驾驶实验室。
Nat Rev Chem. 2025 Aug 18. doi: 10.1038/s41570-025-00747-x.
2
Annotated textual dataset PV600 of perovskite bandgaps for information extraction from literature.用于从文献中提取信息的钙钛矿带隙注释文本数据集PV600。
Sci Data. 2025 Aug 11;12(1):1401. doi: 10.1038/s41597-025-05637-x.
3
NMRExtractor: leveraging large language models to construct an experimental NMR database from open-source scientific publications.NMRExtractor:利用大语言模型从开源科学出版物构建实验性核磁共振数据库。
Chem Sci. 2025 May 28. doi: 10.1039/d4sc08802f.
4
A Database of Stress-Strain Properties Auto-generated from the Scientific Literature using ChemDataExtractor.一个使用ChemDataExtractor从科学文献中自动生成的应力-应变特性数据库。
Sci Data. 2024 Nov 23;11(1):1273. doi: 10.1038/s41597-024-03979-6.
5
Unsupervised SapBERT-based bi-encoders for medical concept annotation of clinical narratives with SNOMED CT.基于无监督SapBERT的双编码器,用于使用SNOMED CT对临床叙述进行医学概念注释。
Digit Health. 2024 Oct 21;10:20552076241288681. doi: 10.1177/20552076241288681. eCollection 2024 Jan-Dec.
6
An ontology-based text mining dataset for extraction of process-structure-property entities.一个用于提取过程-结构-属性实体的基于本体的文本挖掘数据集。
Sci Data. 2024 Oct 10;11(1):1112. doi: 10.1038/s41597-024-03926-5.
7
AI-Based Knowledge Extraction from the Bioprinting Literature for Identifying Technology Trends.基于人工智能从生物打印文献中提取知识以识别技术趋势。
3D Print Addit Manuf. 2024 Aug 20;11(4):1495-1509. doi: 10.1089/3dp.2022.0316. eCollection 2024 Aug.
8
Enhancing chemical synthesis research with NLP: Word embeddings for chemical reagent identification-A case study on nano-FeCu.利用自然语言处理技术加强化学合成研究:用于化学试剂识别的词嵌入——以纳米铁铜为例
iScience. 2024 Aug 29;27(10):110780. doi: 10.1016/j.isci.2024.110780. eCollection 2024 Oct 18.
9
Application of machine reading comprehension techniques for named entity recognition in materials science.机器阅读理解技术在材料科学中用于命名实体识别的应用
J Cheminform. 2024 Jul 2;16(1):76. doi: 10.1186/s13321-024-00874-5.
10
Creation of a structured solar cell material dataset and performance prediction using large language models.使用大语言模型创建结构化太阳能电池材料数据集并进行性能预测。
Patterns (N Y). 2024 Mar 22;5(5):100955. doi: 10.1016/j.patter.2024.100955. eCollection 2024 May 10.