一种使用自然语言处理从大型聚合物语料库中提取通用材料属性数据的管道。

A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing.

作者信息

Shetty Pranav, Rajan Arunkumar Chitteth, Kuenneth Chris, Gupta Sonakshi, Panchumarti Lakshmi Prerana, Holm Lauren, Zhang Chao, Ramprasad Rampi

机构信息

School of Computational Science & Engineering, Atlanta, GA USA.

School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA.

出版信息

NPJ Comput Mater. 2023;9(1):52. doi: 10.1038/s41524-023-01003-w. Epub 2023 Apr 5.

DOI:10.1038/s41524-023-01003-w

PMID:37033291

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10073792/

Abstract

The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available at polymerscholar.org which can be used to locate material property data recorded in abstracts. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with extracted material property information.

摘要

材料科学文章数量的不断增加使得从文献中推断化学-结构-性能关系变得困难。我们使用自然语言处理方法从聚合物文献的摘要中自动提取材料性能数据。作为我们流程的一个组成部分，我们使用240万篇材料科学摘要训练了语言模型MaterialsBERT，在五个命名实体识别数据集中的三个中，它的表现优于其他基线模型。使用这个流程，我们在60小时内从约130,000篇摘要中获得了约300,000条材料性能记录。对提取的数据进行了分析，用于燃料电池、超级电容器和聚合物太阳能电池等各种应用，以获得重要的见解。通过我们的流程提取的数据可在polymerscholar.org上获取，可用于查找摘要中记录的材料性能数据。这项工作证明了一个从已发表文献开始并以提取的材料性能信息结束的自动流程的可行性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/649b/10073792/5e0c94a34511/41524_2023_1003_Fig1_HTML.jpg

相似文献

A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing.一种使用自然语言处理从大型聚合物语料库中提取通用材料属性数据的管道。

NPJ Comput Mater. 2023;9(1):52. doi: 10.1038/s41524-023-01003-w. Epub 2023 Apr 5.

Machine-Guided Polymer Knowledge Extraction Using Natural Language Processing: The Example of Named Entity Normalization.基于自然语言处理的机器引导聚合物知识提取：以命名实体规范化为例。

J Chem Inf Model. 2021 Nov 22;61(11):5377-5385. doi: 10.1021/acs.jcim.1c00554. Epub 2021 Nov 9.

Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature.命名实体识别和规范化在材料科学文献的大规模信息抽取中的应用。

J Chem Inf Model. 2019 Sep 23;59(9):3692-3702. doi: 10.1021/acs.jcim.9b00470. Epub 2019 Aug 19.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

A document-level information extraction pipeline for layered cathode materials for sodium-ion batteries.钠离子电池层状阴极材料的文档级信息抽取管道。

Sci Data. 2024 Apr 11;11(1):372. doi: 10.1038/s41597-024-03196-1.

Exploiting and assessing multi-source data for supervised biomedical named entity recognition.利用和评估多源数据进行有监督的生物医学命名实体识别。

Bioinformatics. 2018 Jul 15;34(14):2474-2482. doi: 10.1093/bioinformatics/bty152.

Dielectric Ceramics Database Automatically Constructed by Data Mining in the Literature.通过文献数据挖掘自动构建的介电陶瓷数据库

J Chem Inf Model. 2024 Aug 12;64(15):5931-5943. doi: 10.1021/acs.jcim.4c00282. Epub 2024 Jul 23.

A natural language processing pipeline to synthesize patient-generated notes toward improving remote care and chronic disease management: a cystic fibrosis case study.一种用于合成患者生成的笔记以改善远程护理和慢性病管理的自然语言处理管道：囊性纤维化案例研究。

JAMIA Open. 2021 Sep 29;4(3):ooab084. doi: 10.1093/jamiaopen/ooab084. eCollection 2021 Jul.

Automated knowledge extraction from polymer literature using natural language processing.利用自然语言处理从聚合物文献中自动提取知识。

iScience. 2020 Dec 10;24(1):101922. doi: 10.1016/j.isci.2020.101922. eCollection 2021 Jan 22.

Looking through glass: Knowledge discovery from materials science literature using natural language processing.透过玻璃看：利用自然语言处理从材料科学文献中发现知识。

Patterns (N Y). 2021 Jun 24;2(7):100290. doi: 10.1016/j.patter.2021.100290. eCollection 2021 Jul 9.

引用本文的文献

A keyword-based approach to analyzing scientific research trends: ReRAM present and future.一种基于关键词的科学研究趋势分析方法：忆阻随机存取存储器的现状与未来。

Sci Rep. 2025 Apr 8;15(1):12011. doi: 10.1038/s41598-025-93423-5.

Advances in medical devices using nanomaterials and nanotechnology: Innovation and regulatory science.使用纳米材料和纳米技术的医疗设备进展：创新与监管科学。

Bioact Mater. 2025 Feb 20;48:353-369. doi: 10.1016/j.bioactmat.2025.02.017. eCollection 2025 Jun.

Functional monomer design for synthetically accessible polymers.用于合成可及聚合物的功能性单体设计

Chem Sci. 2025 Feb 13;16(11):4755-4767. doi: 10.1039/d4sc08617a. eCollection 2025 Mar 12.

Auto-generating a database on the fabrication details of perovskite solar devices.自动生成一个关于钙钛矿太阳能电池制备细节的数据库。

Sci Data. 2025 Feb 14;12(1):270. doi: 10.1038/s41597-025-04566-z.

A review of large language models and autonomous agents in chemistry.化学领域中大型语言模型与自主智能体的综述。

Chem Sci. 2024 Dec 9;16(6):2514-2572. doi: 10.1039/d4sc03921a. eCollection 2025 Feb 5.

SciAgents: Automating Scientific Discovery Through Bioinspired Multi-Agent Intelligent Graph Reasoning.科学智能体：通过受生物启发的多智能体智能图推理实现科学发现自动化

Adv Mater. 2025 Jun;37(22):e2413523. doi: 10.1002/adma.202413523. Epub 2024 Dec 18.

An ontology-based text mining dataset for extraction of process-structure-property entities.一个用于提取过程-结构-属性实体的基于本体的文本挖掘数据集。

Sci Data. 2024 Oct 10;11(1):1112. doi: 10.1038/s41597-024-03926-5.

AI-Based Knowledge Extraction from the Bioprinting Literature for Identifying Technology Trends.基于人工智能从生物打印文献中提取知识以识别技术趋势。

3D Print Addit Manuf. 2024 Aug 20;11(4):1495-1509. doi: 10.1089/3dp.2022.0316. eCollection 2024 Aug.

A natural language processing system for the efficient extraction of cell markers.一种用于高效提取细胞标记物的自然语言处理系统。

Sci Rep. 2024 Sep 11;14(1):21183. doi: 10.1038/s41598-024-72204-6.

AI-guided few-shot inverse design of HDP-mimicking polymers against drug-resistant bacteria.人工智能引导的针对耐药菌的 HDP 模拟聚合物的少样本反向设计。

Nat Commun. 2024 Jul 26;15(1):6288. doi: 10.1038/s41467-024-50533-4.

本文引用的文献

Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science.量化特定领域预训练在材料科学命名实体识别任务中的优势。

Patterns (N Y). 2022 Apr 8;3(4):100488. doi: 10.1016/j.patter.2022.100488.

J Chem Inf Model. 2021 Nov 22;61(11):5377-5385. doi: 10.1021/acs.jcim.1c00554. Epub 2021 Nov 9.

Potentially long-lasting effects of the pandemic on scientists.大流行对科学家可能产生长期影响。

Nat Commun. 2021 Oct 26;12(1):6188. doi: 10.1038/s41467-021-26428-z.

Dielectric Polymers Tolerant to Electric Field and Temperature Extremes: Integration of Phenomenology, Informatics, and Experimental Validation.耐受极端电场和温度的介电聚合物：现象学、信息学与实验验证的整合

ACS Appl Mater Interfaces. 2021 Nov 17;13(45):53416-53424. doi: 10.1021/acsami.1c11885. Epub 2021 Aug 26.

Automated Chemical Reaction Extraction from Scientific Literature.从科学文献中自动提取化学反应

J Chem Inf Model. 2022 May 9;62(9):2035-2045. doi: 10.1021/acs.jcim.1c00284. Epub 2021 Jun 11.

Automated knowledge extraction from polymer literature using natural language processing.利用自然语言处理从聚合物文献中自动提取知识。

iScience. 2020 Dec 10;24(1):101922. doi: 10.1016/j.isci.2020.101922. eCollection 2021 Jan 22.

DECIMER: towards deep learning for chemical image recognition.DECIMER：迈向用于化学图像识别的深度学习

J Cheminform. 2020 Oct 27;12(1):65. doi: 10.1186/s13321-020-00469-w.

Environmental aspects of fuel cells: A review.燃料电池的环境方面：综述。

Sci Total Environ. 2021 Jan 15;752:141803. doi: 10.1016/j.scitotenv.2020.141803. Epub 2020 Aug 20.

Polymerized Small-Molecule Acceptors for High-Performance All-Polymer Solar Cells.用于高性能全聚合物太阳能电池的聚合小分子受体

Angew Chem Int Ed Engl. 2021 Feb 23;60(9):4422-4433. doi: 10.1002/anie.202009666. Epub 2020 Dec 23.

The COVID-19 pandemic.新型冠状病毒肺炎（COVID-19）疫情。

Crit Rev Clin Lab Sci. 2020 Sep;57(6):365-388. doi: 10.1080/10408363.2020.1783198. Epub 2020 Jul 9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种使用自然语言处理从大型聚合物语料库中提取通用材料属性数据的管道。

A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献