Suppr超能文献

电池数据提取器:嵌入BERT模型的电池感知文本挖掘软件。

BatteryDataExtractor: battery-aware text-mining software embedded with BERT models.

作者信息

Huang Shu, Cole Jacqueline M

机构信息

Cavendish Laboratory, Department of Physics, University of Cambridge J. J. Thomson Avenue Cambridge CB3 0HE UK

ISIS Neutron and Muon Source, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus Didcot Oxfordshire OX11 0QX UK.

出版信息

Chem Sci. 2022 Sep 23;13(39):11487-11495. doi: 10.1039/d2sc04322j. eCollection 2022 Oct 12.

Abstract

Due to the massive growth of scientific publications, literature mining is becoming increasingly popular for researchers to thoroughly explore scientific text and extract such data to create new databases or augment existing databases. Efforts in literature-mining software design and implementation have improved text-mining productivity, but most of the toolkits that mine text are based on traditional machine-learning-algorithms which hinder the performance of downstream text-mining tasks. Natural-language processing (NLP) and text-mining technologies have seen a rapid development since the release of transformer models, such as bidirectional encoder representations from transformers (BERT). Upgrading rule-based or machine-learning-based literature-mining toolkits by embedding transformer models into the software is therefore likely to improve their text-mining performance. To this end, we release a Python-based literature-mining toolkit for the field of battery materials, BatteryDataExtractor, which involves the embedding of BatteryBERT models in its automated data-extraction pipeline. This pipeline employs BERT models for token-classification tasks, such as abbreviation detection, part-of-speech tagging, and chemical-named-entity recognition, as well as new double-turn question-answering data-extraction models for auto-generating repositories of inter-related material and property data as well as general information. We demonstrate that BatteryDataExtractor exhibits state-of-the-art performance on the evaluation data sets for both token classification and automated data extraction. To aid the use of BatteryDataExtractor, its code is provided as open-source software, with associated documentation to serve as a user guide.

摘要

由于科学出版物的大量增长,文献挖掘对于研究人员全面探索科学文本并提取此类数据以创建新数据库或扩充现有数据库而言变得越来越流行。文献挖掘软件设计与实现方面的努力提高了文本挖掘的生产率,但大多数挖掘文本的工具包都基于传统机器学习算法,这阻碍了下游文本挖掘任务的性能。自诸如双向编码器表征来自变换器(BERT)等变换器模型发布以来,自然语言处理(NLP)和文本挖掘技术得到了快速发展。因此,通过将变换器模型嵌入软件来升级基于规则或基于机器学习的文献挖掘工具包,可能会提高其文本挖掘性能。为此,我们发布了一个用于电池材料领域的基于Python的文献挖掘工具包BatteryDataExtractor,它在其自动数据提取管道中嵌入了BatteryBERT模型。该管道采用BERT模型进行令牌分类任务,如缩写检测、词性标注和化学命名实体识别,以及用于自动生成相互关联的材料和属性数据存储库以及一般信息的新型双轮问答数据提取模型。我们证明,BatteryDataExtractor在令牌分类和自动数据提取的评估数据集上均展现出了领先的性能。为了方便使用BatteryDataExtractor,其代码作为开源软件提供,并配有相关文档作为用户指南。

相似文献

1
BatteryDataExtractor: battery-aware text-mining software embedded with BERT models.
Chem Sci. 2022 Sep 23;13(39):11487-11495. doi: 10.1039/d2sc04322j. eCollection 2022 Oct 12.
2
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
3
BatteryBERT: A Pretrained Language Model for Battery Database Enhancement.
J Chem Inf Model. 2022 Dec 26;62(24):6365-6377. doi: 10.1021/acs.jcim.2c00035. Epub 2022 May 9.
4
Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.
BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.
7
OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain.
J Chem Inf Model. 2023 Apr 10;63(7):1961-1981. doi: 10.1021/acs.jcim.2c01259. Epub 2023 Mar 20.
9
BERT-GT: cross-sentence n-ary relation extraction with BERT and Graph Transformer.
Bioinformatics. 2021 Apr 5;36(24):5678-5685. doi: 10.1093/bioinformatics/btaa1087.
10
Clinical concept extraction using transformers.
J Am Med Inform Assoc. 2020 Dec 9;27(12):1935-1942. doi: 10.1093/jamia/ocaa189.

引用本文的文献

1
Creation of a structured solar cell material dataset and performance prediction using large language models.
Patterns (N Y). 2024 Mar 22;5(5):100955. doi: 10.1016/j.patter.2024.100955. eCollection 2024 May 10.
2
How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?
J Chem Inf Model. 2024 Apr 22;64(8):3205-3212. doi: 10.1021/acs.jcim.4c00063. Epub 2024 Mar 27.
5
ReactionDataExtractor 2.0: A Deep Learning Approach for Data Extraction from Chemical Reaction Schemes.
J Chem Inf Model. 2023 Oct 9;63(19):6053-6067. doi: 10.1021/acs.jcim.3c00422. Epub 2023 Sep 20.
6
DigiMOF: A Database of Metal-Organic Framework Synthesis Information Generated via Text Mining.
Chem Mater. 2023 May 18;35(11):4510-4524. doi: 10.1021/acs.chemmater.3c00788. eCollection 2023 Jun 13.

本文引用的文献

1
Text-mined dataset of gold nanoparticle synthesis procedures, morphologies, and size entities.
Sci Data. 2022 May 26;9(1):234. doi: 10.1038/s41597-022-01321-6.
2
BatteryBERT: A Pretrained Language Model for Battery Database Enhancement.
J Chem Inf Model. 2022 Dec 26;62(24):6365-6377. doi: 10.1021/acs.jcim.2c00035. Epub 2022 May 9.
4
Auto-generated database of semiconductor band gaps using ChemDataExtractor.
Sci Data. 2022 May 3;9(1):193. doi: 10.1038/s41597-022-01294-6.
6
PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format.
J Chem Inf Model. 2022 Apr 11;62(7):1633-1643. doi: 10.1021/acs.jcim.1c01198. Epub 2022 Mar 29.
7
Single Model for Organic and Inorganic Chemical Named Entity Recognition in ChemDataExtractor.
J Chem Inf Model. 2022 Mar 14;62(5):1207-1213. doi: 10.1021/acs.jcim.1c01199. Epub 2022 Feb 24.
9
ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science.
J Chem Inf Model. 2021 Sep 27;61(9):4280-4289. doi: 10.1021/acs.jcim.1c00446. Epub 2021 Sep 16.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验