Huang Shu, Cole Jacqueline M
Cavendish Laboratory, Department of Physics, University of Cambridge J. J. Thomson Avenue Cambridge CB3 0HE UK
ISIS Neutron and Muon Source, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus Didcot Oxfordshire OX11 0QX UK.
Chem Sci. 2022 Sep 23;13(39):11487-11495. doi: 10.1039/d2sc04322j. eCollection 2022 Oct 12.
Due to the massive growth of scientific publications, literature mining is becoming increasingly popular for researchers to thoroughly explore scientific text and extract such data to create new databases or augment existing databases. Efforts in literature-mining software design and implementation have improved text-mining productivity, but most of the toolkits that mine text are based on traditional machine-learning-algorithms which hinder the performance of downstream text-mining tasks. Natural-language processing (NLP) and text-mining technologies have seen a rapid development since the release of transformer models, such as bidirectional encoder representations from transformers (BERT). Upgrading rule-based or machine-learning-based literature-mining toolkits by embedding transformer models into the software is therefore likely to improve their text-mining performance. To this end, we release a Python-based literature-mining toolkit for the field of battery materials, BatteryDataExtractor, which involves the embedding of BatteryBERT models in its automated data-extraction pipeline. This pipeline employs BERT models for token-classification tasks, such as abbreviation detection, part-of-speech tagging, and chemical-named-entity recognition, as well as new double-turn question-answering data-extraction models for auto-generating repositories of inter-related material and property data as well as general information. We demonstrate that BatteryDataExtractor exhibits state-of-the-art performance on the evaluation data sets for both token classification and automated data extraction. To aid the use of BatteryDataExtractor, its code is provided as open-source software, with associated documentation to serve as a user guide.
由于科学出版物的大量增长,文献挖掘对于研究人员全面探索科学文本并提取此类数据以创建新数据库或扩充现有数据库而言变得越来越流行。文献挖掘软件设计与实现方面的努力提高了文本挖掘的生产率,但大多数挖掘文本的工具包都基于传统机器学习算法,这阻碍了下游文本挖掘任务的性能。自诸如双向编码器表征来自变换器(BERT)等变换器模型发布以来,自然语言处理(NLP)和文本挖掘技术得到了快速发展。因此,通过将变换器模型嵌入软件来升级基于规则或基于机器学习的文献挖掘工具包,可能会提高其文本挖掘性能。为此,我们发布了一个用于电池材料领域的基于Python的文献挖掘工具包BatteryDataExtractor,它在其自动数据提取管道中嵌入了BatteryBERT模型。该管道采用BERT模型进行令牌分类任务,如缩写检测、词性标注和化学命名实体识别,以及用于自动生成相互关联的材料和属性数据存储库以及一般信息的新型双轮问答数据提取模型。我们证明,BatteryDataExtractor在令牌分类和自动数据提取的评估数据集上均展现出了领先的性能。为了方便使用BatteryDataExtractor,其代码作为开源软件提供,并配有相关文档作为用户指南。