Suppr超能文献

雪球 2.0:ChemDataExtractor 的通用物质数据解析器。

Snowball 2.0: Generic Material Data Parser for ChemDataExtractor.

机构信息

Cavendish Laboratory, Department of Physics, University of Cambridge, Cambridge CB3 0HE, U.K.

ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.

出版信息

J Chem Inf Model. 2023 Nov 27;63(22):7045-7055. doi: 10.1021/acs.jcim.3c01281. Epub 2023 Nov 7.

Abstract

The ever-growing amount of chemical data found in the scientific literature has led to the emergence of data-driven materials discovery. The first step in the pipeline, to automatically extract chemical information from plain text, has been driven by the development of software toolkits such as ChemDataExtractor. Such data extraction processes have created a demand for parsers that efficiently enable text mining. Here, we present Snowball 2.0, a sentence parser based on a semisupervised machine-learning algorithm. It can be used to extract any chemical property without additional training. We validate its precision, recall, and -score by training and testing a model with sentences of semiconductor band gap information curated from journal articles. Snowball 2.0 builds on two previously developed Snowball algorithms. Evaluation of Snowball 2.0 shows a 15-20% increase in recall with marginally reduced precision over the previous version which has been incorporated into ChemDataExtractor 2.0, giving Snowball 2.0 better performance in most configurations. Snowball 2.0 offers more and better parsing options for ChemDataExtractor, and it is more capable in the pipeline of automated data extraction. Snowball 2.0 also features better generalizability, performance, learning efficiencies, and user-friendliness.

摘要

不断增长的化学文献中的化学数据量导致了数据驱动的材料发现的出现。从纯文本中自动提取化学信息的流水线的第一步是由 ChemDataExtractor 等软件工具包的开发推动的。这种数据提取过程产生了对高效实现文本挖掘的解析器的需求。在这里,我们提出了基于半监督机器学习算法的句子解析器 Snowball 2.0。它可以用于提取任何化学性质,而无需额外的培训。我们通过使用从期刊文章中精心挑选的半导体带隙信息的句子来训练和测试模型,验证了其精度、召回率和 F1 分数。Snowball 2.0 建立在之前开发的两个 Snowball 算法的基础上。对 Snowball 2.0 的评估表明,与已集成到 ChemDataExtractor 2.0 中的以前版本相比,召回率提高了 15-20%,而精度略有下降,在大多数配置中,Snowball 2.0 的性能更好。Snowball 2.0 为 ChemDataExtractor 提供了更多和更好的解析选项,并且在自动化数据提取流水线中更具能力。Snowball 2.0 还具有更好的泛化能力、性能、学习效率和用户友好性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验