Suppr超能文献

针对光催化水分解信息提取,在特定领域的狭窄语料库上进行预训练有多大益处?

How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?

作者信息

Isazawa Taketomo, Cole Jacqueline M

机构信息

Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.

ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.

出版信息

J Chem Inf Model. 2024 Apr 22;64(8):3205-3212. doi: 10.1021/acs.jcim.4c00063. Epub 2024 Mar 27.

Abstract

Language models trained on domain-specific corpora have been employed to increase the performance in specialized tasks. However, little previous work has been reported on how specific a "domain-specific" corpus should be. Here, we test a number of language models trained on varyingly specific corpora by employing them in the task of extracting information from photocatalytic water splitting. We find that more specific corpora can benefit performance on downstream tasks. Furthermore, PhotocatalysisBERT, a pretrained model from scratch on scientific papers on photocatalytic water splitting, demonstrates improved performance over previous work in associating the correct photocatalyst with the correct photocatalytic activity during information extraction, achieving a precision of 60.8(+11.5)% and a recall of 37.2(+4.5)%.

摘要

在特定领域语料库上训练的语言模型已被用于提高特定任务的性能。然而,此前关于“特定领域”语料库应多具体的研究报道较少。在此,我们通过将多个在不同具体程度的语料库上训练的语言模型用于从光催化水分解中提取信息的任务来进行测试。我们发现,更具体的语料库有助于提升下游任务的性能。此外,PhotocatalysisBERT,一个基于光催化水分解科学论文从头开始预训练的模型,在信息提取过程中将正确的光催化剂与正确的光催化活性相关联方面,表现优于此前的工作,精确率达到60.8(+11.5)%,召回率达到37.2(+4.5)%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/81ab/11040717/55f2476df87b/ci4c00063_0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验