Isazawa Taketomo, Cole Jacqueline M
Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
J Chem Inf Model. 2024 Apr 22;64(8):3205-3212. doi: 10.1021/acs.jcim.4c00063. Epub 2024 Mar 27.
Language models trained on domain-specific corpora have been employed to increase the performance in specialized tasks. However, little previous work has been reported on how specific a "domain-specific" corpus should be. Here, we test a number of language models trained on varyingly specific corpora by employing them in the task of extracting information from photocatalytic water splitting. We find that more specific corpora can benefit performance on downstream tasks. Furthermore, PhotocatalysisBERT, a pretrained model from scratch on scientific papers on photocatalytic water splitting, demonstrates improved performance over previous work in associating the correct photocatalyst with the correct photocatalytic activity during information extraction, achieving a precision of 60.8(+11.5)% and a recall of 37.2(+4.5)%.
在特定领域语料库上训练的语言模型已被用于提高特定任务的性能。然而,此前关于“特定领域”语料库应多具体的研究报道较少。在此,我们通过将多个在不同具体程度的语料库上训练的语言模型用于从光催化水分解中提取信息的任务来进行测试。我们发现,更具体的语料库有助于提升下游任务的性能。此外,PhotocatalysisBERT,一个基于光催化水分解科学论文从头开始预训练的模型,在信息提取过程中将正确的光催化剂与正确的光催化活性相关联方面,表现优于此前的工作,精确率达到60.8(+11.5)%,召回率达到37.2(+4.5)%。