• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

针对光催化水分解信息提取,在特定领域的狭窄语料库上进行预训练有多大益处?

How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?

作者信息

Isazawa Taketomo, Cole Jacqueline M

机构信息

Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.

ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.

出版信息

J Chem Inf Model. 2024 Apr 22;64(8):3205-3212. doi: 10.1021/acs.jcim.4c00063. Epub 2024 Mar 27.

DOI:10.1021/acs.jcim.4c00063
PMID:38544337
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11040717/
Abstract

Language models trained on domain-specific corpora have been employed to increase the performance in specialized tasks. However, little previous work has been reported on how specific a "domain-specific" corpus should be. Here, we test a number of language models trained on varyingly specific corpora by employing them in the task of extracting information from photocatalytic water splitting. We find that more specific corpora can benefit performance on downstream tasks. Furthermore, PhotocatalysisBERT, a pretrained model from scratch on scientific papers on photocatalytic water splitting, demonstrates improved performance over previous work in associating the correct photocatalyst with the correct photocatalytic activity during information extraction, achieving a precision of 60.8(+11.5)% and a recall of 37.2(+4.5)%.

摘要

在特定领域语料库上训练的语言模型已被用于提高特定任务的性能。然而,此前关于“特定领域”语料库应多具体的研究报道较少。在此,我们通过将多个在不同具体程度的语料库上训练的语言模型用于从光催化水分解中提取信息的任务来进行测试。我们发现,更具体的语料库有助于提升下游任务的性能。此外,PhotocatalysisBERT,一个基于光催化水分解科学论文从头开始预训练的模型,在信息提取过程中将正确的光催化剂与正确的光催化活性相关联方面,表现优于此前的工作,精确率达到60.8(+11.5)%,召回率达到37.2(+4.5)%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/81ab/11040717/2b320facca3d/ci4c00063_0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/81ab/11040717/55f2476df87b/ci4c00063_0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/81ab/11040717/599fa42ffff2/ci4c00063_0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/81ab/11040717/f843a0dacd9e/ci4c00063_0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/81ab/11040717/2b320facca3d/ci4c00063_0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/81ab/11040717/55f2476df87b/ci4c00063_0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/81ab/11040717/599fa42ffff2/ci4c00063_0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/81ab/11040717/f843a0dacd9e/ci4c00063_0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/81ab/11040717/2b320facca3d/ci4c00063_0004.jpg

相似文献

1
How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?针对光催化水分解信息提取,在特定领域的狭窄语料库上进行预训练有多大益处?
J Chem Inf Model. 2024 Apr 22;64(8):3205-3212. doi: 10.1021/acs.jcim.4c00063. Epub 2024 Mar 27.
2
The effect of alkaline earth metal ion dopants on photocatalytic water splitting by NaTaO(3) powder.碱土金属离子掺杂对NaTaO(3)粉末光催化水分解的影响。
ChemSusChem. 2009;2(9):873-7. doi: 10.1002/cssc.200900160.
3
On the similarity and dissimilarity between photocatalytic water splitting and photocatalytic degradation of pollutants.光催化水分解与光催化降解污染物的相似性和差异性。
Chemphyschem. 2013 Jul 22;14(10):2059-70. doi: 10.1002/cphc.201300247. Epub 2013 Jun 10.
4
Graphene-Based Materials as Efficient Photocatalysts for Water Splitting.基于石墨烯的材料作为高效光解水催化剂。
Molecules. 2019 Mar 5;24(5):906. doi: 10.3390/molecules24050906.
5
Reliable Performance Characterization of Mediated Photocatalytic Water-Splitting Half Reactions.介导光催化水分解半反应的可靠性能表征。
ChemSusChem. 2017 May 22;10(10):2158-2166. doi: 10.1002/cssc.201601901. Epub 2017 Mar 16.
6
Nonadiabatic dynamics of positive charge during photocatalytic water splitting on GaN(10-10) surface: charge localization governs splitting efficiency.氮化镓(10-10)表面光催化水分解过程中带正电荷的非绝热动力学:电荷局域化控制着分解效率。
J Am Chem Soc. 2013 Jun 12;135(23):8682-91. doi: 10.1021/ja4029395. Epub 2013 May 31.
7
Photocatalytic water oxidation by molecular assemblies based on cobalt catalysts.基于钴催化剂的分子组装体光催化水氧化
ChemSusChem. 2014 Sep;7(9):2453-6. doi: 10.1002/cssc.201402195. Epub 2014 Aug 8.
8
[Co(bpy)3](3+/2+) and [Co(phen)3](3+/2+) electron mediators for overall water splitting under sunlight irradiation using Z-scheme photocatalyst system.[Co(bpy)3](3+/2+) 和 [Co(phen)3](3+/2+) 电子媒介剂在 Z 型光催化剂体系下利用太阳光实现全分解水。
J Am Chem Soc. 2013 Apr 10;135(14):5441-9. doi: 10.1021/ja400238r. Epub 2013 Mar 28.
9
Photodeposition of copper and chromia on gallium oxide: the role of co-catalysts in photocatalytic water splitting.铜和氧化铬在氧化镓上的光沉积:助催化剂在光催化水分解中的作用。
ChemSusChem. 2014 Apr;7(4):1030-4. doi: 10.1002/cssc.201301065. Epub 2014 Mar 3.
10
Water-soluble MoS3 nanoparticles for photocatalytic H2 evolution.水溶性 MoS3 纳米粒子用于光催化 H2 析出。
ChemSusChem. 2015 Apr 24;8(8):1464-71. doi: 10.1002/cssc.201500067. Epub 2015 Mar 31.

引用本文的文献

1
Annotated textual dataset PV600 of perovskite bandgaps for information extraction from literature.用于从文献中提取信息的钙钛矿带隙注释文本数据集PV600。
Sci Data. 2025 Aug 11;12(1):1401. doi: 10.1038/s41597-025-05637-x.
2
Autogenerating a Domain-Specific Question-Answering Data Set from a Thermoelectric Materials Database to Enable High-Performing BERT Models.从热电材料数据库自动生成特定领域的问答数据集以启用高性能的BERT模型。
J Chem Inf Model. 2025 Aug 25;65(16):8579-8592. doi: 10.1021/acs.jcim.5c00840. Epub 2025 Aug 7.
3
Cost-Efficient Domain-Adaptive Pretraining of Language Models for Optoelectronics Applications.

本文引用的文献

1
Extracting accurate materials data from research papers with conversational language models and prompt engineering.利用对话式语言模型和提示工程从研究论文中提取准确的材料数据。
Nat Commun. 2024 Feb 21;15(1):1569. doi: 10.1038/s41467-024-45914-8.
2
Structured information extraction from scientific text with large language models.利用大语言模型从科学文本中提取结构化信息。
Nat Commun. 2024 Feb 15;15(1):1418. doi: 10.1038/s41467-024-45563-x.
3
Automated Construction of a Photocatalysis Dataset for Water-Splitting Applications.自动化构建用于水分解应用的光催化数据集。
用于光电子应用的语言模型的经济高效领域自适应预训练
J Chem Inf Model. 2025 Mar 10;65(5):2476-2486. doi: 10.1021/acs.jcim.4c02029. Epub 2025 Feb 11.
Sci Data. 2023 Sep 22;10(1):651. doi: 10.1038/s41597-023-02511-6.
4
OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain.光学 BERT 和光学 Table-SQA:面向光学材料领域的基于文本和表格的语言模型。
J Chem Inf Model. 2023 Apr 10;63(7):1961-1981. doi: 10.1021/acs.jcim.2c01259. Epub 2023 Mar 20.
5
BatteryDataExtractor: battery-aware text-mining software embedded with BERT models.电池数据提取器:嵌入BERT模型的电池感知文本挖掘软件。
Chem Sci. 2022 Sep 23;13(39):11487-11495. doi: 10.1039/d2sc04322j. eCollection 2022 Oct 12.
6
BatteryBERT: A Pretrained Language Model for Battery Database Enhancement.电池 BERT:用于电池数据库增强的预训练语言模型。
J Chem Inf Model. 2022 Dec 26;62(24):6365-6377. doi: 10.1021/acs.jcim.2c00035. Epub 2022 May 9.
7
ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science.ChemDataExtractor 2.0:材料科学自动填充本体。
J Chem Inf Model. 2021 Sep 27;61(9):4280-4289. doi: 10.1021/acs.jcim.1c00446. Epub 2021 Sep 16.
8
A database of battery materials auto-generated using ChemDataExtractor.使用 ChemDataExtractor 自动生成的电池材料数据库。
Sci Data. 2020 Aug 6;7(1):260. doi: 10.1038/s41597-020-00602-2.
9
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
10
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.ChemDataExtractor:一个用于从科学文献中自动提取化学信息的工具包。
J Chem Inf Model. 2016 Oct 24;56(10):1894-1904. doi: 10.1021/acs.jcim.6b00207. Epub 2016 Oct 6.