Suppr超能文献

从热电材料数据库自动生成特定领域的问答数据集以启用高性能的BERT模型。

Autogenerating a Domain-Specific Question-Answering Data Set from a Thermoelectric Materials Database to Enable High-Performing BERT Models.

作者信息

Sierepeklis Odysseas, Cole Jacqueline M

机构信息

Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.

Science and Technology Facilities Council, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.

出版信息

J Chem Inf Model. 2025 Aug 25;65(16):8579-8592. doi: 10.1021/acs.jcim.5c00840. Epub 2025 Aug 7.

Abstract

We present a method for autogenerating a large domain-specific question-answering (QA) dataset from a thermoelectric materials database. We show that a small language model, BERT, once fine-tuned on this automatically generated dataset of 99,757 QA pairs about thermoelectric materials, affords better performance in the field of thermoelectric materials compared to a BERT model fine-tuned on the generic English-language QA data set, SQuAD-v2. We further show that mixing the two data sets (ours and SQuAD-v2), which have significantly different syntactic and semantic scopes, allows the BERT model to achieve even better performance. The best-performing BERT model fine-tuned on the mixed data set outperforms the models fine-tuned on the other two data sets by scoring an exact match of 67.93% and an 1 score of 72.29% when evaluated on our test data set. This has important implications as it demonstrates the ability to realize high-performing small language models, with modest computational resources, empowered by domain-specific materials data sets which can be generated according to our method.

摘要

我们提出了一种从热电材料数据库自动生成大型特定领域问答(QA)数据集的方法。我们表明,一个小型语言模型BERT,一旦在这个自动生成的包含99,757个关于热电材料的问答对的数据集上进行微调,与在通用英语QA数据集SQuAD-v2上微调的BERT模型相比,在热电材料领域具有更好的性能。我们进一步表明,将两个句法和语义范围有显著差异的数据集(我们的数据集和SQuAD-v2)混合,能使BERT模型获得更好的性能。在混合数据集上微调的性能最佳的BERT模型,在我们的测试数据集上进行评估时,精确匹配得分67.93%,F1得分72.29%,优于在其他两个数据集上微调的模型。这具有重要意义,因为它展示了利用我们的方法生成的特定领域材料数据集,以适度的计算资源实现高性能小型语言模型的能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b7f9/12381847/cf79ecc33558/ci5c00840_0001.jpg

相似文献

1
Autogenerating a Domain-Specific Question-Answering Data Set from a Thermoelectric Materials Database to Enable High-Performing BERT Models.
J Chem Inf Model. 2025 Aug 25;65(16):8579-8592. doi: 10.1021/acs.jcim.5c00840. Epub 2025 Aug 7.
3
Fine-grained spatial information extraction in radiology as two-turn question answering.
Int J Med Inform. 2021 Nov 6;158:104628. doi: 10.1016/j.ijmedinf.2021.104628.
5
Cognitive decline assessment using semantic linguistic content and transformer deep learning architecture.
Int J Lang Commun Disord. 2024 May-Jun;59(3):1110-1127. doi: 10.1111/1460-6984.12973. Epub 2023 Nov 16.
6
BioInstruct: instruction tuning of large language models for biomedical natural language processing.
J Am Med Inform Assoc. 2024 Sep 1;31(9):1821-1832. doi: 10.1093/jamia/ocae122.
9
Question Answering for Electronic Health Records: Scoping Review of Datasets and Models.
J Med Internet Res. 2024 Oct 30;26:e53636. doi: 10.2196/53636.

本文引用的文献

1
How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?
J Chem Inf Model. 2024 Apr 22;64(8):3205-3212. doi: 10.1021/acs.jcim.4c00063. Epub 2024 Mar 27.
3
Structured information extraction from scientific text with large language models.
Nat Commun. 2024 Feb 15;15(1):1418. doi: 10.1038/s41467-024-45563-x.
5
14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon.
Digit Discov. 2023 Aug 8;2(5):1233-1250. doi: 10.1039/d3dd00113j. eCollection 2023 Oct 9.
6
Automated Construction of a Photocatalysis Dataset for Water-Splitting Applications.
Sci Data. 2023 Sep 22;10(1):651. doi: 10.1038/s41597-023-02511-6.
7
OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain.
J Chem Inf Model. 2023 Apr 10;63(7):1961-1981. doi: 10.1021/acs.jcim.2c01259. Epub 2023 Mar 20.
9
BatteryBERT: A Pretrained Language Model for Battery Database Enhancement.
J Chem Inf Model. 2022 Dec 26;62(24):6365-6377. doi: 10.1021/acs.jcim.2c00035. Epub 2022 May 9.
10

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验