Suppr
超能文献

从热电材料数据库自动生成特定领域的问答数据集以启用高性能的BERT模型。

Autogenerating a Domain-Specific Question-Answering Data Set from a Thermoelectric Materials Database to Enable High-Performing BERT Models.

作者信息

Sierepeklis Odysseas, Cole Jacqueline M

机构信息

Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.

Science and Technology Facilities Council, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.

出版信息

J Chem Inf Model. 2025 Aug 25;65(16):8579-8592. doi: 10.1021/acs.jcim.5c00840. Epub 2025 Aug 7.

DOI:10.1021/acs.jcim.5c00840

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12381847/

Abstract

We present a method for autogenerating a large domain-specific question-answering (QA) dataset from a thermoelectric materials database. We show that a small language model, BERT, once fine-tuned on this automatically generated dataset of 99,757 QA pairs about thermoelectric materials, affords better performance in the field of thermoelectric materials compared to a BERT model fine-tuned on the generic English-language QA data set, SQuAD-v2. We further show that mixing the two data sets (ours and SQuAD-v2), which have significantly different syntactic and semantic scopes, allows the BERT model to achieve even better performance. The best-performing BERT model fine-tuned on the mixed data set outperforms the models fine-tuned on the other two data sets by scoring an exact match of 67.93% and an 1 score of 72.29% when evaluated on our test data set. This has important implications as it demonstrates the ability to realize high-performing small language models, with modest computational resources, empowered by domain-specific materials data sets which can be generated according to our method.

摘要

我们提出了一种从热电材料数据库自动生成大型特定领域问答（QA）数据集的方法。我们表明，一个小型语言模型BERT，一旦在这个自动生成的包含99,757个关于热电材料的问答对的数据集上进行微调，与在通用英语QA数据集SQuAD-v2上微调的BERT模型相比，在热电材料领域具有更好的性能。我们进一步表明，将两个句法和语义范围有显著差异的数据集（我们的数据集和SQuAD-v2）混合，能使BERT模型获得更好的性能。在混合数据集上微调的性能最佳的BERT模型，在我们的测试数据集上进行评估时，精确匹配得分67.93%，F1得分72.29%，优于在其他两个数据集上微调的模型。这具有重要意义，因为它展示了利用我们的方法生成的特定领域材料数据集，以适度的计算资源实现高性能小型语言模型的能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b7f9/12381847/cf79ecc33558/ci5c00840_0001.jpg

相似文献

1

Autogenerating a Domain-Specific Question-Answering Data Set from a Thermoelectric Materials Database to Enable High-Performing BERT Models.

J Chem Inf Model. 2025 Aug 25;65(16):8579-8592. doi: 10.1021/acs.jcim.5c00840. Epub 2025 Aug 7.

2

Prescription of Controlled Substances: Benefits and Risks

3

Fine-grained spatial information extraction in radiology as two-turn question answering.

Int J Med Inform. 2021 Nov 6;158:104628. doi: 10.1016/j.ijmedinf.2021.104628.

4

Domain-Specific Pretraining of NorDeClin-Bidirectional Encoder Representations From Transformers for Code Prediction in Norwegian Clinical Texts: Model Development and Evaluation Study.

JMIR AI. 2025 Aug 25;4:e66153. doi: 10.2196/66153.

5

Cognitive decline assessment using semantic linguistic content and transformer deep learning architecture.

Int J Lang Commun Disord. 2024 May-Jun;59(3):1110-1127. doi: 10.1111/1460-6984.12973. Epub 2023 Nov 16.

6

BioInstruct: instruction tuning of large language models for biomedical natural language processing.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1821-1832. doi: 10.1093/jamia/ocae122.

7

Short-Term Memory Impairment

8

Sexual Harassment and Prevention Training

9

Question Answering for Electronic Health Records: Scoping Review of Datasets and Models.

J Med Internet Res. 2024 Oct 30;26:e53636. doi: 10.2196/53636.

10

Development of a Large-Scale Dataset of Chest Computed Tomography Reports in Japanese and a High-Performance Finding Classification Model: Dataset Development and Validation Study.

JMIR Med Inform. 2025 Aug 28;13:e71137. doi: 10.2196/71137.

本文引用的文献

1

How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?

J Chem Inf Model. 2024 Apr 22;64(8):3205-3212. doi: 10.1021/acs.jcim.4c00063. Epub 2024 Mar 27.

2

Extracting accurate materials data from research papers with conversational language models and prompt engineering.

Nat Commun. 2024 Feb 21;15(1):1569. doi: 10.1038/s41467-024-45914-8.

3

Structured information extraction from scientific text with large language models.

Nat Commun. 2024 Feb 15;15(1):1418. doi: 10.1038/s41467-024-45563-x.

4

A database of thermally activated delayed fluorescent molecules auto-generated from scientific literature with ChemDataExtractor.

Sci Data. 2024 Jan 17;11(1):80. doi: 10.1038/s41597-023-02897-3.

5

14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon.

Digit Discov. 2023 Aug 8;2(5):1233-1250. doi: 10.1039/d3dd00113j. eCollection 2023 Oct 9.

6

Automated Construction of a Photocatalysis Dataset for Water-Splitting Applications.

Sci Data. 2023 Sep 22;10(1):651. doi: 10.1038/s41597-023-02511-6.

7

OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain.

J Chem Inf Model. 2023 Apr 10;63(7):1961-1981. doi: 10.1021/acs.jcim.2c01259. Epub 2023 Mar 20.

8

A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor.

Sci Data. 2022 Oct 22;9(1):648. doi: 10.1038/s41597-022-01752-1.

9

BatteryBERT: A Pretrained Language Model for Battery Database Enhancement.

J Chem Inf Model. 2022 Dec 26;62(24):6365-6377. doi: 10.1021/acs.jcim.2c00035. Epub 2022 May 9.

10

A database of refractive indices and dielectric constants auto-generated using ChemDataExtractor.

Sci Data. 2022 May 3;9(1):192. doi: 10.1038/s41597-022-01295-5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

文档翻译

学术文献翻译模型，支持多种主流文档格式。