Sierepeklis Odysseas, Cole Jacqueline M
Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
Science and Technology Facilities Council, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
J Chem Inf Model. 2025 Aug 25;65(16):8579-8592. doi: 10.1021/acs.jcim.5c00840. Epub 2025 Aug 7.
We present a method for autogenerating a large domain-specific question-answering (QA) dataset from a thermoelectric materials database. We show that a small language model, BERT, once fine-tuned on this automatically generated dataset of 99,757 QA pairs about thermoelectric materials, affords better performance in the field of thermoelectric materials compared to a BERT model fine-tuned on the generic English-language QA data set, SQuAD-v2. We further show that mixing the two data sets (ours and SQuAD-v2), which have significantly different syntactic and semantic scopes, allows the BERT model to achieve even better performance. The best-performing BERT model fine-tuned on the mixed data set outperforms the models fine-tuned on the other two data sets by scoring an exact match of 67.93% and an 1 score of 72.29% when evaluated on our test data set. This has important implications as it demonstrates the ability to realize high-performing small language models, with modest computational resources, empowered by domain-specific materials data sets which can be generated according to our method.
我们提出了一种从热电材料数据库自动生成大型特定领域问答(QA)数据集的方法。我们表明,一个小型语言模型BERT,一旦在这个自动生成的包含99,757个关于热电材料的问答对的数据集上进行微调,与在通用英语QA数据集SQuAD-v2上微调的BERT模型相比,在热电材料领域具有更好的性能。我们进一步表明,将两个句法和语义范围有显著差异的数据集(我们的数据集和SQuAD-v2)混合,能使BERT模型获得更好的性能。在混合数据集上微调的性能最佳的BERT模型,在我们的测试数据集上进行评估时,精确匹配得分67.93%,F1得分72.29%,优于在其他两个数据集上微调的模型。这具有重要意义,因为它展示了利用我们的方法生成的特定领域材料数据集,以适度的计算资源实现高性能小型语言模型的能力。