Aghaarabi Elaheh, Murray David
Office of Disease Prevention, National Institutes of Health, 6705 Rockledge Dr, Bethesda, MD, 20892, United States, 1 3014964000.
JMIR Med Inform. 2025 May 9;13:e63267. doi: 10.2196/63267.
For the public health community, monitoring recently published articles is crucial for staying informed about the latest research developments. However, identifying publications about studies with specific research designs from the extensive body of public health publications is a challenge with the currently available methods.
Our objective is to develop a fine-tuned pretrained language model that can accurately identify publications from clinical trials that use a group- or cluster-randomized trial (GRT), individually randomized group-treatment trial (IRGT), or stepped wedge group- or cluster-randomized trial (SWGRT) design within the biomedical literature.
We fine-tuned the BioMedBERT language model using a dataset of biomedical literature from the Office of Disease Prevention at the National Institute of Health. The model was trained to classify publications into three categories of clinical trials that use nested designs. The model performance was evaluated on unseen data and demonstrated high sensitivity and specificity for each class.
When our proposed model was tested for generalizability with unseen data, it delivered high sensitivity and specificity for each class as follows: negatives (0.95 and 0.93), GRTs (0.94 and 0.90), IRGTs (0.81 and 0.97), and SWGRTs (0.96 and 0.99), respectively.
Our work demonstrates the potential of fine-tuned, domain-specific language models to accurately identify publications reporting on complex and specialized study designs, addressing a critical need in the public health research community. This model offers a valuable tool for the public health community to directly identify publications from clinical trials that use one of the three classes of nested designs.
对于公共卫生领域而言,监测近期发表的文章对于及时了解最新研究进展至关重要。然而,利用现有方法从大量公共卫生出版物中识别出具有特定研究设计的研究出版物是一项挑战。
我们的目标是开发一种经过微调的预训练语言模型,该模型能够准确识别生物医学文献中采用组群随机试验(GRT)、个体随机分组治疗试验(IRGT)或阶梯楔形组群随机试验(SWGRT)设计的临床试验出版物。
我们使用美国国立卫生研究院疾病预防办公室的生物医学文献数据集对BioMedBERT语言模型进行了微调。该模型经过训练,可将出版物分类为使用嵌套设计的三类临床试验。在未见过的数据上对模型性能进行了评估,结果表明该模型对每个类别都具有较高的敏感性和特异性。
当我们提出的模型使用未见过的数据进行泛化测试时,它对每个类别的敏感性和特异性都很高,具体如下:阴性(0.95和0.93)、GRT(0.94和0.90)、IRGT(0.81和0.97)以及SWGRT(0.96和0.99)。
我们的工作证明了经过微调的特定领域语言模型在准确识别报告复杂和专业研究设计的出版物方面的潜力,满足了公共卫生研究领域的一项关键需求。该模型为公共卫生领域提供了一个有价值的工具,可直接识别采用三类嵌套设计之一的临床试验出版物。