Suppr超能文献

基于Transformer的语言模型用于生物医学文献中的群组随机试验分类:模型开发与验证

Transformer-Based Language Models for Group Randomized Trial Classification in Biomedical Literature: Model Development and Validation.

作者信息

Aghaarabi Elaheh, Murray David

机构信息

Office of Disease Prevention, National Institutes of Health, 6705 Rockledge Dr, Bethesda, MD, 20892, United States, 1 3014964000.

出版信息

JMIR Med Inform. 2025 May 9;13:e63267. doi: 10.2196/63267.

Abstract

BACKGROUND

For the public health community, monitoring recently published articles is crucial for staying informed about the latest research developments. However, identifying publications about studies with specific research designs from the extensive body of public health publications is a challenge with the currently available methods.

OBJECTIVE

Our objective is to develop a fine-tuned pretrained language model that can accurately identify publications from clinical trials that use a group- or cluster-randomized trial (GRT), individually randomized group-treatment trial (IRGT), or stepped wedge group- or cluster-randomized trial (SWGRT) design within the biomedical literature.

METHODS

We fine-tuned the BioMedBERT language model using a dataset of biomedical literature from the Office of Disease Prevention at the National Institute of Health. The model was trained to classify publications into three categories of clinical trials that use nested designs. The model performance was evaluated on unseen data and demonstrated high sensitivity and specificity for each class.

RESULTS

When our proposed model was tested for generalizability with unseen data, it delivered high sensitivity and specificity for each class as follows: negatives (0.95 and 0.93), GRTs (0.94 and 0.90), IRGTs (0.81 and 0.97), and SWGRTs (0.96 and 0.99), respectively.

CONCLUSIONS

Our work demonstrates the potential of fine-tuned, domain-specific language models to accurately identify publications reporting on complex and specialized study designs, addressing a critical need in the public health research community. This model offers a valuable tool for the public health community to directly identify publications from clinical trials that use one of the three classes of nested designs.

摘要

背景

对于公共卫生领域而言,监测近期发表的文章对于及时了解最新研究进展至关重要。然而,利用现有方法从大量公共卫生出版物中识别出具有特定研究设计的研究出版物是一项挑战。

目的

我们的目标是开发一种经过微调的预训练语言模型,该模型能够准确识别生物医学文献中采用组群随机试验(GRT)、个体随机分组治疗试验(IRGT)或阶梯楔形组群随机试验(SWGRT)设计的临床试验出版物。

方法

我们使用美国国立卫生研究院疾病预防办公室的生物医学文献数据集对BioMedBERT语言模型进行了微调。该模型经过训练,可将出版物分类为使用嵌套设计的三类临床试验。在未见过的数据上对模型性能进行了评估,结果表明该模型对每个类别都具有较高的敏感性和特异性。

结果

当我们提出的模型使用未见过的数据进行泛化测试时,它对每个类别的敏感性和特异性都很高,具体如下:阴性(0.95和0.93)、GRT(0.94和0.90)、IRGT(0.81和0.97)以及SWGRT(0.96和0.99)。

结论

我们的工作证明了经过微调的特定领域语言模型在准确识别报告复杂和专业研究设计的出版物方面的潜力,满足了公共卫生研究领域的一项关键需求。该模型为公共卫生领域提供了一个有价值的工具,可直接识别采用三类嵌套设计之一的临床试验出版物。

相似文献

4
The future of Cochrane Neonatal.考克兰新生儿协作网的未来。
Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.
10
Essential Ingredients and Innovations in the Design and Analysis of Group-Randomized Trials.群组随机试验设计与分析的基本要素和创新。
Annu Rev Public Health. 2020 Apr 2;41:1-19. doi: 10.1146/annurev-publhealth-040119-094027. Epub 2019 Dec 23.

本文引用的文献

5
Essential Ingredients and Innovations in the Design and Analysis of Group-Randomized Trials.群组随机试验设计与分析的基本要素和创新。
Annu Rev Public Health. 2020 Apr 2;41:1-19. doi: 10.1146/annurev-publhealth-040119-094027. Epub 2019 Dec 23.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验