Peeters Casper, Vijverberg Koen, Pouwer Marianne, Westerman Bart, Boot Maikel, Verberne Suzan
Medstone Science, Amsterdam, The Netherlands.
Amsterdam University Medical Center (UMC), Amsterdam, The Netherlands.
BMC Med Res Methodol. 2025 Jul 31;25(1):184. doi: 10.1186/s12874-025-02624-z.
Medical decision-making commonly is guided by evidence-based analyses from systematic literature reviews (SLRs). These require large amounts of time and subject matter expertise to perform. Automated extraction of key datapoints from clinical publications could speed up the process of systematic literature review assembly. To this end, we built SURUS, a named entity recognition (NER) system comprised of a Bidirectional Encoder Representations from Transformers (BERT) model trained on a fine-grained dataset. The aim of this study was to assess the quality of SURUS classifications of PICO (patient, intervention, comparator and outcome) and study design elements of clinical study abstracts.
The PubMedBERT-based model was trained and evaluated using a dataset of 39,531 labels amongst 400 clinical abstracts, with an inter-annotator agreement of 0.81 (Cohen’s κ) and 0.88 (F1). The labels were manually annotated using a strict annotation guide. We evaluated quality of the dataset and tested the utility of the model in the practise of systematic literature screening, by comparing SURUS predictions to expert PICO and design classifications. Additionally, we tested out-of-domain quality of the model across 7 other therapeutic areas and another study design.
The SURUS NER system achieved an overall F1 score of 0.95, with minor deviation between labels. In addition, SURUS achieved a NER F1 of 0.90 and 0.84 for out-of-domain therapeutic area and observational study abstracts, respectively. Finally, F1 of PICO and study design classifications was 0.89 with a recall of 0.96 compared to expert classifications.
The system reaches an F1 score of 0.95 across 25 contextually different medical named entities. This high-quality in-domain medical entity prediction of a fine-tuned BERT-based model was the result of a strict annotation guideline and high inter-annotator agreement. This prediction accuracy was largely preserved during extensive out-of-domain evaluation, indicating its utility across other indication areas and study types. Current approaches in the field lack in the fine-grained training data and versatility demonstrated here. We think that this approach sets a new standard in medical literature analysis and paves the way for creating fine-grained datasets of labelled entities that can be used for downstream analysis outside of traditional SLRs.
The online version contains supplementary material available at 10.1186/s12874-025-02624-z.
医学决策通常由系统文献综述(SLR)的循证分析来指导。这些分析需要大量时间和专业知识来进行。从临床出版物中自动提取关键数据点可以加快系统文献综述的汇编过程。为此,我们构建了SURUS,这是一个命名实体识别(NER)系统,由在细粒度数据集上训练的双向编码器表征来自变压器(BERT)模型组成。本研究的目的是评估SURUS对临床研究摘要的PICO(患者、干预措施、对照和结果)及研究设计要素分类的质量。
基于PubMedBERT的模型使用400篇临床摘要中的39,531个标签数据集进行训练和评估,标注者间一致性为0.81(科恩κ系数)和0.88(F1值)。标签使用严格的标注指南进行手动标注。我们通过将SURUS预测结果与专家的PICO及设计分类进行比较,评估了数据集的质量并测试了该模型在系统文献筛选实践中的效用。此外,我们在其他7个治疗领域和另一种研究设计中测试了该模型的域外质量。
SURUS NER系统的总体F1得分为0.95,标签间偏差较小。此外,SURUS对域外治疗领域和观察性研究摘要的NER F1分别为0.90和0.84。最后,与专家分类相比,PICO和研究设计分类的F1为0.89,召回率为0.96。
该系统在25个上下文不同的医学命名实体上的F1得分为0.95。这种基于微调BERT模型的高质量域内医学实体预测是严格标注指南和高标注者间一致性的结果。在广泛的域外评估中,这种预测准确性在很大程度上得以保留,表明其在其他适应症领域和研究类型中的效用。该领域目前的方法缺乏此处展示的细粒度训练数据和通用性。我们认为这种方法为医学文献分析设定了新标准,并为创建可用于传统SLR之外的下游分析的带标签实体细粒度数据集铺平了道路。
在线版本包含可在10.1186/s12874-025-02624-z获取的补充材料。