Center for Statistical Science, Tsinghua University, Beijing, 100084, China.
School of Medicine, Tsinghua University, Beijing, 100084, China.
Sci Data. 2023 Dec 18;10(1):909. doi: 10.1038/s41597-023-02814-8.
Retrieval-based Clinical Decision Support (ReCDS) can aid clinical workflow by providing relevant literature and similar patients for a given patient. However, the development of ReCDS systems has been severely obstructed by the lack of diverse patient collections and publicly available large-scale patient-level annotation datasets. In this paper, we collect a novel dataset of patient summaries and relations called PMC-Patients to benchmark two ReCDS tasks: Patient-to-Article Retrieval (ReCDS-PAR) and Patient-to-Patient Retrieval (ReCDS-PPR). Specifically, we extract patient summaries from PubMed Central articles using simple heuristics and utilize the PubMed citation graph to define patient-article relevance and patient-patient similarity. PMC-Patients contains 167k patient summaries with 3.1 M patient-article relevance annotations and 293k patient-patient similarity annotations, which is the largest-scale resource for ReCDS and also one of the largest patient collections. Human evaluation and analysis show that PMC-Patients is a diverse dataset with high-quality annotations. We also implement and evaluate several ReCDS systems on the PMC-Patients benchmarks to show its challenges and conduct several case studies to show the clinical utility of PMC-Patients.
基于检索的临床决策支持(ReCDS)可以通过为给定患者提供相关文献和相似患者来辅助临床工作流程。然而,由于缺乏多样化的患者群体和公开的大规模患者级注释数据集,ReCDS 系统的开发受到了严重阻碍。在本文中,我们收集了一个名为 PMC-Patients 的新的患者摘要和关系数据集,用于基准测试两个 ReCDS 任务:患者到文章检索(ReCDS-PAR)和患者到患者检索(ReCDS-PPR)。具体来说,我们使用简单的启发式方法从 PubMed Central 文章中提取患者摘要,并利用 PubMed 引文图来定义患者-文章相关性和患者-患者相似性。PMC-Patients 包含 167k 个患者摘要,有 3.1M 个患者-文章相关性注释和 293k 个患者-患者相似性注释,这是最大规模的 ReCDS 资源,也是最大的患者群体之一。人工评估和分析表明,PMC-Patients 是一个具有高质量注释的多样化数据集。我们还在 PMC-Patients 基准上实现和评估了几个 ReCDS 系统,以展示其挑战,并进行了几个案例研究,以展示 PMC-Patients 的临床实用性。