Department of Life Sciences, Technical University of Munich, Freising, Germany.
Department of Quantitative Health Sciences, Cleveland Clinic Foundation, Cleveland, OH, USA.
BMC Med Res Methodol. 2022 Jul 21;22(1):200. doi: 10.1186/s12874-022-01674-x.
We compared six commonly used logistic regression methods for accommodating missing risk factor data from multiple heterogeneous cohorts, in which some cohorts do not collect some risk factors at all, and developed an online risk prediction tool that accommodates missing risk factors from the end-user.
Ten North American and European cohorts from the Prostate Biopsy Collaborative Group (PBCG) were used for fitting a risk prediction tool for clinically significant prostate cancer, defined as Gleason grade group ≥ 2 on standard TRUS prostate biopsy. One large European PBCG cohort was withheld for external validation, where calibration-in-the-large (CIL), calibration curves, and area-underneath-the-receiver-operating characteristic curve (AUC) were evaluated. Ten-fold leave-one-cohort-internal validation further validated the optimal missing data approach.
Among 12,703 biopsies from 10 training cohorts, 3,597 (28%) had clinically significant prostate cancer, compared to 1,757 of 5,540 (32%) in the external validation cohort. In external validation, the available cases method that pooled individual patient data containing all risk factors input by an end-user had best CIL, under-predicting risks as percentages by 2.9% on average, and obtained an AUC of 75.7%. Imputation had the worst CIL (-13.3%). The available cases method was further validated as optimal in internal cross-validation and thus used for development of an online risk tool. For end-users of the risk tool, two risk factors were mandatory: serum prostate-specific antigen (PSA) and age, and ten were optional: digital rectal exam, prostate volume, prior negative biopsy, 5-alpha-reductase-inhibitor use, prior PSA screen, African ancestry, Hispanic ethnicity, first-degree prostate-, breast-, and second-degree prostate-cancer family history.
Developers of clinical risk prediction tools should optimize use of available data and sources even in the presence of high amounts of missing data and offer options for users with missing risk factors.
我们比较了六种常用于处理来自多个异质队列的缺失风险因素数据的逻辑回归方法,其中一些队列根本不收集某些风险因素,并开发了一个在线风险预测工具,可处理来自最终用户的缺失风险因素。
使用来自前列腺活检协作组(PBCG)的十个北美和欧洲队列来拟合用于临床显著前列腺癌的风险预测工具,定义为标准经直肠超声前列腺活检中 Gleason 分级组≥2。保留一个大型欧洲 PBCG 队列进行外部验证,评估了大校准(CIL)、校准曲线和受试者工作特征曲线(ROC)下面积(AUC)。十折留一队列内部验证进一步验证了最佳缺失数据方法。
在来自 10 个训练队列的 12703 次活检中,10597 例(28%)患有临床显著前列腺癌,而外部验证队列中的 5540 例中有 1757 例(32%)患有该疾病。在外部验证中,包含最终用户输入的所有风险因素的个体患者数据的可用病例方法具有最佳的 CIL,平均平均低估风险百分比为 2.9%,AUC 为 75.7%。插补方法的 CIL 最差(-13.3%)。可用病例方法在内部交叉验证中进一步验证为最优,因此用于开发在线风险工具。对于风险工具的最终用户,有两个风险因素是强制性的:血清前列腺特异性抗原(PSA)和年龄,十个是可选的:直肠指检、前列腺体积、既往阴性活检、5-α-还原酶抑制剂使用、既往 PSA 筛查、非洲裔、西班牙裔、一级前列腺癌、乳腺癌和二级前列腺癌家族史。
即使存在大量缺失数据,临床风险预测工具的开发人员也应优化可用数据和来源的使用,并为缺失风险因素的用户提供选项。