Wu Yirong, Fan Jun, Peissig Peggy, Berg Richard, Tafti Ahmad Pahlavan, Yin Jie, Yuan Ming, Page David, Cox Jennifer, Burnside Elizabeth S
University of Wisconsin Madison, WI, USA.
Marshfield Clinic, Marshfield, WI, USA.
Proc SPIE Int Soc Opt Eng. 2018 Feb;10577. doi: 10.1117/12.2293954. Epub 2018 Mar 7.
Improved prediction of the "most harmful" breast cancers that cause the most substantive morbidity and mortality would enable physicians to target more intense screening and preventive measures at those women who have the highest risk; however, such prediction models for the "most harmful" breast cancers have rarely been developed. Electronic health records (EHRs) represent an underused data source that has great research and clinical potential. Our goal was to quantify the value of EHR variables in the "most harmful" breast cancer risk prediction. We identified 794 subjects who had breast cancer with primary non-benign tumors with their earliest diagnosis on or after 1/1/2004 from an existing personalized medicine data repository, including 395 "most harmful" breast cancer cases and 399 "least harmful" breast cancer cases. For these subjects, we collected EHR data comprised of 6 components: demographics, diagnoses, symptoms, procedures, medications, and laboratory results. We developed two regularized prediction models, Ridge Logistic Regression (Ridge-LR) and Lasso Logistic Regression (Lasso-LR), to predict the "most harmful" breast cancer one year in advance. The area under the ROC curve (AUC) was used to assess model performance. We observed that the AUCs of Ridge-LR and Lasso-LR models were 0.818 and 0.839 respectively. For both the Ridge-LR and Lasso-LR models, the predictive performance of the whole EHR variables was significantly higher than that of each individual component (p<0.001). In conclusion, EHR variables can be used to predict the "most harmful" breast cancer, providing the possibility to personalize care for those women at the highest risk in clinical practice.
对导致最高发病率和死亡率的“最具危害性”乳腺癌进行更准确的预测,将使医生能够针对那些风险最高的女性采取更密集的筛查和预防措施;然而,针对“最具危害性”乳腺癌的此类预测模型却很少被开发出来。电子健康记录(EHRs)是一种未得到充分利用的数据来源,具有巨大的研究和临床潜力。我们的目标是量化EHR变量在“最具危害性”乳腺癌风险预测中的价值。我们从一个现有的个性化医疗数据存储库中,识别出794名在2004年1月1日或之后首次被诊断出患有原发性非良性肿瘤的乳腺癌患者,其中包括395例“最具危害性”乳腺癌病例和399例“危害性最小”乳腺癌病例。对于这些受试者,我们收集了由6个部分组成的EHR数据:人口统计学信息、诊断结果、症状、治疗程序、用药情况和实验室检查结果。我们开发了两种正则化预测模型,即岭逻辑回归(Ridge-LR)和套索逻辑回归(Lasso-LR),以提前一年预测“最具危害性”乳腺癌。ROC曲线下面积(AUC)用于评估模型性能。我们观察到,Ridge-LR和Lasso-LR模型的AUC分别为0.818和0.839。对于Ridge-LR和Lasso-LR模型,整个EHR变量的预测性能显著高于每个单独的组成部分(p<0.001)。总之,EHR变量可用于预测“最具危害性”乳腺癌,为临床实践中那些风险最高的女性提供个性化护理的可能性。