Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Lübeck, Germany.
Institute for Cardiogenetics, Universität zu Lübeck, Lübeck, Germany.
Genet Epidemiol. 2020 Mar;44(2):125-138. doi: 10.1002/gepi.22279. Epub 2020 Jan 10.
Coronary artery disease (CAD) is the leading global cause of mortality and has substantial heritability with a polygenic architecture. Recent approaches of risk prediction were based on polygenic risk scores (PRS) not taking possible nonlinear effects into account and restricted in that they focused on genetic loci associated with CAD, only. We benchmarked PRS, (penalized) logistic regression, naïve Bayes (NB), random forests (RF), support vector machines (SVM), and gradient boosting (GB) on a data set of 7,736 CAD cases and 6,774 controls from Germany to identify the algorithms for most accurate classification of CAD status. The final models were tested on an independent data set from Germany (527 CAD cases and 473 controls). We found PRS to be the best algorithm, yielding an area under the receiver operating curve (AUC) of 0.92 (95% CI [0.90, 0.95], 50,633 loci) in the German test data. NB and SVM (AUC ~ 0.81) performed better than RF and GB (AUC ~ 0.75). We conclude that using PRS to predict CAD is superior to machine learning methods.
冠状动脉疾病 (CAD) 是全球主要的死亡原因,具有高度的遗传性,其遗传结构为多基因。最近的风险预测方法基于多基因风险评分 (PRS),没有考虑到可能的非线性影响,并且仅限于关注与 CAD 相关的遗传位点。我们在一个来自德国的 7736 例 CAD 病例和 6774 例对照的数据集上对 PRS、(惩罚)逻辑回归、朴素贝叶斯 (NB)、随机森林 (RF)、支持向量机 (SVM) 和梯度提升 (GB) 进行了基准测试,以确定用于 CAD 状态最准确分类的算法。最终模型在来自德国的独立数据集(527 例 CAD 病例和 473 例对照)上进行了测试。我们发现 PRS 是最好的算法,在德国测试数据中,接收器操作曲线下的面积 (AUC) 为 0.92(95%CI [0.90, 0.95],50633 个位点)。NB 和 SVM(AUC≈0.81)的性能优于 RF 和 GB(AUC≈0.75)。我们得出结论,使用 PRS 预测 CAD 优于机器学习方法。