Xu Yi, Cao Liyu, Zhao Xinyi, Yao Yinghao, Liu Qiang, Zhang Bin, Wang Yan, Mao Ying, Ma Yunlong, Ma Jennie Z, Payne Thomas J, Li Ming D, Li Lanjuan
State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.
Department of Public Health Sciences, University of Virginia, Charlottesville, VA, United States.
Front Psychiatry. 2020 May 14;11:416. doi: 10.3389/fpsyt.2020.00416. eCollection 2020.
Smoking is a complex behavior with a heritability as high as 50%. Given such a large genetic contribution, it provides an opportunity to prevent those individuals who are susceptible to smoking dependence from ever starting to smoke by predicting their inherited predisposition with their genomic profiles. Although previous studies have identified many susceptibility variants for smoking, they have limited power to predict smoking behavior. We applied the support vector machine (SVM) and random forest (RF) methods to build prediction models for smoking behavior. We first used 1,431 smokers and 1,503 non-smokers of African origin for model building with a 10-fold cross-validation and then tested the prediction models on an independent dataset consisting of 213 smokers and 224 non-smokers. The SVM model with 500 top single nucleotide polymorphisms (SNPs) selected using logistic regression (p<0.01) as the feature selection method achieved an area under the curve (AUC) of 0.691, 0.721, and 0.720 for the training, test, and independent test samples, respectively. The RF model with 500 top SNPs selected using logistic regression (p<0.01) achieved AUCs of 0.671, 0.665, and 0.667 for the training, test, and independent test samples, respectively. Finally, we used the combined logistic (p<0.01) and LASSO (λ=10) regression to select features and the SVM algorithm for model building. The SVM model with 500 top SNPs achieved AUCs of 0.756, 0.776, and 0.897 for the training, test, and independent test samples, respectively. We conclude that machine learning methods are promising means to build predictive models for smoking.
吸烟是一种复杂行为,其遗传度高达50%。鉴于如此大的遗传贡献,通过利用基因组图谱预测个体的遗传易感性,为预防那些易患吸烟依赖的人开始吸烟提供了一个机会。尽管先前的研究已经鉴定出许多吸烟易感性变异,但它们预测吸烟行为的能力有限。我们应用支持向量机(SVM)和随机森林(RF)方法构建吸烟行为预测模型。我们首先使用1431名非洲裔吸烟者和1503名非洲裔非吸烟者进行模型构建,并进行10倍交叉验证,然后在由213名吸烟者和224名非吸烟者组成的独立数据集上测试预测模型。使用逻辑回归(p<0.01)作为特征选择方法选择的500个顶级单核苷酸多态性(SNP)构建的SVM模型,训练样本、测试样本和独立测试样本的曲线下面积(AUC)分别为0.691、0.721和0.720。使用逻辑回归(p<0.01)选择的500个顶级SNP构建的RF模型,训练样本、测试样本和独立测试样本的AUC分别为0.671、0.665和0.667。最后,我们使用联合逻辑回归(p<0.01)和LASSO(λ=10)回归进行特征选择,并使用SVM算法进行模型构建。使用500个顶级SNP构建的SVM模型,训练样本、测试样本和独立测试样本的AUC分别为0.756、0.776和0.897。我们得出结论,机器学习方法是构建吸烟预测模型的有前景的手段。