Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA.
J Am Med Inform Assoc. 2014 Oct;21(e2):e312-9. doi: 10.1136/amiajnl-2013-002358. Epub 2014 Apr 15.
The objective of this investigation is to evaluate binary prediction methods for predicting disease status using high-dimensional genomic data. The central hypothesis is that the Bayesian network (BN)-based method called efficient Bayesian multivariate classifier (EBMC) will do well at this task because EBMC builds on BN-based methods that have performed well at learning epistatic interactions.
We evaluate how well eight methods perform binary prediction using high-dimensional discrete genomic datasets containing epistatic interactions. The methods are as follows: naive Bayes (NB), model averaging NB (MANB), feature selection NB (FSNB), EBMC, logistic regression (LR), support vector machines (SVM), Lasso, and extreme learning machines (ELM). We use a hundred 1000-single nucleotide polymorphism (SNP) simulated datasets, ten 10,000-SNP datasets, six semi-synthetic sets, and two real genome-wide association studies (GWAS) datasets in our evaluation.
In fivefold cross-validation studies, the SVM performed best on the 1000-SNP dataset, while the BN-based methods performed best on the other datasets, with EBMC exhibiting the best overall performance. In-sample testing indicates that LR, SVM, Lasso, ELM, and NB tend to overfit the data.
EBMC performed better than NB when there are several strong predictors, whereas NB performed better when there are many weak predictors. Furthermore, for all BN-based methods, prediction capability did not degrade as the dimension increased.
Our results support the hypothesis that EBMC performs well at binary outcome prediction using high-dimensional discrete datasets containing epistatic-like interactions. Future research using more GWAS datasets is needed to further investigate the potential of EBMC.
本研究旨在评估二元预测方法,以利用高维基因组数据预测疾病状态。中心假设是,基于贝叶斯网络(BN)的方法,即高效贝叶斯多元分类器(EBMC),将在这项任务中表现出色,因为 EBMC 基于在学习上位性相互作用方面表现良好的 BN 方法。
我们评估了八种方法在使用包含上位性相互作用的高维离散基因组数据集进行二元预测的表现。这些方法如下:朴素贝叶斯(NB)、模型平均 NB(MANB)、特征选择 NB(FSNB)、EBMC、逻辑回归(LR)、支持向量机(SVM)、套索和极限学习机(ELM)。我们在评估中使用了一百个 1000 个单核苷酸多态性(SNP)模拟数据集、十个 10000-SNP 数据集、六个半合成数据集和两个全基因组关联研究(GWAS)数据集。
在五重交叉验证研究中,SVM 在 1000-SNP 数据集上表现最佳,而基于 BN 的方法在其他数据集上表现最佳,EBMC 表现出最佳的整体性能。样本内测试表明,LR、SVM、套索、ELM 和 NB 倾向于过度拟合数据。
当存在多个强预测因子时,EBMC 的表现优于 NB,而当存在许多弱预测因子时,NB 的表现优于 EBMC。此外,对于所有基于 BN 的方法,预测能力不会随着维度的增加而降低。
我们的结果支持 EBMC 在使用包含类似上位性相互作用的高维离散数据集进行二元结果预测方面表现良好的假设。需要使用更多的 GWAS 数据集进行进一步的研究,以进一步探讨 EBMC 的潜力。