Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
BMC Bioinformatics. 2011 Mar 31;12:89. doi: 10.1186/1471-2105-12-89.
Gene-gene epistatic interactions likely play an important role in the genetic basis of many common diseases. Recently, machine-learning and data mining methods have been developed for learning epistatic relationships from data. A well-known combinatorial method that has been successfully applied for detecting epistasis is Multifactor Dimensionality Reduction (MDR). Jiang et al. created a combinatorial epistasis learning method called BNMBL to learn Bayesian network (BN) epistatic models. They compared BNMBL to MDR using simulated data sets. Each of these data sets was generated from a model that associates two SNPs with a disease and includes 18 unrelated SNPs. For each data set, BNMBL and MDR were used to score all 2-SNP models, and BNMBL learned significantly more correct models. In real data sets, we ordinarily do not know the number of SNPs that influence phenotype. BNMBL may not perform as well if we also scored models containing more than two SNPs. Furthermore, a number of other BN scoring criteria have been developed. They may detect epistatic interactions even better than BNMBL.Although BNs are a promising tool for learning epistatic relationships from data, we cannot confidently use them in this domain until we determine which scoring criteria work best or even well when we try learning the correct model without knowledge of the number of SNPs in that model.
We evaluated the performance of 22 BN scoring criteria using 28,000 simulated data sets and a real Alzheimer's GWAS data set. Our results were surprising in that the Bayesian scoring criterion with large values of a hyperparameter called α performed best. This score performed better than other BN scoring criteria and MDR at recall using simulated data sets, at detecting the hardest-to-detect models using simulated data sets, and at substantiating previous results using the real Alzheimer's data set.
We conclude that representing epistatic interactions using BN models and scoring them using a BN scoring criterion holds promise for identifying epistatic genetic variants in data. In particular, the Bayesian scoring criterion with large values of a hyperparameter α appears more promising than a number of alternatives.
基因-基因上位性相互作用可能在许多常见疾病的遗传基础中发挥重要作用。最近,已经开发了机器学习和数据挖掘方法,用于从数据中学习上位性关系。一种众所周知的组合方法,多因子降维(MDR),已成功用于检测上位性。Jiang 等人创建了一种组合上位性学习方法,称为 BNMBL,用于学习贝叶斯网络(BN)上位性模型。他们使用模拟数据集比较了 BNMBL 和 MDR。这些数据集中的每一个都是从一个与疾病相关的两个 SNP 并包含 18 个不相关 SNP 的模型生成的。对于每个数据集,BNMBL 和 MDR 都用于对所有 2-SNP 模型进行评分,并且 BNMBL 学习到的正确模型明显更多。在真实数据集,我们通常不知道影响表型的 SNP 数量。如果我们还对包含两个以上 SNP 的模型进行评分,BNMBL 的表现可能不会那么好。此外,还开发了许多其他 BN 评分标准。它们可能比 BNMBL 更好地检测上位性相互作用。尽管 BNs 是从数据中学习上位性关系的有前途的工具,但在确定哪种评分标准在不知道模型中 SNP 数量的情况下学习正确模型时效果最好甚至良好之前,我们不能在该领域放心使用它们。
我们使用 28000 个模拟数据集和一个真实的阿尔茨海默病 GWAS 数据集评估了 22 个 BN 评分标准的性能。我们的结果令人惊讶,即具有较大超参数α值的贝叶斯评分标准表现最佳。与模拟数据集中的其他 BN 评分标准和 MDR 相比,该评分在召回率方面表现更好,在检测模拟数据集中最难检测的模型方面表现更好,并且在使用真实的阿尔茨海默病数据证实之前的结果方面表现更好。
我们得出结论,使用 BN 模型表示上位性相互作用并使用 BN 评分标准对其进行评分,有望在数据中识别上位性遗传变异。特别是,具有较大超参数α值的贝叶斯评分标准比许多替代方法更有前途。