遗传流行病学中用于检测基因-基因相互作用的神经网络机器学习优化方法的比较。

Comparison of approaches for machine-learning optimization of neural networks for detecting gene-gene interactions in genetic epidemiology.

作者信息

Motsinger-Reif Alison A, Dudek Scott M, Hahn Lance W, Ritchie Marylyn D

机构信息

Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, USA.

出版信息

Genet Epidemiol. 2008 May;32(4):325-40. doi: 10.1002/gepi.20307.

DOI:10.1002/gepi.20307

PMID:18265411

Abstract

The detection of genotypes that predict common, complex disease is a challenge for human geneticists. The phenomenon of epistasis, or gene-gene interactions, is particularly problematic for traditional statistical techniques. Additionally, the explosion of genetic information makes exhaustive searches of multilocus combinations computationally infeasible. To address these challenges, neural networks (NN), a pattern recognition method, have been used. One limitation of the NN approach is that its success is dependent on the architecture of the network. To solve this, machine-learning approaches have been suggested to evolve the best NN architecture for a particular data set. In this study we provide a detailed technical description of the use of grammatical evolution to optimize neural networks (GENN) for use in genetic association studies. We compare the performance of GENN to that of a previous machine-learning NN application--genetic programming neural networks in both simulated and real data. We show that GENN greatly outperforms genetic programming neural networks in data sets with a large number of single nucleotide polymorphisms. Additionally, we demonstrate that GENN has high power to detect disease-risk loci in a range of high-order epistatic models. Finally, we demonstrate the scalability of the GENN method with increasing numbers of variables--as many as 500,000 single nucleotide polymorphisms.

摘要

对人类遗传学家来说，检测能够预测常见复杂疾病的基因型是一项挑战。上位性现象，即基因与基因之间的相互作用，对于传统统计技术而言尤其棘手。此外，遗传信息的爆炸式增长使得对多位点组合进行详尽搜索在计算上变得不可行。为应对这些挑战，人们采用了神经网络（NN）这种模式识别方法。神经网络方法的一个局限性在于其成功与否取决于网络的架构。为解决这一问题，有人提出了机器学习方法，以针对特定数据集演化出最佳的神经网络架构。在本研究中，我们详细阐述了如何使用语法进化来优化神经网络（GENN），以便用于遗传关联研究。我们将GENN的性能与先前的一个机器学习神经网络应用——遗传编程神经网络在模拟数据和真实数据中的性能进行了比较。我们表明，在具有大量单核苷酸多态性的数据集中，GENN的性能大大优于遗传编程神经网络。此外，我们证明了GENN在一系列高阶上位性模型中具有检测疾病风险位点的强大能力。最后，我们展示了GENN方法随着变量数量增加（多达50万个单核苷酸多态性）的可扩展性。