Rodin Andrei S, Boerwinkle Eric
Human Genetics Center, School of Public Health, University of Texas Health Science Center Houston, TX 77030, USA.
Bioinformatics. 2005 Aug 1;21(15):3273-8. doi: 10.1093/bioinformatics/bti505. Epub 2005 May 24.
The wealth of single nucleotide polymorphism (SNP) data within candidate genes and anticipated across the genome poses enormous analytical problems for studies of genotype-to-phenotype relationships, and modern data mining methods may be particularly well suited to meet the swelling challenges. In this paper, we introduce the method of Belief (Bayesian) networks to the domain of genotype-to-phenotype analyses and provide an example application.
A Belief network is a graphical model of a probabilistic nature that represents a joint multivariate probability distribution and reflects conditional independences between variables. Given the data, optimal network topology can be estimated with the assistance of heuristic search algorithms and scoring criteria. Statistical significance of edge strengths can be evaluated using Bayesian methods and bootstrapping. As an example application, the method of Belief networks was applied to 20 SNPs in the apolipoprotein (apo) E gene and plasma apoE levels in a sample of 702 individuals from Jackson, MS. Plasma apoE level was the primary target variable. These analyses indicate that the edge between SNP 4075, coding for the well-known epsilon2 allele, and plasma apoE level was strong. Belief networks can effectively describe complex uncertain processes and can both learn from data and incorporate prior knowledge.
Various alternative and supplemental networks (not given in the text) as well as source code extensions, are available from the authors.
候选基因内以及全基因组中预计存在的大量单核苷酸多态性(SNP)数据,给基因型与表型关系的研究带来了巨大的分析难题,而现代数据挖掘方法可能特别适合应对日益增加的挑战。在本文中,我们将信念(贝叶斯)网络方法引入到基因型与表型分析领域,并提供了一个应用实例。
信念网络是一种概率性质的图形模型,它表示联合多元概率分布,并反映变量之间的条件独立性。给定数据后,可以借助启发式搜索算法和评分标准来估计最优网络拓扑结构。可以使用贝叶斯方法和自展法评估边强度的统计显著性。作为一个应用实例,信念网络方法被应用于来自密西西比州杰克逊市的702名个体样本中的载脂蛋白(apo)E基因的20个SNP和血浆apoE水平。血浆apoE水平是主要的目标变量。这些分析表明,编码著名的ε2等位基因的SNP 4075与血浆apoE水平之间的边很强。信念网络可以有效地描述复杂的不确定过程,并且既能从数据中学习,又能纳入先验知识。
作者提供了各种替代和补充网络(文中未给出)以及源代码扩展。