Moore Jason H, Barney Nate, Tsai Chia-Ti, Chiang Fu-Tien, Gui Jiang, White Bill C
Computational Genetics Laboratory, Norris-Cotton Cancer Center, Dartmouth Medical School, Lebanon, NH 03756, USA.
Hum Hered. 2007;63(2):120-33. doi: 10.1159/000099184. Epub 2007 Feb 2.
The workhorse of modern genetic analysis is the parametric linear model. The advantages of the linear modeling framework are many and include a mathematical understanding of the model fitting process and ease of interpretation. However, an important limitation is that linear models make assumptions about the nature of the data being modeled. This assumption may not be realistic for complex biological systems such as disease susceptibility where nonlinearities in the genotype to phenotype mapping relationship that result from epistasis, plastic reaction norms, locus heterogeneity, and phenocopy, for example, are the norm rather than the exception. We have previously developed a flexible modeling approach called symbolic discriminant analysis (SDA) that makes no assumptions about the patterns in the data. Rather, SDA lets the data dictate the size, shape, and complexity of a symbolic discriminant function that could include any set of mathematical functions from a list of candidates supplied by the user. Here, we outline a new five step process for symbolic model discovery that uses genetic programming (GP) for coarse-grained stochastic searching, experimental design for parameter optimization, graphical modeling for generating expert knowledge, and estimation of distribution algorithms for fine-grained stochastic searching. Finally, we introduce function mapping as a new method for interpreting symbolic discriminant functions. We show that function mapping when combined with measures of interaction information facilitates statistical interpretation by providing a graphical approach to decomposing complex models to highlight synergistic, redundant, and independent effects of polymorphisms and their composite functions. We illustrate this five step SDA modeling process with a real case-control dataset.
现代遗传分析的主力是参数线性模型。线性建模框架有诸多优点,包括对模型拟合过程的数学理解以及易于解释。然而,一个重要的局限性在于线性模型对所建模数据的性质做出了假设。对于诸如疾病易感性等复杂生物系统而言,这种假设可能并不现实,例如,由于上位性、塑性反应规范、基因座异质性和拟表型导致的基因型到表型映射关系中的非线性是常态而非例外。我们之前开发了一种灵活的建模方法,称为符号判别分析(SDA),它不对数据模式做任何假设。相反,SDA让数据决定符号判别函数的大小、形状和复杂性,该函数可以包括用户提供的候选列表中的任何一组数学函数。在此,我们概述了一个用于符号模型发现的新的五步过程,该过程使用遗传编程(GP)进行粗粒度随机搜索、实验设计进行参数优化、图形建模生成专家知识以及分布估计算法进行细粒度随机搜索。最后,我们引入函数映射作为解释符号判别函数的一种新方法。我们表明,函数映射与交互信息度量相结合时,通过提供一种将复杂模型分解以突出多态性及其复合函数的协同、冗余和独立效应的图形方法,有助于进行统计解释。我们用一个实际的病例对照数据集来说明这个五步SDA建模过程。