Nunkesser Robin, Bernholt Thorsten, Schwender Holger, Ickstadt Katja, Wegener Ingo
Collaborative Research Center 475, Department of Computer Science, University of Dortmund, Dortmund, Germany.
Bioinformatics. 2007 Dec 15;23(24):3280-8. doi: 10.1093/bioinformatics/btm522. Epub 2007 Nov 15.
Not individual single nucleotide polymorphisms (SNPs), but high-order interactions of SNPs are assumed to be responsible for complex diseases such as cancer. Therefore, one of the major goals of genetic association studies concerned with such genotype data is the identification of these high-order interactions. This search is additionally impeded by the fact that these interactions often are only explanatory for a relatively small subgroup of patients. Most of the feature selection methods proposed in the literature, unfortunately, fail at this task, since they can either only identify individual variables or interactions of a low order, or try to find rules that are explanatory for a high percentage of the observations. In this article, we present a procedure based on genetic programming and multi-valued logic that enables the identification of high-order interactions of categorical variables such as SNPs. This method called GPAS cannot only be used for feature selection, but can also be employed for discrimination.
In an application to the genotype data from the GENICA study, an association study concerned with sporadic breast cancer, GPAS is able to identify high-order interactions of SNPs leading to a considerably increased breast cancer risk for different subsets of patients that are not found by other feature selection methods. As an application to a subset of the HapMap data shows, GPAS is not restricted to association studies comprising several 10 SNPs, but can also be employed to analyze whole-genome data.
Software can be downloaded from http://ls2-www.cs.uni-dortmund.de/~nunkesser/#Software
人们认为,不是单个单核苷酸多态性(SNP),而是SNP的高阶相互作用才是导致诸如癌症等复杂疾病的原因。因此,涉及此类基因型数据的基因关联研究的主要目标之一就是识别这些高阶相互作用。此外,这些相互作用通常仅对相对较小的患者亚组具有解释性,这一事实阻碍了此类研究。不幸的是,文献中提出的大多数特征选择方法都无法完成这项任务,因为它们要么只能识别单个变量或低阶相互作用,要么试图找到对高比例观察结果具有解释性的规则。在本文中,我们提出了一种基于遗传编程和多值逻辑的方法,该方法能够识别诸如SNP等分类变量的高阶相互作用。这种称为GPAS的方法不仅可用于特征选择,还可用于判别分析。
在对GENICA研究中的基因型数据(一项关于散发性乳腺癌的关联研究)的应用中,GPAS能够识别SNP的高阶相互作用,这些相互作用会使不同患者亚组的乳腺癌风险显著增加,而其他特征选择方法并未发现这些相互作用。正如对HapMap数据子集的应用所示,GPAS不仅限于包含数十个SNP的关联研究,还可用于分析全基因组数据。
软件可从http://ls2-www.cs.uni-dortmund.de/~nunkesser/#Software下载