Kim Seyoung, Sohn Kyung-Ah, Xing Eric P
School of Computer Science, Carnegie Mellon University, Pittsburgh, USA.
Bioinformatics. 2009 Jun 15;25(12):i204-12. doi: 10.1093/bioinformatics/btp218.
Many complex disease syndromes such as asthma consist of a large number of highly related, rather than independent, clinical phenotypes, raising a new technical challenge in identifying genetic variations associated simultaneously with correlated traits. Although a causal genetic variation may influence a group of highly correlated traits jointly, most of the previous association analyses considered each phenotype separately, or combined results from a set of single-phenotype analyses.
We propose a new statistical framework called graph-guided fused lasso to address this issue in a principled way. Our approach represents the dependency structure among the quantitative traits explicitly as a network, and leverages this trait network to encode structured regularizations in a multivariate regression model over the genotypes and traits, so that the genetic markers that jointly influence subgroups of highly correlated traits can be detected with high sensitivity and specificity. While most of the traditional methods examined each phenotype independently, our approach analyzes all of the traits jointly in a single statistical method to discover the genetic markers that perturb a subset of correlated traits jointly rather than a single trait. Using simulated datasets based on the HapMap consortium data and an asthma dataset, we compare the performance of our method with the single-marker analysis, and other sparse regression methods that do not use any structural information in the traits. Our results show that there is a significant advantage in detecting the true causal single nucleotide polymorphisms when we incorporate the correlation pattern in traits using our proposed methods.
Software for GFlasso is available at http://www.sailing.cs.cmu.edu/gflasso.html.
许多复杂疾病综合征,如哮喘,由大量高度相关而非独立的临床表型组成,这在识别与相关性状同时相关的基因变异方面带来了新的技术挑战。尽管一个因果基因变异可能会共同影响一组高度相关的性状,但大多数先前的关联分析都是分别考虑每个表型,或者将一组单表型分析的结果合并起来。
我们提出了一种新的统计框架,称为图引导融合套索,以一种有原则的方式解决这个问题。我们的方法将数量性状之间的依赖结构明确表示为一个网络,并利用这个性状网络在基因型和性状的多元回归模型中编码结构化正则化,以便能够以高灵敏度和特异性检测共同影响高度相关性状亚组的基因标记。虽然大多数传统方法独立检查每个表型,但我们的方法在单一统计方法中联合分析所有性状,以发现共同干扰相关性状子集而非单个性状的基因标记。使用基于HapMap联盟数据的模拟数据集和一个哮喘数据集,我们将我们方法的性能与单标记分析以及其他不使用性状中任何结构信息的稀疏回归方法进行了比较。我们的结果表明,当我们使用我们提出的方法纳入性状中的相关模式时,在检测真正的因果单核苷酸多态性方面具有显著优势。