Ho Chiu Man, Hsu Stephen D H
Department of Physics and Astronomy, Michigan State University, 567 Wilson Road, East Lansing, 48824 MI USA.
Gigascience. 2015 Sep 14;4:44. doi: 10.1186/s13742-015-0081-6. eCollection 2015.
One of the fundamental problems of modern genomics is to extract the genetic architecture of a complex trait from a data set of individual genotypes and trait values. Establishing this important connection between genotype and phenotype is complicated by the large number of candidate genes, the potentially large number of causal loci, and the likely presence of some nonlinear interactions between different genes. Compressed Sensing methods obtain solutions to under-constrained systems of linear equations. These methods can be applied to the problem of determining the best model relating genotype to phenotype, and generally deliver better performance than simply regressing the phenotype against each genetic variant, one at a time. We introduce a Compressed Sensing method that can reconstruct nonlinear genetic models (i.e., including epistasis, or gene-gene interactions) from phenotype-genotype (GWAS) data. Our method uses L1-penalized regression applied to nonlinear functions of the sensing matrix.
The computational and data resource requirements for our method are similar to those necessary for reconstruction of linear genetic models (or identification of gene-trait associations), assuming a condition of generalized sparsity, which limits the total number of gene-gene interactions. An example of a sparse nonlinear model is one in which a typical locus interacts with several or even many others, but only a small subset of all possible interactions exist. It seems plausible that most genetic architectures fall in this category. We give theoretical arguments suggesting that the method is nearly optimal in performance, and demonstrate its effectiveness on broad classes of nonlinear genetic models using simulated human genomes and the small amount of currently available real data. A phase transition (i.e., dramatic and qualitative change) in the behavior of the algorithm indicates when sufficient data is available for its successful application.
Our results indicate that predictive models for many complex traits, including a variety of human disease susceptibilities (e.g., with additive heritability h (2)∼0.5), can be extracted from data sets comprised of n ⋆∼100s individuals, where s is the number of distinct causal variants influencing the trait. For example, given a trait controlled by ∼10 k loci, roughly a million individuals would be sufficient for application of the method.
现代基因组学的基本问题之一是从个体基因型和性状值的数据集中提取复杂性状的遗传结构。由于候选基因数量众多、潜在的因果位点数量可能很大以及不同基因之间可能存在一些非线性相互作用,建立基因型和表型之间的这种重要联系变得很复杂。压缩感知方法可求解欠定线性方程组。这些方法可应用于确定将基因型与表型联系起来的最佳模型的问题,并且通常比一次简单地将表型对每个遗传变异进行回归具有更好的性能。我们引入一种压缩感知方法,该方法可以从表型 - 基因型(全基因组关联研究,GWAS)数据中重建非线性遗传模型(即包括上位性或基因 - 基因相互作用)。我们的方法使用应用于感知矩阵非线性函数的L1惩罚回归。
假设广义稀疏条件限制了基因 - 基因相互作用的总数,我们方法的计算和数据资源需求与重建线性遗传模型(或鉴定基因 - 性状关联)所需的需求相似。稀疏非线性模型的一个例子是,一个典型位点与其他几个甚至许多位点相互作用,但所有可能相互作用中只有一小部分存在。大多数遗传结构似乎都属于这一类别,这似乎是合理的。我们给出理论论据表明该方法在性能上几乎是最优的,并使用模拟人类基因组和少量当前可用的真实数据证明了其在广泛类别的非线性遗传模型上的有效性。算法行为中的相变(即剧烈和定性的变化)表明何时有足够的数据可用于其成功应用。
我们的结果表明,许多复杂性状的预测模型,包括各种人类疾病易感性(例如,加性遗传力h(2) ∼ 0.5),可以从由n⋆ ∼ 100s个个体组成的数据集中提取,其中s是影响该性状的不同因果变异的数量。例如,对于由 ∼ 10 k个位点控制的性状,大约一百万个个体就足以应用该方法。