Fusi Nicolo, Listgarten Jennifer
Microsoft Research , Cambridge, Massachusetts.
J Comput Biol. 2017 Jun;24(6):524-535. doi: 10.1089/cmb.2016.0174. Epub 2017 Jan 5.
Genome-wide association studies commonly examine one trait at a time. Occasionally they examine several related traits with the hope of increasing power; in such a setting, the traits are not generally smoothly varying in any way such as time or space. However, for function-valued traits, the trait is often smoothly varying along the axis of interest, such as space or time. For instance, in the case of longitudinal traits such as growth curves, the axis of interest is time; for spatially varying traits such as chromatin accessibility, it would be position along the genome. Although there have been efforts to perform genome-wide association studies with such function-valued traits, the statistical approaches developed for this purpose often have limitations such as requiring the trait to behave linearly in time or space, or constraining the genetic effect itself to be constant or linear in time. Herein, we present a flexible model for this problem-the Partitioned Gaussian Process-which removes many such limitations and is especially effective as the number of time points increases. The theoretical basis of this model provides machinery for handling missing and unaligned function values such as would occur when not all individuals are measured at the same time points. Furthermore, we make use of algebraic refactorizations to substantially reduce the time complexity of our model beyond the naive implementation. Finally, we apply our approach and several others to synthetic data before closing, with some directions for improved modeling and statistical testing.
全基因组关联研究通常一次只研究一个性状。偶尔也会研究几个相关性状,以期提高检验效能;在这种情况下,这些性状一般不会以时间或空间等任何方式平滑变化。然而,对于函数值性状而言,该性状通常会沿着感兴趣的轴平滑变化,比如空间或时间。例如,对于诸如生长曲线这样的纵向性状,感兴趣的轴是时间;对于诸如染色质可及性这样的空间变化性状,感兴趣的轴则是基因组上的位置。尽管已经有人尝试对这类函数值性状进行全基因组关联研究,但为此目的开发的统计方法往往存在局限性,比如要求性状在时间或空间上呈线性变化,或者限制遗传效应本身在时间上保持恒定或呈线性。在此,我们针对这个问题提出了一个灵活的模型——分区高斯过程,它消除了许多此类局限性,并且随着时间点数量的增加,效果尤为显著。该模型的理论基础提供了处理缺失和未对齐函数值的机制,比如在并非所有个体都在相同时间点进行测量时会出现的情况。此外,我们利用代数重构大幅降低了模型的时间复杂度,使其优于简单的实现方式。最后,在结束之前,我们将我们的方法和其他几种方法应用于合成数据,并给出了一些改进建模和统计检验的方向。