Fore Ruby, Boehme Jaden, Li Kevin, Westra Jason, Tintle Nathan
Department of Biostatistics, Brown University, Providence, RI, United States.
Department of Mathematics, Oregon State University, Corvallis, OR, United States.
Front Genet. 2020 Nov 9;11:591606. doi: 10.3389/fgene.2020.591606. eCollection 2020.
Gene-based tests of association (e.g., variance components and burden tests) are now common practice for analyses attempting to elucidate the contribution of rare genetic variants on common disease. As sequencing datasets continue to grow in size, the number of variants within each set (e.g., gene) being tested is also continuing to grow. Pathway-based methods have been used to allow for the initial aggregation of gene-based statistical evidence and then the subsequent aggregation of evidence across the pathway. This "multi-set" approach (first gene-based test, followed by pathway-based) lacks thorough exploration in regard to evaluating genotype-phenotype associations in the age of large, sequenced datasets. In particular, we wonder whether there are statistical and biological characteristics that make the multi-set approach optimal vs. simply doing all gene-based tests? In this paper, we provide an intuitive framework for evaluating these questions and use simulated data to affirm us this intuition. A real data application is provided demonstrating how our insights manifest themselves in practice. Ultimately, we find that when initial subsets are biologically informative (e.g., tending to aggregate causal genetic variants within one or more subsets, often genes), multi-set strategies can improve statistical power, with particular gains in cases where causal variants are aggregated in subsets with less variants overall (high proportion of causal variants in the subset). However, we find that there is little advantage when the sets are non-informative (similar proportion of causal variants in the subsets). Our application to real data further demonstrates this intuition. In practice, we recommend wider use of pathway-based methods and further exploration of optimal ways of aggregating variants into subsets based on emerging biological evidence of the genetic architecture of complex disease.
基于基因的关联性检测(例如方差成分和负担检验)如今在试图阐明罕见基因变异对常见疾病影响的分析中已成为常规做法。随着测序数据集规模不断扩大,每个被检测集合(如基因)内的变异数量也在持续增加。基于通路的方法已被用于对基于基因的统计证据进行初步汇总,然后再对整个通路的证据进行后续汇总。这种“多集合”方法(先进行基于基因的检测,然后是基于通路的检测)在大规模测序数据集时代评估基因型与表型关联方面缺乏深入探索。特别是,我们想知道是否存在一些统计和生物学特征,使得多集合方法比单纯进行所有基于基因的检测更具优势?在本文中,我们提供了一个直观的框架来评估这些问题,并使用模拟数据来证实我们的这种直觉。我们还给出了一个实际数据应用示例,展示了我们的见解在实际中的体现。最终,我们发现当初始子集具有生物学信息时(例如,倾向于在一个或多个子集内汇总因果基因变异,通常是基因),多集合策略可以提高统计效力,在因果变异集中在总体变异较少的子集(子集中因果变异比例较高)的情况下尤其如此。然而,我们发现当这些集合没有信息时(子集中因果变异比例相似),优势并不明显。我们对实际数据的应用进一步证明了这种直觉。在实践中,我们建议更广泛地使用基于通路的方法,并根据复杂疾病遗传结构的新出现的生物学证据,进一步探索将变异汇总到子集中的最佳方法。