Loehlein Fier Heide, Prokopenko Dmitry, Hecker Julian, Cho Michael H, Silverman Edwin K, Weiss Scott T, Tanzi Rudolph E, Lange Christoph
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America.
Working Group of Genomic Mathematics, University of Bonn, Bonn, Germany.
Genet Epidemiol. 2017 May;41(4):332-340. doi: 10.1002/gepi.22040. Epub 2017 Mar 20.
For the association analysis of whole-genome sequencing (WGS) studies, we propose an efficient and fast spatial-clustering algorithm. Compared to existing analysis approaches for WGS data, that define the tested regions either by sliding or consecutive windows of fixed sizes along variants, a meaningful grouping of nearby variants into consecutive regions has the advantage that, compared to sliding window approaches, the number of tested regions is likely to be smaller. In comparison to consecutive, fixed-window approaches, our approach is likely to group nearby variants together. Given existing biological evidence that disease-associated mutations tend to physically cluster in specific regions along the chromosome, the identification of meaningful groups of nearby located variants could thus lead to a potential power gain for association analysis. Our algorithm defines consecutive genomic regions based on the physical positions of the variants, assuming an inhomogeneous Poisson process and groups together nearby variants. As parameters are estimated locally, the algorithm takes the differing variant density along the chromosome into account and provides locally optimal partitioning of variants into consecutive regions. An R-implementation of the algorithm is provided. We discuss the theoretical advances of our algorithm compared to existing, window-based approaches and show the performance and advantage of our introduced algorithm in a simulation study and by an application to Alzheimer's disease WGS data. Our analysis identifies a region in the ITGB3 gene that potentially harbors disease susceptibility loci for Alzheimer's disease. The region-based association signal of ITGB3 replicates in an independent data set and achieves formally genome-wide significance. Software Implementation: An implementation of the algorithm in R is available at: https://github.com/heidefier/cluster_wgs_data.
对于全基因组测序(WGS)研究的关联分析,我们提出了一种高效快速的空间聚类算法。与现有的WGS数据分析方法相比,现有方法通过沿变异位点滑动或使用固定大小的连续窗口来定义测试区域,将附近的变异位点有意义地分组到连续区域具有这样的优势:与滑动窗口方法相比,测试区域的数量可能更少。与连续的固定窗口方法相比,我们的方法可能会将附近的变异位点聚集在一起。鉴于现有生物学证据表明疾病相关突变倾向于在染色体上的特定区域物理聚集,识别附近定位的变异位点的有意义组可能会为关联分析带来潜在的功效提升。我们的算法基于变异位点的物理位置定义连续的基因组区域,假设为非齐次泊松过程,并将附近的变异位点聚集在一起。由于参数是局部估计的,该算法考虑了沿染色体不同的变异密度,并提供了将变异位点局部最优地划分为连续区域的方法。提供了该算法的R实现。我们讨论了我们的算法与现有基于窗口的方法相比的理论进展,并在模拟研究以及对阿尔茨海默病WGS数据的应用中展示了我们引入算法的性能和优势。我们的分析在整合素β3(ITGB3)基因中识别出一个区域,该区域可能含有阿尔茨海默病的疾病易感位点。ITGB3基于区域的关联信号在一个独立数据集中得到重复,并达到了正式的全基因组显著性水平。软件实现:该算法的R实现可在以下网址获取:https://github.com/heidefier/cluster_wgs_data 。