Department of Statistical and Actuarial Sciences, Western University, London, Ontario, Canada.
Department of Biology, Western University, London, Ontario, Canada.
PLoS One. 2018 Sep 25;13(9):e0204156. doi: 10.1371/journal.pone.0204156. eCollection 2018.
Mutation cluster analysis is critical for understanding certain mutational mechanisms relevant to genetic disease, diversity, and evolution. Yet, whole genome sequencing for detection of mutation clusters is prohibitive with high cost for most organisms and population surveys. Single nucleotide polymorphism (SNP) genotyping arrays, like the Mouse Diversity Genotyping Array, offer an alternative low-cost, screening for mutations at hundreds of thousands of loci across the genome using experimental designs that permit capture of de novo mutations in any tissue. Formal statistical tools for genome-wide detection of mutation clusters under a microarray probe sampling system are yet to be established. A challenge in the development of statistical methods is that microarray detection of mutation clusters is constrained to select SNP loci captured by probes on the array. This paper develops a Monte Carlo framework for cluster testing and assesses test statistics for capturing potential deviations from spatial randomness which are motivated by, and incorporate, the array design. While null distributions of the test statistics are established under spatial randomness via the homogeneous Poisson process, power performance of the test statistics is evaluated under postulated types of Neyman-Scott clustering processes through Monte Carlo simulation. A new statistic is developed and recommended as a screening tool for mutation cluster detection. The statistic is demonstrated to be excellent in terms of its robustness and power performance, and useful for cluster analysis in settings of missing data. The test statistic can also be generalized to any one dimensional system where every site is observed, such as DNA sequencing data. The paper illustrates how the informal graphical tools for detecting clusters may be misleading. The statistic is used for finding clusters of putative SNP differences in a mixture of different mouse genetic backgrounds and clusters of de novo SNP differences arising between tissues with development and carcinogenesis.
突变簇分析对于理解与遗传疾病、多样性和进化相关的某些突变机制至关重要。然而,对于大多数生物和种群调查来说,全基因组测序来检测突变簇是非常昂贵的。单核苷酸多态性(SNP)基因分型阵列,如 Mouse Diversity Genotyping Array,提供了一种替代方法,以低成本筛选基因组中数十万位置的突变,使用允许捕获任何组织中新发生突变的实验设计。在微阵列探针采样系统下,用于全基因组检测突变簇的正式统计工具尚未建立。统计方法开发的一个挑战是,微阵列检测突变簇仅限于捕获阵列探针上捕获的 SNP 基因座。本文开发了一种用于簇测试的蒙特卡罗框架,并评估了用于捕获潜在偏离空间随机性的测试统计量,这些统计量是由阵列设计驱动的,并包含了阵列设计。虽然通过均匀泊松过程在空间随机性下建立了测试统计量的零分布,但通过蒙特卡罗模拟,在假设的 Neyman-Scott 聚类过程类型下评估了测试统计量的功效。开发了一个新的统计量,并推荐作为突变簇检测的筛选工具。该统计量在稳健性和功效方面表现出色,并且在存在缺失数据的情况下,对于聚类分析也很有用。该测试统计量还可以推广到任何一维系统,例如 DNA 测序数据,在该系统中每个位置都可以被观察到。本文说明了用于检测聚类的非正式图形工具可能会产生误导。该统计量用于在不同小鼠遗传背景的混合物中发现假定 SNP 差异的聚类,以及在发育和癌变过程中组织之间出现的新发生 SNP 差异的聚类。