Beissinger Timothy M, Rosa Guilherme J M, Kaeppler Shawn M, Gianola Daniel, de Leon Natalia
Department of Plant Sciences, University of California, Davis, 95616, USA.
Department of Animal Sciences, University of Wisconsin, Madison, 53706, USA.
Genet Sel Evol. 2015 Apr 17;47(1):30. doi: 10.1186/s12711-015-0105-9.
High-density genomic data is often analyzed by combining information over windows of adjacent markers. Interpretation of data grouped in windows versus at individual locations may increase statistical power, simplify computation, reduce sampling noise, and reduce the total number of tests performed. However, use of adjacent marker information can result in over- or under-smoothing, undesirable window boundary specifications, or highly correlated test statistics. We introduce a method for defining windows based on statistically guided breakpoints in the data, as a foundation for the analysis of multiple adjacent data points. This method involves first fitting a cubic smoothing spline to the data and then identifying the inflection points of the fitted spline, which serve as the boundaries of adjacent windows. This technique does not require prior knowledge of linkage disequilibrium, and therefore can be applied to data collected from individual or pooled sequencing experiments. Moreover, in contrast to existing methods, an arbitrary choice of window size is not necessary, since these are determined empirically and allowed to vary along the genome.
Simulations applying this method were performed to identify selection signatures from pooled sequencing FST data, for which allele frequencies were estimated from a pool of individuals. The relative ratio of true to false positives was twice that generated by existing techniques. A comparison of the approach to a previous study that involved pooled sequencing FST data from maize suggested that outlying windows were more clearly separated from their neighbors than when using a standard sliding window approach.
We have developed a novel technique to identify window boundaries for subsequent analysis protocols. When applied to selection studies based on F ST data, this method provides a high discovery rate and minimizes false positives. The method is implemented in the R package GenWin, which is publicly available from CRAN.
高密度基因组数据通常通过合并相邻标记窗口上的信息来进行分析。对按窗口分组的数据与单个位置的数据进行解释,可能会提高统计功效、简化计算、减少抽样噪声并减少所执行测试的总数。然而,使用相邻标记信息可能会导致过度平滑或平滑不足、不理想的窗口边界规格或高度相关的测试统计量。我们引入了一种基于数据中的统计引导断点来定义窗口的方法,作为分析多个相邻数据点的基础。该方法首先对数据拟合三次平滑样条,然后识别拟合样条的拐点,这些拐点用作相邻窗口的边界。此技术不需要连锁不平衡的先验知识,因此可应用于从个体或混合测序实验收集的数据。此外,与现有方法不同,无需任意选择窗口大小,因为这些窗口大小是根据经验确定的,并且允许沿基因组变化。
应用此方法进行了模拟,以从混合测序FST数据中识别选择特征,其中等位基因频率是从个体池中估计的。真阳性与假阳性的相对比率是现有技术产生的两倍。将该方法与之前一项涉及玉米混合测序FST数据的研究进行比较,结果表明,与使用标准滑动窗口方法相比,异常窗口与其相邻窗口的分离更为明显。
我们开发了一种新颖的技术来识别窗口边界,以供后续分析方案使用。当应用于基于FST数据的选择研究时,该方法具有较高的发现率并将假阳性降至最低。该方法在R包GenWin中实现,可从CRAN公开获取。