Tang H, Lewontin R C
Department of Statistics, Stanford University, Stanford, California 94305, USA.
Genetics. 1999 Sep;153(1):485-95. doi: 10.1093/genetics/153.1.485.
In the comparison of DNA and protein sequences between species or between paralogues or among individuals within a species or population, there is often some indication that different regions of the sequence are divergent or polymorphic to different degrees, indicating differential constraint or diversifying selection operating in different regions of the sequence. The problem is to test statistically whether the observed regional differences in the density of variant sites represent real differences and then to estimate as accurately as possible the location of the differential regions. A method is given for testing and locating regions of differential variation. The method consists of calculating G(x(k)) = k/n - x(k)/N, where x(k) is the position of the kth variant site along the sequence, n is the total number of variant sites, and N is the total sequence length. The estimated region is the longest stretch of adjacent sequence for which G(x(k)) is monotonically increasing (a hot spot) or decreasing (a cold spot). Critical values of this length for tests of significance are given, a sequential method is developed for locating multiple differential regions, and the power of the method against various alternatives is explored. The method locates the endpoints of hot spots and cold spots of variation with high accuracy.
在比较物种之间、旁系同源物之间或物种或群体内个体之间的DNA和蛋白质序列时,常常有迹象表明序列的不同区域在不同程度上存在差异或多态性,这表明在序列的不同区域存在差异约束或多样化选择。问题在于通过统计学方法检验观察到的变异位点密度的区域差异是否代表真实差异,然后尽可能准确地估计差异区域的位置。本文给出了一种用于检验和定位差异变异区域的方法。该方法包括计算G(x(k)) = k/n - x(k)/N,其中x(k)是第k个变异位点在序列中的位置,n是变异位点的总数,N是序列的总长度。估计区域是G(x(k))单调递增(热点)或单调递减(冷点)的最长相邻序列片段。给出了用于显著性检验的该长度的临界值,开发了一种用于定位多个差异区域的序贯方法,并探讨了该方法针对各种替代情况的功效。该方法能够高精度地定位变异热点和冷点的端点。