Li Yong Fuga, Costello James C, Holloway Alisha K, Hahn Matthew W
School of Informatics, Indiana University, Bloomington, IN, USA.
Evolution. 2008 Dec;62(12):2984-94. doi: 10.1111/j.1558-5646.2008.00486.x. Epub 2008 Aug 26.
Rapid and inexpensive sequencing technologies are making it possible to collect whole genome sequence data on multiple individuals from a population. This type of data can be used to quickly identify genes that control important ecological and evolutionary phenotypes by finding the targets of adaptive natural selection, and we therefore refer to such approaches as "reverse ecology." To quantify the power gained in detecting positive selection using population genomic data, we compare three statistical methods for identifying targets of selection: the McDonald-Kreitman test, the mkprf method, and a likelihood implementation for detecting d(N)/d(S) > 1. Because the first two methods use polymorphism data we expect them to have more power to detect selection. However, when applied to population genomic datasets from human, fly, and yeast, the tests using polymorphism data were actually weaker in two of the three datasets. We explore reasons why the simpler comparative method has identified more genes under selection, and suggest that the different methods may really be detecting different signals from the same sequence data. Finally, we find several statistical anomalies associated with the mkprf method, including an almost linear dependence between the number of positively selected genes identified and the prior distributions used. We conclude that interpreting the results produced by this method should be done with some caution.
快速且低成本的测序技术使得从一个群体中收集多个个体的全基因组序列数据成为可能。这类数据可用于通过寻找适应性自然选择的靶点,快速识别控制重要生态和进化表型的基因,因此我们将此类方法称为“反向生态学”。为了量化利用群体基因组数据检测正选择时所获得的功效,我们比较了三种用于识别选择靶点的统计方法:麦克唐纳 - 克里特曼检验、mkprf方法以及一种用于检测d(N)/d(S)>1的似然法。由于前两种方法使用多态性数据,我们预期它们在检测选择方面具有更强的功效。然而,当应用于来自人类、果蝇和酵母的群体基因组数据集时,在三个数据集中有两个数据集里,使用多态性数据的检验实际上功效更弱。我们探究了为何更简单的比较方法能识别出更多处于选择状态的基因,并提出不同方法可能实际上是从相同序列数据中检测到了不同信号。最后,我们发现了与mkprf方法相关的几个统计异常情况,包括所识别出的正选择基因数量与所用先验分布之间几乎呈线性依赖关系。我们得出结论,在解释该方法产生的结果时应谨慎行事。