Leman Scotland C, Uyenoyama Marcy K, Lavine Michael, Chen Yuguo
Institute of Statistics and Decision Sciences, Duke University, Durham, NC, USA.
Bioinformatics. 2007 Aug 1;23(15):1962-8. doi: 10.1093/bioinformatics/btm264. Epub 2007 May 22.
Gene genealogies offer a powerful context for inferences about the evolutionary process based on presently segregating DNA variation. In many cases, it is the distribution of population parameters, marginalized over the effectively infinite-dimensional tree space, that is of interest. Our evolutionary forest (EF) algorithm uses Monte Carlo methods to generate posterior distributions of population parameters. A novel feature is the updating of parameter values based on a probability measure defined on an ensemble of histories (a forest of genealogies), rather than a single tree.
The EF algorithm generates samples from the correct marginal distribution of population parameters. Applied to actual data from closely related fruit fly species, it rapidly converged to posterior distributions that closely approximated the exact posteriors generated through massive computational effort. Applied to simulated data, it generated credible intervals that covered the actual parameter values in accordance with the nominal probabilities.
A C++ implementation of this method is freely accessible at http://www.isds.duke.edu/~scl13
基因谱系为基于当前分离的DNA变异推断进化过程提供了一个强大的背景。在许多情况下,感兴趣的是在有效无限维树空间上边缘化的群体参数分布。我们的进化森林(EF)算法使用蒙特卡罗方法生成群体参数的后验分布。一个新颖的特点是基于在一组历史(基因谱系森林)上定义的概率测度更新参数值,而不是基于单个树。
EF算法从群体参数的正确边际分布中生成样本。应用于密切相关果蝇物种的实际数据时,它迅速收敛到后验分布,该分布与通过大量计算努力生成的精确后验分布非常接近。应用于模拟数据时,它生成的可信区间根据标称概率覆盖了实际参数值。
此方法的C++实现可从http://www.isds.duke.edu/~scl13免费获取。