在系统发育马尔可夫链蒙特卡罗方法中寻找收敛性。

Searching for convergence in phylogenetic Markov chain Monte Carlo.

作者信息

Beiko Robert G, Keith Jonathan M, Harlow Timothy J, Ragan Mark A

机构信息

ARC Centre in Bioinformatics and Institute for Molecular Bioscience, The University of Queensland, Brisbane, Queensland 4072, Australia.

出版信息

Syst Biol. 2006 Aug;55(4):553-65. doi: 10.1080/10635150600812544.

DOI:10.1080/10635150600812544

PMID:16857650

Abstract

Markov chain Monte Carlo (MCMC) is a methodology that is gaining widespread use in the phylogenetics community and is central to phylogenetic software packages such as MrBayes. An important issue for users of MCMC methods is how to select appropriate values for adjustable parameters such as the length of the Markov chain or chains, the sampling density, the proposal mechanism, and, if Metropolis-coupled MCMC is being used, the number of heated chains and their temperatures. Although some parameter settings have been examined in detail in the literature, others are frequently chosen with more regard to computational time or personal experience with other data sets. Such choices may lead to inadequate sampling of tree space or an inefficient use of computational resources. We performed a detailed study of convergence and mixing for 70 randomly selected, putatively orthologous protein sets with different sizes and taxonomic compositions. Replicated runs from multiple random starting points permit a more rigorous assessment of convergence, and we developed two novel statistics, delta and epsilon, for this purpose. Although likelihood values invariably stabilized quickly, adequate sampling of the posterior distribution of tree topologies took considerably longer. Our results suggest that multimodality is common for data sets with 30 or more taxa and that this results in slow convergence and mixing. However, we also found that the pragmatic approach of combining data from several short, replicated runs into a "metachain" to estimate bipartition posterior probabilities provided good approximations, and that such estimates were no worse in approximating a reference posterior distribution than those obtained using a single long run of the same length as the metachain. Precision appears to be best when heated Markov chains have low temperatures, whereas chains with high temperatures appear to sample trees with high posterior probabilities only rarely.

摘要

马尔可夫链蒙特卡罗（MCMC）是一种在系统发育学界广泛应用的方法，也是诸如MrBayes等系统发育软件包的核心。对于MCMC方法的使用者来说，一个重要问题是如何为可调整参数选择合适的值，例如马尔可夫链的长度、采样密度、提议机制，以及如果使用的是 metropolis 耦合MCMC，加热链的数量及其温度。尽管文献中已经对一些参数设置进行了详细研究，但其他参数的选择往往更多地考虑计算时间或对其他数据集的个人经验。这样的选择可能导致对树形空间的采样不足或计算资源的低效利用。我们对70个随机选择的、具有不同大小和分类组成的假定直系同源蛋白质集进行了收敛和混合的详细研究。从多个随机起点进行重复运行可以对收敛进行更严格的评估，为此我们开发了两个新的统计量，delta和epsilon。尽管似然值总是很快稳定下来，但对树形拓扑结构后验分布的充分采样却花费了相当长的时间。我们的结果表明，对于有30个或更多分类单元的数据集，多峰性很常见，这导致收敛和混合缓慢。然而，我们还发现，将来自几个短的、重复运行的数据组合成一个“元链”来估计二分后验概率的实用方法提供了很好的近似值，并且这种估计在近似参考后验分布方面并不比使用与元链长度相同的单个长运行得到的估计差。当加热的马尔可夫链温度较低时，精度似乎最佳，而高温链似乎很少采样到具有高后验概率的树。