Hoff Michael, Orf Stefan, Riehm Benedikt, Darriba Diego, Stamatakis Alexandros
Karlsruhe Institute of Technology, Department of Informatics, Kaiserstraße 12, Karlsruhe, 76131, Germany.
The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, Heidelberg, 69118, Germany.
BMC Bioinformatics. 2016 Mar 24;17:143. doi: 10.1186/s12859-016-0985-x.
In the context of a master level programming practical at the computer science department of the Karlsruhe Institute of Technology, we developed and make available an open-source code for testing all 203 possible nucleotide substitution models in the Maximum Likelihood (ML) setting under the common Akaike, corrected Akaike, and Bayesian information criteria. We address the question if model selection matters topologically, that is, if conducting ML inferences under the optimal, instead of a standard General Time Reversible model, yields different tree topologies. We also assess, to which degree models selected and trees inferred under the three standard criteria (AIC, AICc, BIC) differ. Finally, we assess if the definition of the sample size (#sites versus #sites × #taxa) yields different models and, as a consequence, different tree topologies.
We find that, all three factors (by order of impact: nucleotide model selection, information criterion used, sample size definition) can yield topologically substantially different final tree topologies (topological difference exceeding 10 %) for approximately 5 % of the tree inferences conducted on the 39 empirical datasets used in our study.
We find that, using the best-fit nucleotide substitution model may change the final ML tree topology compared to an inference under a default GTR model. The effect is less pronounced when comparing distinct information criteria. Nonetheless, in some cases we did obtain substantial topological differences.
在卡尔斯鲁厄理工学院计算机科学系的硕士水平编程实践中,我们开发并提供了一个开源代码,用于在常用的赤池信息准则、修正赤池信息准则和贝叶斯信息准则下,在最大似然(ML)设置中测试所有203种可能的核苷酸替换模型。我们探讨了模型选择在拓扑结构上是否重要的问题,也就是说,在最优模型而非标准的通用时间可逆模型下进行ML推断是否会产生不同的树拓扑结构。我们还评估了在三个标准准则(AIC、AICc、BIC)下选择的模型和推断的树之间的差异程度。最后,我们评估样本量的定义(#位点与#位点×#分类单元)是否会产生不同的模型,进而产生不同的树拓扑结构。
我们发现,对于我们研究中使用的39个经验数据集上进行的大约5%的树推断,所有三个因素(按影响程度排序:核苷酸模型选择、使用的信息准则、样本量定义)都可能产生拓扑结构上有显著差异的最终树拓扑结构(拓扑差异超过10%)。
我们发现,与在默认GTR模型下进行推断相比,使用最佳拟合核苷酸替换模型可能会改变最终的ML树拓扑结构。在比较不同的信息准则时,这种影响不太明显。尽管如此,在某些情况下我们确实获得了显著的拓扑差异。