State Key Laboratory of Rice Biology, Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, 310058, Hangzhou, China.
Institute of Insect Sciences, Zhejiang University, 310058, Hangzhou, China.
Nat Commun. 2020 Nov 30;11(1):6096. doi: 10.1038/s41467-020-20005-6.
Phylogenetic trees are essential for studying biology, but their reproducibility under identical parameter settings remains unexplored. Here, we find that 3515 (18.11%) IQ-TREE-inferred and 1813 (9.34%) RAxML-NG-inferred maximum likelihood (ML) gene trees are topologically irreproducible when executing two replicates (Run1 and Run2) for each of 19,414 gene alignments in 15 animal, plant, and fungal phylogenomic datasets. Notably, coalescent-based ASTRAL species phylogenies inferred from Run1 and Run2 sets of individual gene trees are topologically irreproducible for 9/15 phylogenomic datasets, whereas concatenation-based phylogenies inferred twice from the same supermatrix are reproducible. Our simulations further show that irreproducible phylogenies are more likely to be incorrect than reproducible phylogenies. These results suggest that a considerable fraction of single-gene ML trees may be irreproducible. Increasing reproducibility in ML inference will benefit from providing analyses' log files, which contain typically reported parameters (e.g., program, substitution model, number of tree searches) but also typically unreported ones (e.g., random starting seed number, number of threads, processor type).
系统发育树对于生物学研究至关重要,但在相同参数设置下其重现性仍未得到探索。在这里,我们发现,对于 15 个动物、植物和真菌系统发育基因组数据集的 19414 个基因比对,分别对每个比对执行两次重复(Run1 和 Run2),在 IQ-TREE 推断的 3515 个(18.11%)和 RAxML-NG 推断的最大似然(ML)基因树中有 1813 个(9.34%)在拓扑结构上是不可重现的。值得注意的是,从个体基因树的 Run1 和 Run2 集合推断的基于合并的 ASTRAL 种系统发育树对于 9/15 个系统发育基因组数据集在拓扑结构上是不可重现的,而从同一超级矩阵两次推断的串联系统发育树是可重现的。我们的模拟进一步表明,不可重现的系统发育树比可重现的系统发育树更有可能是不正确的。这些结果表明,相当一部分单基因 ML 树可能是不可重现的。通过提供分析日志文件,增加 ML 推断的可重现性将受益,这些日志文件包含通常报告的参数(例如,程序、替代模型、树搜索次数),但也包含通常未报告的参数(例如,随机起始种子数、线程数、处理器类型)。