Biometry and Evolutionary Biology Laboratory (LBBE), University Claude Bernard Lyon 1, Lyon, France.
Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.
Mol Biol Evol. 2024 Jan 3;41(1). doi: 10.1093/molbev/msad277.
Simulating multiple sequence alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical.
Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate several aspects of empirical MSAs, including site-wise rates as well as amino acid and nucleotide composition.
使用序列进化的概率模型模拟多个序列比对 (MSA) 在评估系统发育推断工具的性能方面起着重要作用,对于开发基于学习的新方法进行系统发育重建也至关重要,例如神经网络。这些模型和由此产生的模拟数据需要尽可能真实,以指示开发工具在经验数据上的性能,并确保在模拟数据上训练的神经网络在经验数据上表现良好。多年来,已经发表了许多进化模型,其目的是尽可能忠实地表示序列进化过程,从而模拟类似经验的数据。在这项研究中,我们使用最先进的序列模拟器,在有插入/缺失 (indel) 事件和没有插入/缺失 (indel) 事件的情况下,使用越来越复杂的进化模型来模拟 DNA 和蛋白质 MSA。我们通过量化监督学习方法能够准确预测给定 MSA 是模拟的还是经验的,来评估其真实性。
我们的结果表明,我们可以使用两种不同的、独立开发的分类方法,在所有测试的序列进化模型中,以高精度区分经验和模拟的 MSA。我们的发现表明,目前最先进的模型无法准确复制经验 MSA 的几个方面,包括位点速率以及氨基酸和核苷酸组成。