IASMA Research and Innovation Center, Via E. Mach 1, 38010, San Michele all'Adige, TN, Italy.
J Mol Evol. 2010 Dec;71(5-6):319-31. doi: 10.1007/s00239-010-9398-z. Epub 2010 Oct 26.
Noisy data, especially in combination with misalignment and model misspecification can have an adverse effect on phylogeny reconstruction; however, effective methods to identify such data are few. One particularly important class of noisy data is saturated positions. To avoid potential errors related to saturation in phylogenomic analyses, we present an automated procedure involving the step-wise removal of the most variable positions in a given data set coupled with a stopping criterion derived from correlation analyses of pairwise ML distances calculated from the deleted (saturated) and the remaining (conserved) subsets of the alignment. Through a comparison with existing methods, we demonstrate both the effectiveness of our proposed procedure for identifying noisy data and the effect of the removal of such data using a well-publicized case study involving placental mammals. At the least, our procedure will identify data sets requiring greater data exploration, and we recommend its use to investigate the effect on phylogenetic analyses of removing subsets of variable positions exhibiting weak or no correlation to the rest of the alignment. However, we would argue that this procedure, by identifying and removing noisy data, facilitates the construction of more accurate phylogenies by, for example, ameliorating potential long-branch attraction artefacts.
嘈杂数据,尤其是与不对齐和模型失配相结合时,可能会对系统发育重建产生不利影响;然而,有效的识别此类数据的方法却很少。嘈杂数据的一个特别重要的类别是饱和位置。为了避免系统发育分析中与饱和相关的潜在错误,我们提出了一种自动程序,涉及逐步删除给定数据集中最可变的位置,并结合源自从删除的(饱和的)和对齐的剩余(保守的)子集计算的成对 ML 距离的相关分析的停止标准。通过与现有方法的比较,我们证明了我们提出的识别嘈杂数据的程序的有效性,以及使用一个广为人知的胎盘哺乳动物案例研究去除这些数据子集的效果。至少,我们的程序将识别需要更大数据探索的数据集,我们建议使用它来研究去除与对齐其余部分相关性较弱或没有相关性的可变位置子集对系统发育分析的影响。然而,我们认为,通过识别和去除嘈杂数据,该程序通过例如减轻潜在的长枝吸引伪影,有助于构建更准确的系统发育。