McTavish Emily Jane, Steel Mike, Holder Mark T
Heidelberg Institute for Theoretical Studies, Heidelberg, Germany; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, USA.
Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand.
Mol Phylogenet Evol. 2015 Dec;93:289-95. doi: 10.1016/j.ympev.2015.07.027. Epub 2015 Aug 6.
Statistically consistent estimation of phylogenetic trees or gene trees is possible if pairwise sequence dissimilarities can be converted to a set of distances that are proportional to the true evolutionary distances. Susko et al. (2004) reported some strikingly broad results about the forms of inconsistency in tree estimation that can arise if corrected distances are not proportional to the true distances. They showed that if the corrected distance is a concave function of the true distance, then inconsistency due to long branch attraction will occur. If these functions are convex, then two "long branch repulsion" trees will be preferred over the true tree - though these two incorrect trees are expected to be tied as the preferred true. Here we extend their results, and demonstrate the existence of a tree shape (which we refer to as a "twisted Farris-zone" tree) for which a single incorrect tree topology will be guaranteed to be preferred if the corrected distance function is convex. We also report that the standard practice of treating gaps in sequence alignments as missing data is sufficient to produce non-linear corrected distance functions if the substitution process is not independent of the insertion/deletion process. Taken together, these results imply inconsistent tree inference under mild conditions. For example, if some positions in a sequence are constrained to be free of substitutions and insertion/deletion events while the remaining sites evolve with independent substitutions and insertion/deletion events, then the distances obtained by treating gaps as missing data can support an incorrect tree topology even given an unlimited amount of data.
如果成对序列差异能够转换为一组与真实进化距离成比例的距离,那么系统发育树或基因树的统计一致性估计是可能的。Susko等人(2004年)报告了一些关于树估计中不一致形式的显著广泛结果,如果校正距离与真实距离不成比例,这些不一致形式就可能出现。他们表明,如果校正距离是真实距离的凹函数,那么由于长枝吸引将出现不一致。如果这些函数是凸函数,那么两棵“长枝排斥”树将比真实树更受青睐——尽管预计这两棵错误的树会作为首选真实树而不分上下。在这里,我们扩展了他们的结果,并证明了存在一种树形(我们称之为“扭曲的法里斯区域”树),如果校正距离函数是凸函数,那么单一的错误树拓扑结构将肯定更受青睐。我们还报告说,如果替换过程与插入/缺失过程不独立,那么将序列比对中的空位视为缺失数据的标准做法足以产生非线性校正距离函数。综合起来,这些结果意味着在温和条件下树推断会出现不一致。例如,如果序列中的某些位置被限制不发生替换和插入/缺失事件,而其余位点以独立的替换和插入/缺失事件进化,那么即使给定无限量的数据,将空位视为缺失数据所获得的距离也可能支持错误的树拓扑结构。