Chifman Julia, Kubatko Laura
Department of Cancer Biology, Wake Forest School of Medicine, Winston-Salem, NC 27157, United States.
Department of Statistics, The Ohio State University, Columbus, OH 43210, United States; Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, United States.
J Theor Biol. 2015 Jun 7;374:35-47. doi: 10.1016/j.jtbi.2015.03.006. Epub 2015 Mar 17.
The inference of the evolutionary history of a collection of organisms is a problem of fundamental importance in evolutionary biology. The abundance of DNA sequence data arising from genome sequencing projects has led to significant challenges in the inference of these phylogenetic relationships. Among these challenges is the inference of the evolutionary history of a collection of species based on sequence information from several distinct genes sampled throughout the genome. It is widely accepted that each individual gene has its own phylogeny, which may not agree with the species tree. Many possible causes of this gene tree incongruence are known. The best studied is the incomplete lineage sorting, which is commonly modeled by the coalescent process. Numerous methods based on the coalescent process have been proposed for the estimation of the phylogenetic species tree given DNA sequence data. However, use of these methods assumes that the phylogenetic species tree can be identified from DNA sequence data at the leaves of the tree, although this has not been formally established. We prove that the unrooted topology of the n-leaf phylogenetic species tree is generically identifiable given observed data at the leaves of the tree that are assumed to have arisen from the coalescent process under a time-reversible substitution process with the possibility of site-specific rate variation modeled by the discrete gamma distribution and a proportion of invariable sites.
推断一组生物体的进化历史是进化生物学中一个具有根本重要性的问题。基因组测序项目产生的大量DNA序列数据给这些系统发育关系的推断带来了重大挑战。其中一个挑战是根据从整个基因组中采样的几个不同基因的序列信息推断一组物种的进化历史。人们普遍认为,每个个体基因都有自己的系统发育,这可能与物种树不一致。已知这种基因树不一致有许多可能的原因。研究得最透彻的是不完全谱系分选,它通常由合并过程建模。已经提出了许多基于合并过程的方法来根据DNA序列数据估计系统发育物种树。然而,使用这些方法假定可以从树的叶子处的DNA序列数据中识别出系统发育物种树,尽管这尚未得到正式确立。我们证明,在假设树的叶子处的观测数据是在具有由离散伽马分布建模的位点特异性速率变化可能性和一定比例不变位点的时间可逆替换过程下由合并过程产生的情况下,n叶系统发育物种树的无根拓扑结构一般是可识别的。