Alexe G, Satya R Vijaya, Seiler M, Platt D, Bhanot T, Hui S, Tanaka M, Levine A J, Bhanot G
The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
J Mol Evol. 2008 Nov;67(5):465-87. doi: 10.1007/s00239-008-9148-7. Epub 2008 Oct 15.
Phylogenetic trees based on mtDNA polymorphisms are often used to infer the history of recent human migrations. However, there is no consensus on which method to use. Most methods make strong assumptions which may bias the choice of polymorphisms and result in computational complexity which limits the analysis to a few samples/polymorphisms. For example, parsimony minimizes the number of mutations, which biases the results to minimizing homoplasy events. Such biases may miss the global structure of the polymorphisms altogether, with the risk of identifying a "common" polymorphism as ancient without an internal check on whether it either is homoplasic or is identified as ancient because of sampling bias (from oversampling the population with the polymorphism). A signature of this problem is that different methods applied to the same data or the same method applied to different datasets results in different tree topologies. When the results of such analyses are combined, the consensus trees have a low internal branch consensus. We determine human mtDNA phylogeny from 1737 complete sequences using a new, direct method based on principal component analysis (PCA) and unsupervised consensus ensemble clustering. PCA identifies polymorphisms representing robust variations in the data and consensus ensemble clustering creates stable haplogroup clusters. The tree is obtained from the bifurcating network obtained when the data are split into k = 2,3,4,...,kmax clusters, with equal sampling from each haplogroup. Our method assumes only that the data can be clustered into groups based on mutations, is fast, is stable to sample perturbation, uses all significant polymorphisms in the data, works for arbitrary sample sizes, and avoids sample choice and haplogroup size bias. The internal branches of our tree have a 90% consensus accuracy. In conclusion, our tree recreates the standard phylogeny of the N, M, L0/L1, L2, and L3 clades, confirming the African origin of modern humans and showing that the M and N clades arose in almost coincident migrations. However, the N clade haplogroups split along an East-West geographic divide, with a "European R clade" containing the haplogroups H, V, H/V, J, T, and U and a "Eurasian N subclade" including haplogroups B, R5, F, A, N9, I, W, and X. The haplogroup pairs (N9a, N9b) and (M7a, M7b) within N and M are placed in nonnearest locations in agreement with their expected large TMRCA from studies of their migrations into Japan. For comparison, we also construct consensus maximum likelihood, parsimony, neighbor joining, and UPGMA-based trees using the same polymorphisms and show that these methods give consistent results only for the clade tree. For recent branches, the consensus accuracy for these methods is in the range of 1-20%. From a comparison of our haplogroups to two chimp and one bonobo sequences, and assuming a chimp-human coalescent time of 5 million years before present, we find a human mtDNA TMRCA of 206,000 +/- 14,000 years before present.
基于线粒体DNA(mtDNA)多态性构建的系统发育树常被用于推断近代人类迁徙的历史。然而,对于使用哪种方法尚无共识。大多数方法都有很强的假设,这可能会使多态性的选择产生偏差,并导致计算复杂性,从而将分析限制在少数样本/多态性上。例如,简约法使突变数量最小化,这会使结果偏向于最小化平行进化事件。这种偏差可能会完全忽略多态性的全局结构,存在将一个“常见”多态性误判为古老多态性的风险,而没有对其是否为平行进化或者是否因抽样偏差(对具有该多态性的群体过度抽样)而被鉴定为古老多态性进行内部核查。这个问题的一个表现是,将不同方法应用于相同数据,或者将相同方法应用于不同数据集,会得到不同的树拓扑结构。当合并此类分析的结果时,共识树的内部分支一致性较低。我们使用一种基于主成分分析(PCA)和无监督共识集成聚类的新的直接方法,从1737个完整序列确定人类mtDNA系统发育。PCA识别出代表数据中稳健变异的多态性,共识集成聚类创建稳定的单倍群聚类。该树是从将数据分成k = 2、3、4、...、kmax个聚类时得到的二叉网络获得的,每个单倍群的抽样相等。我们的方法仅假设数据可以基于突变聚类,速度快,对样本扰动稳定,使用数据中的所有显著多态性,适用于任意样本量,并且避免样本选择和单倍群大小偏差。我们构建的树的内部分支具有90%的一致性准确率。总之,我们构建的树重现了N、M、L0/L1、L2和L3支系的标准系统发育,证实了现代人类起源于非洲,并表明M和N支系在几乎同时的迁徙中出现。然而,N支系的单倍群沿着东西地理分界线分裂,一个“欧洲R支系”包含单倍群H、V、H/V、J、T和U,一个“欧亚N亚支系”包括单倍群B、R5、F、A、N9、I、W和X。N和M内的单倍群对(N9a,N9b)和(M7a,M7b)被放置在非最邻近位置,这与它们迁入日本的研究中预期的较大的最近共同祖先时间(TMRCA)一致。为了进行比较,我们还使用相同的多态性构建了基于最大似然法、简约法、邻接法和UPGMA的共识树,并表明这些方法仅在支系树方面给出一致的结果。对于最近的分支,这些方法的一致性准确率在1 - 20%的范围内。通过将我们的单倍群与两个黑猩猩和一个倭黑猩猩序列进行比较,并假设黑猩猩 - 人类的合并时间为距今500万年,我们发现人类mtDNA的TMRCA为距今206,000 ± 14,000年。