在似然分析中合并多个数据集：哪些模型是最佳的？

Combining multiple data sets in a likelihood analysis: which models are the best?

作者信息

Pupko Tal, Huchon Dorothée, Cao Ying, Okada Norihiro, Hasegawa Masami

机构信息

The Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569, Japan.

出版信息

Mol Biol Evol. 2002 Dec;19(12):2294-307. doi: 10.1093/oxfordjournals.molbev.a004053.

DOI:10.1093/oxfordjournals.molbev.a004053

PMID:12446820

Abstract

Until recently, phylogenetic analyses have been routinely based on homologous sequences of a single gene. Given the vast number of gene sequences now available, phylogenetic studies are now based on the analysis of multiple genes. Thus, it has become necessary to devise statistical methods to combine multiple molecular data sets. Here, we compare several models for combining different genes for the purpose of evaluating the likelihood of tree topologies. Three methods of branch length estimation were studied: assuming all genes have the same branch lengths (concatenate model), assuming that branch lengths are proportional among genes (proportional model), or assuming that each gene has a separate set of branch lengths (separate model). We also compared three models of among-site rate variation: the homogenous model, a model that assumes one gamma parameter for all genes, and a model that assumes one gamma parameter for each gene. On the basis of two nuclear and one mitochondrial amino acid data sets, our results suggest that, depending on the data set chosen, either the separate model or the proportional model represents the most appropriate method for branch length analysis. For all the data sets examined, one gamma parameter for each gene represents the best model for among-site rate variation. Using these models we analyzed alternative mammalian tree topologies, and we describe the effect of the assumed model on the maximum likelihood tree. We show that the choice of the model has an impact on the best phylogeny obtained.

摘要

直到最近，系统发育分析通常都是基于单个基因的同源序列。鉴于现在可获得的基因序列数量巨大，系统发育研究如今是基于多个基因的分析。因此，设计统计方法来组合多个分子数据集变得很有必要。在此，我们比较了几种用于组合不同基因以评估树形拓扑结构似然性的模型。研究了三种分支长度估计方法：假设所有基因具有相同的分支长度（串联模型），假设基因间分支长度成比例（比例模型），或者假设每个基因有一组单独的分支长度（单独模型）。我们还比较了三种位点间速率变化模型：均匀模型、为所有基因假设一个伽马参数的模型，以及为每个基因假设一个伽马参数的模型。基于两个核基因和一个线粒体氨基酸数据集，我们的结果表明，根据所选数据集，单独模型或比例模型代表了分支长度分析的最合适方法。对于所有检测的数据集，为每个基因假设一个伽马参数代表了位点间速率变化的最佳模型。使用这些模型我们分析了替代的哺乳动物树形拓扑结构，并描述了假设模型对最大似然树的影响。我们表明模型的选择对所获得的最佳系统发育有影响。