Cheng Fuxia, Hartmann Stefanie, Gupta Mayetri, Ibrahim Joseph G, Vision Todd J
Department of Mathematics, Illinois State University, Normal, IL, USA.
Bioinformatics. 2009 Mar 1;25(5):592-8. doi: 10.1093/bioinformatics/btp015. Epub 2009 Jan 15.
Full-length DNA and protein sequences that span the entire length of a gene are ideally used for multiple sequence alignments (MSAs) and the subsequent inference of their relationships. Frequently, however, MSAs contain a substantial amount of missing data. For example, expressed sequence tags (ESTs), which are partial sequences of expressed genes, are the predominant source of sequence data for many organisms. The patterns of missing data typical for EST-derived alignments greatly compromise the accuracy of estimated phylogenies.
We present a statistical method for inferring phylogenetic trees from EST-based incomplete MSA data. We propose a class of hierarchical models for modeling pairwise distances between the sequences, and develop a fully Bayesian approach for estimation of the model parameters. Once the distance matrix is estimated, the phylogenetic tree may be constructed by applying neighbor-joining (or any other algorithm of choice). We also show that maximizing the marginal likelihood from the Bayesian approach yields similar results to a profile likelihood estimation. The proposed methods are illustrated using simulated protein families, for which the true phylogeny is known, and one real protein family.
R code for fitting these models are available from: http://people.bu.edu/gupta/software.htm.
跨越基因全长的完整DNA和蛋白质序列最适合用于多序列比对(MSA)以及后续关系推断。然而,MSA常常包含大量缺失数据。例如,表达序列标签(EST)作为已表达基因的部分序列,是许多生物体序列数据的主要来源。源自EST比对的典型缺失数据模式极大地损害了估计系统发育树的准确性。
我们提出了一种从基于EST的不完整MSA数据推断系统发育树的统计方法。我们提出了一类用于对序列间成对距离进行建模的层次模型,并开发了一种用于估计模型参数的全贝叶斯方法。一旦估计出距离矩阵,就可以通过应用邻接法(或任何其他选择的算法)构建系统发育树。我们还表明,从贝叶斯方法中最大化边际似然会产生与轮廓似然估计相似的结果。使用已知真实系统发育关系的模拟蛋白质家族以及一个真实蛋白质家族对所提出的方法进行了说明。
用于拟合这些模型的R代码可从以下网址获取:http://people.bu.edu/gupta/software.htm。