Suppr超能文献

一种根据特定位点氨基酸频率进行调整并改进蛋白质系统发育推断的类频率混合模型。

A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny.

作者信息

Wang Huai-Chun, Li Karen, Susko Edward, Roger Andrew J

机构信息

Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, N,S, B3H 1X5, Canada.

出版信息

BMC Evol Biol. 2008 Dec 16;8:331. doi: 10.1186/1471-2148-8-331.

Abstract

BACKGROUND

Widely used substitution models for proteins, such as the Jones-Taylor-Thornton (JTT) or Whelan and Goldman (WAG) models, are based on empirical amino acid interchange matrices estimated from databases of protein alignments that incorporate the average amino acid frequencies of the data set under examination (e.g JTT + F). Variation in the evolutionary process between sites is typically modelled by a rates-across-sites distribution such as the gamma (Gamma) distribution. However, sites in proteins also vary in the kinds of amino acid interchanges that are favoured, a feature that is ignored by standard empirical substitution matrices. Here we examine the degree to which the pattern of evolution at sites differs from that expected based on empirical amino acid substitution models and evaluate the impact of these deviations on phylogenetic estimation.

RESULTS

We analyzed 21 large protein alignments with two statistical tests designed to detect deviation of site-specific amino acid distributions from data simulated under the standard empirical substitution model: JTT+ F + Gamma. We found that the number of states at a given site is, on average, smaller and the frequencies of these states are less uniform than expected based on a JTT + F + Gamma substitution model. With a four-taxon example, we show that phylogenetic estimation under the JTT + F + Gamma model is seriously biased by a long-branch attraction artefact if the data are simulated under a model utilizing the observed site-specific amino acid frequencies from an alignment. Principal components analyses indicate the existence of at least four major site-specific frequency classes in these 21 protein alignments. Using a mixture model with these four separate classes of site-specific state frequencies plus a fifth class of global frequencies (the JTT + cF + Gamma model), significant improvements in model fit for real data sets can be achieved. This simple mixture model also reduces the long-branch attraction problem, as shown by simulations and analyses of a real phylogenomic data set.

CONCLUSION

Protein families display site-specific evolutionary dynamics that are ignored by standard protein phylogenetic models. Accurate estimation of protein phylogenies requires models that accommodate the heterogeneity in the evolutionary process across sites. To this end, we have implemented a class frequency mixture model (cF) in a freely available program called QmmRAxML for phylogenetic estimation.

摘要

背景

广泛使用的蛋白质替代模型,如琼斯 - 泰勒 - 桑顿(JTT)模型或惠兰和戈德曼(WAG)模型,是基于从蛋白质比对数据库估计的经验性氨基酸交换矩阵构建的,这些数据库纳入了所研究数据集的平均氨基酸频率(例如JTT + F)。位点间进化过程的差异通常由位点间速率分布(如伽马(Gamma)分布)来建模。然而,蛋白质中的位点在偏好的氨基酸交换类型上也存在差异,这一特征被标准的经验性替代矩阵所忽略。在这里,我们研究位点处的进化模式与基于经验性氨基酸替代模型预期模式的差异程度,并评估这些偏差对系统发育估计的影响。

结果

我们使用两种统计检验分析了21个大型蛋白质比对,这两种检验旨在检测特定位点氨基酸分布与在标准经验性替代模型(JTT + F + Gamma)下模拟的数据之间的偏差。我们发现,给定位点的状态数平均比基于JTT + F + Gamma替代模型预期的要少,并且这些状态的频率也不如预期均匀。以一个四分类的例子,我们表明,如果数据是在利用比对中观察到的特定位点氨基酸频率的模型下模拟的,那么在JTT + F + Gamma模型下的系统发育估计会受到长枝吸引假象的严重偏差。主成分分析表明,在这21个蛋白质比对中至少存在四个主要的特定位点频率类别。使用具有这四类单独的特定位点状态频率以及第五类全局频率的混合模型(JTT + cF + Gamma模型),可以显著提高对真实数据集的模型拟合度。如对一个真实系统发育基因组数据集的模拟和分析所示,这个简单的混合模型也减少了长枝吸引问题。

结论

蛋白质家族表现出特定位点的进化动态,而标准的蛋白质系统发育模型忽略了这一点。准确估计蛋白质系统发育需要能够适应位点间进化过程异质性的模型。为此,我们在一个名为QmmRAxML的免费程序中实现了一种类频率混合模型(cF)用于系统发育估计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bb2/2628903/ccfac1250622/1471-2148-8-331-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验