基于位点速率的几种氨基酸替换矩阵来模拟蛋白质进化。

Modeling protein evolution with several amino acid replacement matrices depending on site rates.

机构信息

Méthodes et Algorithmes pour la Bioinformatique (LIRMM & IBC), Centre National de la Recherche Scientifique (CNRS)-Université Montpellier II, Montpellier Cedex 5, France.

出版信息

Mol Biol Evol. 2012 Oct;29(10):2921-36. doi: 10.1093/molbev/mss112. Epub 2012 Apr 6.

Abstract

Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x.

摘要

大多数蛋白质替换模型使用单个氨基酸替换矩阵来总结氨基酸的生化特性。然而,位点进化高度异质化,取决于许多影响替换模式的因素。在本文中,我们研究了为不同的位点进化率使用不同的替换矩阵。实际上,进化率的可变性对应于位点之间最明显的异质化因素之一,没有理由假设替换模式保持不变,而不管进化率如何。我们首先引入 LG4M,它由四个矩阵组成,每个矩阵对应于离散伽马率类别之一(共四个)。这些矩阵在氨基酸平衡分布和可交换性方面有所不同,与标准的伽马模型不同,标准的伽马模型仅在全局速率上因类别而异。接下来,我们介绍 LG4X,它也使用四个不同的矩阵,但不考虑伽马分布,而是采用无分布方案来处理位点速率。所有这些矩阵都是从一个非常大的比对数据库中估计得到的,我们的两个模型使用大量独立的比对进行了测试。对所得矩阵和模型的详细分析表明了氨基酸替换的复杂性以及 LG4M 和 LG4X 等灵活模型的优势。与单矩阵模型相比,这两种模型都显著提高了对数似然值,对于大多数数据集,提高了几十到几百个单位。由于 LG4X 采用了无分布的位点速率方案,因此与 LG4M 相比,它获得了实质性的增益。由于 LG4M 和 LG4X 具有这些优势,但需要相同的内存空间并且运行时间与标准模型相当,因此我们认为 LG4M 和 LG4X 是单替换矩阵的相关替代方案。我们的模型、数据和软件可从 http://www.atgc-montpellier.fr/models/lg4x 获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索